I have a database that keeps stats of mail messages that arrive for me. The database has a record summarizing data about each message. The database is updated each night at midnight with the preceding day's data. The data goes back to September 8, 2008. There are records for 5,639 days worth of data (a few days are missing) using up 268.7 MB of storage. It has records for 2,568,429 messages received since then. Of that total 2,118,711 (82%) were immediately classified as almost certainly Spam and not even looked at. So on the average over that time I got 455 messages per day and classified 375 as Spam. The total size of all messages over this period was 31.0 GB (5.6 MB per day) of which 24.2 GB was Spam (78% of all bytes).
The day with the highest total message count was Wednesday, June 9, 2010 with 3,905 messages of which 3,794 (97%) were Spam. That was also (not surprisingly) the day with the most Spam messages. The day with the highest Spam fraction was Saturday, October 17, 2015 with 1 messages of which 1 (100%) were Spam.
I used to have low count days here as well, but they turn out to be days when the mail server was down most of the day. So, the low counts weren't because I wasn't being sent much, but because it couldn't be delivered.
In the last week I have received 1,011 messages. Of which 499 (49%) were Spam. So on average over the week I got 144 messages per day and classified 71 as Spam.
Here are some plots over time. In each of these plots the data is averaged for each month. There is a vertical bar for each calendar month, the year labels on the X axis mark January of each year. Because of the small number of days so far this month, there may be statistical aberration in the last value in each plot since it's averaging over only 1 day's worth of data.
As someone who has been around the net for a long time (over 48 years) I'm on every Spammer's list. I have set up some very strict filters for incoming messages. These first charts look at how much of my arriving mail gets preclassified as Spam.
In this first graph, for each month I have data, I plot the average number of messages per day showing the Spam/not Spam distinction. The total height of the bar is average number of messages per day, the red part was Spam and the green was (maybe) not. Historically Spam has really swamped the good stuff at times. Also, while it looks like there was more good stuff at the beginning, that's only because I'm plotting what the incoming filters decided. They just weren't as good at identifying Spam for the first few months of this data. More explanation with the later graphs.
This shows how the fraction that's classified as Spam has varied over time. Notice that right at the beginning there's a bit of steep rise. It was rising Spam rates that made me want to track this data. It's a pity I don't have data going back further to show how it was before that. The reason the fraction classified as Spam went up sharply was that I was adjusting the Spam filters to get better.
OK, enough for getting swamped with Spam. Here are plots of just the potentially useful messages I got...only "potentially" because some Spam still gets through the filters, and I don't record whether I'm deleting a message because I've read it and don't need it or because it was Spam that got through. The database records the delivery and not what I do with it.
First a plot of how many messages didn't get pulled as Spam. In this graph you'll notice a high peak at the start and then a precipitous drop. This was because I noticed that unfiltered Spam was rising greatly and realized I wanted to track it over time and started this database. So, the data starts around the time I started to deal with it (when it was at its worst). But, then I worked on the Spam filters and improved things quite a bit after which it settled down a bit.
If I leave off 2008, which is relatively easy, you can see the extra detail. Note: this also makes both scales change, so account for that when comparing the graphs.
In these graphs we plot incoming messages based on what day of the week they arrive. The first plot has both Spam and non-Spam messages, the second shows just the non-Spam messages, so the vertical gets rescaled. There does seem to be less (both Spam and non-Spam) mail on weekends.
And, if you want the actual numbers behind the second graph, here they are in a table:
Web site designed, developed and hosted by
MAP Network Engineering
MAP Network Engineering uses, supports and recommends Open Source software and open standards.
Use of the logos above does not imply endorsement by the respective organizations of MAP Network Engineering sites or services.
On the contrary it implies endorsement by MAP Network Engineering of those organizations or software.
Page generated 2024-03-03
at 02:43 GMT
Copyright © MMXXIV Michael A. Patton
And finally, a random note:
My imaginary friend tells me you have serious mental problems.