Infopost | 2015.03.22

A few months ago Steve and I noted that our respective sites were getting tons of hits from Samara Oblast, an obscure(?) territory in Russia. Russian search engine maybe? Cybercriminals? Proxy for the American or Chinese or Syrian electronic armies? Who really cares? Only port 80 should be open and doing nothing fancy

But since this kilroy thing has gotten pretty lengthy I was scoping the possibility of doing some sort of 'top content' thing based on hits. So I pulled my server logs and was looking through them to see how hard it'd be to parse.
Attack surface

Missile command screenshot

Well this is fun: "GET /kilroy/archive/2008/04/index.html HTTP/1.0"... "GET /kilroy/2008/01/leader-board-r.html HTTP/1.0"... "GET /kilroy/2008/01/index.php HTTP/1.0"... "GET /2008/01/index.php HTTP/1.0"... "GET /kilroy/2008/01/index.php HTTP/1.0"... "GET /2008/01/index.php HTTP/1.0"... "GET /kilroy/2008/01/index.php HTTP/1.0"... "GET /2008/01/index.php HTTP/1.0"...

How am I going to count hits for 2008/01/index.php when there is no anything.php?

Eight sequential hits from the same person, within 10 seconds. That's what I call quick on the mouse. Whois says it's from Ukraine. I'm going to stop me right here, this is my first time actually looking at http traffic, this is old hat to 80% of the world. Okay, let's continue.
Maybe they're just guessing about site map, but probably they're looking to have some fun with php.

Another interesting one:

POST /cgi-bin/php?%2D%64+%61%6C%6C%6F%77%5F%75%72%6C%5F%69%6E%63%6C%75%64%65
%69%72%65%63%74%5F%73%74%61%74%75%73%5F%65%6E%76%3D%30+%2D%6E HTTP/1.1

Looking to do injection or overflow or something? Not really my wheelhouse, but it was kind of a fun digression.

So I wrote some code to classify site traffic into one of the following categories:
Some of it was pretty easy, bots tend to declare themselves in the user agent string and hit robots.txt first. Malicious stuff sends PUTs and looks for files that aren't .html/.jpg/etc. And, of course, sequential traffic from the same IP can be classified together. This is important because an attack might hit numerous legit links but it's not visit traffic.

Logs go back about a year. Here's some excel because easy.

Classification of web site hits

I get indexed about twice as much as I get visited. There have been more than 20,000 malicious http requests.

Web site bot hits histogram

Google, Baidu, and Majestic 12 (a distributed indexing project) turned up most. But there are quite a few bots out there.

So the top visited content, the main reason for this whole endeavor:

Labels - which are now just links to search
Data skew: some content has been around longer. On the other hand, the logs are only from about a year back.

When I get some more fun-coding time I'll see about putting this in the sidebar.