Apache log is a script to parse Apache access.log file producing some statistics like top referrers and top pages. It's useful to skip search engine trafic (more than 90% of total trafic on some web servers) and get statistics about your real (human) guests.

Features

  • filter method: only process POST, GET, HEAD requests (skip WEBDAV requests, used by Subversion)
  • filter host (IP): ignore search engine hosts (eg. search engine with valid user agent, read bot_hosts.py)
  • filter user: skip requests of authenticated users (we already know them)
  • filter url: ignore CSS, pictures, etc. requests
  • filter referrer: ignore self-referrence for example
  • filter user agent: ignore automatic trafic (search engine crawlers), use white list (read user_agents.py) to avoid false positive
  • use regular expressions for filter
  • fully written in Python

Statistics

  • top pages by referrer
  • top pages by number of hits
  • top pages by host
  • first and last timestamp
  • human traffic percent of total traffic (between 0.1% and 5% on my servers)

Download

svn co http://haypo.hachoir.org/svn/apache_log

Browse source code

Browse Python source code (see also root directory).

Example of result

Load file /var/log/apache2/hachoir.log ...
File /var/log/apache2/hachoir.log parsed (2037 lines).

=== Top host ===
#1: 209.85.238.2 (9 hits)
#2: 124.168.200.44 (7 hits)
#3: 212.226.169.252 (7 hits)
#4: 80.236.234.95 (5 hits)

=== Top page ===
#1: / (24 hits)
#2: /wiki/hachoir-parser (13 hits)
#3: /wiki/hachoir-metadata (10 hits)
#4: /log/?limit=100&mode=stop_on_copy&format=rss (9 hits)
#5: /wiki/hachoir-core (6 hits)
#6: /wiki/hachoir-urwid (3 hits)
#7: /ticket/153 (2 hits)
#8: /wiki/WikiStart (2 hits)
#9: /wiki/Canoscan5000F (2 hits)

=== Top referrer ===
#1: http://www.forensicfocus.com/index.php?name=News&file=article&sid=762 (6 hits)
#2: http://linuxfr.org/2006/12/19/21787.html (6 hits)
#3: http://www.haypocalc.com/wiki/Hachoir (4 hits)
#4: http://wiki.wireshark.org/CaptureSetup/USB (2 hits)
#5: http://themacelite.com/forums/viewtopic.php?t=9&postdays=0&postorder=asc&start=15 (2 hits)
#6: http://www.advogato.org/person/follower/diary.html?start=80 (2 hits)
#7: http://cheeseshop.python.org/pypi/hachoir-parser (2 hits)
#8: http://www.haypocalc.com/wiki/Détecter_un_charset (1 hits)
#9: http://formats-ouverts.org/blog/2006/11/04/995-pour-lire-les-formats-sortez-le-hachoir (1 hits)

Human trafic: 90 hits on 2037 total hits (4.4%)
Period: 2007-09-02 07:41:13 to 2007-09-03 12:04:41 (1 day, 4:23:28)

Example of use

from apache_log import (
    GenericFilter, ApacheLogParser_Stat,
    printSummary)

class HachoirOrg(GenericFilter):
    def __init__(self):
        GenericFilter.__init__(self, [r"hachoir\.org"])

def runHachoir(*filenames):
    syntax = "{host} - {user} {date} {request} {answer} {referrer} {user_agent}"
    r=ApacheLogParser_Stat(syntax, "hachoir.org")
    r.ignore_handler = HachoirOrg().ignoreHandler
    for filename in filenames:
        r.parseFile(filename)
    printSummary(r)

runHachoir(
    "/var/log/apache2/hachoir.log",
    "/var/log/apache2/hachoir.log.1",
)