Apache log is a script to parse Apache access.log file producing some statistics like top referrers and top pages. It's useful to skip search engine trafic (more than 90% of total trafic on some web servers) and get statistics about your real (human) guests.
Features
- filter method: only process POST, GET, HEAD requests (skip WEBDAV requests, used by Subversion)
- filter host (IP): ignore search engine hosts (eg. search engine with valid user agent, read bot_hosts.py)
- filter user: skip requests of authenticated users (we already know them)
- filter url: ignore CSS, pictures, etc. requests
- filter referrer: ignore self-referrence for example
- filter user agent: ignore automatic trafic (search engine crawlers), use white list (read user_agents.py) to avoid false positive
- use regular expressions for filter
- fully written in Python
Statistics
- top pages by referrer
- top pages by number of hits
- top pages by host
- first and last timestamp
- human traffic percent of total traffic (between 0.1% and 5% on my servers)
Download
svn co http://haypo.hachoir.org/svn/apache_log
Browse source code
Browse Python source code (see also root directory).
Example of result
Load file /var/log/apache2/hachoir.log ... File /var/log/apache2/hachoir.log parsed (2037 lines). === Top host === #1: 209.85.238.2 (9 hits) #2: 124.168.200.44 (7 hits) #3: 212.226.169.252 (7 hits) #4: 80.236.234.95 (5 hits) === Top page === #1: / (24 hits) #2: /wiki/hachoir-parser (13 hits) #3: /wiki/hachoir-metadata (10 hits) #4: /log/?limit=100&mode=stop_on_copy&format=rss (9 hits) #5: /wiki/hachoir-core (6 hits) #6: /wiki/hachoir-urwid (3 hits) #7: /ticket/153 (2 hits) #8: /wiki/WikiStart (2 hits) #9: /wiki/Canoscan5000F (2 hits) === Top referrer === #1: http://www.forensicfocus.com/index.php?name=News&file=article&sid=762 (6 hits) #2: http://linuxfr.org/2006/12/19/21787.html (6 hits) #3: http://www.haypocalc.com/wiki/Hachoir (4 hits) #4: http://wiki.wireshark.org/CaptureSetup/USB (2 hits) #5: http://themacelite.com/forums/viewtopic.php?t=9&postdays=0&postorder=asc&start=15 (2 hits) #6: http://www.advogato.org/person/follower/diary.html?start=80 (2 hits) #7: http://cheeseshop.python.org/pypi/hachoir-parser (2 hits) #8: http://www.haypocalc.com/wiki/Détecter_un_charset (1 hits) #9: http://formats-ouverts.org/blog/2006/11/04/995-pour-lire-les-formats-sortez-le-hachoir (1 hits) Human trafic: 90 hits on 2037 total hits (4.4%) Period: 2007-09-02 07:41:13 to 2007-09-03 12:04:41 (1 day, 4:23:28)
Example of use
from apache_log import ( GenericFilter, ApacheLogParser_Stat, printSummary) class HachoirOrg(GenericFilter): def __init__(self): GenericFilter.__init__(self, [r"hachoir\.org"]) def runHachoir(*filenames): syntax = "{host} - {user} {date} {request} {answer} {referrer} {user_agent}" r=ApacheLogParser_Stat(syntax, "hachoir.org") r.ignore_handler = HachoirOrg().ignoreHandler for filename in filenames: r.parseFile(filename) printSummary(r) runHachoir( "/var/log/apache2/hachoir.log", "/var/log/apache2/hachoir.log.1", )
