iwla

iwla Commit Details

Date:2017-05-25 21:04:18 (4 years 1 month ago)
Author:Grégory Soutadé
Branch:dev, master
Commit:68a67adecc716c09f68356f42a2e22f02b494a67
Parents: 4bc2c1ad4ba1f1f2cf560722ec247e4ee51e5233
Message:Add one more rule to robot detection : more than ten 404 pages viewed

Changes:
Mplugins/pre_analysis/robots.py (2 diffs)

File differences

plugins/pre_analysis/robots.py
114114
115115
116116
117
117118
118119
119120
120121
121122
122123
124
125
126
123127
124128
125129
......
128132
129133
130134
135
136
137
138
139
131140
132141
133142
self._setRobot(k, super_hit)
continue
not_found_pages = 0
for hit in super_hit['requests']:
# 3) /robots.txt read
if hit['extract_request']['http_uri'].endswith('/robots.txt'):
self._setRobot(k, super_hit)
break
if int(hit['status']) == 404:
not_found_pages += 1
# 4) Any referer for hits
if not hit['is_page'] and hit['http_referer']:
referers += 1
self._setRobot(k, super_hit)
continue
# 5) more than 10 404 pages
if not_found_pages > 10:
self._setRobot(k, super_hit)
continue
if not super_hit['viewed_pages'] and \
(super_hit['viewed_hits'] and not referers):
self._setRobot(k, super_hit)

Archive Download the corresponding diff file

Branches

Tags