Discussion:
Blocking SEO robots
David Anderson
2014-08-06 09:50:52 UTC
Permalink
This isn't specifically a WP issue, but I think it will be relevant to
lots of us, trying to maximise our resources...

Issue: I find that a disproportionate amount of server resources are
consumed by a certain subset crawlers/robots which contribute nothing.
I'd like to just block them. I have in mind the various semi-private
search engines run by SEO companies/backlink-checkers, e.g.
http://en.seokicks.de/, https://ahrefs.com/. These things happily spider
a few thousand pages, every author, tag, category, etc., archive. Some
of them refuse to obey robots.txt (the one that specifically annoys is
when they ignore the Crawl-Delay directive. I even came across one that
proudly had a section on its website explaining that robots.txt was a
stupid idea, so they always ignored it!).

I'd like to just block such crawlers. So: does anyone know of where a
reliable list of the IP addresses used by these services is kept?
Specifically, I want to block the semi-private or obscure crawlers that
do nothing useful for my sites. I don't want to block mainstream search
engines, of course. I've done some Googling, and haven't managed to find
something that makes this distinction.

Or alternatively - anyone think this is a bad idea?

Best wishes,
David
--
UpdraftPlus - best WordPress backups - http://updraftplus.com
WordShell - WordPress fast from the CLI - http://wordshell.net
Eric Hendrix
2014-08-06 09:58:00 UTC
Permalink
This is not a bad idea at all - and I'd like to second the request if
anyone has researched this previously. David is correct as I've found the
same issue with valuable server resources - especially when you're running
a handful of heavy WP sites.

So, bot experts, what say you?
Post by David Anderson
This isn't specifically a WP issue, but I think it will be relevant to
lots of us, trying to maximise our resources...
Issue: I find that a disproportionate amount of server resources are
consumed by a certain subset crawlers/robots which contribute nothing. I'd
like to just block them. I have in mind the various semi-private search
engines run by SEO companies/backlink-checkers, e.g.
http://en.seokicks.de/, https://ahrefs.com/. These things happily spider
a few thousand pages, every author, tag, category, etc., archive. Some of
them refuse to obey robots.txt (the one that specifically annoys is when
they ignore the Crawl-Delay directive. I even came across one that proudly
had a section on its website explaining that robots.txt was a stupid idea,
so they always ignored it!).
I'd like to just block such crawlers. So: does anyone know of where a
reliable list of the IP addresses used by these services is kept?
Specifically, I want to block the semi-private or obscure crawlers that do
nothing useful for my sites. I don't want to block mainstream search
engines, of course. I've done some Googling, and haven't managed to find
something that makes this distinction.
Or alternatively - anyone think this is a bad idea?
Best wishes,
David
--
UpdraftPlus - best WordPress backups - http://updraftplus.com
WordShell - WordPress fast from the CLI - http://wordshell.net
_______________________________________________
wp-hackers mailing list
http://lists.automattic.com/mailman/listinfo/wp-hackers
--
*Eric A. HendrixUSA, MSG(R)****@gmail.com
(910) 644-8940

*"Non Timebo Mala"*
Haluk Karamete
2014-08-06 10:06:47 UTC
Permalink
Could this list help you? ->http://www.robotstxt.org/db/all.txt

Source:

http://stackoverflow.com/questions/1717049/tell-bots-apart-from-human-visitors-for-stats
This is not a bad idea at all - and I'd like to second the request if
anyone has researched this previously. David is correct as I've found the
same issue with valuable server resources - especially when you're running
a handful of heavy WP sites.

So, bot experts, what say you?
Post by David Anderson
This isn't specifically a WP issue, but I think it will be relevant to
lots of us, trying to maximise our resources...
Issue: I find that a disproportionate amount of server resources are
consumed by a certain subset crawlers/robots which contribute nothing. I'd
like to just block them. I have in mind the various semi-private search
engines run by SEO companies/backlink-checkers, e.g.
http://en.seokicks.de/, https://ahrefs.com/. These things happily spider
a few thousand pages, every author, tag, category, etc., archive. Some of
them refuse to obey robots.txt (the one that specifically annoys is when
they ignore the Crawl-Delay directive. I even came across one that proudly
had a section on its website explaining that robots.txt was a stupid idea,
so they always ignored it!).
I'd like to just block such crawlers. So: does anyone know of where a
reliable list of the IP addresses used by these services is kept?
Specifically, I want to block the semi-private or obscure crawlers that do
nothing useful for my sites. I don't want to block mainstream search
engines, of course. I've done some Googling, and haven't managed to find
something that makes this distinction.
Or alternatively - anyone think this is a bad idea?
Best wishes,
David
--
UpdraftPlus - best WordPress backups - http://updraftplus.com
WordShell - WordPress fast from the CLI - http://wordshell.net
_______________________________________________
wp-hackers mailing list
http://lists.automattic.com/mailman/listinfo/wp-hackers
--


*Eric A. HendrixUSA, MSG(R)****@gmail.com
(910) 644-8940

*"Non Timebo Mala"*
Blue Chives
2014-08-06 10:15:05 UTC
Permalink
Depending on the web server software you are using you can look at using the htaccess file and block users/bot based on their user agent.

This article should help:

http://www.javascriptkit.com/howto/htaccess13.shtml

Alternatively do drop me a line if you would like a hand with this as we manage the hosting for a number of WordPress blogs/websites.


Cheers
John
Post by Eric Hendrix
This is not a bad idea at all - and I'd like to second the request if
anyone has researched this previously. David is correct as I've found the
same issue with valuable server resources - especially when you're running
a handful of heavy WP sites.
So, bot experts, what say you?
Post by David Anderson
This isn't specifically a WP issue, but I think it will be relevant to
lots of us, trying to maximise our resources...
Issue: I find that a disproportionate amount of server resources are
consumed by a certain subset crawlers/robots which contribute nothing. I'd
like to just block them. I have in mind the various semi-private search
engines run by SEO companies/backlink-checkers, e.g.
http://en.seokicks.de/, https://ahrefs.com/. These things happily spider
a few thousand pages, every author, tag, category, etc., archive. Some of
them refuse to obey robots.txt (the one that specifically annoys is when
they ignore the Crawl-Delay directive. I even came across one that proudly
had a section on its website explaining that robots.txt was a stupid idea,
so they always ignored it!).
I'd like to just block such crawlers. So: does anyone know of where a
reliable list of the IP addresses used by these services is kept?
Specifically, I want to block the semi-private or obscure crawlers that do
nothing useful for my sites. I don't want to block mainstream search
engines, of course. I've done some Googling, and haven't managed to find
something that makes this distinction.
Or alternatively - anyone think this is a bad idea?
Best wishes,
David
--
UpdraftPlus - best WordPress backups - http://updraftplus.com
WordShell - WordPress fast from the CLI - http://wordshell.net
_______________________________________________
wp-hackers mailing list
http://lists.automattic.com/mailman/listinfo/wp-hackers
--
(910) 644-8940
*"Non Timebo Mala"*
_______________________________________________
wp-hackers mailing list
http://lists.automattic.com/mailman/listinfo/wp-hackers
David Anderson
2014-08-06 12:08:26 UTC
Permalink
Could this list help you?http://www.robotstxt.org/db/all.txt
At first this looks potentially useful - since it is in a
machine-readable format, and can be parsed to find a list of bots that
match specified criteria.... but on a second glance, it looks not so
useful. I searched for 3 of the recent bots I've seen most regularly in
my logs: SEOKicks, AHrefs, Majestic12 - and it doesn't have any of them.
Depending on the web server software you are using you can look at using the htaccess file and block users/bot based on their user agent.
http://www.javascriptkit.com/howto/htaccess13.shtml
The issue's not about how to write blocklist rules; it's about having a
reliable, maintained, categorised list of bots such that it's easy to
automate the blocklist. Turning the list into .htaccess rules is the
easy bit; what I want to avoid is having to spend long churning through
log files to obtain the source data, because it feels very much like
something there 'ought' to be pre-existing data out there for, given how
many watts the world's servers must be wasting on such bots.

Best wishes,
David
--
UpdraftPlus - best WordPress backups - http://updraftplus.com
WordShell - WordPress fast from the CLI - http://wordshell.net
Jeremy Clarke
2014-08-06 12:31:29 UTC
Permalink
Post by David Anderson
The issue's not about how to write blocklist rules; it's about having a
reliable, maintained, categorised list of bots such that it's easy to
automate the blocklist. Turning the list into .htaccess rules is the easy
bit; what I want to avoid is having to spend long churning through log
files to obtain the source data, because it feels very much like something
there 'ought' to be pre-existing data out there for, given how many watts
the world's servers must be wasting on such bots.
The best answer is the htaccess-based blacklists from PerishablePress. I
think this is the latest one:

http://perishablepress.com/5g-blacklist-2013/

He uses a mix of blocked user agents, blocked IP's and blocked requests
(i.e /admin.php, intrusion scans for other software). He's been updating it
for years and it's definitely a WP-centric project.

In the past some good stuff has been blocked by his lists (Facebook spider
blocked because it had an empty user agent, common spiders used by
academics were blocked) but that's bound to happen and I'm sure every UA
was used by a spammer at some point.

I run a ton of sites on my server so I hate the .htaccess format (which is
a pain to implement alongside wp+super cache rules). If I used multisite it
would be less of a big deal. Either way, know that you can block UA's for
all virtual hosts if that's relevant.

Note that ip blocking is a lot more effective at the server level because
blocking with Apache still uses a ton of resources (but at least no MySQL
etc). On Linux an iptables based block is much more effective.
--
Jeremy Clarke
Code and Design • globalvoicesonline.org
Daniel
2014-08-07 04:26:22 UTC
Permalink
Set up a trap. A link hidden by CSS on each page that if hit, the IP gets
blacklisted for a period of time. No human will ever come across the link
unless they're digging. No bot actually renders the entire page out before
deciding what to use.
Post by David Anderson
Post by David Anderson
The issue's not about how to write blocklist rules; it's about having a
reliable, maintained, categorised list of bots such that it's easy to
automate the blocklist. Turning the list into .htaccess rules is the easy
bit; what I want to avoid is having to spend long churning through log
files to obtain the source data, because it feels very much like
something
Post by David Anderson
there 'ought' to be pre-existing data out there for, given how many watts
the world's servers must be wasting on such bots.
The best answer is the htaccess-based blacklists from PerishablePress. I
http://perishablepress.com/5g-blacklist-2013/
He uses a mix of blocked user agents, blocked IP's and blocked requests
(i.e /admin.php, intrusion scans for other software). He's been updating it
for years and it's definitely a WP-centric project.
In the past some good stuff has been blocked by his lists (Facebook spider
blocked because it had an empty user agent, common spiders used by
academics were blocked) but that's bound to happen and I'm sure every UA
was used by a spammer at some point.
I run a ton of sites on my server so I hate the .htaccess format (which is
a pain to implement alongside wp+super cache rules). If I used multisite it
would be less of a big deal. Either way, know that you can block UA's for
all virtual hosts if that's relevant.
Note that ip blocking is a lot more effective at the server level because
blocking with Apache still uses a ton of resources (but at least no MySQL
etc). On Linux an iptables based block is much more effective.
--
Jeremy Clarke
Code and Design • globalvoicesonline.org
_______________________________________________
wp-hackers mailing list
http://lists.automattic.com/mailman/listinfo/wp-hackers
--
-Dan
Daniel
2014-08-07 04:28:33 UTC
Permalink
Almost forgot, the link should be in a subdirectory that is marked in
robots.txt to ignore, so anything ignoring robots.txt is whats hit.
Post by Daniel
Set up a trap. A link hidden by CSS on each page that if hit, the IP gets
blacklisted for a period of time. No human will ever come across the link
unless they're digging. No bot actually renders the entire page out before
deciding what to use.
Post by David Anderson
Post by David Anderson
The issue's not about how to write blocklist rules; it's about having a
reliable, maintained, categorised list of bots such that it's easy to
automate the blocklist. Turning the list into .htaccess rules is the
easy
Post by David Anderson
bit; what I want to avoid is having to spend long churning through log
files to obtain the source data, because it feels very much like
something
Post by David Anderson
there 'ought' to be pre-existing data out there for, given how many
watts
Post by David Anderson
the world's servers must be wasting on such bots.
The best answer is the htaccess-based blacklists from PerishablePress. I
http://perishablepress.com/5g-blacklist-2013/
He uses a mix of blocked user agents, blocked IP's and blocked requests
(i.e /admin.php, intrusion scans for other software). He's been updating it
for years and it's definitely a WP-centric project.
In the past some good stuff has been blocked by his lists (Facebook spider
blocked because it had an empty user agent, common spiders used by
academics were blocked) but that's bound to happen and I'm sure every UA
was used by a spammer at some point.
I run a ton of sites on my server so I hate the .htaccess format (which is
a pain to implement alongside wp+super cache rules). If I used multisite it
would be less of a big deal. Either way, know that you can block UA's for
all virtual hosts if that's relevant.
Note that ip blocking is a lot more effective at the server level because
blocking with Apache still uses a ton of resources (but at least no MySQL
etc). On Linux an iptables based block is much more effective.
--
Jeremy Clarke
Code and Design • globalvoicesonline.org
_______________________________________________
wp-hackers mailing list
http://lists.automattic.com/mailman/listinfo/wp-hackers
--
-Dan
--
-Dan
Micky Hulse
2014-08-07 04:30:43 UTC
Permalink
Post by Daniel
the link should be in a subdirectory that is marked in
robots.txt to ignore, so anything ignoring robots.txt is whats hit.
That's an awesome tip! :)

Thanks!!!!
--
<git.io/micky>
Daniel
2014-08-07 05:31:13 UTC
Permalink
It also works for forms, in addition to a captcha have a hidden form and
anything touching that input gets denied :)
Post by Micky Hulse
Post by Daniel
the link should be in a subdirectory that is marked in
robots.txt to ignore, so anything ignoring robots.txt is whats hit.
That's an awesome tip! :)
Thanks!!!!
--
<git.io/micky>
_______________________________________________
wp-hackers mailing list
http://lists.automattic.com/mailman/listinfo/wp-hackers
--
-Dan
Micky Hulse
2014-08-07 05:59:55 UTC
Permalink
Post by Daniel
It also works for forms, in addition to a captcha have a hidden form and
anything touching that input gets denied :)
Nice! I'm looking forward to giving that a try. Thanks again for sharing tips!

This thread has been a good read.
--
<git.io/micky>
Daniel Fenn
2014-08-07 04:31:16 UTC
Permalink
I like to use a nice tool from http://www.spambotsecurity.com/ but it
may cause more issues for some people though. Best thing is that it
very fast and dowsn't slow down unlike .htaccess
Regards,
Daniel Fenn
Post by Daniel
Almost forgot, the link should be in a subdirectory that is marked in
robots.txt to ignore, so anything ignoring robots.txt is whats hit.
Post by Daniel
Set up a trap. A link hidden by CSS on each page that if hit, the IP gets
blacklisted for a period of time. No human will ever come across the link
unless they're digging. No bot actually renders the entire page out before
deciding what to use.
Post by David Anderson
Post by David Anderson
The issue's not about how to write blocklist rules; it's about having a
reliable, maintained, categorised list of bots such that it's easy to
automate the blocklist. Turning the list into .htaccess rules is the
easy
Post by David Anderson
bit; what I want to avoid is having to spend long churning through log
files to obtain the source data, because it feels very much like
something
Post by David Anderson
there 'ought' to be pre-existing data out there for, given how many
watts
Post by David Anderson
the world's servers must be wasting on such bots.
The best answer is the htaccess-based blacklists from PerishablePress. I
http://perishablepress.com/5g-blacklist-2013/
He uses a mix of blocked user agents, blocked IP's and blocked requests
(i.e /admin.php, intrusion scans for other software). He's been updating it
for years and it's definitely a WP-centric project.
In the past some good stuff has been blocked by his lists (Facebook spider
blocked because it had an empty user agent, common spiders used by
academics were blocked) but that's bound to happen and I'm sure every UA
was used by a spammer at some point.
I run a ton of sites on my server so I hate the .htaccess format (which is
a pain to implement alongside wp+super cache rules). If I used multisite it
would be less of a big deal. Either way, know that you can block UA's for
all virtual hosts if that's relevant.
Note that ip blocking is a lot more effective at the server level because
blocking with Apache still uses a ton of resources (but at least no MySQL
etc). On Linux an iptables based block is much more effective.
--
Jeremy Clarke
Code and Design • globalvoicesonline.org
_______________________________________________
wp-hackers mailing list
http://lists.automattic.com/mailman/listinfo/wp-hackers
--
-Dan
--
-Dan
_______________________________________________
wp-hackers mailing list
http://lists.automattic.com/mailman/listinfo/wp-hackers
David Anderson
2014-08-07 08:20:18 UTC
Permalink
Post by Jeremy Clarke
The best answer is the htaccess-based blacklists from PerishablePress. I
http://perishablepress.com/5g-blacklist-2013/
This looks like an interesting list, but doesn't fit the use case. The
proprietor says "the 5G Blacklist helps reduce the number of malicious
URL requests that hit your website" - and reading the list confirms
that's what he's aiming for. I'm aiming to block non-malicious actors
who are running their own private search engines - i.e. those who want
to spider the web as part of creating their own non-public products
(e.g. databases of SEO back-links). It's not about site security; it's
about not being spidered each day by search engines that Joe Public will
never use. If you have a shared server used to host many sites for your
managed clients, then this quickly adds up.

At the moment the best solution I have is adding a robots.txt to every
site with "Crawl-delay: 15" in it, to slow down the rate of compliant
bots and spread the load around a bit.

Best wishes,
David
--
UpdraftPlus - best WordPress backups - http://updraftplus.com
WordShell - WordPress fast from the CLI - http://wordshell.net
Loading...