Yahoo’s Open Search strategy is great news for mashup developers but it could also be used by scrapers to grab your content and republish it on their own sites, but thankfully Yahoo play by the rules and honour the robots exclusion protocol. This article will help you to block the services that can be used to scrape or remix your content.
None of the robots.txt changes will prevent die-hard blackhats, though, because they are unlikely to be using Yahoo tools for their nefarious activities. Most blackhats will have their own toolkits so you would need to go through your server logs to look for patterns of IP address, user agent, cookie use etc to block them effectively.
It should be noted that I fully support Yahoo’s efforts to open up the search results and that I only recommend blocking their crawlers if you’re specifically having problems with content theft. I haven’t implemented any of these techniques on this blog, so you can remix away to your heart’s content.
I do take content theft and scraping seriously though, so I check for scrapers regularly using tools like CopyScape and FairShare to check for pliagiarism. I report all content theft to Google via its webmaster tools (which usually delists the site and stops the problem) and if problems persist I’ll contact the infringer’s ISP to get their site shut down. Luckily, I’ve not (yet) had to take any further action - probably a sign of the blog’s unpopularity :-)
Yahoo Pipes 
One of the most common ways that your content gets scraped is via your blog’s RSS feed and Yahoo Pipes is a handy tool for tweaking and mashing RSS feeds. If you are publishing your feed through Feedburner, blocking Yahoo Pipes is easy, just follow these steps:
- Log in to your Feedburner account and choose the feed that you are concerned about
- Click the “publicize” tab and then the “NoIndex” service
- You can then choose to block Yahoo Pipes by clicking the second check box - don’t forget to activate the service
If you don’t use a service like Feedburner but serve your RSS feed yourself, you need to block using either server configuration changes or robots.txt. From the FAQ, we can tell that the Yahoo Pipes user agent is “Yahoo Pipes 1.0”, so add the following to your robots.txt file:
User-agent: Yahoo Pipes 1.0
Disallow: /
YQL
I found it quite hard to find any information on YQL’s user agent string and ended up asking on Twitter. @jonathantrevor provided the answer:
YQL uses "Yahoo Pipes 2.0 " for fetching robots to see if its allowed, and then uses mozilla for the content
My tests confirmed this as correct.
Yahoo BOSS
Yahoo BOSS allows you to create your own search engines using Yahoo’s data (like my recipe search engine).
As it uses Yahoo’s search index, the only way to prevent access to your content through BOSS is to block Yahoo’s search crawler, Slurp, which is probably not what you want.
There are rumors of a closer integration of BOSS to some of the other services mentioned here, so if/when that happens the other blocking methods given here should apply.
Search Monkey Data Services
Search Monkey is a technology that allows developers to create widgets to embed into the Yahoo search results (which I mentioned before). It behaves differently to YQL and Pipes in that it does not use a web crawler, so you cannot use a robots.txt entry to deny access. Instead, you can modify your web server configuration to deny access to the user agent.
Search Monkey user agent strings are quite distinctive, so you can alter your httpd.conf or even .htaccess files (if your host allows) to deny access with the following code:
SetEnvIfNoCase User-Agent "Yahoo! SearchMonkey 1.0" noMonkey
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=noMonkey
</Limit>
This code was taken from the SearchMonkey user guide, which also lists an email address that you can contact to have your pages blocked at the source.
Putting It All Together
If you want to block all of these services, you’ll need to add the following to your robots.txt file:
User-agent: Yahoo Pipes 1.0
Disallow: /User-agent: Yahoo Pipes 2.0
Disallow: /
Add this code to your server configuration too:
SetEnvIfNoCase User-Agent "Yahoo! SearchMonkey 1.0" noMonkey
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=noMonkey
</Limit>
There you have it, a few configuration changes to block opportunist scrapers.




web-development
del.icio.us
Furl
Google Bookmarks
reddit
Simpy
Sphinn
StumbleUpon
Yahoo MyWeb
Post On Fire
Google Buzz



More information on Pipes
Hi,
We document how to do a number of these (for Pipes), and a number of other things you can do to block the engine, in our documentation:
http://pipes.yahoo.com/pipes/docs?doc=troubleshooting#q6
I think its worth pointing out that once your content (pages, images) is indexed by a search engine (or syndicated via RSS for blog readers) its very much available outside of your web site. So if blocking a particular tool above is necessary, you probably want to block everything - in other words, decide if you want your content available to be found/used away from your site by everyone or not.
You might also want to cover the numerous technologies and companies that also go after content (google, dapper, open kapow, ibm mashup center ...) and how to stop those too. Just do a search for mashup tools
.Jonathan