blog :: web development :: yahoo pipes yql robots

Use Robots.txt To Prevent Yahoo Pipes & YQL From Scraping Your Site

Posted on 10 Jul 2009 by - Permanent link Trackback this post Subscribe to this post Comment on this post -  

Yahoo’s Open Search strategy is great news for mashup developers but it could also be used by scrapers to grab your content and republish it on their own sites, but thankfully Yahoo play by the rules and honour the robots exclusion protocol. This article will help you to block the services that can be used to scrape or remix your content.

None of the robots.txt changes will prevent die-hard blackhats, though, because they are unlikely to be using Yahoo tools for their nefarious activities. Most blackhats will have their own toolkits so you would need to go through your server logs to look for patterns of IP address, user agent, cookie use etc to block them effectively.

It should be noted that I fully support Yahoo’s efforts to open up the search results and that I only recommend blocking their crawlers if you’re specifically having problems with content theft. I haven’t implemented any of these techniques on this blog, so you can remix away to your heart’s content.

I do take content theft and scraping seriously though, so I check for scrapers regularly using tools like CopyScape and FairShare to check for pliagiarism. I report all content theft to Google via its webmaster tools (which usually delists the site and stops the problem) and if problems persist I’ll contact the infringer’s ISP to get their site shut down. Luckily, I’ve not (yet) had to take any further action - probably a sign of the blog’s unpopularity :-)

Yahoo Pipes

One of the most common ways that your content gets scraped is via your blog’s RSS feed and Yahoo Pipes is a handy tool for tweaking and mashing RSS feeds. If you are publishing your feed through Feedburner, blocking Yahoo Pipes is easy, just follow these steps:

  1. Log in to your Feedburner account and choose the feed that you are concerned about
  2. Click the “publicize” tab and then the “NoIndex” service
  3. You can then choose to block Yahoo Pipes by clicking the second check box - don’t forget to activate the service
Block Yahoo Pipes from Feedburner

If you don’t use a service like Feedburner but serve your RSS feed yourself, you need to block using either server configuration changes or robots.txt. From the FAQ, we can tell that the Yahoo Pipes user agent is “Yahoo Pipes 1.0”, so add the following to your robots.txt file:

User-agent: Yahoo Pipes 1.0
Disallow: /

YQL

I found it quite hard to find any information on YQL’s user agent string and ended up asking on Twitter. @jonathantrevor provided the answer:

YQL uses "Yahoo Pipes 2.0 " for fetching robots to see if its allowed, and then uses mozilla for the content

My tests confirmed this as correct.

Yahoo BOSS

Yahoo BOSS allows you to create your own search engines using Yahoo’s data (like my recipe search engine).

As it uses Yahoo’s search index, the only way to prevent access to your content through BOSS is to block Yahoo’s search crawler, Slurp, which is probably not what you want.

There are rumors of a closer integration of BOSS to some of the other services mentioned here, so if/when that happens the other blocking methods given here should apply.

Search Monkey Data Services

Search Monkey is a technology that allows developers to create widgets to embed into the Yahoo search results (which I mentioned before). It behaves differently to YQL and Pipes in that it does not use a web crawler, so you cannot use a robots.txt entry to deny access. Instead, you can modify your web server configuration to deny access to the user agent.

Search Monkey user agent strings are quite distinctive, so you can alter your httpd.conf or even .htaccess files (if your host allows) to deny access with the following code:

SetEnvIfNoCase User-Agent "Yahoo! SearchMonkey 1.0" noMonkey
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=noMonkey
</Limit>

This code was taken from the SearchMonkey user guide, which also lists an email address that you can contact to have your pages blocked at the source.

Putting It All Together

If you want to block all of these services, you’ll need to add the following to your robots.txt file:

User-agent: Yahoo Pipes 1.0
Disallow: /
User-agent: Yahoo Pipes 2.0
Disallow: /

Add this code to your server configuration too:

SetEnvIfNoCase User-Agent "Yahoo! SearchMonkey 1.0" noMonkey
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=noMonkey
</Limit>

There you have it, a few configuration changes to block opportunist scrapers.


Related Posts

0 Trackbacks

Trackbacks are closed for this story.

4 Comments

 Jonathan Trevor said at 2009-07-10 17:33

More information on Pipes

Hi,

We document how to do a number of these (for Pipes), and a number of other things you can do to block the engine, in our documentation:

http://pipes.yahoo.com/pipes/docs?doc=troubleshooting#q6

I think its worth pointing out that once your content (pages, images) is indexed by a search engine (or syndicated via RSS for blog readers) its very much available outside of your web site. So if blocking a particular tool above is necessary, you probably want to block everything - in other words, decide if you want your content available to be found/used away from your site by everyone or not.

You might also want to cover the numerous technologies and companies that also go after content (google, dapper, open kapow, ibm mashup center ...) and how to stop those too. Just do a search for mashup tools

.

Jonathan

 MMMeeja said at 2009-07-11 01:13

@Jonathan

Thanks for your input. You are of course right that there are many services that can be used legitimately to mashup your content, or for more nefarious purposes.

I wasn’t trying to pick on Yahoo technologies in particular, just didn't find the user agent information very readily so I thought that it would make a good blog post. I'll look into doing the same for other services too.

Thanks for stopping by and commenting.

 Jonathan Trevor said at 2009-07-11 06:38

@MMMeeja

Pipes information is pretty well documented :-) but we need to add the same information for YQL (its pretty much the same as Pipes as you've shown), along with other information for content/api providers.

While the article is clearly valuable, the problem that many content providers face is that the "nefarious" purpose applications won't be well behaved and respect conventions liked the useful tools you covered (who all respect the rights of the content provider to either explicitly or implicitly opt-out).

I know you use and enjoy these tools yourself, so please let us know (via the message boards) if you have any concerns we can address.

Thanks

Jonathan

 Makeup Tips said at 2009-08-19 10:23

Mcafee Warning

 Yesterday I got a security warning from mcafee to edit or remove robots.txt. The warning was trying to tell security risk involved with this. I know it has SEO perspective. Most of the user target only Google bot forgetting on other indexing services. Thanks for the post.

Comments are closed for this story.

 

Sitemap

Copyright © 2006-2009 MMMeeja Pty. Ltd. All rights reserved.