Having covered why sitemaps are important, we now turn our attention to the importance of a good robots.txt file for a Magento store. (if you haven’t read it already, both of these arose from our post containing a ten-point health check for your online store).
What is a robots.txt file?
A robots.txt file is, as its name suggests, a small text file which is there for robots. Not actual robots, obviously, but search engine spiders. It’s used by website owners to give instructions to the search engine spiders which constantly crawl the web updating their owners’ search results. Any spider which adheres to the robots.txt protocol – it’s not binding, as there’s no way to enforce it, but all the good spiders adhere to it – checks for the presence of a robots.txt file on a site before indexing it.
What does a robots.txt file contain?
The most important part of a robots.txt file contains instructions telling spiders if they’re welcome or not. The very simplest robots.txt would ban all spiders by saying the following :
User-agent: * Disallow: /
The first line says “this applies to all of you spiders”, and the second line says “don’t index anything“. This is fine if you’re developing a site and it’s not ready to be indexed yet, but beyond that it’s not going to do much for your search engine optimisation.
If you’re wanting to block bots that are annoying (they put too much strain on your server by constantly crawling your site, or they’re for search engines which you’re never going to want to appear in) then you can target specific bots as follows :
User-agent: FatBot Disallow: /
Preventing specific files or folders from being indexed
To disallow specific files, add the following to the robots.txt :
Disallow: /file1.php Disallow: /file2.sh
To disallow a directory, or any of its contents from being indexed, simply add the following :
Disallow: /directory1/ Disallow: /directory2/
A robots.txt file for Magento stores
Over the years we’ve developed a set of instructions for a robots.txt specifically for a Magento store :
Disallow: /app/ Disallow: /downloader/ Disallow: /errors/ Disallow: /cgi-bin/ Disallow: /includes/ Disallow: /lib/ Disallow: /pkginfo/ Disallow: /shell/ Disallow: /var/ Disallow: /catalogsearch/ Disallow: /catalog/seo_sitemap/category/ Disallow: /catalog/seo_sitemap/product/ Disallow: /index.php/ Disallow: /catalogsearch/result/ Disallow: /catalogsearch/result/index/ Disallow: /catalogsearch/result/index/?* Disallow: /control/ Disallow: /contacts/ Disallow: /customer/ Disallow: /customize/ Disallow: /newsletter/ Disallow: /poll/ Disallow: /review/ Disallow: /sendfriend/ Disallow: /tag/ Disallow: /wishlist/ Disallow: /cron.php Disallow: /cron.sh Disallow: /error_log Disallow: /install.php Disallow: /LICENSE.html Disallow: /LICENSE.txt Disallow: /LICENSE_AFL.txt Disallow: /STATUS.txt
To restrain the behaviour of bots
You can limit the speed with which bots index your site as follows :
User-agent: * Crawl-delay: 10 User-agent: Baiduspider Crawl-delay: 20
This restricts all bots to crawling one page every ten seconds, then has a more specific rule for Baidu’s spider which restricts it to every twenty seconds.
Linking in your sitemap
The final use for a robots.txt is to signpost to bots where your sitemap is – since bots check in with robots.txt first, it’s a nice little marker for them, particularly if your sitemap isn’t at a standard location like www.yoursite.com/sitemap.xml.
# Website Sitemap Sitemap: http://www.theyorkshirepantry.com/sitemap.xml
One thing to bear in mind
A robots.txt file is public – so there’s no point trying to use it to hide files or folders that you don’t want people to fine, because the action of hiding them will, itself, be visible in the robots.txt file. Use browser authentication instead.