Robots.txt is a plain text file placed at the root of a website that provides instructions to web crawlers about which parts of the site they should or should not access. The file follows the Robots Exclusion Protocol, a convention that well-behaved crawlers respect, though it is advisory rather than enforceable.

The syntax is straightforward: User-agent directives specify which crawler the rules apply to, and Disallow directives list paths that should not be crawled. An Allow directive can override a Disallow for specific paths. The file can also include a Sitemap directive pointing to the site's XML sitemap. Web crawling books on Amazon explain the specification.

For URL shortening services, robots.txt plays a dual role. The service's own website uses robots.txt to guide crawlers to important pages while blocking administrative areas. The redirect endpoints, however, should generally be accessible to crawlers so that search engines can follow short URLs and discover the destination content.

Important considerations include not using robots.txt to hide sensitive content (it does not prevent indexing if pages are linked from elsewhere), testing the file with Google's robots.txt tester, and being aware that different crawlers may interpret edge cases differently. Search engine books on Amazon discuss these nuances.

Robots.txt

Related Terms

Sitemap

Web Crawler

Noindex