Title Image

Everything You Need to Know About Robots.txt for SEO

Everything You Need to Know About Robots.txt for SEO

Between your website and search engines, one small text file can make a big difference: robots.txt. This file tells web crawlers (like Googlebot) which parts of your site they should visit and which parts they should skip. It’s often seen as a way to “stop Google from indexing a page,” but that’s not exactly how it works (we’ll explain that later). What it really does is control crawling, which can affect how efficiently search engines use their time on your site and which pages they focus on.

Knowing how to set up robots.txt is a basic SEO skill. The Robots Exclusion Protocol has existed since 1994 and became an official standard in 2022. Used the right way, robots.txt helps you direct crawl activity, reduce wasted crawler visits, and keep bots out of areas you don’t want crawled. Below is a clear guide to what it is, how it works, and how to use it safely.

What is robots.txt and why does it impact SEO?

robots.txt is a plain text file placed at the main (root) level of your website. It contains rules for bots that crawl the web and collect pages for search results. You can think of it as a sign at the front door that says, “You can go here, but please don’t go there.” Search engines usually respect these rules, and that affects what they crawl and how much time they spend on different parts of your site.

Blocking crawlers from certain URLs can help search engines spend more time on your most important pages. It can also stop crawlers from getting stuck in endless URL paths (for example, pages created by filters that generate thousands of near-duplicate URLs). This leads to cleaner crawling, better use of crawl budget, and fewer wasted requests. Without a good robots.txt, crawlers may spend too much time on low-value pages, put pressure on your server, or miss important content because they ran out of crawl time.

How do search engines use robots.txt?

Before a crawler visits pages on a domain it hasn’t seen before, it will usually request robots.txt first. It reads that file to learn the rules for that site, then decides which URLs it is allowed to crawl and which ones it should skip.

Most major search engines follow these rules, mainly to avoid hitting your server too hard and to use their crawl resources wisely. But robots.txt is not a security tool. Good bots follow it, but bad bots may ignore it. Some scrapers may even read it to find areas you tried to keep out of sight.

What is the Robots Exclusion Protocol?

robots.txt is based on the Robots Exclusion Protocol (REP). REP started in 1994 as a shared agreement on how websites could communicate crawl rules to bots. It was widely used for decades and became an official standard in 2022.

REP defines the basic format and commands used in robots.txt. Site owners use it to manage bot access, reduce unnecessary crawling, and sometimes suggest crawl pacing. Bots are not forced to obey, but major search engines like Google, Bing, and Yandex generally do. That makes robots.txt a practical way to guide crawling, even though it’s still based on cooperation.

Where should you place the robots.txt file on your website?

robots.txt only works if it’s in the correct location. It must be placed in the root directory of your domain. If your site is www.example.com, the file must be reachable at:

https://www.example.com/robots.txt

If you put it anywhere else (like https://www.example.com/folder/robots.txt), crawlers will not use it.

The filename is also case-sensitive. It must be exactly robots.txt. Variations like Robots.txt can cause crawlers to miss it. Also, each subdomain needs its own file. So if you have blog.example.com and shop.example.com, each subdomain needs its own robots.txt in its own root.

How to access your robots.txt file online

You can view a site’s robots.txt by adding /robots.txt to the main domain. Example:

https://www.cloudflare.com/robots.txt

Because it is public, you should never use robots.txt to hide private data. Anyone can read it.

For your own site, you can edit robots.txt through your hosting file manager, FTP, or sometimes through a CMS. Many WordPress SEO plugins (like Yoast SEO) offer a simple editor so you don’t have to edit server files directly. This also helps reduce the risk of mistakes that could block your whole site.

Robots.txt syntax: Rules, directives, and structure

robots.txt uses a simple format made of groups of rules (often called “blocks”). Each block starts with a User-agent line (which bot the rules apply to), followed by lines like Disallow or Allow.

# A group of rules for a specific bot
User-agent: Googlebot
Disallow: /private/

# A group of rules for all bots
User-agent: *
Disallow: /tmp/

The commands like Allow and Disallow are not case-sensitive, but the URL paths you list usually are. Many people capitalize the commands to make the file easier to read, which helps when you need to review it later.

User-agent directive explained

User-agent tells which crawler the rules apply to. Examples:

  • User-agent: Googlebot applies to Google’s main crawler.
  • User-agent: Bingbot applies to Bing’s crawler.

Search engines may also use special crawlers like Googlebot-Image or Googlebot-Video.

You can use * as a wildcard to target all bots that follow REP:

User-agent: *

If your file has multiple blocks, crawlers pick the most specific match. For example, a rule written for Googlebot overrides a general * rule for Googlebot.

Disallow directive usage

Disallow tells a bot which paths it should not crawl. Examples:

  • Disallow: /path/to/directory/
  • Disallow: /filename.html

If you leave it empty (like Disallow:), nothing is blocked, so crawling is allowed.

Example rules:


User-agent: *
Disallow: /private/
blocks the /private/ folder for all bots that follow REP.

User-agent: Googlebot
Disallow: /old-articles/
blocks that folder for Google only.

Be careful with path matching and capitalization. /Photo is different from /photo. Also, blocking /Photo can also block /Photography/ because the path starts the same way.

# Block the /private/ folder for all bots
User-agent: *
Disallow: /private/

# Block the /old-articles/ folder for Google only
User-agent: Googlebot
Disallow: /old-articles/

Allow directive usage

Allow was not in the original REP, but major search engines (including Google) support it. It lets you open access to a specific file or subfolder inside a larger area you blocked with Disallow.

Example (common on WordPress sites):

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

This blocks the admin area, but still allows one file needed for site features.

Wildcard patterns and regular expressions

robots.txt can use wildcard matching. The main wildcard is *, which matches any sequence of characters. Example:

Disallow: /*.php blocks URLs that contain .php after a slash.

The $ character means “end of the URL.” Example:

Disallow: /*.php$ blocks /index.php but does not block /index.php?p=1 because that URL does not end with .php.

User-agent: *
# Blocks URL paths like /products.php, /articles.php, etc.
Disallow: /*.php

# Blocks URLs that end with .php, but allows those with parameters
Disallow: /*.php$

This is useful for large sites with many parameter-based URLs (filters, sorting, internal search), where crawling can spiral into huge sets of near-duplicate pages.

Crawl-delay directive

Crawl-delay is a common but unofficial directive. It asks bots to wait a certain number of seconds between requests, to reduce load on your server. Example:

User-agent: *
Crawl-delay: 10

Some search engines may respect it, but Google does not. For Google, crawl rate settings are handled in Google Search Console. Many modern bots also adjust their crawl speed automatically based on how your server responds.

# Asks all bots to wait 10 seconds between requests (not supported by Google).
User-agent: *
Crawl-delay: 10

Sitemap directive for XML sitemaps

The Sitemap directive helps crawlers find your XML sitemap. You should still submit your sitemap in Google Search Console and Bing Webmaster Tools, but listing it in robots.txt adds another discovery path.

Example:

Sitemap: https://www.example.com/sitemap.xml

You can list more than one sitemap if needed (for example, separate sitemaps for products, categories, and pages). Use full URLs that start with https://.

Adding comments in robots.txt

You can add comments using #. Bots ignore these lines, but they help humans understand why rules exist.

# This robots.txt file was last updated on 2026-05-15
# Disallow sensitive admin paths
User-agent: 
Disallow: /admin/
Disallow: /private-docs/

Adding dates is helpful for debugging, especially if an older backup gets restored and suddenly causes crawling issues.

How does robots.txt affect search engine crawling and indexing?

Many SEO problems happen because people mix up crawling and indexing. robots.txt mainly controls crawling (whether a bot can fetch a page). It does not directly control indexing (whether the page can appear in search results).

Blocking crawling often means the content won’t be indexed because the bot can’t read it. But there’s an important exception: a blocked URL can still show up in search results if search engines find it through links from other pages. In that case, the result may appear without a proper title or snippet because the bot couldn’t crawl the page.

Managing crawl budget with robots.txt

Search engines have limited time and resources for crawling each site. For bigger sites, this is often called “crawl budget,” meaning the rough number of URLs a bot will crawl in a period of time. It depends on things like how trusted your site is and how fast your server responds.

robots.txt helps you steer that crawl budget to the pages that matter most. Common pages to block include:

  • Internal search results
  • Cart and checkout pages
  • Staging environments and admin sections
  • Duplicate URLs created by parameters (example: /?category=shoes&color=blue)

If you don’t control parameter URLs, crawlers can spend lots of time on pages that won’t rank, and they may crawl key category and product pages less often.

# Block common low-value URL patterns to preserve crawl budget
User-agent: *
Disallow: /search/
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /*?sort=
Disallow: /*&color=

Risks of blocking content in robots.txt

robots.txt is powerful, and mistakes can be expensive. If you block important pages (or your whole site), search engines may stop crawling them, and your rankings can drop. This often happens during site moves or redesigns.

One classic mistake is blocking CSS and JavaScript files. Google needs these files to render pages and understand layout and usability. If you block them, Google may see an incomplete or broken version of your pages, which can hurt how your site is evaluated.

# BAD PRACTICE: Blocking essential rendering files in WordPress
User-agent: *
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
# This can block CSS and JS files, harming rendering.

Why robots.txt doesn’t prevent indexing by itself

robots.txt is not a reliable way to keep a URL out of Google’s index. Google has said its main purpose is to reduce unnecessary crawling and avoid putting too much load on your site. If other sites link to a blocked URL, Google can still find that URL and may show it in results. Often it will appear as a “bare” listing with little or no snippet because Google couldn’t crawl the page.

If you truly need a page to stay out of search results, use one of these options:

  • Add a noindex meta tag in the page’s <head> or send a noindex HTTP header. The page must be crawlable for bots to see the tag.
  • Password-protect the page or folder.
  • Remove the page so it returns a 404 or 410 status code.

Avoid mixing rules in a way that fights itself (for example, blocking a page in robots.txt and also trying to use noindex on that page). If bots can’t crawl it, they can’t read the noindex.

# INCORRECT: This blocks the crawler from seeing the "noindex" tag on the page.
# The page must be crawlable for "noindex" to work.
User-agent: *
Disallow: /page-i-want-to-noindex/

Best practices for robots.txt in SEO

Using robots.txt well is more than knowing the commands. You need to plan rules so they support your SEO goals instead of harming them. A well-written file can improve crawl efficiency, keep bots away from low-value areas, and help search engines spend time where it counts.

The main idea is balance: give crawlers helpful guidance, but don’t block resources or pages needed for ranking. Review your file regularly and treat it as an active part of technical SEO, not a one-time setup.

Avoid blocking important CSS and JavaScript files

Blocking CSS and JavaScript is one of the most damaging robots.txt mistakes. Google has warned against it since 2015. Search engines render pages in a browser-like way, so they need access to resources that control design and behavior.

If those files are blocked, Google may not understand your layout, mobile experience, or interactive features. That can hurt rankings. Make sure your robots.txt does not block key CSS/JS paths used by your site.

# Good Practice: Allow access to all files unless a path is specifically disallowed.
User-agent: *
Disallow: /private/
# By not disallowing /css/ or /js/, they are implicitly allowed.

Minimizing complexity and syntax errors

It’s usually better to keep robots.txt simple. More rules mean more chances to make a mistake. A small typo or a missing slash can cause crawlers to read your rules differently than you expected, which can block pages you meant to allow (or allow pages you meant to block).

Check the file often, and use tools like the robots.txt tester in Google Search Console (available in the “Old version” section or through indexing tools). Always test rule changes before you publish them.

Combining multiple directives for better control

If you need tighter control, you can combine Disallow and Allow. This lets you block a whole folder but open access to specific files inside it.

Example use case: block a mostly private area but allow a few public documents. That said, the default behavior is “allow everything,” so if you only need to block a small set of URLs, a few targeted Disallow lines are often enough.

Linking to XML sitemaps from robots.txt

Adding your sitemap URL(s) in robots.txt is a solid habit. Even though sitemap submission in Google Search Console and Bing Webmaster Tools matters most for reporting and control, the Sitemap line in robots.txt gives crawlers another way to find it.

For clarity, many site owners place sitemap lines near the end of the file and always use full URLs. Large sites often use multiple sitemaps and list all of them.

# List multiple sitemaps, which is common for large sites
Sitemap: https://www.example.com/sitemap.xml
Sitemap: https://www.example.com/blog-sitemap.xml
Sitemap: https://www.example.com/products-sitemap.xml

User-agent: *
Disallow: /private/

Common robots.txt mistakes and how to avoid them

robots.txt looks simple, but it often causes SEO issues. Mistakes range from minor crawl waste to major blocks that remove large parts of a site from being crawled. Avoiding problems comes down to understanding what it can and cannot do, writing clear rules, and testing often.

Many errors happen because people assume all bots behave the same way or because they expect robots.txt to work like a privacy tool. Knowing the common failures helps protect your organic traffic.

Incorrect syntax and unsupported directives

Bad syntax is a basic but common issue. Some bots may ignore a rule if it’s written incorrectly. Also, some patterns may work in Google but not in other search engines, which can create different crawl behavior across platforms.

Always check formatting and path capitalization. Use a validator or testing tool before publishing changes so you catch errors early.

Over-restricting crawler access

It’s easy to go too far with Disallow. Blocking too much can stop bots from reaching pages that matter, and it can reduce how often important pages are crawled. This can be tricky with parameter URLs: they may create duplicates, but sometimes they also have value (and might be better handled with canonicals instead of blocking).

Before blocking a folder or URL pattern, think about whether it matters for search traffic or conversions. If your goal is to keep a page out of search results, use noindex or canonical tags where appropriate instead of cutting off crawling completely.

# WARNING: This single line will block all compliant crawlers from your entire site.
User-agent: *
Disallow: /

Bots that ignore robots.txt rules

robots.txt is optional for bots. Major search engines typically follow it closely, but scrapers and malicious bots may ignore it. Some may even treat disallowed paths as a list of interesting targets.

So don’t use robots.txt as protection for private data. If content must be private, use real access controls like passwords, IP rules, or server authentication.

Special considerations for AI bots and new crawlers

AI-focused crawlers like GPTBot (OpenAI) and ClaudeBot (Anthropic) have become more common. These bots often collect data for training, not for classic search results. Many of them follow REP rules like other bots do.

If your robots.txt allows all bots (User-agent: Disallow:), these AI bots will often be allowed too. If you block all bots (User-agent: Disallow: /), they are usually blocked as well. You can also target specific AI crawlers by listing their user-agent name and setting rules for them.

Choosing to allow or block them is a business decision. Allowing them may help your content show up in AI answers. Blocking them may help you keep tighter control over unique content like original reviews or special inventory data.

# Block a specific AI crawler from the entire site
User-agent: GPTBot
Disallow: /

# Allow all other user agents by default
User-agent: *
Disallow:

Testing and validating your robots.txt file

One basic rule: don’t publish robots.txt changes without testing. A single wrong line can block your entire site from crawling and cause major search visibility problems. Testing is how you prevent that.

Use testing tools as a safety check before and after updates. They help you confirm that important pages are crawlable and that blocked areas are blocked on purpose.

Tools for testing robots.txt (Google Search Console and others)

The key tool for Google is the robots.txt tester in Google Search Console. It lets you:

  • See how Googlebot reads your robots.txt.
  • Test specific URLs and check if Googlebot is blocked.
  • Spot syntax errors.

You can also use third-party testers to check formatting and compare behavior for other crawlers. But for Google rules, Google Search Console is the main source you should trust.

Google Search Console also has a “Page Indexing” report with a “Blocked by robots.txt” area, which shows URLs Google found but couldn’t crawl because of your rules.

How to validate robots.txt syntax

Syntax validation means checking that your file follows the expected format and that paths are correct. Key items to check include:

  • Each rule group starts with User-agent.
  • Disallow and Allow each appear on their own lines.
  • Paths are correct and include slashes where needed.
  • Wildcards (*) and end markers ($) are used correctly.
  • Path capitalization matches your real URLs.

The Search Console tester highlights syntax issues, but it’s also smart to manually review the logic (or have another person review it) to catch rule problems tools might not flag.

# A well-formed robots.txt file example

# Rules should be in separate lines
User-agent: *
Disallow: /admin/

# Use wildcards to match patterns
# This blocks any URL ending in .pdf
Disallow: /*.pdf$

Interpreting test results and fixing issues

After testing, compare results to what you expected:

  • If an important URL is blocked, update the Disallow rules.
  • If a private URL is allowed, add or adjust your Disallow rules.

If CSS or JavaScript is blocked, update rules so those files are allowed. After edits, test again right away. Also remember that search engines cache robots.txt and may refresh it several times a day, so updates can appear quickly, but you should still confirm by testing.

Frequently asked questions about robots.txt for SEO

Does every website need a robots.txt file?

No. A very small website where every page should be public may not need one. If there’s no robots.txt, crawlers usually assume they can crawl everything.

# An explicit "allow-all" robots.txt file. This is the default behavior if no file exists.
# It is useful for making your intention clear.
User-agent: *
Disallow:

But most real sites benefit from it, especially sites with:

  • Dynamic URLs
  • Internal search
  • Account pages
  • Admin areas
  • Large numbers of product and filter pages (marketplaces)

Some CMS platforms (like Wix or Blogger) may hide direct access or manage this through built-in settings. For custom sites and large platforms, robots.txt is a core technical SEO item.

Can robots.txt be used to block AI crawlers?

Yes. You can block or allow specific AI crawlers like GPTBot or ClaudeBot by using their user-agent names and adding Disallow or Allow rules. Many AI crawlers follow REP rules.

This choice depends on your goals. Allowing them may increase how often your content appears in AI answers. Blocking them can help you keep tighter control over unique content and data.

How often should you update your robots.txt file?

Review and update robots.txt when your site changes in ways that affect URLs and crawling. Common reasons include:

  • New site sections or new URL patterns
  • Changes to filters or URL parameters
  • Domain changes or site migrations
  • New duplicate/low-value pages that waste crawl budget
  • New AI crawlers you want to allow or block

For large, active sites, it’s practical to review robots.txt during releases or during regular technical SEO audits.

Can you use robots.txt to remove a page from Google results?

No. robots.txt does not reliably remove a page from Google results. It blocks crawling, but a URL can still appear in results if Google finds it through links. Often it will show without a normal snippet because Google couldn’t crawl the page.

If you need to keep a page out of Google results, use one of these methods:

  • Add a noindex meta tag or noindex HTTP header (and keep the page crawlable so Google can see it).
  • Password-protect the page or folder.
  • Delete the page and return 404 or 410.
<!-- Method 1: Add a "noindex" meta tag to the page's HTML <head> -->
<meta name="robots" content="noindex">
# Method 2: Send a "noindex" HTTP header with the server response for the file
X-Robots-Tag: noindex

Using robots.txt for removal often leads to partial results and confusion.

Rad Paluszak

Author: Rad Paluszak

With over 25 years of web development experience and more than 15 years in technical SEO, Rad is an international SEO conference speaker, co-founder, and CTO at NON.agency – a London-based international SEO agency – and former CTO at SUSO Digital, where he collaborated with industry leaders like Matt Diggity and Matthew Woodward. Before that, he worked at Poland’s largest digital marketing agency and co-founded Husky Hamster in 2021, which later evolved into NON.agency Global. Rad is based in London, UK.

👉 Rad’s LinkedIn Profile
👉 Rad’s X Account

Rad Paluszak

[email protected]

With over 25 years of web development experience and more than 15 years in technical SEO, Rad is an international SEO conference speaker, co-founder and CTO at NON.agency — a London-based international SEO agency — and former CTO at SUSO Digital, where he collaborated with industry leaders like Matt Diggity and Matthew Woodward. Prior to that, he worked at Poland's largest digital marketing agency and co-founded Husky Hamster in 2021, which later evolved into NON.agency Global. Rad is based in London, UK.

Blue Starling Media
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.