21 Nov 2017 
Support Center » Knowledgebase » Site Promotion » How do I use a robots.txt file to control access to my site?
 How do I use a robots.txt file to control access to my site?
Solution

Removing your entire website using a robots.txt file

You can use a robots.txt file to request that search engines remove your site and prevent robots from crawling it in the future. (It's important to note that if a robot discovers your site by other means - for example, by following a link to your URL from another site - your content may still appear in our index and our search results. To entirely prevent a page from being added to the Google index even if other sites link to it, use a noindex meta tag.)

To prevent robots from crawling your site, place the following robots.txt file in your server root:

User-agent: * Disallow: /

To remove your site from Google only and prevent just Googlebot from crawling your site in the future, place the following robots.txt file in your server root:

User-agent: Googlebot Disallow: /

Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols. For example, to allow Googlebot to index all http pages but no https pages, you'd use the robots.txt files below.

For your http protocol (http://replace-with-your-domain.com/robots.txt):

User-agent: * Allow: /

For the https protocol (https://replace-with-your-domain.com/robots.txt):

User-agent: * Disallow: /

------------------------------------------------------------
Where do I place my robots.txt file?

The robots.txt file must reside in the root of the domain and must be named "robots.txt". A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, http://www.replace-with-your-domain.com/robots.txt is a valid location. But, http://www.replace-with-your-domain.com/mysite/robots.txt is not. If you don't have access to the root of a domain, you can restrict access using the Robots META tag.

------------------------------------------------------------

How do I create a robots.txt file?.

If you want to create the file yourself, you can use any text editor. It should be an ASCII-encoded text file, not an HTML file. The filename should be lowercase.

Syntax
The simplest robots.txt file uses two rules:

* User-agent: the robot the following rule applies to
* Disallow: the pages you want to block

These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines and multiple user-agents in one entry.

What should be listed on the User-agent line?
A user-agent is a specific search engine robot. The Web Robots Database lists many common bots. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk). An entry that applies to all bots looks like this:

User-agent: *

Google uses several different bots (user-agents). The bot we use for our web search is Googlebot. Our other bots like Googlebot-Mobile and Googlebot-Image follow rules you set up for Googlebot, but you can set up additional rules for these specific bots as well.

What should be listed on the Disallow line?
The Disallow line lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).

* To block the entire site, use a forward slash.

Disallow: /

* To block a directory and everything in it, follow the directory name with a forward slash.

Disallow: /private_directory/

* To block a page, list the page.

Disallow: /private_file.html

URLs are case-sensitive. For instance, Disallow: /private_file.html would block http://www.replace-with-your-domain.com/private_file.html, but would allow http://www.replace-with-your-domain.com/Private_file.html.

------------------------------------------------------------

Block or remove pages using meta tags

Rather than use a robots.txt file to block crawler access to pages, you can add a <META> tag to an HTML page to tell robots not to index the page.

To prevent all robots from indexing a page on your site, you'd place the following meta tag into the <HEAD> section of your page:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

To allow other robots to index the page on your site, preventing only Google's robots from indexing the page, you'd use the following tag:

<META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOFOLLOW">

To allow robots to index the page on your site but instruct them not to follow outgoing links, you'd use the following tag:

<META NAME="ROBOTS" CONTENT="NOFOLLOW">

To allow robots to index the page on your site but instruct them not to index images on that page, you'd use the following tag:

<META NAME="ROBOTS" CONTENT="NOIMAGEINDEX">

------------------------------------------------------------

Block or remove pages using a robots.txt file

You can use a robots.txt file to block Googlebot from crawling pages on your site.

For example, if you're manually creating a robots.txt file, to block Googlebot from crawling all pages under a particular directory (for example, lemurs), you'd use the following robots.txt entry:

User-agent: Googlebot
Disallow: /lemurs/

To block Googlebot from crawling all files of a specific file type (for example, .gif), you'd use the following robots.txt entry:

User-agent: Googlebot
Disallow: /*.gif$

To block Googlebot from crawling any URL that includes a ? (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):

User-agent: Googlebot
Disallow: /*?

While we won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org) can appear in Google search results. However, no content from your pages will be crawled, indexed, or displayed.

To entirely prevent a page from being added to the Google index even if other sites link to it, use a noindex meta tag, and ensure that the page does not appear in robots.txt. When Googlebot crawls the page, it will recognize the noindex meta tag and drop the URL from the index.

Removing your entire website using a robots.txt file

You can use a robots.txt file to request that search engines remove your site and prevent robots from crawling it in the future. (It's important to note that if a robot discovers your site by other means - for example, by following a link to your URL from another site - your content may still appear in our index and our search results. To entirely prevent a page from being added to the Google index even if other sites link to it, use a noindex meta tag.)

To prevent robots from crawling your site, place the following robots.txt file in your server root:

User-agent: * Disallow: /

To remove your site from Google only and prevent just Googlebot from crawling your site in the future, place the following robots.txt file in your server root:

User-agent: Googlebot Disallow: /

Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols. For example, to allow Googlebot to index all http pages but no https pages, you'd use the robots.txt files below.

For your http protocol (http://replace-with-your-domain.com/robots.txt):

User-agent: * Allow: /

For the https protocol (https://replace-with-your-domain.com/robots.txt):

User-agent: * Disallow: /

------------------------------------------------------------
Where do I place my robots.txt file?

The robots.txt file must reside in the root of the domain and must be named "robots.txt". A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, http://www.replace-with-your-domain.com/robots.txt is a valid location. But, http://www.replace-with-your-domain.com/mysite/robots.txt is not. If you don't have access to the root of a domain, you can restrict access using the Robots META tag.

------------------------------------------------------------

How do I create a robots.txt file?.

If you want to create the file yourself, you can use any text editor. It should be an ASCII-encoded text file, not an HTML file. The filename should be lowercase.

Syntax
The simplest robots.txt file uses two rules:

* User-agent: the robot the following rule applies to
* Disallow: the pages you want to block

These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines and multiple user-agents in one entry.

What should be listed on the User-agent line?
A user-agent is a specific search engine robot. The Web Robots Database lists many common bots. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk). An entry that applies to all bots looks like this:

User-agent: *

Google uses several different bots (user-agents). The bot we use for our web search is Googlebot. Our other bots like Googlebot-Mobile and Googlebot-Image follow rules you set up for Googlebot, but you can set up additional rules for these specific bots as well.

What should be listed on the Disallow line?
The Disallow line lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).

* To block the entire site, use a forward slash.

Disallow: /

* To block a directory and everything in it, follow the directory name with a forward slash.

Disallow: /private_directory/

* To block a page, list the page.

Disallow: /private_file.html

URLs are case-sensitive. For instance, Disallow: /private_file.html would block http://www.replace-with-your-domain.com/private_file.html, but would allow http://www.replace-with-your-domain.com/Private_file.html.

------------------------------------------------------------

Block or remove pages using meta tags

Rather than use a robots.txt file to block crawler access to pages, you can add a <META> tag to an HTML page to tell robots not to index the page.

To prevent all robots from indexing a page on your site, you'd place the following meta tag into the <HEAD> section of your page:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

To allow other robots to index the page on your site, preventing only Google's robots from indexing the page, you'd use the following tag:

<META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOFOLLOW">

To allow robots to index the page on your site but instruct them not to follow outgoing links, you'd use the following tag:

<META NAME="ROBOTS" CONTENT="NOFOLLOW">

To allow robots to index the page on your site but instruct them not to index images on that page, you'd use the following tag:

<META NAME="ROBOTS" CONTENT="NOIMAGEINDEX">

------------------------------------------------------------

Block or remove pages using a robots.txt file

You can use a robots.txt file to block Googlebot from crawling pages on your site.

For example, if you're manually creating a robots.txt file, to block Googlebot from crawling all pages under a particular directory (for example, lemurs), you'd use the following robots.txt entry:

User-agent: Googlebot
Disallow: /lemurs/

To block Googlebot from crawling all files of a specific file type (for example, .gif), you'd use the following robots.txt entry:

User-agent: Googlebot
Disallow: /*.gif$

To block Googlebot from crawling any URL that includes a ? (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):

User-agent: Googlebot
Disallow: /*?

While we won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org) can appear in Google search results. However, no content from your pages will be crawled, indexed, or displayed.

To entirely prevent a page from being added to the Google index even if other sites link to it, use a noindex meta tag, and ensure that the page does not appear in robots.txt. When Googlebot crawls the page, it will recognize the noindex meta tag and drop the URL from the index.



Article Details
Article ID: 103
Created On: 18 Aug 2008 12:49 AM

 Back
 Login [Lost Password] 
Email:
Password:
Remember Me:
 
 Search
 Article Options

Home |  Sign Up |  Products & Services |  Online Support |  Submit Ticket

Copyright © 1999-2008 by Internet Planners LLC. Read our Terms and Conditions. All rights reserved.