How to Use Robots.txt to Control Web Crawlers

Introduction

A robots.txt is a text file that contains rules instructing web crawlers and search engines to either access or ignore specific sections of your website. Commonly referred to as web robots, crawlers follow directives in the robots.txt file before scanning any part of your website. The robots.txt file should be in the website’s document root directory for access by any web crawler.

This article explains how you can use robots.txt to control web crawlers on your website.

Prerequisites

Deploy a Cloud Server on Vultr.
Point an active domain name to the server.
Log in through SSH as a non-root user with sudo privileges.
Host a website on the server such as WordPress.

The Robots.txt File Structure

A valid Robots.txt file contains one or more directives declared in the format: field, colon, value.

User-agent: Declares the web crawler a rule applies to.
Allow: Specifies the path a web crawler should access.
Disallow: Declares the path a web crawler should not access.
Sitemap: Full URL to the website structure sitemap.

Values must include relative paths for the allow/disallow fields, absolute paths (valid URL) for sitemap, and web crawler names for the user-agent field. Common user-agent names and respective search engines you can safely declare in a robots.txt file include:

Alexa
- ia_archiver
AOL
- aolbuild
Bing
- Bingbot
- BingPreview
DuckDuckGo
- DuckDuckBot
Google
- Googlebot
- Googlebot-Image
- Googlebot-Video
Yahoo
- Slurp
Yandex
- Yandex

Undeclared crawlers follow the all * directive.

Common Robots.txt Directives

Rules in the robots.txt must be valid or web crawlers ignore invalid syntax rules. A valid rule must include a path or a fully qualified URL. The examples below explain how to allow, disallow and control web crawlers in the robots.txt file.

1. Grant Web Crawlers access to Website Files

Allow a single web crawler to access all website files.

User-agent: Bingbot
Allow: /

Allow all web crawlers to access website files.

User-agent: *
Allow: /

Grant a web crawler access to a single file.

User-agent: Bingbot
Allow: /documents/helloworld.php

Grant all web crawlers access to a single file.

User-agent: *
Allow: /documents/helloworld.php

2. Deny Web Crawlers Access to Website Files

Deny a web crawler access to all website files.

User-agent: Googlebot
Disallow: /

Deny all web crawlers access to website files.

User-agent: *
Disallow: /

Deny a web crawler access to a single image.

User-agent: MSNBot-Media
Disallow: /documents/helloworld.jpg

Deny a web crawler access to all images of a specific type.

User-agent: MSNBot-Media
Disallow: /*.jpg$

You can also deny a specific images crawler access to all website images. For example, the following rule instructs Google images to ignore all and remove indexed images from their database.

User-agent: Googlebot-Image
Disallow: /

Deny web crawlers access to all files except for one file.

User-agent: *
Disallow: /~documents/helloworld.php

To explicitly allow access to multiple files, repeat the Disallow rule:

User-agent: *
Disallow: /~documents/hello.php
Disallow: /~documents/world.php
Disallow: /~documents/again.php

Instruct all web crawlers to access website files, but ignore a specific file.

User-agent: *
Allow: /
Disallow: /documents/index.html

Instruct all web crawlers to ignore a specific directory. For example: wp-admin.

User-agent: *
Disallow: /wp-admin/

3. Grouping Robots.txt Directives

To apply robots.txt directives in groups, declare multiple user agents and apply the single rule.

For example:

User-agent: Googlebot    # First Group
User-agent: Googlebot-News
Allow: /
Disallow: /wp-admin/

User-agent: Bing   # Second Group
User-agent: Slurp
Allow: /
Disallow: /wp-includes/
Disallow: /wp-content/uploads/  # Ignore WordPress Images

The above directives apply the same rule per declared group.

4. Control Web Crawler Intervals

Web Crawler requests can increase your server load, so you need to regulate the rate at which crawlers scan your website in your seconds.

For example, the following directive instructs all web crawlers to wait at least 60 seconds between successive requests to your server.

User-agent: *
Crawl-delay: 60

Example

The following robots.txt sample instructs all web crawlers to access website files, ignore critical directories, and use the sitemap to understand the website’s structure.

User-agent: *
Allow: /
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/

Sitemap: https://www.example.com/sitemap_index.xml

To test and view your robots.txt file, visit your website and load the file after /. For example: http://example.com/robots.txt.

If your website returns a 404 error, create a new robots.txt file, and upload it to your document root directory, usually /var/www/html or /var/www/public_html.

Most web crawlers follow your robots.txt directives. However, bad bots and malware crawlers may ignore your rules. To secure your server, block bad bots through the .htaccess file if you are using the LAMP stack on your server by adding the following lines to the file:

SetEnvIfNoCase User-Agent ([a-z0-9]{2000}) bad_bots
SetEnvIfNoCase User-Agent (archive.org|binlar|casper|checkpriv|choppy|clshttp|cmsworld|diavol|dotbot|extract|feedfinder|flicky|g00g1e|harvest|heritrix|httrack|kmccrew|loader|miner|nikto|nutch|planetwork|postrank|purebot|pycurl|python|seekerspider|siclab|skygrid|sqlmap|sucker|turnit|vikspider|winhttp|xxxyy|youda|zmeu|zune) bad_bots
Order Allow,Deny
Allow from All
Deny from env=bad_bots

Tags:

WordPress

Robots.txt

Web Crawlers

Comments

No comments yet.

How to Use Robots.txt to Control Web Crawlers

Introduction

Prerequisites

The Robots.txt File Structure

Common Robots.txt Directives

1. Grant Web Crawlers access to Website Files

2. Deny Web Crawlers Access to Website Files

3. Grouping Robots.txt Directives

4. Control Web Crawler Intervals

Example

Comments

Products

Features

Solutions

Marketplace

Resources

Company

Tech Talks

Vultr Blogs