robots.txt书写模板及简析

所属分类：网站建设 | 浏览：1402 | 发布于 2023-05-16

虽然目前好像robots.txt已经消失在大众视野里，但是还是记录一下，这里提供10个书写模板。

1、Disallow All，禁止爬取

这句话表明禁止抓取网页，这在很多情况下有用，比如：

站点并没有准备好。
站点不希望出现在某个搜索引擎结果中。
这是在正式版之前的测试版。

模板代码：

User-agent: *
Disallow: /

这里有两个规则：

User-agent：指定特点的爬虫，使用*号表示针对所有搜索引擎。

Disallow：用于告诉爬虫机器人不能爬取这个区域的网页，当设置成“/”时表示禁止爬取所有网页。

2、Allow All，允许爬取所有

示例代码：

User-agent: *
Disallow:

如果你想让爬虫机器人爬取所有网页，可以使用这样代码，当爬虫机器人读取到这条规则时，它就知道这里没有被禁止爬取的URL。

3、Block a Folder，禁止爬取某个目录

示例代码：

User-agent: *
Disallow: /admin/

这条规则表示禁止爬取admin目录下的所有文件。

4、Block a file，禁止爬取某个文件

User-agent: *
Disallow: /admin.html

这条规则表示禁止爬取根目录下的admin.html文件。

5、Disallow a File Extension，禁止爬取某类扩展的文件

示例代码：

User-agent: *
Disallow: /*.pdf$
Disallow: /*.xls$

这条规则表示禁止爬取以pdf和xls为后缀的文件，这条规则能匹配下面这些url：

https://example.com/files/spreadsheet1.xls
https://example.com/files/folder2/profit.xls
https://example.com/users.xls

6、Allow Only Googlebot，只允许google爬虫，禁止其它爬虫

示例代码：

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow:

这条规则表示只允许google爬虫爬取网页，禁止其它爬虫爬取网页。

7、Disallow a Specific Bot，禁止某一特定爬虫，允许其它

示例代码：

User-agent: Googlebot
Disallow: /

User-agent: *
Disallow:

这条规则表明禁止google爬虫爬取，但是允许其它爬虫。

8、Link to your Sitemap，指定sitemap链接地址

示例代码：

User-agent: *
Sitemap: https://pagedart.com/sitemap.xml

这条规模手动指定了网站的sitemap.xml的地址

9、Slow the Crawl Speed，延迟抓取设定

目前有bing，yahoo和Yandex支持延迟抓取Crawl-delay规则，这条规则允许你在两次爬取之间设置一个延迟。

示例代码：

User-agent: *
Crawl-delay: 10

这条规则表示，在抓取下一个页面之前等待10秒，你可以设置延迟时间的范围是1到30秒。

10、bot user agents，常见的机器人的user agent

Googlebot - Used for Google Search
Bingbot - Used for Bing Search
Slurp - Yahoo's web crawler
DuckDuckBot - Used by the DuckDuckGo search engine
Baiduspider - This is a Chinese search engine
YandexBot - This is a Russian search engine
facebot - Used by Facebook
Pinterestbot - Used by Pinterest
TwitterBot - Used by Twitter

本文链接： https://wenge365.com/a/djZiRnZtQ0cvZTNyNk5HSm9EL2tGUT09.html