Crawler FAQs
- Q: What is Fatbot?
- A: Fatbot is the web crawling robot for TheFind, Inc. It gathers data (e.g., HTML pages, CSS stylesheets, image files) from the web to build the index for our search engine. You can visit our site at http://www.thefind.com
- Q: What is your crawler's HTTP user-agent string?
- A: Fatbot.
- Q: How often will Fatbot access my web site?
- A: Fatbot is built to politely access your site. Fatbot attempts to access each web server no more than once every few seconds.
- Q: How do I request that Fatbot not crawl parts or all of my site?
- Set up a robots.txt file - this standard document will tell Fatbot not to download some or all of the information on your site. Learn more about The Robot Exclusion Standard here. Please note that Fatbot checks for changes to your server's robots.txt file only a few times a day. Therefore, any changes you make will not be immediately gathered, rather will be noted in the next crawl.
- Q: Why is Fatbot trying to access a file called robots.txt that isn't on my server?
- A: You can use a robots.txt file to tell Fatbot not to download some or all information from your web server. For information on how to create a robots.txt file, see The Robot Exclusion Standard. Also, in order to prevent the "file not found" error messages in your web server log, you can create an empty file named robots.txt.
- Q: Why is Fatbot attempting to download incorrect or non-existent links from my server?
- A: Fatbot discovers web pages by extracting links from other known web pages. If someone publishes an incorrect link to your site . i.e. there is a typo/spelling error in the URL, of if they fail to update links to reflect changes in your server, Fatbot will try to follow those links which will result in attempts to download incorrect links.
- Q: Why isn't Fatbot respecting my robots.txt file?
- Fatbot only downloads a copy of your robots.txt a few times a day. Therefore, it is possible for a delay to exist before Fatbot learns of changes to your robots.txt file.
- It is important to make sure that you are exactly following the The Robot Exclusion Standard. A common source of problems is that the robots.txt file isn't placed in the top directory of the server (e.g., www.myhost.com/robots.txt).
- Q: Can you tell me the IP addresses from which Fatbot crawls so that I can filter my logs?
- A: The best way for you to filter is by using the user-agent (Fatbot), since the IP addresses used by Fatbot change from time to time
- Q: Why is Fatbot downloading the same page on my site multiple times?
- A: In general, Fatbot only downloads one copy of each file from your site during a given crawl. However, if the crawler is stopped and restarted, it may cause Fatbot to recrawl pages that it has recently retrieved.
- Q: Who should I contact if I have more questions about Fatbot?
- A: Please contact us with questions. Please include your site URL, a detailed description of your problem, and the portion of your weblog that shows Fatbot activity. This will help us track down your problem quickly.