Robots

Thunderstone Search Appliance Manual

Robots

Syntax: select Yes or No buttons

robots.txt

With this set to Yes, the Search Appliance will initially get /robots.txt from any site being indexed and respect its directives for what prefixes to ignore. Turning this setting off is not generally recommended. Supported directives in robots.txt include User-agent, Disallow, Allow, Sitemap, and Crawl-delay.

Note that any Crawl-delay value will be modified to fit in the Robots Crawl-delay range (here, and overrides Walk Delay (here.

Any Sitemap links in robots.txt will be walked as well, subject to normal exclusion settings. Sitemaps not in robots.txt may be added via Base URL(s) (here) or URL URL (p.here).

Meta

Respect the meta tag called robots. With this set to Y the Search Appliance will process and respect the robot control information within each retrieved HTML page.

Placeholder

Whether to still put an (empty) entry - a placeholder - in the html search table for URLs that are excluded via <meta name="robots"> tags. Leaving a placeholder improves refresh walks, as the URL can then have its own individual refresh time like any other stored URL. Without a placeholder, the URL would be fetched every time a link to it is found, because no knowledge that it has been recently fetched would be stored.

The downside to placeholders is that if the URL is also being searched in queries - i.e. Url is part of Index Fields - then the excluded URL might be found in results. Placeholders have empty text fields (e.g. no body, meta, etc.) to avoid matches on text, but the URL field must remain.