Syntax: select Yes or No buttons
robots.txt
With this set to Yes, the Search Appliance will initially get
/robots.txt from any site being indexed and respect its
directives for what prefixes to ignore. Turning this setting off is
not generally recommended. Supported directives in robots.txt
include User-agent
, Disallow
, Allow
, Sitemap
,
and Crawl-delay
.
Note that any Crawl-delay
value will be modified to fit in the
Robots Crawl-delay range (here, and
overrides Walk Delay (here.
Any Sitemap
links in robots.txt
will be walked as
well, subject to normal exclusion settings. Sitemaps not in
robots.txt
may be added via Base URL(s)
(here) or URL URL (p.here).
Meta
Respect the meta tag called robots
. With this set to Y
the Search Appliance will process and respect the robot control information
within each retrieved HTML page.
Placeholder
Whether to still put an (empty) entry - a placeholder - in the
html
search table for URLs that are excluded via
<meta name="robots">
tags. Leaving a placeholder improves
refresh walks, as the URL can then have its own individual refresh
time like any other stored URL. Without a placeholder, the URL would
be fetched every time a link to it is found, because no knowledge that
it has been recently fetched would be stored.
The downside to placeholders is that if the URL is also being searched
in queries - i.e. Url
is part of Index Fields - then
the excluded URL might be found in results. Placeholders have empty
text fields (e.g. no body, meta, etc.) to avoid matches on text, but
the URL field must remain.
See also Robots.txt
4.5.