The Search Appliance maintains a database that contains text from HTML pages, links to other pages, and a list of categories.
When the Search Appliance walker runs it creates a new database, under your specified data directory, to hold the new walk. It then dispatches a separate process for each web site it needs to visit and another to handle all of the "Single Pages". Each of these retrieves all of the pages in its base list and stores the text of the HTML page to the html table and the hyperlinks to the refs table. All of the desirable URLs from the page that have not been seen before are placed into an internal "todo" list. After all of the base URLs are processed the process repeats with the internal todo list. When there's nothing left in the todo list processing is complete.
Once all of the walking is complete the indices needed for searching are created on the data. Then the new database is flagged as the "live" one and the old database is deleted. Therefore your disk must have sufficient space for 2 complete databases plus temporary space used during the indexing step.
The databases are called db1
and db2
.
The Search Appliance alternates between using these two names.
Note that the above applies to a walk type of New
. During a walk
type of Refresh
only one database, the "live" one, is used.
The Search Appliance also maintains a file containing the detailed report
for each walk. This file has the same name as the database with
.long
appended to the end. Also, a single file called
summary
is maintained with short summary information about the
state of the database.
Given a data directory named .../default
there may also be
the following:
.../default/db1
.../default/db2
.../default/db1.long
Walk Status
.../default/db2.long
Walk Status
.../default/summary
Walk summary
when viewing Walk Settings
Each setting has a record in the options
table of the default database.
See section 5.4 (here) for the list of fields in the table.
At each complete rewalk the current options settings are copied into
an options table in the walk database. These options are not changed
as settings are modified and are not otherwise used unless a search
is performed setting the database with db
instead of setting
the profile with pr
.