
February, 2000
For more than a year, various groups in the library have been involved in a planning process to rebuild the indexes in Acorn. This report summarizes the current status of this effort.
The current plan
We will perform a gradual re-indexing of Acorn that will update the indexes through a process of sending large batches of bibliographic records through the index regeneration programs. Sirsi has confirmed that this is a workable approach to rebuilding Indexes.
According to our current estimates, we will complete this process by the first week in March.
This strategy has the main advantage of eliminating any down-time for either public access or staff searching of Acorn. The disadvantage is that it will take longer to complete and that there will be some inconsistencies in the indexes during the process. There are inconsistencies already, however.
What will be different
The purpose of this index rebuild is to bring in some changes that library workgroups have recommended that will improve searching in Acorn and the way in which results are presented in browse lists.
Here are some specific changes:
MeSH searching has been more fully implemented. This change is largely in effect now. All the records from the Biomedical Library were sent off to Marcive, and we obtained from Marcive all the corresponding MeSH authority records necessary for us to have a full implementation of MeSH indexes in Acorn. The 18,523 medical bibliographic records yielded 12,570 MeSH authority records.
Changes in the GMD (General Material Designator). The GMD is stored in the subfield h of the 245 (title) field, and describes the type of media of the work for non-book materials.
The most conspicuous change--which has been implemented already--involves a global change of the text from [computer file] to [electronic resource]. This change was approved by the Information Services Advisory Group and the Cataloging and Authorities Advisory Group. This change was made globally for all existing records and is being made for current cataloging.
The GMD is currently included in the text of the browse indexes, but will be removed during re-indexing. The presence of the GMD in the indexes often caused entries for multiple instances of these titles to become separated from each other in index displays. The one exception to this rule involves Musical format materials. The GMD will continue to be included in the indexes for these materials at the request of the Music Library.
The 653 field will no longer be included in browse indexes. The 653 tag is used for non-controlled subject headings. This field was previously included in error--we want only authorized LCSH and MeSH entries to appear in our indexes. The contents of the 653 will continue to be searchable by keyword.
The 611 field was previously omitted from the LCNAMES browse index. This MARC tag is used for meeting names. It will now be included.
Searching by language. Unicorn currently bases its ability to search by language on the contents of the 041 field. This, unfortunately, causes items to be included when they are translated from a given language, even though the text itself is in another language. We believe that it would be more correct to base the language selection on positions 35-37 of the 008 control field of the Marc record or just the subfield "a" of the 041, but this is not the way that Unicorn works. We have submitted this issue to Sirsi as an enhancement request. In the mean time, the search by language feature will continue to retrieve what we perceive as extraneous results.
Form/Genre headings. Changes were made to the Form/Genre headings, that involve adding the subfield "v" to the index. The Form/Genere index are activated for WebCat searching. This will allow us to experiment with this new approach behind the scenes, without affecting the public catalog.
Why we need to re-index
We last rebuilt the indexes in early November 1999. That re-indexing was necessary to fix some technical problems in the indexes that had crept in due to a system crash several month earlier. The policies that control the structure of the heading and keyword indexes did not yet reflect all the changes planned under the comprehensive index review process that was in place at that time. Although many of the features of that plan came into effect, many were not. We must rebuild them now to ensure that our indexes are correct and consistent.
There is also a problem in the current keyword indexes that affects authorities processing. The automatic authorities control integrated into Unicorn will not work until we regenerate the keyword index, which will be accomplished as part of this gradual re-indexing effort.
Problems encountered
On Saturday February 19th, we encountered a problem with the index generation programs. The part of the process that re-organizes the indexes aborted due to a limit on the size of one of the files involved. The result was that the Browse indexes were updated for these 140,000 records and the Keyword Indexes were not. The main symptom of this problem involved users getting the wrong record when they selected a term from a browse index. To resolve this problem, the keyword indexes had to be rebuilt. SIRSI staff created a new keyword index, which was copied into the system Friday February 25th. We still needed to finish the indexing process to update the Browse indexes for the remaining records. These last records were indexed on Saturday February 26th, completing the entire process.
Although we did run into this one problem in re-generating the indexes this way, we are confident that we can use this process in the future if we should need to rebuild indexes. SIRSI has made available to us a way of re-organizing the keyword indexes that will prevent the file overrun that we encountered. This time we started with relatively small batches of records, and gradually increased the numbers until we reached the maximum that our system can process. With this knowledge, we will also be able to start with these larger batches next time, greatly reducing the time it takes to re-index the entire database.
Schedule for Indexing
We plan to begin the Re-indexing process in about the first week of February 2000.
The first sets of records to be re-indexed will be those that have the greatest impact on the public catalog. We will index all records that have a GMD entry in the 245 subfield "h" first. We will do these by February 5th. Once these records have been run through the indexing programs, the GMD information will no longer display in the browse indexes, except for Music. There are 158,417 records that fall into this category and we plan to index them all over a single weekend.
After that group of records is complete, we will begin processing the database in batches of about 40,000 records per weekday night, and about 100,000 per weekend. These numbers are estimates. The actual number of records processed per day will be adjusted up or down depending on the time available for batch processing and the general load on the system.
During this period while we are re-indexing there will be some side-effects that might be noticed by staff. There may be some mornings where reports run a little later than usual, but we generally plan a schedule where they are ready by the usual 8am time. Library staff should not plan to schedule any reports to run between Friday afternoons and Monday mornings. Staff may contiue to run reports during the day. During this period, there may be a slightly higher chance that staff will see "Record Locked" messages.
The following table describes the schedule for indexing. Numbers in bold are completed, the others reflect the numbers we plan for that date. We will update this schedule as the project progresses.
| Acorn Indexing Schedule | ||||
|---|---|---|---|---|
| Date | Day | Number of Records | Records Remaining | Percent Complete |
| February 2, 2000 | Wednesday | 10,000 | 1,710,928 | 0.58% |
| February 3, 2000 | Thursday | 30,000 | 1,680,928 | 2.32% |
| February 4, 2000 | Friday | 158,417 | 1,522,511 | 11.53% |
| February 7, 2000 | Monday | 40,000 | 1,482,511 | 13.85% |
| February 8, 2000 | Tuesday | 50,000 | 1,432,511 | 16.76% |
| February 9, 2000 | Wednesday | 60,000 | 1,372,511 | 20.25% |
| February 10, 2000 | Thursday | 70,000 | 1,302,511 | 24.31% |
| February 11, 2000 | Friday | 150,000 | 1,152,511 | 33.03% |
| February 12, 2000 | Saturday | 150,000 | 1,002,511 | 41.75% |
| February 14, 2000 | Monday | 80,000 | 922,511 | 46.39% |
| February 15, 2000 | Tuesday | 80,000 | 842,511 | 51.04% |
| February 16, 2000 | Wednesday | 100,000 | 742,511 | 56.85% |
| February 17, 2000 | Thursday | 100,000 | 642,511 | 62.66% |
| February 18, 2000 | Friday | 150,000 | 492,511 | 71.38% |
| February 19, 2000 | Saturday | 220,000 | 272,511 | 84.17% |
| February 21, 2000 | Monday | 140,000 | 132,511 | 92.30% |
| February 26, 2000 | Saturday | 132,511 | 0 | 100.00% |
Return to the StaffWeb
Last updated February 18, 2000