Periodic PUI indexer sometimes fails to index archival objects in a newly-imported EAD

Description

When importing very large and complex EAD records (e.g. a single ~1.5MB file or multiple ~500Kb files) the components (c01, c02, etc or nested c elements) can sometimes fail to appear on the public user interface. Not just temporarily, and not because they aren't set for publication, but permanently. They can be found and edited in the staff interface, but not viewed or searched for on the public web site. This also has the effect of leaving a gap where the "Collection Navigator" panel should be on collection pages, and on-demand PDF generation fails (this last triggers the only error logging for this issue.)

This appears to be a synchronization issue between the MySQL database and Solr index. The periodic PUI indexer continues to run while an import job is running, but the indexer cannot see the new records created by the importer until the latter's MySQL transaction finishes, by which time the timestamp text files recording the last check for new records has been set beyond the last modified time on rows in the database. This is despite having increased the solr_indexing_frequency_seconds and pui_indexing_frequency_seconds config settings to 60 seconds. Increasing those further would reduce the likelihood further, but then people editing records in the staff interface would have to wait before seeing their changes in the published version. Only archival objects seem to be affected, presumably because the importer triggers real-time indexing of collections, plus any new names, subjects, etc.

Running a full re-index fixes the issue, but you need to know this has happened to know to run it. For that purpose, the attached SQL script will count the number of archival objects flagged for publication, which can then be compared to a Solr search (fq=primary_type:%22archival_object%22&fq=types:%22pui%22&q=:&rows=0&wt=xml) counting the number actually published.

But ideally either the importer should trigger indexing for the new archival objects it creates, or somehow send a message to the indexer to pause until the import job has finished.

Environment

None

Status

Assignee

Unassigned

Reporter

Andrew Morrison

Affects versions

2.5.1

Priority

Major
Configure