Details
Assignee
UnassignedUnassignedReporter
Regine HeberleinRegine HeberleinLabels
Priority
MinorHarvest Time Tracking
Open Harvest Time Tracking
Details
Details
Assignee
Unassigned
UnassignedReporter
Regine Heberlein
Regine HeberleinLabels
Priority
Harvest Time Tracking
Open Harvest Time Tracking
Created September 17, 2024 at 7:37 PM
Updated March 11, 2025 at 3:04 PM
Observation:
When I run Agent updates via API, I get false duplicate-record errors for records that, after all updates are applied, would have identical name fields BUT would differ in all other fields, including some that I would expect to be used in the deduplication validation:
record_identifier
entity_identifier
dates_of_existence
authority_id
Scenario:
Given Agent record 13167 (see attached)
And Agent record 13168 (see attached)
When I update Agent record 13168 via API with the following fields:
current_title
uri
start
end
date_expression
date_certainty
last
first
fuller_form
Bayard, James A. (James Asheton), 1799-1880
/agents/people/13168
1799
1880
Bayard
James A.
James Asheton
Then I receive a false error
{"error":{"names":["Agent must be unique"],"conflicting_record":["/agents/people/13167"]
, i.e. indicating that the updates to 13168 cannot be applied because doing so would duplicate record 13167.Expected behavior: The update should be applied and record 13168 saved.
To Reproduce:
Sorry, this won’t be straightforward to reproduce unless you have similar records lying around. One (involved) way to reproduce this might go like this:
import records 13167 and 13168 into your instance
post a person update (
[:POST] /agents/people/:id
) to 13168 updating it withand
(the script we used is here in case that helps)
Why is this happening?
Many thanks to my colleague Max Kadel for stepping through the code with me!
The cause for this seems to be the shallow validation in agent_manager.rb.
The validation mechanism relies on comparison of two hash values, one (presumably) stored with existing records in the db, the other created when a user posts a record, prior to the record saving to the db. If the hash is the same, the incoming record is rejected.
Starting on line 115, the
validate
method calls onvalidates_unique
with input:agent_sha1
. If the validation fails, the frozen stringAGENT_MUST_BE_UNIQUE
is returned, set to “Agent must me unique”, followed by the db error indicating the record that’s preventing the incoming record from saving (line 9).find_matching
(line 277) comparesagent_sha1
to a hash calculated from the (incoming) json record viacalculate_hash
.calculate_hash
is defined on line 356 and in turn callsassemble_hash_fields
on line 316, and here is where it gets interesting!assemble_hash_fields
collects a bunch of fields that it then uses for calculating the hash. Those fields are:name
fieldsdate_type_structured
date_label
some fields from
contact
some fields from
external documents
some
notes
fieldsAnd that’s why James Bayard 13168 cannot save! Because:
date_type_structured
has two possible values, “single” and “range”. In this case, it’s “range”, the same as 13167.date_label is "existence" in both records--this is a constant when the field is populated.
the contact fields are blank in both cases, since in our environment the field only applies for Agents who are donors or dealers
the external documents fields are blank in both cases
there are no notes in either record
=> since the hash seems to be computed from only those fields, the calculated value is found to already exist in the database, and the incoming record is rejected.
This creates too many rejections of duplicates that aren’t, in fact, duplicates. All that needs to happen is for the name strings to be the same, the date type to be the same (while the dates can be different), and no additional notes, attachments, or contact information to be present.
It also explains why Henry Adams, a similar update, goes through: in this case, the existing record has a
date_type_structured
of type “range” but the incoming record has adate_type_structured
of type “single”, creating two different hash values.current_title
uri
start
end
date_expression
date_certainty
last
first
fuller_form
Adams, Henry, 1838-1918.
/agents/people/11936
1838
1918
Adams
Henry
Adams, Henry
Adams, Henry, 1944-
/agents/people/9969
1944
Adams
Henry
Adams, Henry
In other words, it appears that the hash on which the validating comparison is based is created from fields that are constant or near constant, or likely to be blank, resulting in the false duplicates.
It seems to have been introduced in this commit.
I would suggest including fields in the hash creation that are more likely to be both present and distinct, including e.g.
authority_id
,begin_date_standardized
, orend_date_standardized
.