When updating Agent records via API, validation rejects false duplicates

Description

 

Observation:

When I run Agent updates via API, I get false duplicate-record errors for records that, after all updates are applied, would have identical name fields BUT would differ in all other fields, including some that I would expect to be used in the deduplication validation:

  • record_identifier

  • entity_identifier

  • dates_of_existence

  • authority_id

Scenario:

Given Agent record 13167 (see attached)

And Agent record 13168 (see attached)

When I update Agent record 13168 via API with the following fields:

current_title

uri

start

end

date_expression

date_certainty

last

first

fuller_form

 Bayard, James A. (James Asheton), 1799-1880

/agents/people/13168

1799

1880

 

 

Bayard

James A.

James Asheton

Then I receive a false error {"error":{"names":["Agent must be unique"],"conflicting_record":["/agents/people/13167"], i.e. indicating that the updates to 13168 cannot be applied because doing so would duplicate record 13167.

Expected behavior: The update should be applied and record 13168 saved.

To Reproduce:

Sorry, this won’t be straightforward to reproduce unless you have similar records lying around. One (involved) way to reproduce this might go like this:

  1. import records 13167 and 13168 into your instance

  2. post a person update ([:POST] /agents/people/:id) to 13168 updating it with

and

(the script we used is here in case that helps)

Why is this happening?

Many thanks to my colleague Max Kadel for stepping through the code with me!

The cause for this seems to be the shallow validation in agent_manager.rb.

The validation mechanism relies on comparison of two hash values, one (presumably) stored with existing records in the db, the other created when a user posts a record, prior to the record saving to the db. If the hash is the same, the incoming record is rejected.

Starting on line 115, the validate method calls on validates_unique with input :agent_sha1. If the validation fails, the frozen string AGENT_MUST_BE_UNIQUE is returned, set to “Agent must me unique”, followed by the db error indicating the record that’s preventing the incoming record from saving (line 9).

find_matching (line 277) compares agent_sha1to a hash calculated from the (incoming) json record via calculate_hash.

calculate_hash is defined on line 356 and in turn calls assemble_hash_fields on line 316, and here is where it gets interesting!

assemble_hash_fields collects a bunch of fields that it then uses for calculating the hash. Those fields are:

  • name fields

  • date_type_structured

  • date_label

  • some fields from contact

  • some fields from external documents

  • some notes fields

And that’s why James Bayard 13168 cannot save! Because:

  • date_type_structured has two possible values, “single” and “range”. In this case, it’s “range”, the same as 13167.

  • date_label is "existence" in both records--this is a constant when the field is populated.

  • the contact fields are blank in both cases, since in our environment the field only applies for Agents who are donors or dealers

  • the external documents fields are blank in both cases

  • there are no notes in either record

=> since the hash seems to be computed from only those fields, the calculated value is found to already exist in the database, and the incoming record is rejected.

This creates too many rejections of duplicates that aren’t, in fact, duplicates. All that needs to happen is for the name strings to be the same, the date type to be the same (while the dates can be different), and no additional notes, attachments, or contact information to be present.

It also explains why Henry Adams, a similar update, goes through: in this case, the existing record has a date_type_structured of type “range” but the incoming record has a date_type_structured of type “single”, creating two different hash values.

current_title

uri

start

end

date_expression

date_certainty

last

first

fuller_form

 Adams, Henry, 1838-1918.

/agents/people/11936

1838

1918

 

 

Adams

Henry

Adams, Henry

 Adams, Henry, 1944-

/agents/people/9969

1944

 

 

 

Adams

Henry

Adams, Henry

In other words, it appears that the hash on which the validating comparison is based is created from fields that are constant or near constant, or likely to be blank, resulting in the false duplicates.

It seems to have been introduced in this commit.

I would suggest including fields in the hash creation that are more likely to be both present and distinct, including e.g. authority_id, begin_date_standardized, or end_date_standardized.

Environment

None

Attachments

3

Activity

Show:

Dalton AlvesMarch 5, 2025 at 5:03 PM

DevPri discussed this ticket at our March 2025 meeting. DevPri recommends passing. The solution proposed in the original ticket was also discussed. We tentatively endorse the proposed fields to be included in the hash calculation.

Christine Di BellaJanuary 13, 2025 at 6:01 PM

Even though this one is complicated and testing requires using the API, I’m passing this to Development Prioritization for discussion of better validation criteria than what’s currently there.

Details

Assignee

Reporter

Labels

Priority

Harvest Time Tracking

Open Harvest Time Tracking

Created September 17, 2024 at 7:37 PM
Updated March 11, 2025 at 3:04 PM
Harvest Time Tracking