There is some strange, non-Unicode behavior with the EAD3 exports now (I'm guessing due to gem issues since nothing else has changed). See the two examples below

Description

Please see the following resource record:

http://test.archivesspace.org/staff/resources/228#tree::resource_228

If you export the EAD 2002 version, then everything is fine.

If you export the EAD3 version (see attached), then there are two immediate issues present (that were not present in versions 2.2.0 of ASpace and earlier):

  1. Diacritics are mangled. See the author field (line 12 in the attached file). Landskröner --> Landskröner

  2. If you've got smart quotes or the like, then you get a big ole error in the export (I won't paste the entire thing here since it's so long, but see the attached XML file; I believe the relevant bit, though, is this: "incompatible encoding regexp match (UTF-8
    regexp with ASCII-8BIT string) ["org/jruby/RubyString.java:2574:in `gsub'",")
    (see line 94 in the attached file)

To iterate, I've only tested versions 2.2.0 and 2.4.1, but here's the gist:

In version 2.2.0, neither of the above two issues are present when exporting EAD2002 or EAD3.
In version 2.4.1, neither issue is present when exporting EAD2002, but BOTH issues are present when exporting EAD3.

We now utilize the EAD3 exports so this is a deal breaker for us for upgrading (unless we an address it in a plugin as stop-gap solution).

Environment

None

Activity

Show:
Christine Di Bella
September 7, 2018, 3:52 PM

Sounds like https://github.com/archivesspace/archivesspace/pull/1330 should fix this. It will be in the next release.

Mark Custer
August 2, 2018, 1:28 AM

All that said, I'm not sure why ASpace should need to scrub smart quotes and the like. Without the proper unicode characters, you can't display those characters in other formats, such as the PDF finding aid (but maybe another option is to replace straight quotes and the like with curly quotes during that transformation, although then you couldn't have a mixture of the two). Using the unicode characters should be preferable than encoding that information in EAD (e.g. */@render='doublequote'). Granted, I get the issue with smart quotes from programs like Word, but I believe/thought that's because those aren't the unicode characters.

Mark Custer
August 2, 2018, 1:13 AM
Edited

I've just confirmed that upgrading nokogiri to version 1.8.4 solves this problem. we had a similar issue with nokogiri 1.8.2 and jruby 9 when upgrading the ArchivesSpace Export Service plugin, and in that case (since a fix for 1.8.2 wasn't available yet), we wound up downgrading nokogiri. Anyhow, i think that things got patched with 1.8.3, but so far I've just tested 1.8.4

Done

Assignee

Mark Custer

Reporter

Mark Custer

Labels

Affects versions

Priority

Blocker