Jump to content
Anders Melander

Unicode normalization

Recommended Posts

I'm writing a shaper for complex text layout and for that, I need to do Unicode decomposition and composition (NFD and NFC normalization).
Ā 

Does anyone know of a Pascal library that can do this?

Ā 

I have the following basic requirements:

  • Open source with a workable license (i.e. not GPL).
  • Cross platform (i.e. not tied to Windows or whatever).
  • Operate on UFC4/UTF-32 strings.
  • Based on the standard Unicode data tables.
  • Must have the ability to update tables when new Unicode tables are published.
  • Must support both NFD decomposition and NFC composition.

Ā 

So far I have found the following candidates:

Delphi Unicode libraries

  • PUCU Pascal UniCode Utils Libary
    šŸ”µ Origin: Benjamin Rosseaux.
    āœ… Maintained: Yes.
    āœ… Maintained by author: Yes.
    āœ… License: Zlib.
    ā›” Readability: Poor. Very bad formatting.
    āœ… Performance: The native string format is UCS4 (32-bit).
    āœ… Features: Supports decomposition and composition.
    āœ… Dependencies: None.
    āœ… Data source: Unicode tables are generated from official Unicode data files. Source for converter provided.
    ā›” Table format: Generated inline arrays and code.
    āœ… Completeness: All Unicode tables are available.
    āœ… Hangul decomposition: Yes.
    āœ… Customizable: Required data and structures are exposed.
    āœ… Unicode level: Currently at Unicode v15.
    ā›” Unicode normalization test suite: Fail/Crash
    Ā 

  • FreePascal RTL
    šŸ”µ Origin: Based on code by Inoussa Ouedrago.
    āœ… Maintained: Yes.
    šŸ”µ Maintained by author: No.
    šŸ”µ License: GPL with linking exception.
    āœ… Readability: Good. Code is clean.
    āœ… Performance: Code appears efficient.
    ā›” Features: Only supports decomposition. Not composition.
    āœ… Dependencies: None.
    āœ… Data source: Unicode tables are generated from official Unicode data files. Source for converter provided.
    āœ… Table format: Generated arrays in include files.
    ā›” Completeness: Only some Unicode tables are available.
    ā›” Hangul decomposition: No.
    ā›” Customizable: Required data and structures are private.
    āœ… Unicode level: Currently at Unicode v14.
    ā›” Unicode normalization test suite: N/A; Composition not supported.

    Ā 
  • JEDI jcl
    šŸ”µ Origin: Based on Mike Lischke's Unicode library.
    šŸ”µ Maintained: Sporadically.
    šŸ”µ Maintained by author: No.
    āœ… License: MPL.
    āœ… Readability: Good. Code is clean.
    ā›” Performance: Very inefficient. String reallocations. The native string format is UCS4 (32-bit).
    āœ… Features: Supports decomposition and composition.
    ā›” Dependencies: Has dependencies on a plethora of other JEDI units.
    āœ… Data source: Unicode tables are generated from official Unicode data files. Source for converter provided.
    āœ… Table format: Generated resource files.
    šŸ”µ Unicode level: Currently at Unicode v13.
    āœ… Completeness: All Unicode tables are available.
    āœ… Hangul decomposition: Yes.
    ā›” Customizable: Required data and structures are private.
    ā›” Other: Requires installation (to generate the JEDI.inc file).
    šŸ”µ Unicode normalization test suite: Unknown

Ā 

The FPC implementation has had the composition part removed so that immediately disqualifies it and the JEDI implementation, while based on an originally nice and clean implementation, has gotten the usual JEDI treatment so it pulls in the rest of the JEDI jcl as dependencies. I could clean that up but it would amount to a fork of the code and I would prefer not to have to also maintain that piece of code.

That leaves the PUCU library and I currently have that integrated and working - or so I thought... Unfortunately, I have now found a number of severe defects in it and that has prompted me to search for alternatives again.

Ā 

Here's the project I need it for: https://gitlab.com/anders.bo.melander/pascaltype2

Share this post


Link to post
20 hours ago, Anders Melander said:

I could clean that up but it would amount to a fork of the code and I would prefer not to have to also maintain that piece of code.

Why not extract only necessary external requirements to custom low-fat units and leave the unit of interest as is? Or there's too much deps?

Share this post


Link to post
1 minute ago, Fr0sT.Brutal said:

Why not extract only necessary external requirements to custom low-fat units and leave the unit of interest as is? Or there's too much deps?

I guess that's one way to remove the dependencies. I hadn't thought of that.

Ā 

But I would still have to clean up the remaining source and rewrite parts of it, so again: A fork.
If I go that way I would probably prefer to simply start from the original, pre-JEDI, version of the source instead of trying to polish the turd it has become.

Share this post


Link to post

In the end, I had to abandon PUCU as the author never bothered to react to my bug reports.

Instead, I tried to adapt the JEDI implementation: I removed all dependencies and fixed the worst bugs and performance issues. Finally, I ran the code against my unit tests. The result was a big disappointment; While it didn't crash like PUCU, it failed even more of the test cases. The test suite uses the 19,000 NFC (compose) and NFD (decompose) normalization test cases published by the Unicode consortium.

Ā 

So back to square one again. Comparing the algorithms used by the JEDI library against countless (I've looked at over a hundred) other Unicode libraries didn't reveal the cause. They all used the same algorithms. Blogs and articles that described the algorithm also matched what was being done. I was beginning to suspect that Unicode's own test cases were wrong, but then I finally got around to reading the actual Unicode specification where the rules and algorithms are described, and guess what - Apart from Unicode's own reference library and a few others, they're all doing it wrong.

Ā 

I have now implemented the normalization functions from scratch based on the Unicode v15 specs and all tests now pass.

The functions can be found here, in case anyone needs them: https://gitlab.com/anders.bo.melander/pascaltype2/-/blob/master/Source/PascalType.Unicode.pas#L258

Note that while the functions implement both canonical (NFC/NFD) and compatible (NFKC/NFKD) normalization, only the canonical variants have been tested as they are the only ones I need.

  • Like 1
  • Thanks 4

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Ɨ