Jump to content

Leaderboard


Popular Content

Showing content with the highest reputation on 07/03/23 in all areas

  1. Anders Melander

    Unicode normalization

    In the end, I had to abandon PUCU as the author never bothered to react to my bug reports. Instead, I tried to adapt the JEDI implementation: I removed all dependencies and fixed the worst bugs and performance issues. Finally, I ran the code against my unit tests. The result was a big disappointment; While it didn't crash like PUCU, it failed even more of the test cases. The test suite uses the 19,000 NFC (compose) and NFD (decompose) normalization test cases published by the Unicode consortium. So back to square one again. Comparing the algorithms used by the JEDI library against countless (I've looked at over a hundred) other Unicode libraries didn't reveal the cause. They all used the same algorithms. Blogs and articles that described the algorithm also matched what was being done. I was beginning to suspect that Unicode's own test cases were wrong, but then I finally got around to reading the actual Unicode specification where the rules and algorithms are described, and guess what - Apart from Unicode's own reference library and a few others, they're all doing it wrong. I have now implemented the normalization functions from scratch based on the Unicode v15 specs and all tests now pass. The functions can be found here, in case anyone needs them: https://gitlab.com/anders.bo.melander/pascaltype2/-/blob/master/Source/PascalType.Unicode.pas#L258 Note that while the functions implement both canonical (NFC/NFD) and compatible (NFKC/NFKD) normalization, only the canonical variants have been tested as they are the only ones I need.
  2. It's called "Natural" sort order. Explorer probably uses StrCmpLogicalW but there's also CompareStringEx with SORT_DIGITSASNUMBERS? See also: Sorting for Humans : Natural Sort Order
×