Jump to content
MarkShark

CompareString function for UTF8 strings or buffers?

Recommended Posts

Hi all!  Is there a library routine or api function that will compare two utf8 strings or buffers without converting them to UTF16?   I'm looking for something analogous to AnsiCompareStr.  Thanks!

 

-Mark

 

Share this post


Link to post
Posted (edited)
2 hours ago, MarkShark said:

I'm looking for something analogous to AnsiCompareStr.

The System.AnsiStrings.AnsiCompareStr() function uses the Win32 CompareStringA() function on Windows (the System.SysUtils.AnsiCompareStr() function uses CompareStringW() instead).  But, CompareStringA() assumes the input strings are in the ANSI encoding of the specified locale.  The MSDN documentation says:

Quote

If your application is calling the ANSI version of CompareString, the function converts parameters via the default code page of the supplied locale. Thus, an application can never use CompareString to handle UTF-8 text.

 

Edited by Remy Lebeau

Share this post


Link to post
Posted (edited)
6 minutes ago, Remy Lebeau said:

The System.AnsiStrings.AnsiCompareStr() function uses the Win32 CompareStringA() function on Windows.

Unfortunately (see CompareStringA function (winnls.h) - Win32 apps | Microsoft Learn😞

Quote

If your application is calling the ANSI version of CompareString, the function converts parameters via the default code page of the supplied locale. Thus, an application can never use CompareString to handle UTF-8 text.

It also appears that System.AnsiStrings.AnsiCompareStr ignores the code page of the ansi strings.

Edited by pyscripter

Share this post


Link to post

System.AnsiStrings.AnsiCompareStr under POSIX converts the ansistrings to UnicodeStrings and then compares.

 

FPC has a function UTF8CompareStr, which sounds promising, but it also converts the ansi strings to UTF-16 and then compares them.    

 

Unfortunately it appears that there is no good way to directly compare utf8 strings without converting them.  I hope I am wrong.

Share this post


Link to post
14 hours ago, MarkShark said:

Hi all!  Is there a library routine or api function that will compare two utf8 strings or buffers without converting them to UTF16?   I'm looking for something analogous to AnsiCompareStr.  Thanks!

 

-Mark

 

Depends on what you need to compare for. Equality is easy, that is a simple memory comparison. Larger/smaller implies a collation sequence, which is language-specific. Here converting to UTF-16 first is, in my opinion, the only practical way. Same if you need case-insensitive comparison, since case conversion is also language-specific and may not be applicable for the language in question at all (e.g. chinese).

Share this post


Link to post
On 4/12/2024 at 6:22 PM, pyscripter said:

Unfortunately it appears that there is no good way to directly compare utf8 strings without converting them.  I hope I am wrong.

If you know the UTF-8 strings have been normalized the same way and you want to test for equality, then do a simple memory comparison.

 

If you need to do anything more complex than that (case-insensitivity, ignoring diacritics, accounting for different normalizations, etc) , then there is no reason not to convert the strings anyway. 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×