Jump to content
alank2

Assigning UTF-8 to a control

Recommended Posts

So as a test I tried to set a TEdit to a UTF-8 sequence and it ended up as a series of characters and not the smile face I was testing.

 

If I convert it first to a wide string and then assign the wide string it works:

MultiByteToWideChar(CP_UTF8, 0, "\xf0\x9f\x98\x82", -1, ws1, 1024);

 

Is there a way to make it recognizing a narrow string assignment as UTF-8 - some sort of code page setting in the application maybe?  or will it not do that?

Share this post


Link to post

If it's a char* then no. But if you held the data in a type that also had an encoding then the rtl would convert. Why not convert to utf16 as early as possible though? 

Edited by David Heffernan

Share this post


Link to post

Thanks for the advice.  On Friday I discovered an unexpected thing with sprintf/swprintf that string conversion from narrow to wide and vice versa uses a buffer that is limited to 512 bytes.  I traced this down to the source code (in vprinter.c) to see that that is what it is doing.  I typically think of sprintf/swprintf as commands as being designed to emit data directly as to not be limited by a buffer size, but clearly that isn't the case.  Trying to convert a larger string will corrupt a program from a buffer overrun.  Why they didn't just code it to do a simple bufferless conversion in place does not make sense to me, especially since they aren't really properly converting between UTF-8 and UTF-16 anyway, but it is what it is.

 

I found the WIN32 API functions MultiByteToWideChar and WideCharToMultiByte, but at the same time I've been thinking about how I can better handle string conversion and variable width strings in general to support UTF-8/UTF-16 better.

 

Share this post


Link to post

The standard approach is to decide what encoding to use internally. And then convert between that encoding and any other encoding when the data initially enters your program, or at the last moment before it leaves. 

 

Given that the native encoding of the frameworks that C++ Builder supports is utf16, that is the obvious choice for the internal encoding.

 

And then you can use TEncoding to perform all other conversions. 

 

Why are you wanting to go low level to C runtime and Win32? Why don't you use the frameworks provided. 

Share this post


Link to post
On 5/19/2023 at 1:56 PM, alank2 said:

So as a test I tried to set a TEdit to a UTF-8 sequence and it ended up as a series of characters and not the smile face I was testing.

Makes sense, because by default assigning a raw char* string to a UnicodeString has no way of knowing that the char string is encoded as UTF-8, so it assumes the string is encoded as ANSI instead.

Quote

Is there a way to make it recognizing a narrow string assignment as UTF-8 - some sort of code page setting in the application maybe?  or will it not do that?

You might try using SetMultiByteConversionCodePage(), but it would be better to put the UTF-8 char sequence inside of a UTF8String object instead, eg:

Edit1->Text = UTF8String("\xf0\x9f\x98\x82");

More generically, you can use RawByteString for handle any 'char'-based encoding:

RawByteString str("\xf0\x9f\x98\x82");
SetCodePage(str, CP_UTF8, false);
Edit1->Text = str;

The RTL knows how to convert a UTF8String/RawByteString to a UnicodeString, and vice versa.

Edited by Remy Lebeau

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×