Assigning UTF-8 to a control

alank2 · May 19, 2023

So as a test I tried to set a TEdit to a UTF-8 sequence and it ended up as a series of characters and not the smile face I was testing.

If I convert it first to a wide string and then assign the wide string it works:

MultiByteToWideChar(CP_UTF8, 0, "\xf0\x9f\x98\x82", -1, ws1, 1024);

Is there a way to make it recognizing a narrow string assignment as UTF-8 - some sort of code page setting in the application maybe? or will it not do that?

David Heffernan · May 19, 2023

Where is the UTF8 text coming from?

alank2 · May 19, 2023

I suppose anywhere; I just wondered if you could assign a char* string that is UTF-8 and have it recognize it that way.Insert other media

David Heffernan · May 20, 2023

If it's a char* then no. But if you held the data in a type that also had an encoding then the rtl would convert. Why not convert to utf16 as early as possible though?

Edited May 20, 2023 by David Heffernan

alank2 · May 20, 2023

C++ Builder

alank2 · May 22, 2023

Thanks for the advice. On Friday I discovered an unexpected thing with sprintf/swprintf that string conversion from narrow to wide and vice versa uses a buffer that is limited to 512 bytes. I traced this down to the source code (in vprinter.c) to see that that is what it is doing. I typically think of sprintf/swprintf as commands as being designed to emit data directly as to not be limited by a buffer size, but clearly that isn't the case. Trying to convert a larger string will corrupt a program from a buffer overrun. Why they didn't just code it to do a simple bufferless conversion in place does not make sense to me, especially since they aren't really properly converting between UTF-8 and UTF-16 anyway, but it is what it is.

I found the WIN32 API functions MultiByteToWideChar and WideCharToMultiByte, but at the same time I've been thinking about how I can better handle string conversion and variable width strings in general to support UTF-8/UTF-16 better.

David Heffernan · May 22, 2023

The standard approach is to decide what encoding to use internally. And then convert between that encoding and any other encoding when the data initially enters your program, or at the last moment before it leaves.

Given that the native encoding of the frameworks that C++ Builder supports is utf16, that is the obvious choice for the internal encoding.

And then you can use TEncoding to perform all other conversions.

Why are you wanting to go low level to C runtime and Win32? Why don't you use the frameworks provided.

Remy Lebeau · May 22, 2023

On 5/19/2023 at 1:56 PM, alank2 said:

So as a test I tried to set a TEdit to a UTF-8 sequence and it ended up as a series of characters and not the smile face I was testing.

Makes sense, because by default assigning a raw char* string to a UnicodeString has no way of knowing that the char string is encoded as UTF-8, so it assumes the string is encoded as ANSI instead.

Quote

Is there a way to make it recognizing a narrow string assignment as UTF-8 - some sort of code page setting in the application maybe? or will it not do that?

You might try using SetMultiByteConversionCodePage(), but it would be better to put the UTF-8 char sequence inside of a UTF8String object instead, eg:

Edit1->Text = UTF8String("\xf0\x9f\x98\x82");

More generically, you can use RawByteString for handle any 'char'-based encoding:

RawByteString str("\xf0\x9f\x98\x82");
SetCodePage(str, CP_UTF8, false);
Edit1->Text = str;

The RTL knows how to convert a UTF8String/RawByteString to a UnicodeString, and vice versa.

Edited May 23, 2023 by Remy Lebeau

alank2 · May 22, 2023

Thank you; I'll check it out!

Sign In

Assigning UTF-8 to a control

Recommended Posts

alank2 5

Share this post

Link to post

David Heffernan 2514

Share this post

Link to post

alank2 5

Share this post

Link to post

David Heffernan 2514

Share this post

Link to post

alank2 5

Share this post

Link to post

alank2 5

Share this post

Link to post

David Heffernan 2514

Share this post

Link to post

Remy Lebeau 1697

Share this post

Link to post

alank2 5

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity