alank2 5 Posted May 19, 2023 So as a test I tried to set a TEdit to a UTF-8 sequence and it ended up as a series of characters and not the smile face I was testing. If I convert it first to a wide string and then assign the wide string it works: MultiByteToWideChar(CP_UTF8, 0, "\xf0\x9f\x98\x82", -1, ws1, 1024); Is there a way to make it recognizing a narrow string assignment as UTF-8 - some sort of code page setting in the application maybe? or will it not do that? Share this post Link to post
David Heffernan 2354 Posted May 19, 2023 Where is the UTF8 text coming from? Share this post Link to post
alank2 5 Posted May 19, 2023 I suppose anywhere; I just wondered if you could assign a char* string that is UTF-8 and have it recognize it that way.Insert other media Share this post Link to post
David Heffernan 2354 Posted May 20, 2023 (edited) If it's a char* then no. But if you held the data in a type that also had an encoding then the rtl would convert. Why not convert to utf16 as early as possible though? Edited May 20, 2023 by David Heffernan Share this post Link to post
alank2 5 Posted May 22, 2023 Thanks for the advice. On Friday I discovered an unexpected thing with sprintf/swprintf that string conversion from narrow to wide and vice versa uses a buffer that is limited to 512 bytes. I traced this down to the source code (in vprinter.c) to see that that is what it is doing. I typically think of sprintf/swprintf as commands as being designed to emit data directly as to not be limited by a buffer size, but clearly that isn't the case. Trying to convert a larger string will corrupt a program from a buffer overrun. Why they didn't just code it to do a simple bufferless conversion in place does not make sense to me, especially since they aren't really properly converting between UTF-8 and UTF-16 anyway, but it is what it is. I found the WIN32 API functions MultiByteToWideChar and WideCharToMultiByte, but at the same time I've been thinking about how I can better handle string conversion and variable width strings in general to support UTF-8/UTF-16 better. Share this post Link to post
David Heffernan 2354 Posted May 22, 2023 The standard approach is to decide what encoding to use internally. And then convert between that encoding and any other encoding when the data initially enters your program, or at the last moment before it leaves. Given that the native encoding of the frameworks that C++ Builder supports is utf16, that is the obvious choice for the internal encoding. And then you can use TEncoding to perform all other conversions. Why are you wanting to go low level to C runtime and Win32? Why don't you use the frameworks provided. Share this post Link to post
Remy Lebeau 1441 Posted May 22, 2023 (edited) On 5/19/2023 at 1:56 PM, alank2 said: So as a test I tried to set a TEdit to a UTF-8 sequence and it ended up as a series of characters and not the smile face I was testing. Makes sense, because by default assigning a raw char* string to a UnicodeString has no way of knowing that the char string is encoded as UTF-8, so it assumes the string is encoded as ANSI instead. Quote Is there a way to make it recognizing a narrow string assignment as UTF-8 - some sort of code page setting in the application maybe? or will it not do that? You might try using SetMultiByteConversionCodePage(), but it would be better to put the UTF-8 char sequence inside of a UTF8String object instead, eg: Edit1->Text = UTF8String("\xf0\x9f\x98\x82"); More generically, you can use RawByteString for handle any 'char'-based encoding: RawByteString str("\xf0\x9f\x98\x82"); SetCodePage(str, CP_UTF8, false); Edit1->Text = str; The RTL knows how to convert a UTF8String/RawByteString to a UnicodeString, and vice versa. Edited May 23, 2023 by Remy Lebeau Share this post Link to post