Jump to content
alank2

What is the best way to convert between UnicodeString and UTF-8

Recommended Posts

There are a whole list of functions here, but it seems some are depreciated and others not:

 

https://docwiki.embarcadero.com/RADStudio/Alexandria/en/UTF-8_Conversion_Routines

 

Many VCL/FMX properties use UnicodeString, so when working with them, if you want to convert to UTF-8 and back, what do you use?

 

One is UTF8ToUnicodeString, but I don't see its reverse which I would have expected to possibly be UnicodeStringToUTF8 ?

Share this post


Link to post

That really is the question isn't it.  What I've *been doing* is using wchar_t in cppbuilder, but looking at that now, I'm wonder if that is the best approach or not.  Most of the text I work with is going to fit in 7-bit ASCII, but if wchar_t has to use surrogates to support all of Unicodes 17 planes anyway, why not just use UTF-8 which is perhaps more efficient as well anyway?

 

I found this site which is certainly pro UTF-8:

http://utf8everywhere.org/

 

My question is, for modern cppbuilder development, is it better to use wchar_t or go back to char and assume it is UTF-8?  Both have the issue of variable code points possibly being one character anyway.  If so, then are the conversions to and from the UnicodeString's that VCL/FMX uses worth dealing with, or does it make more sense to just store them in a wchar_t.  So many things have to be converted to char for the outside world anyway.

 

I know there may not be a one thought fits all on this, so I just wanted to get everyone's opinion.

Share this post


Link to post

Well, I cannot speak for C++-Builder, but in Delphi there is type UTF8String and you can just assign to and from string:

var
  S: string;
  u8: UTF8String;
begin
  S := 'Hello World';
  u8 := S;

  u8 := 'Hello World';
  S := u8;
end;

 

Share this post


Link to post
1 hour ago, Uwe Raabe said:

Well, I cannot speak for C++-Builder, but in Delphi there is type UTF8String and you can just assign to and from string

UTF8String exists in C++Builder too, and does the same implicit conversion to UTF-8 when assigned other string types.

Share this post


Link to post
3 hours ago, alank2 said:

My question is, for modern cppbuilder development, is it better to use wchar_t or go back to char and assume it is UTF-8?

It really depends on what you are using the strings for.

 

If the strings are mostly for interacting with Embarcadero's RTL/VCL/FMX frameworks, then stick with UnicodeString/System::String, and convert to other string types only when needed.

 

If the strings are mostly for interacting with external libraries, then use whatever type is most suitable for those libs, and convert to/from UnicodeString only when needed.

 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×