Remove non-utf8 characters from a utf8 string

Arnaud Bouchez · September 25, 2020

21 hours ago, borni69 said:
Thanks looks like
TCharacter.IsValid(s[i]))
is not suported in 10.4

I don't get what TCharacter.IsValid() was supposed to mean. Sounds like a big confusion from the Embarcadero RTL people.
In UTF-16 you may need two WideChar to encode an Unicode glyph - it is called a surrogate pair.
So if you want to check the UTF-16 encoding validity, you have to work at the string level, or at least test two WideChars at once when needed.

I guess this may be the reason why it disappeared. Confused and confusing.

Edited September 25, 2020 by Arnaud Bouchez

borni69 · September 25, 2020

I tested yesterday I see that #127..#160 have som valid characters and also exist in our DB...

I will have another look..

Darian Miller · September 25, 2020

3 hours ago, borni69 said:

I tested yesterday I see that #127..#160 have som valid characters and also exist in our DB...

I will have another look..

What encoding is utilized to store your data within your DB?

UTF-8 is typically used on any web front end, your Delphi tooling likely prefers UTF-16, while your DB may be set to they system default ANSI code page. You need to clearly define a proper strategy to obtain data (from each different source that feeds your system), properly transfer every byte received, and then store/search the textual data in a consistent format.

My assumption is that you are accepting defaults on inputs/transfers/storage and there are improper conversions going on.

borni69 · September 25, 2020

Some more information

We store all in MYSQL UTF8

We also use UTF 8 in our webbroker APP

The system is running on linux mod files connecting to DB with Firedac, all read and write text is done by Aswidestring.

The problem is.

Sometimes we get some character from client that can make some json issue on angular when sent back to client, this is what we try to fix.

On 600 000 records we have approx 1 problem...

We could og course just make a fix for the problem we have today with this character #11, but we try to fix it for other character coming later.

The problem appears when client copy paste text from other system like words..

So now we try to figure out what character's to remove before we save text to DB.

Not sure if this make it more clear..

B

September 25, 2020

Off topic.

@Darian Miller Would you consider a blogpost or an article to clear as much as you can about strings in Delphi ?

Many Delphi developers confuse strings types and on top of that abuse widestrings too without knowing.

Such article is due, and yet there is none, Embarcadero documentation is missing many things or not focusing on some stuff that does really matter, it is a suggestion to consider as i remember that you said you have the time, patience, and of course the right knowledge to write such article as referenced documentation for everyone benefits.

Darian Miller · September 25, 2020

57 minutes ago, Kas Ob. said:

Off topic.

@Darian Miller Would you consider a blogpost or an article to clear as much as you can about strings in Delphi ?

Many Delphi developers confuse strings types and on top of that abuse widestrings too without knowing.

Such article is due, and yet there is none, Embarcadero documentation is missing many things or not focusing on some stuff that does really matter, it is a suggestion to consider as i remember that you said you have the time, patience, and of course the right knowledge to write such article as referenced documentation for everyone benefits.

Thanks, but I'm not the right guy for that as I still want my box drawing characters above 127 for cool looking console menus. Besides, there are some references already, including:

Cary Jensen: https://www.embarcadero.com/images/dm/technical-papers/delphi-unicode-migration.pdf

Marco Cantu: https://www.embarcadero.com/images/dm/technical-papers/delphi-and-unicode-marco-cantu.pdf

Nick Hodge: https://www.embarcadero.com/images/dm/technical-papers/delphi-in-a-unicode-world-updated.pdf

There's also a help reference: http://docwiki.embarcadero.com/RADStudio/en/Unicode_in_RAD_Studio

I certainly agree that interop can be messy when dealing with third parties. Hopefully the world eventually says "Everything is now going to be UTF-8" but that will never happen. (Even if it did, there's a lot on disk that has to be converted.)

I started a TextEncodingDetector unit and did think about putting it on GitHub and doing an article on it, but I'm just not the expert. A public lashing by the real experts would be a learning experience for me I suppose. If you are dealing with a stream of bytes without a BOM or external encoding reference, and need to figure out if it's UTF-8, UTF16 (LE/BE), or ANSI then it simply isn't going to be 100% foolproof and the code isn't pretty. (At least mine is not.)

I think a TextEncodingDetector unit by someone like Arnaud Bouchez, François Piette, David Heffernan, Remy Lebeau, Andreas Rejbrand, etc. would be quite interesting to discuss. (Arnaud probably already has one.)

Darian Miller · September 25, 2020

1 hour ago, Kas Ob. said:

Off topic.

@Darian Miller Would you consider a blogpost or an article to clear as much as you can about strings in Delphi ?

Many Delphi developers confuse strings types and on top of that abuse widestrings too without knowing.

Such article is due, and yet there is none, Embarcadero documentation is missing many things or not focusing on some stuff that does really matter, it is a suggestion to consider as i remember that you said you have the time, patience, and of course the right knowledge to write such article as referenced documentation for everyone benefits.

Just saw this recent blog post today from LinkedIn:

https://developpeur-pascal.fr/p/_500e-les-pieges-de-l-encodage-lors-de-l-ouverture-de-fichiers-textes.html

It simply goes to show that there isn't a real clean solution. I don't particular agree with this author's suggestion.

September 25, 2020

I think there is no real clean solution, in the frame of one solution fit all, no, so i don't agree too.

But every usage or case has clean or better approach, and this is be up to developer to choose the right tool and the right approach, like i can't remember if i ever used TFile !

But for the developers and programmers to write the right code they need to have good understanding of the strings and chars, or at least have a good documentation or nice articles discussing things and clearing every detail, for novice and for experts, after all, we are in this era of Internet (with/out Google) you don't need to have to remember every thing or have a printed paper in your hand.

Remy Lebeau · September 25, 2020

8 hours ago, Arnaud Bouchez said:

I don't know where #127..#160 comes from.

Agreed, that is odd to allow that range but not allow 161..255 as well.

8 hours ago, Arnaud Bouchez said:

It is valid set of chars, e.g. in Europe for accentuated characters like é à â.

Certainly in Unicode strings, characters in the 0..255 range are well-defined. But in ANSI/MBCS strings, characters in the 128..255 (Extended ASCII) range are locale-specific. The numeric values of characters in that range vary between what Unicode defines and what different locales/charsets define. Most locales/charsets agree with Unicode only for characters in the 0..127 (ASCII) range, then they define whatever they need for characters 128..255. Beyond character 255, you need Unicode instead.

8 hours ago, Arnaud Bouchez said:

You are making a confusing in encoding. A Delphi string is UTF-16 encoded, so #127..#160 are some valid UTF-16 characters.

Unicode codepoints #127..#160 (U+007F..U+00A0) are certainly valid in UTF-16, yes. But locale-based characters in the #128..#255 ($80..$FF) range in ANSI/MBCS strings usually map to different numeric values as Unicode codepoints.

Remy Lebeau · September 25, 2020

3 hours ago, borni69 said:

We store all in MYSQL UTF8

We also use UTF 8 in our webbroker APP

The system is running on linux mod files connecting to DB with Firedac, all read and write text is done by Aswidestring.

The problem is.

Sometimes we get some character from client that can make some json issue on angular when sent back to client, this is what we try to fix.

Then it seems to me that you are approaching this from the wrong angle. If the frontend is UTF-8, then it doesn't matter what the client enters, it will get transmitted as UTF-8, and stored as-is in the DB as UTF-8, and converted between UTF-8<->UTF-16 where needed. Both UTFs handle the entire Unicode repertoire without any data loss. So your issue has to be somewhere else. Either you are not processing your JSON correctly, or you are not sending the JSON back to the client correctly.

3 hours ago, borni69 said:

We could og course just make a fix for the problem we have today with this character #11, but we try to fix it for other character coming later.

You really should not be filtering out ANY characters at all, especially since UTFs and JSON support ALL Unicode characters. So I suggest you take some more time to really debug the issue deeper and find out exactly where the REAL failure point is, because it is likely not what you think it is.

3 hours ago, borni69 said:

The problem appears when client copy paste text from other system like words..

Which is perfectly fine, if the UI frontend and communication/DB backends are all using UTF-8 properly. The issue has to be somewhere else.

3 hours ago, borni69 said:

So now we try to figure out what character's to remove before we save text to DB.

Don't remove any characters at all.

Anders Melander · September 25, 2020

5 hours ago, borni69 said:

Sometimes we get some character from client that can make some json issue on angular when sent back to client, this is what we try to fix.

If you could define what "some json issue" is then we (and you) would have a chance of actually solving this problem. Right now we're all just guessing.

borni69 · September 26, 2020

Hi,

some json issue is

if the frontend get byte like 0b #11 in Json it crash..

when we build the Json in Delphi before sent to Server we use TJSONString.create( delphi widestring ).

Maybe the issue is FrontEnd not handeling the json correct or TJSONString.create in Delphi not handle some control characters..

Bu now I consider to use like recommended in an earlier thread

if TCharacter.IsControl(ch) then only accept #9 #10 #13 else block Control characters.

all other characters will be sent out..

B

borni69 · September 26, 2020

outtext:='';
 str:=memo1.Lines.Text;
 for ch  in  str  do
   begin
     valid:=true;
       if TCharacter.IsControl(ch) then
        begin
          valid:=false;
          CharacterControl:=ord(ch);
          if CharacterControl in [9,10,13] then valid:=true;
        end;
     if valid then
     begin
      outtext:= outtext+ch;
     end;
   end;

   memo2.Lines.Text := outtext;

For me this test code seems to do the job.. and now all needed characters are send out..

will test it a little more.

thanks..

David Heffernan · September 26, 2020

There's a lot of noise in here. It seems you don't really understand where these characters are coming from and are in trial and error programming mode.

The advice from the wise heads here is to understand what is going on, and then work out how to tackle it.

You don't seem to want to heed that advice. That's fine, it's your choice. But we don't need a blow by blow account of your trial and error coding. That's only meaningful to you.

borni69 · September 26, 2020

ok got it..

thanks

aehimself · September 26, 2020

7 hours ago, David Heffernan said:

[...] and are in trial and error programming mode. [...]

I didn't know that there is an actual expression for this. Sounds familiar though, I guess I did it lots of times as well.

David Heffernan · September 26, 2020

1 hour ago, aehimself said:

I didn't know that there is an actual expression for this. Sounds familiar though, I guess I did it lots of times as well.

We've all done it. It never works out.

Attila Kovacs · September 26, 2020

Unless it's a trial and triumph.

timfrost · September 26, 2020

On 9/23/2020 at 8:21 PM, borni69 said:

how can I remove non-utf8 characters from a utf8 string

Despite the name of this topic I cannot remember seeing any of your code which actually references a UTFstring. If you want an answer to the original question, the 'lazy programmer' solution (and so occasionally mine) is to pass the UTF8 text to the Windows API MultiByteToWideString function with source code 65001 (UTF8). With the normal options, any invalid UTF8 sequences should be returned as Ufffd in the returned Unicode string, and you can walk the result and drop them. This is not the 'proper' way to approach the problem, but it should work.

David Heffernan · September 26, 2020

13 minutes ago, timfrost said:

With the normal options, any invalid UTF8 sequences should be returned as Ufffd in the returned Unicode string, and you can walk the result and drop them.

Asker seems to want to remove certain valid UTF8 sequences..... So this won't help.

Nobody can help with no clear spec.

Darian Miller · September 27, 2020

On 9/24/2020 at 4:10 AM, borni69 said:

I get this result...

11 - 11 - 69 - 45 - 75 - 32 - 98 - 108 - 101 -

I guess I could remove the 11, but will all no utf8 be 11 ???

No. It's just a byte like any other single-byte character. In ASCII and UTF-8 this single byte represents a Vertical Tab, a control character rarely used today and originally used for advancing the print head when printing. This special character has seen other uses and this StackOverflow question has some history: https://stackoverflow.com/questions/3380538/what-is-a-vertical-tab

Apparently this is used as an alternate line feed character in various systems. So, depending on where your data is coming from, you may want to replace each character #11 that you come across with a CRLF pair. (Look at your data and see if that decision makes any sense.)

memo2.Lines.Text := StringReplace(memo1.lines.text, #11, sLineBreak, [rfReplaceAll]);

Not sure when sLineBreak was introduced...if you are using an old version of Delphi:

memo2.Lines.Text := StringReplace(memo1.lines.text, #11, #13#10, [rfReplaceAll]);

Curious... are you getting other odd characters, or is it just #11? (0b)

Remy Lebeau · September 28, 2020

On 9/26/2020 at 1:02 AM, borni69 said:

some json issue is

if the frontend get byte like 0b #11 in Json it crash..

What kind of crash exactly? This implies to me that you are not creating your JSON properly to begin with.

On 9/26/2020 at 1:02 AM, borni69 said:

when we build the Json in Delphi before sent to Server we use TJSONString.create( delphi widestring ).

Can you please show your actual code that is taking in user input and producing JSON from it?

On 9/26/2020 at 1:02 AM, borni69 said:

Maybe the issue is FrontEnd not handeling the json correct or TJSONString.create in Delphi not handle some control characters..

There is no way for us to tell you that, because you have not shown what you are actually doing yet.

borni69 · September 28, 2020

Thanks all for helping.

We have now fixed the problem.

Sorry If i confused you guys in the start with not be clear about the problem. I did not understand it myself. And started in wrong end.

Learned a lot about unicode / utf8 last days.

The problem was the Angular client app that did not handle a few control character in Json. It was correct sent from Delphi in Tjsonstring the character was #11 #3 #5 .

We have lopped all fields in database and found a few places with this character, all of them did not work client site.

After a data review, we have removed them with a script in db .

All is working fine now.

the DB is like 20 gb and there was a total of 355 fields with wrong character, not so many.

all are old registration's , more that 2 years ago.

We will make a check for this character when new registration occur in future.

The angular team will also check why these character make the client crash.

Thanks

B

Edited September 28, 2020 by borni69

Tommi Prami · October 1, 2020

I would think problem is that there is malformed data (Illegal byte sequence), and it should be fixed to have valid UTF8 encoding? Maybe?

We had recently such a situation in Delphi IDE when there was some garbage and IDE did not want to open it and complained about invalid unicode/UTF character or something like that.

But how to do it, don't know.

-Tee-

Anders Melander · October 1, 2020

14 minutes ago, Tommi Prami said:

I would think [...]? Maybe?
[...] or something like that.
But [...], don't know.

Are you sure?

Sign In

Remove non-utf8 characters from a utf8 string

Recommended Posts

Arnaud Bouchez 413

Share this post

Link to post

borni69 1

Share this post

Link to post

Darian Miller 391

Share this post

Link to post

borni69 1

Share this post

Link to post

Guest

Share this post

Link to post

Darian Miller 391

Share this post

Link to post

Darian Miller 391

Share this post

Link to post

Guest

Share this post

Link to post

Remy Lebeau 1656

Share this post

Link to post

Remy Lebeau 1656

Share this post

Link to post

Anders Melander 2108

Share this post

Link to post

borni69 1

Share this post

Link to post

borni69 1

Share this post

Link to post

David Heffernan 2479

Share this post

Link to post

borni69 1

Share this post

Link to post

aehimself 413

Share this post

Link to post

David Heffernan 2479

Share this post

Link to post

Attila Kovacs 690

Share this post

Link to post

timfrost 81

Share this post

Link to post

David Heffernan 2479

Share this post

Link to post

Darian Miller 391

Share this post

Link to post

Remy Lebeau 1656

Share this post

Link to post

borni69 1

Share this post

Link to post

Tommi Prami 159

Share this post

Link to post

Anders Melander 2108

Share this post

Link to post

Create an account or sign in to comment

Create an account