Jump to content
borni69

Remove non-utf8 characters from a utf8 string

Recommended Posts

21 hours ago, borni69 said:

Thanks looks like  


TCharacter.IsValid(s[i]))

is not suported in 10.4

I don't get what TCharacter.IsValid() was supposed to mean. Sounds like a big confusion from the Embarcadero RTL people.
In UTF-16 you may need two WideChar to encode an Unicode glyph - it is called a surrogate pair.
So if you want to check the UTF-16 encoding validity, you have to work at the string level, or at least test two WideChars at once when needed.

I guess this may be the reason why it disappeared. Confused and confusing.

Edited by Arnaud Bouchez

Share this post


Link to post

I tested yesterday  I see that   #127..#160  have som valid characters and also exist in our DB...

 

I will have another look..

 

 

Share this post


Link to post
3 hours ago, borni69 said:

I tested yesterday  I see that   #127..#160  have som valid characters and also exist in our DB...

 

I will have another look..

 

 

What encoding is utilized to store your data within your DB?  

 

UTF-8 is typically used on any web front end, your Delphi tooling likely prefers UTF-16, while your DB may be set to they system default ANSI code page.  You need to clearly define a proper strategy to obtain data (from each different source that feeds your system), properly transfer every byte received, and then store/search the textual data in a consistent format.

 

My assumption is that you are accepting defaults on inputs/transfers/storage and there are improper conversions going on.  

 

 

Share this post


Link to post

Some more information

 

 

We store all in  MYSQL   UTF8

 

We also use UTF 8   in our webbroker APP 

The system is  running on  linux mod files connecting to DB with Firedac, all read  and write  text is done by Aswidestring.

The problem is.

Sometimes we get some character from client that can make some json issue on angular when sent back to client, this is what we try to fix.

 

On 600 000 records  we have  approx 1  problem...

 

We could og course just make a fix for the problem we have today with this character #11, but we try to fix it for other character coming later. 

 

The problem appears when client copy paste text from other system like words..

 

So now we try to figure out what character's to remove before we save text to DB.

 

Not sure if this make it more clear..

 

 

 

B

 

 

 

 

 

 

 

 

 

Share this post


Link to post

Off topic.

 

@Darian Miller Would you consider a blogpost or an article to clear as much as you can about strings in Delphi ?

 

Many Delphi developers confuse strings types and on top of that abuse widestrings too without knowing.

Such article is due, and yet there is none, Embarcadero documentation is missing many things or not focusing on some stuff that does really matter, it is a suggestion to consider as i remember that you said you have the time, patience, and of course the right knowledge to write such article as referenced documentation for everyone benefits.

Share this post


Link to post
57 minutes ago, Kas Ob. said:

Off topic.

 

@Darian Miller Would you consider a blogpost or an article to clear as much as you can about strings in Delphi ?

 

Many Delphi developers confuse strings types and on top of that abuse widestrings too without knowing.

Such article is due, and yet there is none, Embarcadero documentation is missing many things or not focusing on some stuff that does really matter, it is a suggestion to consider as i remember that you said you have the time, patience, and of course the right knowledge to write such article as referenced documentation for everyone benefits.

 

Thanks, but I'm not the right guy for that as I still want my box drawing characters above 127 for cool looking console menus.  Besides, there are some references already, including:

 

Cary Jensen:  https://www.embarcadero.com/images/dm/technical-papers/delphi-unicode-migration.pdf

Marco Cantu:  https://www.embarcadero.com/images/dm/technical-papers/delphi-and-unicode-marco-cantu.pdf

Nick Hodge: https://www.embarcadero.com/images/dm/technical-papers/delphi-in-a-unicode-world-updated.pdf

There's also a help reference: http://docwiki.embarcadero.com/RADStudio/en/Unicode_in_RAD_Studio

 

I certainly agree that interop can be messy when dealing with third parties.  Hopefully the world eventually says "Everything is now going to be UTF-8" but that will never happen.  (Even if it did, there's a lot on disk that has to be converted.)

 

I started a TextEncodingDetector unit and did think about putting it on GitHub and doing an article on it, but I'm just not the expert.  A public lashing by the real experts would be a learning experience for me I suppose.  If you are dealing with a stream of bytes without a BOM or external encoding reference, and need to figure out if it's UTF-8, UTF16 (LE/BE), or ANSI then it simply isn't going to be 100% foolproof and the code isn't pretty.  (At least mine is not.)

 

I think a TextEncodingDetector unit by someone like Arnaud Bouchez, François Piette, David Heffernan, Remy Lebeau, Andreas Rejbrand, etc. would be quite interesting to discuss.  (Arnaud probably already has one.)  

 

 

 

Share this post


Link to post
1 hour ago, Kas Ob. said:

Off topic.

 

@Darian Miller Would you consider a blogpost or an article to clear as much as you can about strings in Delphi ?

 

Many Delphi developers confuse strings types and on top of that abuse widestrings too without knowing.

Such article is due, and yet there is none, Embarcadero documentation is missing many things or not focusing on some stuff that does really matter, it is a suggestion to consider as i remember that you said you have the time, patience, and of course the right knowledge to write such article as referenced documentation for everyone benefits.

 

Just saw this recent blog post today from LinkedIn:

https://developpeur-pascal.fr/p/_500e-les-pieges-de-l-encodage-lors-de-l-ouverture-de-fichiers-textes.html

 

It simply goes to show that there isn't a real clean solution.  I don't particular agree with this author's suggestion.

 

Share this post


Link to post

I think there is no real clean solution, in the frame of one solution fit all, no, so i don't agree too.

 

But every usage or case has clean or better approach, and this is be up to developer to choose the right tool and the right approach, like i can't remember if i ever used TFile !

 

But for the developers and programmers to write the right code they need to have good understanding of the strings and chars, or at least have a good documentation or nice articles discussing things and clearing every detail, for novice and for experts, after all, we are in this era of Internet (with/out Google) you don't need to have to remember every thing or have a printed paper in your hand.

Share this post


Link to post
8 hours ago, Arnaud Bouchez said:

I don't know where #127..#160 comes from.

Agreed, that is odd to allow that range but not allow 161..255 as well.

8 hours ago, Arnaud Bouchez said:

It is valid set of chars, e.g. in Europe for accentuated characters like é à â.

Certainly in Unicode strings, characters in the 0..255 range are well-defined.  But in ANSI/MBCS strings, characters in the 128..255 (Extended ASCII) range are locale-specific.  The numeric values of characters in that range vary between what Unicode defines and what different locales/charsets define.  Most locales/charsets agree with Unicode only for characters in the 0..127 (ASCII) range, then they define whatever they need for characters 128..255.  Beyond character 255, you need Unicode instead.

8 hours ago, Arnaud Bouchez said:

You are making a confusing in encoding. A Delphi string is UTF-16 encoded, so #127..#160 are some valid UTF-16 characters.

Unicode codepoints #127..#160 (U+007F..U+00A0) are certainly valid in UTF-16, yes.  But locale-based characters in the #128..#255 ($80..$FF) range in ANSI/MBCS strings usually map to different numeric values as Unicode codepoints.

Share this post


Link to post
3 hours ago, borni69 said:

We store all in  MYSQL   UTF8

 

We also use UTF 8   in our webbroker APP 

The system is  running on  linux mod files connecting to DB with Firedac, all read  and write  text is done by Aswidestring.

The problem is.

Sometimes we get some character from client that can make some json issue on angular when sent back to client, this is what we try to fix.

Then it seems to me that you are approaching this from the wrong angle.  If the frontend is UTF-8, then it doesn't matter what the client enters, it will get transmitted as UTF-8, and stored as-is in the DB as UTF-8, and converted between UTF-8<->UTF-16 where needed.  Both UTFs handle the entire Unicode repertoire without any data loss.  So your issue has to be somewhere else.  Either you are not processing your JSON correctly, or you are not sending the JSON back to the client correctly.

3 hours ago, borni69 said:

We could og course just make a fix for the problem we have today with this character #11, but we try to fix it for other character coming later. 

You really should not be filtering out ANY characters at all, especially since UTFs and JSON support ALL Unicode characters.  So I suggest you take some more time to really debug the issue deeper and find out exactly where the REAL failure point is, because it is likely not what you think it is.

3 hours ago, borni69 said:

The problem appears when client copy paste text from other system like words..

Which is perfectly fine, if the UI frontend and communication/DB backends are all using UTF-8 properly.  The issue has to be somewhere else.

3 hours ago, borni69 said:

So now we try to figure out what character's to remove before we save text to DB.

Don't remove any characters at all.

 

  • Like 1

Share this post


Link to post
5 hours ago, borni69 said:

Sometimes we get some character from client that can make some json issue on angular when sent back to client, this is what we try to fix.

If you could define what "some json issue" is then we (and you) would have a chance of actually solving this problem. Right now we're all just guessing.

  • Like 1

Share this post


Link to post

Hi,

 

some json issue  is

 

if the frontend get byte like 0b  #11  in Json  it crash..

 

when we build the Json in Delphi  before sent to Server  we use TJSONString.create( delphi widestring ).

 

Maybe the issue is FrontEnd not handeling the json correct  or TJSONString.create in Delphi not handle some control  characters..

 

 

Bu now I consider to use  like recommended in an earlier thread   

if TCharacter.IsControl(ch) then only  accept #9   #10  #13 else block  Control characters.

 

 all other characters will be sent out..

 

 

B

 

 

 

Share this post


Link to post
outtext:='';
 str:=memo1.Lines.Text;
 for ch  in  str  do
   begin
     valid:=true;
       if TCharacter.IsControl(ch) then
        begin
          valid:=false;
          CharacterControl:=ord(ch);
          if CharacterControl in [9,10,13] then valid:=true;
        end;
     if valid then
     begin
      outtext:= outtext+ch;
     end;
   end;

   memo2.Lines.Text := outtext;

 

For me this test code seems to do the job..  and now all needed characters are send out..

 

will test it a little more.

 

thanks..

 

Share this post


Link to post

There's a lot of noise in here. It seems you don't really understand where these characters are coming from and are in trial and error programming mode.

 

The advice from the wise heads here is to understand what is going on, and then work out how to tackle it. 

 

You don't seem to want to heed that advice. That's fine, it's your choice.  But we don't need a blow by blow account of your trial and error coding. That's only meaningful to you. 

  • Like 1

Share this post


Link to post
7 hours ago, David Heffernan said:

[...] and are in trial and error programming mode. [...]

I didn't know that there is an actual expression for this. Sounds familiar though, I guess I did it lots of times as well.

Share this post


Link to post
1 hour ago, aehimself said:

I didn't know that there is an actual expression for this. Sounds familiar though, I guess I did it lots of times as well.

We've all done it. It never works out. 

Share this post


Link to post
On 9/23/2020 at 8:21 PM, borni69 said:

how can I remove non-utf8 characters from a  utf8 string

Despite the name of this topic I cannot remember seeing any of your code which actually references a UTFstring.  If you want an answer to the original question, the 'lazy programmer' solution (and so occasionally mine) is to pass the UTF8 text to the Windows API MultiByteToWideString function with source code 65001 (UTF8).  With the  normal options, any invalid UTF8 sequences should be returned as Ufffd in the returned Unicode string, and you can walk the result and drop them.  This is not the 'proper' way to approach the problem, but it should work.

Share this post


Link to post
13 minutes ago, timfrost said:

With the  normal options, any invalid UTF8 sequences should be returned as Ufffd in the returned Unicode string, and you can walk the result and drop them. 

Asker seems to want to remove certain valid UTF8 sequences..... So this won't help.

 

Nobody can help with no clear spec. 

Share this post


Link to post
On 9/24/2020 at 4:10 AM, borni69 said:

I get this result...

11 - 11 - 69 - 45 - 75 - 32 - 98 - 108 - 101 - 

 

I guess I could remove the 11, but will all no utf8 be 11  ???

 

 

No.  It's just a byte like any other single-byte character.  In ASCII and UTF-8 this single byte represents a Vertical Tab, a control character rarely used today and originally used for advancing the print head when printing.  This special character has seen other uses and this StackOverflow question has some history: https://stackoverflow.com/questions/3380538/what-is-a-vertical-tab   

 

Apparently this is used as an alternate line feed character in various systems.  So, depending on where your data is coming from, you may want to replace each character #11 that you come across with a CRLF pair.  (Look at your data and see if that decision makes any sense.)

 

memo2.Lines.Text := StringReplace(memo1.lines.text, #11, sLineBreak, [rfReplaceAll]);

Not sure when sLineBreak was introduced...if you are using an old version of Delphi:

memo2.Lines.Text := StringReplace(memo1.lines.text, #11, #13#10, [rfReplaceAll]);

 

Curious... are you getting other odd characters, or is it just #11?  (0b)

 

 

Share this post


Link to post
On 9/26/2020 at 1:02 AM, borni69 said:

some json issue  is

 

if the frontend get byte like 0b  #11  in Json  it crash..

What kind of crash exactly? This implies to me that you are not creating your JSON properly to begin with.

On 9/26/2020 at 1:02 AM, borni69 said:

when we build the Json in Delphi  before sent to Server  we use TJSONString.create( delphi widestring ).

Can you please show your actual code that is taking in user input and producing JSON from it?

On 9/26/2020 at 1:02 AM, borni69 said:

Maybe the issue is FrontEnd not handeling the json correct  or TJSONString.create in Delphi not handle some control  characters..

There is no way for us to tell you that, because you have not shown what you are actually doing yet.

Share this post


Link to post

 

Thanks all for helping.

 

We have now fixed the problem.

 

Sorry If i confused you guys in the start with not be clear about the problem. I did not understand it myself. And started in wrong end.

Learned a lot about unicode / utf8 last days.

 

The problem was the Angular client app that did not handle a few  control character in Json. It was  correct sent from Delphi in Tjsonstring  the character was  #11  #3   #5   .

 

We have lopped all fields in  database  and found a few  places with  this character, all of them did not work client site. 

 

After a data review, we have removed them with a script in db .

All is working fine now.

 

the DB is like 20 gb and there was a total of 355 fields with wrong character, not so many.

 

all are  old registration's , more that 2 years ago.

 

We will make a check for this character when new registration occur in future.  

 

The angular team will also check why these character make the client crash.

 

Thanks

 

B

 

 

 

 

 

 

      

 

 

Edited by borni69
  • Confused 1

Share this post


Link to post

I would think problem is that there is malformed data (Illegal byte sequence), and it should be fixed to have valid UTF8 encoding? Maybe?

We had recently such a situation in Delphi IDE when there was some garbage and IDE did not want to open it and complained about invalid unicode/UTF character or something like that.

But how to do it, don't know.

 

-Tee-

Share this post


Link to post
14 minutes ago, Tommi Prami said:

I would think [...]? Maybe?
[...] or something like that.
But [...], don't know.

Are you sure?

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×