borni69 1 Posted September 23, 2020 Hei, how can I remove non-utf8 characters from a utf8 string. Any examples Thanks B Share this post Link to post
FPiette 383 Posted September 23, 2020 You mean just remove or replace by somathing else? Share this post Link to post
David Heffernan 2345 Posted September 23, 2020 Define what you mean by a non UTF8 character, remembering that this is a variable width encoding. Share this post Link to post
Lars Fosdal 1792 Posted September 24, 2020 Isn't single characters in the #128-#255 range invalid in UTF8, if not prefixed to form a sequence? Are there per definition, UTF8 sequences that are invalid? Share this post Link to post
Uwe Raabe 2057 Posted September 24, 2020 36 minutes ago, Lars Fosdal said: Are there per definition, UTF8 sequences that are invalid? Yes, there are. One can use this fact to distinguish between UTF8 and ANSI encoding. An ANSI encoded text file can throw an exception when read as UTF8 if it contains certain characters. UTF8 character sequences start with a specific byte range followed by one to three bytes from a different range. Simply exchanging the first two bytes of a UTF8 sequence invalidates the file. Share this post Link to post
Guest Posted September 24, 2020 Adding a rephrase from Uwe answer, UTF8 char can be 1 byte or up to 4 bytes. https://en.wikipedia.org/wiki/UTF-8#Encoding Share this post Link to post
borni69 1 Posted September 24, 2020 (edited) E-K ble This line above have two characters I would like to remove in start see attached image Not sure if this make any sense to you... String : E-K ble bytes : 0b 0b 45 2d 4b 20 62 6c 65 How can I programaticly remove all character not utf8, if it make sense to say it that way. If I run this code procedure TForm1.Button1Click(Sender: TObject); var I: integer; begin edit1.Text:='E-K ble'; edit2.Text:=''; for I := 1 to length(edit1.Text) do begin edit2.Text := edit2.Text + (ord(edit1.Text) ).ToString+' - '; end; end; I get this result... 11 - 11 - 69 - 45 - 75 - 32 - 98 - 108 - 101 - I guess I could remove the 11, but will all no utf8 be 11 ??? B Edited September 24, 2020 by borni69 Share this post Link to post
Guest Posted September 24, 2020 1 hour ago, borni69 said: I guess I could remove the 11, but will all no utf8 be 11 ??? Yes, you are right in this case. and to make it clear as per UTF8 standard you can safely remove any one byte with value <= 127 ($7F), all these values ( i mean less than 128 ) are representing one byte chars and they are safe to be removed. Share this post Link to post
David Heffernan 2345 Posted September 24, 2020 The two 0b characters are valid UTF-8 characters. They are this character: https://www.fileformat.info/info/unicode/char/000b/index.htm What you need to do is decide exactly what your specification is. The question you have asked doesn't seem to match the example you have provided. 1 Share this post Link to post
borni69 1 Posted September 24, 2020 Thanks , I think I understand your what you say.. And I am not sure after this discussion what my spec is. But our problem is. We have a html text box on an Angular webclient that people add text to, and after the text is added this text is posted to the server, using a webbroker ISAPI backend. Sometimes they copy text from word / email etc, and then we get characters we dont want.. I am not sure what they are, we like to keep linbreak tabs ect, but not this character showing as a ? or a as seen in above image. Not sure if this question make sense. b Share this post Link to post
Anders Melander 1782 Posted September 24, 2020 (edited) I think you might be confusing UTF-8 and Unicode because your example is using unicode strings and not UTF-8 strings. If what you have is in fact unicode and you just want to remove non-printable characters then you can use the TCharacter class: for var i := Length(s)-1 downto 1 do if (not TCharacter.IsValid(s[i])) or (TCharacter.IsControl(s[i])) then Delete(s, i, 1); Edited September 24, 2020 by Anders Melander typo 1 Share this post Link to post
borni69 1 Posted September 24, 2020 Thanks looks like TCharacter.IsValid(s[i])) is not suported in 10.4 Share this post Link to post
David Heffernan 2345 Posted September 24, 2020 50 minutes ago, borni69 said: Sometimes they copy text from word / email etc, and then we get characters we dont want.. I am not sure what they are, we like to keep linbreak tabs ect, but not this character showing as a ? or a as seen in above image. Your problem is not how to remove characters, it is to work out what characters are to be removed. As is so often the case, the hardest part of any programming tasks is determining the correct specification. Share this post Link to post
borni69 1 Posted September 24, 2020 (edited) I agree with you... And after this back and forward discussion in this thread I think I have a solution. I will only send out characters and commands that are handled by TjsonString.create() So I think my code will be something like this PS: there is 0b 0b before E-K in rawText rawText := 'E-K ble æøå Test // 98'; ajson := TjsonObject.Create; try ajson.AddPair('text',TJSONString.Create(rawText)); RawtextOut := ajson.tostring; finally ajson.Free; end; textOut:=''; for ch in RawtextOut do begin if (ch >= #32) then textOut := textOut+ch end; memo1.Lines.Text := textOut; result {"text":"E-K ble æøå Test \/\/ 98"} characters I dont want are removed... also line break tabs will be handled correct .. \r\n Thanks for all your help.. B Edited September 24, 2020 by borni69 Share this post Link to post
Anders Melander 1782 Posted September 24, 2020 14 minutes ago, borni69 said: not suported in 10.4 No. Seems like it's gone. Strange. Share this post Link to post
David Heffernan 2345 Posted September 24, 2020 25 minutes ago, borni69 said: I agree with you... And after this back and forward discussion in this thread I think I have a solution. I will only send out characters and commands that are handled by TjsonString.create() So I think my code will be something like this PS: there is 0b 0b before E-K in rawText rawText := 'E-K ble æøå Test // 98'; ajson := TjsonObject.Create; try ajson.AddPair('text',TJSONString.Create(rawText)); RawtextOut := ajson.tostring; finally ajson.Free; end; textOut:=''; for ch in RawtextOut do begin if (ch >= #32) then textOut := textOut+ch end; memo1.Lines.Text := textOut; result {"text":"E-K ble æøå Test \/\/ 98"} characters I dont want are removed... also line break tabs will be handled correct .. \r\n Thanks for all your help.. B Better hope that there are no line breaks ..... Share this post Link to post
borni69 1 Posted September 24, 2020 Line break will work, since 1) TjsonString.create('') will change them first to : \r\n , and now they are valid. 2) remove unwanted characters Bernt Share this post Link to post
Lars Fosdal 1792 Posted September 24, 2020 1 hour ago, Anders Melander said: No. Seems like it's gone. Strange. It's not in Rio either. Share this post Link to post
Anders Melander 1782 Posted September 24, 2020 1 minute ago, Lars Fosdal said: It's not in Rio either. Ah. I didn't check the source. I just assumed the documentation was correct. Silly me Share this post Link to post
Bill Meyer 337 Posted September 24, 2020 2 hours ago, Lars Fosdal said: It's not in Rio either. Nor Tokyo. Anyone know when it appeared? Share this post Link to post
Remy Lebeau 1394 Posted September 24, 2020 4 hours ago, borni69 said: I will only send out characters and commands that are handled by TjsonString.create() That makes no sense, as JSON handles the entire Unicode repertoire. ANY Unicode character can appear in JSON. ALL Unicode characters except for the first 31 ASCII control characters (U+0000..U+001F), " (U+0022), and \ (U+005C) can appear in an unescaped form. ALL Unicode characters can appear in an escaped form. Share this post Link to post
borni69 1 Posted September 24, 2020 Will do it like this... function StripCharsInSet(s:string; c:CharSet):string; var i,j:Integer; begin SetLength(result,Length(s)); j:=0; for i:=1 to Length(s) do if not (s[i] in c) then begin inc(j); result[j]:=s[i]; end; SetLength(result,j); end; Begin //test memo2.Lines.Text := StripCharsInSet( memo1.Lines.Text ,[#0..#9,#11,#12,#14..#31,#127..#160] ); ... Found the snippet on this page https://stackoverflow.com/questions/5650532/delphi-strip-out-all-non-standard-text-characers-from-string https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec B Share this post Link to post
Arnaud Bouchez 407 Posted September 25, 2020 I don't know where #127..#160 comes from. It is valid set of chars, e.g. in Europe for accentuated characters like é à â. You are making a confusing in encoding. A Delphi string is UTF-16 encoded, so #127..#160 are some valid UTF-16 characters. What you call "character" is confusing. #11 is a valid character, in terms of both UTF-8 and UTF-16 as David wrote. Share this post Link to post