Remove non-utf8 characters from a utf8 string

borni69 · September 23, 2020

Hei,

how can I remove non-utf8 characters from a utf8 string.

Any examples

Thanks

B

FPiette · September 23, 2020

You mean just remove or replace by somathing else?

borni69 · September 23, 2020

Just remove 🙂

David Heffernan · September 23, 2020

Define what you mean by a non UTF8 character, remembering that this is a variable width encoding.

Lars Fosdal · September 24, 2020

Isn't single characters in the #128-#255 range invalid in UTF8, if not prefixed to form a sequence?

Are there per definition, UTF8 sequences that are invalid?

Uwe Raabe · September 24, 2020

36 minutes ago, Lars Fosdal said:

Are there per definition, UTF8 sequences that are invalid?

Yes, there are. One can use this fact to distinguish between UTF8 and ANSI encoding. An ANSI encoded text file can throw an exception when read as UTF8 if it contains certain characters. UTF8 character sequences start with a specific byte range followed by one to three bytes from a different range. Simply exchanging the first two bytes of a UTF8 sequence invalidates the file.

September 24, 2020

Adding a rephrase from Uwe answer,

UTF8 char can be 1 byte or up to 4 bytes.

https://en.wikipedia.org/wiki/UTF-8#Encoding

borni69 · September 24, 2020

E-K ble

This line above have two characters I would like to remove in start

see attached image

Not sure if this make any sense to you...

String : E-K ble

bytes : 0b 0b 45 2d 4b 20 62 6c 65

How can I programaticly remove all character not utf8, if it make sense to say it that way.

If I run this code

procedure TForm1.Button1Click(Sender: TObject);
var
I: integer;

begin
edit1.Text:='E-K ble';
edit2.Text:='';
for I := 1 to length(edit1.Text) do
begin
edit2.Text := edit2.Text + (ord(edit1.Text) ).ToString+' - ';
end;

end;

I get this result...

11 - 11 - 69 - 45 - 75 - 32 - 98 - 108 - 101 -

I guess I could remove the 11, but will all no utf8 be 11 ???

B

Edited September 24, 2020 by borni69

September 24, 2020

1 hour ago, borni69 said:

I guess I could remove the 11, but will all no utf8 be 11 ???

Yes, you are right in this case.

and to make it clear as per UTF8 standard you can safely remove any one byte with value <= 127 ($7F), all these values ( i mean less than 128 ) are representing one byte chars and they are safe to be removed.

David Heffernan · September 24, 2020

The two 0b characters are valid UTF-8 characters. They are this character: https://www.fileformat.info/info/unicode/char/000b/index.htm

What you need to do is decide exactly what your specification is. The question you have asked doesn't seem to match the example you have provided.

borni69 · September 24, 2020

Thanks , I think I understand your what you say..

And I am not sure after this discussion what my spec is.

But our problem is.

We have a html text box on an Angular webclient that people add text to, and after the text is added this text is posted to the server, using a webbroker ISAPI backend.

Sometimes they copy text from word / email etc, and then we get characters we dont want.. I am not sure what they are, we like to keep linbreak tabs ect, but not this character showing as a ? or a as seen in above image.

Not sure if this question make sense.

b

Anders Melander · September 24, 2020

I think you might be confusing UTF-8 and Unicode because your example is using unicode strings and not UTF-8 strings.

If what you have is in fact unicode and you just want to remove non-printable characters then you can use the TCharacter class:

for var i := Length(s)-1 downto 1 do
  if (not TCharacter.IsValid(s[i])) or (TCharacter.IsControl(s[i])) then
    Delete(s, i, 1);

Edited September 24, 2020 by Anders Melander
typo

borni69 · September 24, 2020

Thanks looks like

TCharacter.IsValid(s[i]))

is not suported in 10.4

David Heffernan · September 24, 2020

50 minutes ago, borni69 said:

Sometimes they copy text from word / email etc, and then we get characters we dont want.. I am not sure what they are, we like to keep linbreak tabs ect, but not this character showing as a ? or a as seen in above image.

Your problem is not how to remove characters, it is to work out what characters are to be removed. As is so often the case, the hardest part of any programming tasks is determining the correct specification.

borni69 · September 24, 2020

I agree with you...

And after this back and forward discussion in this thread I think I have a solution.

I will only send out characters and commands that are handled by TjsonString.create()

So I think my code will be something like this

PS: there is 0b 0b before E-K in rawText

 rawText := 'E-K ble æøå  Test // 98';
  ajson := TjsonObject.Create;
  try
   ajson.AddPair('text',TJSONString.Create(rawText));
   RawtextOut := ajson.tostring;
  finally
    ajson.Free;
  end;

  textOut:='';
  for ch in RawtextOut  do
  begin
   if (ch >= #32)  then
   textOut := textOut+ch

  end;

  memo1.Lines.Text := textOut;

result

{"text":"E-K ble æøå Test \/\/ 98"} characters I dont want are removed...

also line break tabs will be handled correct .. \r\n

Thanks for all your help..

B

Edited September 24, 2020 by borni69

Anders Melander · September 24, 2020

14 minutes ago, borni69 said:

not suported in 10.4

No. Seems like it's gone. Strange.

David Heffernan · September 24, 2020

25 minutes ago, borni69 said:
I agree with you...

And after this back and forward discussion in this thread I think I have a solution.

I will only send out characters and commands that are handled by TjsonString.create()

So I think my code will be something like this

PS: there is 0b 0b before E-K in rawText
 rawText := 'E-K ble æøå  Test // 98';
  ajson := TjsonObject.Create;
  try
   ajson.AddPair('text',TJSONString.Create(rawText));
   RawtextOut := ajson.tostring;
  finally
    ajson.Free;
  end;

  textOut:='';
  for ch in RawtextOut  do
  begin
   if (ch >= #32)  then
   textOut := textOut+ch

  end;

  memo1.Lines.Text := textOut;
result

{"text":"E-K ble æøå Test \/\/ 98"} characters I dont want are removed...

also line break tabs will be handled correct .. \r\n

Thanks for all your help..

B

Better hope that there are no line breaks .....

borni69 · September 24, 2020

Line break will work, since

1) TjsonString.create('') will change them first to : \r\n , and now they are valid.

2) remove unwanted characters

Bernt

Lars Fosdal · September 24, 2020

1 hour ago, Anders Melander said:

No. Seems like it's gone. Strange.

It's not in Rio either.

Anders Melander · September 24, 2020

1 minute ago, Lars Fosdal said:

It's not in Rio either.

Ah. I didn't check the source. I just assumed the documentation was correct. Silly me

Bill Meyer · September 24, 2020

2 hours ago, Lars Fosdal said:

It's not in Rio either.

Nor Tokyo. Anyone know when it appeared?

Remy Lebeau · September 24, 2020

4 hours ago, borni69 said:

I will only send out characters and commands that are handled by TjsonString.create()

That makes no sense, as JSON handles the entire Unicode repertoire. ANY Unicode character can appear in JSON. ALL Unicode characters except for the first 31 ASCII control characters (U+0000..U+001F), " (U+0022), and \ (U+005C) can appear in an unescaped form. ALL Unicode characters can appear in an escaped form.

borni69 · September 24, 2020

ok Thanks will reconsider..

Bernt

borni69 · September 24, 2020

Will do it like this...

function StripCharsInSet(s:string; c:CharSet):string;
  var i,j:Integer;
  begin
     SetLength(result,Length(s));
     j:=0;
     for i:=1 to Length(s) do
       if not (s[i] in c) then
        begin
         inc(j);
         result[j]:=s[i];
        end;
     SetLength(result,j);
  end;





Begin
 //test 
 memo2.Lines.Text := StripCharsInSet( memo1.Lines.Text ,[#0..#9,#11,#12,#14..#31,#127..#160] );

...

Found the snippet on this page

https://stackoverflow.com/questions/5650532/delphi-strip-out-all-non-standard-text-characers-from-string

https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec

B

Arnaud Bouchez · September 25, 2020

I don't know where #127..#160 comes from. It is valid set of chars, e.g. in Europe for accentuated characters like é à â.

You are making a confusing in encoding. A Delphi string is UTF-16 encoded, so #127..#160 are some valid UTF-16 characters.

What you call "character" is confusing. #11 is a valid character, in terms of both UTF-8 and UTF-16 as David wrote.

Sign In

Remove non-utf8 characters from a utf8 string

Recommended Posts

borni69 1

Share this post

Link to post

FPiette 393

Share this post

Link to post

borni69 1

Share this post

Link to post

David Heffernan 2482

Share this post

Link to post

Lars Fosdal 1928

Share this post

Link to post

Uwe Raabe 2203

Share this post

Link to post

Guest

Share this post

Link to post

borni69 1

Share this post

Link to post

Guest

Share this post

Link to post

David Heffernan 2482

Share this post

Link to post

borni69 1

Share this post

Link to post

Anders Melander 2114

Share this post

Link to post

borni69 1

Share this post

Link to post

David Heffernan 2482

Share this post

Link to post

borni69 1

Share this post

Link to post

Anders Melander 2114

Share this post

Link to post

David Heffernan 2482

Share this post

Link to post

borni69 1

Share this post

Link to post

Lars Fosdal 1928

Share this post

Link to post

Anders Melander 2114

Share this post

Link to post

Bill Meyer 339

Share this post

Link to post

Remy Lebeau 1656

Share this post

Link to post

borni69 1

Share this post

Link to post

borni69 1

Share this post

Link to post

Arnaud Bouchez 413

Share this post

Link to post

Create an account or sign in to comment

Create an account