Jump to content
borni69

Remove non-utf8 characters from a utf8 string

Recommended Posts

Isn't single characters in the #128-#255 range invalid in UTF8, if not prefixed to form a sequence?

Are there per definition, UTF8 sequences that are invalid? 

Share this post


Link to post
36 minutes ago, Lars Fosdal said:

Are there per definition, UTF8 sequences that are invalid? 

Yes, there are. One can use this fact to distinguish between UTF8 and ANSI encoding. An ANSI encoded text file can throw an exception when read as UTF8 if it contains certain characters. UTF8 character sequences start with a specific byte range followed by one to three bytes from a different range. Simply exchanging the first two bytes of a UTF8 sequence invalidates the file.

Share this post


Link to post

  E-K ble

 

This line above have two characters I would like to remove in start

 

see attached image

 

Not sure if this make any sense to you...

 

 

Capture.PNG

 

 String  E-K ble

bytes  :   0b 0b 45 2d 4b 20 62 6c 65

 

How can I programaticly remove all character not utf8, if it make sense to say it that way.

 

 

If I run this code

 

procedure TForm1.Button1Click(Sender: TObject);
var
    I: integer;

begin
  edit1.Text:='E-K ble';
  edit2.Text:='';
  for I := 1 to length(edit1.Text)  do
   begin
       edit2.Text := edit2.Text + (ord(edit1.Text) ).ToString+' - ';
   end;


end;
 

 

 

 

 

I get this result...

11 - 11 - 69 - 45 - 75 - 32 - 98 - 108 - 101 - 

 

I guess I could remove the 11, but will all no utf8 be 11  ???

 

 

 

 

 

 

B

Capture.PNG

Edited by borni69

Share this post


Link to post
Guest
1 hour ago, borni69 said:

I guess I could remove the 11, but will all no utf8 be 11  ???

Yes, you are right in this case.

 

and to make it clear as per UTF8 standard you can safely remove any one byte with value <= 127 ($7F), all these values ( i mean less than 128 ) are representing one byte chars and they are safe to be removed. 

Share this post


Link to post

Thanks , I think I understand your what you say..

 

And I am not sure after this discussion what my spec is.

 

But our problem is.

 

We have a html text box on an Angular  webclient that people add text to, and after the text is added this text is posted to the server, using a webbroker ISAPI backend.

 

Sometimes they copy text from word / email  etc, and then we get characters we dont want..  I am not sure what they are, we like to keep linbreak tabs ect, but not this character showing as a ?  or a as seen in above image.

 

Not sure if this question make sense.

 

 

 

b

 

 

 

 

Share this post


Link to post

I think you might be confusing UTF-8 and Unicode because your example is using unicode strings and not UTF-8 strings.

If what you have is in fact unicode and you just want to remove non-printable characters then you can use the TCharacter class:

for var i := Length(s)-1 downto 1 do
  if (not TCharacter.IsValid(s[i])) or (TCharacter.IsControl(s[i])) then
    Delete(s, i, 1);

 

Edited by Anders Melander
typo
  • Like 1

Share this post


Link to post
50 minutes ago, borni69 said:

Sometimes they copy text from word / email  etc, and then we get characters we dont want..  I am not sure what they are, we like to keep linbreak tabs ect, but not this character showing as a ?  or a as seen in above image.

 

Your problem is not how to remove characters, it is to work out what characters are to be removed. As is so often the case, the hardest part of any programming tasks is determining the correct specification.

Share this post


Link to post

I agree with you...

 

And after this back and forward discussion in this thread I think I have a solution.

 

I will only send out characters  and commands that are handled by TjsonString.create() 

 

So I think my code will be something like this  

 

PS:  there is  0b 0b  before E-K in rawText

 

 rawText := 'E-K ble æøå  Test // 98';
  ajson := TjsonObject.Create;
  try
   ajson.AddPair('text',TJSONString.Create(rawText));
   RawtextOut := ajson.tostring;
  finally
    ajson.Free;
  end;

  textOut:='';
  for ch in RawtextOut  do
  begin
   if (ch >= #32)  then
   textOut := textOut+ch

  end;

  memo1.Lines.Text := textOut;

 

result

{"text":"E-K ble æøå  Test \/\/ 98"}  characters I dont want are removed...

 

also line break tabs will be handled correct  ..    \r\n

 

Thanks for all your help..

 

B

 

 

 

 

 

Edited by borni69

Share this post


Link to post
25 minutes ago, borni69 said:

I agree with you...

 

And after this back and forward discussion in this thread I think I have a solution.

 

I will only send out characters  and commands that are handled by TjsonString.create() 

 

So I think my code will be something like this  

 

PS:  there is  0b 0b  before E-K in rawText

 


 rawText := 'E-K ble æøå  Test // 98';
  ajson := TjsonObject.Create;
  try
   ajson.AddPair('text',TJSONString.Create(rawText));
   RawtextOut := ajson.tostring;
  finally
    ajson.Free;
  end;

  textOut:='';
  for ch in RawtextOut  do
  begin
   if (ch >= #32)  then
   textOut := textOut+ch

  end;

  memo1.Lines.Text := textOut;

 

result

{"text":"E-K ble æøå  Test \/\/ 98"}  characters I dont want are removed...

 

also line break tabs will be handled correct  ..    \r\n

 

Thanks for all your help..

 

B

 

 

 

 

 

Better hope that there are no line breaks .....

Share this post


Link to post

Line break will work, since

 

1)  TjsonString.create('') will change them first to  :   \r\n   , and now they are valid.

2) remove unwanted characters

 

Bernt

Share this post


Link to post
1 minute ago, Lars Fosdal said:

It's not in Rio either.

Ah. I didn't check the source. I just assumed the documentation was correct. Silly me :classic_smile:

Share this post


Link to post
4 hours ago, borni69 said:

I will only send out characters  and commands that are handled by TjsonString.create() 

That makes no sense, as JSON handles the entire Unicode repertoire.  ANY Unicode character can appear in JSON.  ALL Unicode characters except for the first 31 ASCII control characters (U+0000..U+001F), " (U+0022), and \ (U+005C) can appear in an unescaped form.  ALL Unicode characters can appear in an escaped form.

Share this post


Link to post

Will do it like this...

 

function StripCharsInSet(s:string; c:CharSet):string;
  var i,j:Integer;
  begin
     SetLength(result,Length(s));
     j:=0;
     for i:=1 to Length(s) do
       if not (s[i] in c) then
        begin
         inc(j);
         result[j]:=s[i];
        end;
     SetLength(result,j);
  end;





Begin
 //test 
 memo2.Lines.Text := StripCharsInSet( memo1.Lines.Text ,[#0..#9,#11,#12,#14..#31,#127..#160] );

...

 

Found the  snippet on this page

 

https://stackoverflow.com/questions/5650532/delphi-strip-out-all-non-standard-text-characers-from-string

 

https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec

 

B

Share this post


Link to post

I don't know where #127..#160 comes from. It is valid set of chars, e.g. in Europe for accentuated characters like é à â.

You are making a confusing in encoding. A Delphi string is UTF-16 encoded, so #127..#160 are some valid UTF-16 characters.

What you call "character" is confusing. #11 is a valid character, in terms of both UTF-8 and UTF-16 as David wrote.

 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×