Jump to content
Fons N

Delete unicode non-breaking space

Recommended Posts

Hi,

 

I am trying to import using the clipboard some text which has to be processed. I am not a professional coder, it's just a hobby, but I do use some of my application at work (administration department).

 

Below is data from a web application. In notepad is looks like this. 

 

image.png.25c45af0f0566740e56a8e3d381bf211.png

 

In EditPad Pro is looks like this.

 

image.png.fcbf2dd60fa651db3cd012f3df073f3b.png

 

Using the hex view it looks like this.

 

image.png.d628f8cc62b953a503cc9c2ca415357d.png

 

The "white" space after the first Transaction seems to be 2 characters. The Â is C2 and the space after is A0. According to a web search C2A0 is a non-breaking space. But you all probably know this :classic_biggrin:

 

I am trying to delete all spaces including this "special" unicode space. But I cannot get it to work.

 

  S := StringReplace(S, #$C2#$A0, '', [rfReplaceAll]);
  S := StringReplace(S, #$C2A0, '', [rfReplaceAll]);
 

Delphi does not complain about any syntax errors, but the result is that this non-breaking space is not deleted - as in replaced by an empty string.

 

I am at a loss. Delphi does include a function IsWhiteSpace that can detect this character, but I need something to delete it.

 

Thanks in advance.

 

Best regards,

Fons

 

 

Share this post


Link to post

I have figured it out... after trying and searching for about an hour before posting my question... I suddenly have the answer.

 

At first I tried the decimal value of C2A0 which is 49824. But that did not work either.

 

Then I found this:

 

image.thumb.png.7a96f092eddd7f96afa05ae7eddac2f6.png

 

When I use #160 it works :classic_biggrin:

 

Greetings,

Fons

 

Share this post


Link to post
56 minutes ago, Fons N said:

Hi,

 

I am trying to import using the clipboard some text which has to be processed. I am not a professional coder, it's just a hobby, but I do use some of my application at work (administration department).

 

Below is data from a web application. In notepad is looks like this. 

 

image.png.25c45af0f0566740e56a8e3d381bf211.png

 

 

You are approaching this from the wrong angle. The data you showed pasted into notepad looks like a semicolon-separated CSV format. To dissect this you can use TStringlist. Something like this:

 

var
  LText, LLine: TStringlist;
  i: integer;
begin
  LText := TStringlist.Create;
  try
    LText := Clipboard.AsText;
    // This splits the data into lines

    LLine := TStringlist.Create;
    try
      LLine.StrictDelimiter := true;
      LLine.Delimiter := ';';
     
      for i:= 0 to LText.Count - 1 do begin
        LLine.DelimitedText := LText[i];
        if i = 0 then
          ProcessHeaderLine(LLine)
        else
          ProcessDataLine(LLine); 
      end;
    finally
      LLine.Free;
    end;
  finally
    LText.Free;
  end;
end;
       
     

Untested, just typed into the post directly.

 

The two Process routines are something you would write yourself. For each the passed stringlist should hold 5 lines, the column captions for the header and the column values for the data.

If you really want to replace a non-breaking Unicode space it is a single character with the code #$00A0, not a two-character string. Your hex viewer probably pastes the clipboard content as ANSI text, while Notepad pastes it as Unicode (UTF-16).

 

Edited by PeterBelow
  • Like 1
  • Thanks 1

Share this post


Link to post
On 4/10/2022 at 9:07 AM, Fons N said:

When I use #160 it works :classic_biggrin:

That is because the original data is encoded in UTF-8, but once it is loaded into your string, it is no longer encoded in UTF-8, it is encoded in UTF-16 instead.  $C2 $A0 are the UTF-8 bytes for the non-breaking character, whereas $00A0 (decimal 160) is the UTF-16 value of that same character.

  • Like 1
  • Thanks 1

Share this post


Link to post
On 4/10/2022 at 6:44 PM, PeterBelow said:

If you really want to replace a non-breaking Unicode

Peter,

 

Thanks for your help. Sorry for the elaborate introduction to my question. I wasn't sure of it all, thus the long introduction. And yes, the actually splitting is done similar to your example.

 

Greetings,

Fons 

Share this post


Link to post
On 4/11/2022 at 10:47 PM, Remy Lebeau said:

but once it is loaded into your string, it is no longer encoded in UTF-8, it is encoded in UTF-16 instead.

Remy,

 

Thanks. Reading your reply, yes, it does make sense to me. I know there is UTF-8, 16 and 32, but didn't realize the "code point" (not sure about the name) would be different, just that the storage would be. But that it of course not the case, quite logically I suppose, but having not to deal with that issue thus far, it just didn't occur to me, yet.

 

Greetings,

Fons

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×