Unicode weirdness

David Schwartz · April 4, 2023

I have some PDF files that I ran through a text extractor to get simple text files (.txt). I assumed they were ASCII text, but it appears not.

The files have lots of things like â€™ and â€“ and â€¦ scattered throughout.

I found a table that shows what they're supposed to be and wrote this to convert them (strs points to a memo.Lines property):

  var ln := '';
  strs.BeginUpdate;
  for var n := 0 to strs.Count-1 do
  begin
    ln := StringReplace( strs[n], 'âž¤', '>', [rfReplaceAll] ); // 'âž¤'
    ln := StringReplace( ln, 'â€™', '''', [rfReplaceAll] );     // 'â€™'
    ln := StringReplace( ln, 'â€œ', '"', [rfReplaceAll] );      // 'â€œ'
    ln := StringReplace( ln, 'â€', '"', [rfReplaceAll] );       // 'â€'
    ln := StringReplace( ln, 'â€¦', '...', [rfReplaceAll] );    // 'â€¦'
    ln := StringReplace( ln, 'â€"', '--', [rfReplaceAll] );     // 'â€"'
    ln := StringReplace( ln, 'â€“', '--', [rfReplaceAll] );     // 'â€“'
    strs[n] := ln;
  end;
  strs.EndUpdate;

This worked for a little while, until the Dephi IDE (10.4.2) unexpectedly decided to convert all of the string literals into actual Unicode characters, and then it stopped working since StringReplace didn't find any of them in the text. Ugh.

I corrected it here before pasting this code, and hopefully it won't get changed here as well.

For my purposes, these characters are irrelevant. I'm replacing them with ASCII characters so they make sense if you're reading the text. But whether they're ASCII or Unicode doesn't matter.

I found a table here: https://www.i18nqa.com/debug/utf8-debug.html

and it says an apostrophe can be represented in several ways:

image.png.493ea0729a2578226dbc4d9349431dce.png

How can I replace a 2- or 3-char literal like â € ™ with one of these above codes so the compiler doesn't change them back to Unicode representations?

Is there a simpler way to do this?

Depending on what I'm using to look at the text data files, they may appear as their "real" Unicode representation, or they may appear as 2- or 3-char gibberish.

I just need ASCII text that comes close to what they represent.

Lars Fosdal · April 4, 2023

@David Schwartz - This looks like MBCS encoding - the old ANSI multibyte character set encoding scheme in Windows.

The ANSI routines should be capable of converting the strings to Unicode, but they depend on knowing the appropriate code page.

https://docwiki.embarcadero.com/RADStudio/Alexandria/en/Commonly_Used_Routines_for_AnsiStrings

David Schwartz · April 4, 2023

I don't want or need them in Unicode -- I want plain ASCII or Ansi Strings. That's what I'm trying to do here -- the problem is the IDE is changing the MBCS back to Unicode, so the StringReplace isn't doing what I want it to do.

I'm wondering how to rewrite the StringReplace calls so they match the actual text, but the IDE won't translate them into Unicode?

Kryvich · April 4, 2023

@David Schwartz What type is ln, AnsiString or String? Try

ln := StringReplace( strs[n], AnsiString('âž¤'), '>', [rfReplaceAll] );

timfrost · April 4, 2023

Can you not find a better 'text extractor' which produces more useful output?

Stefan Glienke · April 4, 2023

22 minutes ago, Kryvich said:
@David Schwartz What type is ln, AnsiString or String? Try
ln := StringReplace( strs[n], AnsiString('âž¤'), '>', [rfReplaceAll] );

That would be quite nonsense given that strs is TStrings as David wrote ("strs points to a memo.Lines property").

35 minutes ago, David Schwartz said:

I don't want or need them in Unicode -- I want plain ASCII or Ansi Strings.

Then don't use a Memo and its Lines property I would say - they are Unicode.

David Schwartz · April 4, 2023

You guys are totally missing the point here.

The code in the Delphi IDE looks like this:

    ln := StringReplace( strs[n], 'âž¤', '>', [rfReplaceAll] ); // 'âž¤'
    ln := StringReplace( ln, 'â€™', '''', [rfReplaceAll] );     // 'â€™'
    ln := StringReplace( ln, 'â€œ', '"', [rfReplaceAll] );      // 'â€œ'
    ln := StringReplace( ln, 'â€', '"', [rfReplaceAll] );       // 'â€'
    ln := StringReplace( ln, 'â€¦', '...', [rfReplaceAll] );    // 'â€¦'
    ln := StringReplace( ln, 'â€"', '--', [rfReplaceAll] );     // 'â€"'
    ln := StringReplace( ln, 'â€“', '--', [rfReplaceAll] );     // 'â€“'

This code works fine.

And at one point I opened Delphi and the same code now looks like this:

    ln := StringReplace( strs[n], '➤', '>', [rfReplaceAll] ); // '➤'
    ln := StringReplace( ln, '’', '''', [rfReplaceAll] );     // '’'
    ln := StringReplace( ln, '“', '"', [rfReplaceAll] );      // '“'
    ln := StringReplace( ln, '”', '"', [rfReplaceAll] );       // '”'
    ln := StringReplace( ln, '…', '...', [rfReplaceAll] );    // '…'
    ln := StringReplace( ln, '—', '--', [rfReplaceAll] );     // '—'
    ln := StringReplace( ln, '—', '--', [rfReplaceAll] );     // '—'

This code DOES NOT WORK!

My text does NOT contain these Unicode characters! It contans 2- and 3-char representations.

I even tried something like this:

    ln := StringReplace( ln, 'â'+'€œ', '"', [rfReplaceAll] );      // 'â€œ'

It did not work either.

Edited April 4, 2023 by David Schwartz

Fr0sT.Brutal · April 4, 2023

Seems Project options > Codepage has been set to 65001

Just define the fragments to replace with numeric codes. Or try defining them char by char ('a'+'b'+'c')

Edited April 4, 2023 by Fr0sT.Brutal

David Heffernan · April 4, 2023

Isn't the real problem that you have interpreted UTF-8 encoded data as though it were ANSI?

Quote

I have some PDF files that I ran through a text extractor to get simple text files (.txt). I assumed they were ASCII text, but it appears not.

I mean, it's clearly not ASCII because none of the characters in your code are in the ASCII set.

You can actually delete all of these StringReplace calls by simply using the correct encoding for your extracted data.

Edited April 4, 2023 by David Heffernan

Lars Fosdal · April 4, 2023

6 hours ago, David Schwartz said:

I'm replacing them with ASCII characters so they make sense if you're reading the text.

Wouldn't converting the chars to Unicode solve that problem?

All strings in modern Delphi components are using Unicode.

I don't understand why you don't want to handle the text as what it is.

Once you have the text as Unicode, you also get all the nice TCharHelper functions to understand what kind of character you are looking at, in case you want to do more manipulations.

A lot better and more robust than string replacements.

David Schwartz · April 5, 2023

On 4/4/2023 at 4:55 AM, David Heffernan said:

You can actually delete all of these StringReplace calls by simply using the correct encoding for your extracted data.

That was not an option on the extraction tool; you upload the file, it processes the file, then you click a Download button.

21 hours ago, Lars Fosdal said:

Wouldn't converting the chars to Unicode solve that problem?

I don't understand why you don't want to handle the text as what it is.

Yes, but how can I do that?

Because they are unnecessary. Deleting them would make no difference to the result I'm after. But having them as weird text IS a problem.

David Heffernan · April 5, 2023

30 minutes ago, David Schwartz said:

That was not an option on the extraction tool; you upload the file, it processes the file, then you click a Download button.

It's when you read the output into Delphi that there's a problem. You tool is emitting UTF-8 encoded text, but you are interpreting it as ANSI. The tool is fine. Your code is not.

Edited April 5, 2023 by David Heffernan

A.M. Hoornweg · April 5, 2023

Just open the extracted *.txt files in Notepad++ and try out the different encoding options that this program offers until the files display correctly.

Then save them as "unicode with bom". tStringlist.loadfromfile will load the files correctly even if they countain foreign characters.

Brian Evans · April 5, 2023

See this most often when ASCII is automatically cleaned up typographically for printing by doing some conversions like dash to em dash. If this text is put back / interpreted as ASCII bytes the various UTF8 encodings of the typographical replacements end up us multiple characters each like a minus/dash that was converted to em dash then ending up as â€ ". Need to find where the problem is - the data in the PDF itself could already be corrupted this way or it can happen at some other stage including in the PDF -> Text or in how you load the text.

Often even if you do interpret the encodings correctly so there is a — (em dash) instead of â€ " the equivalent replacements might be worthwhile to convert text back to plain ASCII.

David Heffernan · April 5, 2023

4 hours ago, A.M. Hoornweg said:

Just open the extracted *.txt files in Notepad++ and try out the different encoding options that this program offers until the files display correctly.

Then save them as "unicode with bom". tStringlist.loadfromfile will load the files correctly even if they countain foreign characters.

No. We know the text is UTF8 encoded so just load it specifying that encoding. No point adding a extra step.

David Heffernan · April 5, 2023

4 hours ago, Brian Evans said:

See this most often when ASCII is automatically cleaned up typographically for printing by doing some conversions like dash to em dash. If this text is put back / interpreted as ASCII bytes the various UTF8 encodings of the typographical replacements end up us multiple characters each like a minus/dash that was converted to em dash then ending up as â€ ". Need to find where the problem is - the data in the PDF itself could already be corrupted this way or it can happen at some other stage including in the PDF -> Text or in how you load the text.

Often even if you do interpret the encodings correctly so there is a — (em dash) instead of â€ " the equivalent replacements might be worthwhile to convert text back to plain ASCII.

The text is clearly UTF8 encoded. That much we already know.

Lars Fosdal · April 6, 2023

@David Schwartz - You wouldn't happen to have an original file uploaded as an attachment to a post here, so that we can try some conversions?

Fr0sT.Brutal · April 6, 2023

On 4/4/2023 at 1:56 PM, David Schwartz said:

The code in the Delphi IDE looks like this:


    ln := StringReplace( strs[n], 'âž¤', '>', [rfReplaceAll] ); // 'âž¤'
    ln := StringReplace( ln, 'â€™', '''', [rfReplaceAll] );     // 'â€™'
    ln := StringReplace( ln, 'â€œ', '"', [rfReplaceAll] );      // 'â€œ'
    ln := StringReplace( ln, 'â€', '"', [rfReplaceAll] );       // 'â€'
    ln := StringReplace( ln, 'â€¦', '...', [rfReplaceAll] );    // 'â€¦'
    ln := StringReplace( ln, 'â€"', '--', [rfReplaceAll] );     // 'â€"'
    ln := StringReplace( ln, 'â€“', '--', [rfReplaceAll] );     // 'â€“'

This code works fine.

And at one point I opened Delphi and the same code now looks like this:


    ln := StringReplace( strs[n], '➤', '>', [rfReplaceAll] ); // '➤'
    ln := StringReplace( ln, '’', '''', [rfReplaceAll] );     // '’'
    ln := StringReplace( ln, '“', '"', [rfReplaceAll] );      // '“'
    ln := StringReplace( ln, '”', '"', [rfReplaceAll] );       // '”'
    ln := StringReplace( ln, '…', '...', [rfReplaceAll] );    // '…'
    ln := StringReplace( ln, '—', '--', [rfReplaceAll] );     // '—'
    ln := StringReplace( ln, '—', '--', [rfReplaceAll] );     // '—'

This code DOES NOT WORK!

So you had file interpreted as ANSI and converted into UTF16 with all the "weird" chars just widened ($AB => $00AB). And you had your UTF16-encoded literals defined in the same way because IDE thought the source file is in ANSI. Then, in new version, the option has changed to UTF8. And your literals which together form a valid UTF8 compound char turned to single UTF16 char which is not contained in source string.

That's my version.

David Heffernan · April 6, 2023

5 hours ago, Lars Fosdal said:

@David Schwartz - You wouldn't happen to have an original file uploaded as an attachment to a post here, so that we can try some conversions?

It's UTF8. We don't need to check any more. And you don't need any more information than is in the original post.

jeroenp · April 7, 2023

On 4/4/2023 at 10:57 AM, David Schwartz said:

The files have lots of things like â€™ and â€“ and â€¦ scattered throughout.

I found a table that shows what they're supposed to be and wrote this to convert them (strs points to a memo.Lines property):

Run these oddly looking Mojibake character sequences through ftfy: fixes text for you analyser which lists the encoding/decoding steps to get from the oddly looking text to proper text, then repeat these encoding sequences in Delphi code (using for instance the TEncoding class).

This is way better than using a conversion table, because likely that table will be incomplete.

It also solves your problem where apparently your Delphi source code got mangled undoing your table based conversion workaround.

That code mangling can have lots of causes including hard to reproduce bugs of the Delphi IDE itself or plugins used by the IDE.

BTW: if you install poppler (for instance through Chocolatey), the included pdftotext console executable can extract text from PDF files for you.

Edited April 7, 2023 by jeroenp

David Heffernan · April 7, 2023

5 hours ago, jeroenp said:

Run these oddly looking Mojibake character sequences through ftfy: fixes text for you analyser which lists the encoding/decoding steps to get from the oddly looking text to proper text, then repeat these encoding sequences in Delphi code (using for instance the TEncoding class).

I'd just read them using the UTF8 encoding in the first place and so never ever see these characters. I'm sure you would too.

Javier Tarí · April 8, 2023

I would just substitute the original characters for their equivalent codes, in the #99 or #$ab format

renna · April 8, 2023

It's UTF8. so don't need to check any more.

David Heffernan · April 8, 2023

This entire thread blows my mind. The number of people who think it's normal to read UTF8 as though it were ANSI.

jeroenp · April 8, 2023

15 hours ago, David Heffernan said:

I'd just read them using the UTF8 encoding in the first place and so never ever see these characters. I'm sure you would too.

That would be my first try too.

Since could just as well be the odd way the PDF to text on-line exporter makes an encoding error (it wouldn't be the first tool or site doing strange encoding stuff, hence the series of blog posts at https://wiert.me/category/mojibake/ ) and why I mentioned ftfy: it's a great tool helping to figure out encoding issues.

Looking at https://ftfy.vercel.app/?s=â€¦ (and hoping this forum does not mangle that URL) two encode/decode steps are required to fix, so it does not look like a plain "read using UTF8" solution:

s = s.encode('latin-1')
s = s.decode('utf-8')
s = s.encode('sloppy-windows-1252')
s = s.decode('utf-8')

Edited April 8, 2023 by jeroenp
ftfy example

Sign In

Unicode weirdness

Recommended Posts

David Schwartz 412

Share this post

Link to post

Lars Fosdal 1725

Share this post

Link to post

David Schwartz 412

Share this post

Link to post

Kryvich 163

Share this post

Link to post

timfrost 73

Share this post

Link to post

Stefan Glienke 1899

Share this post

Link to post

David Schwartz 412

Share this post

Link to post

Fr0sT.Brutal 897

Share this post

Link to post

David Heffernan 2288

Share this post

Link to post

Lars Fosdal 1725

Share this post

Link to post

David Schwartz 412

Share this post

Link to post

David Heffernan 2288

Share this post

Link to post

A.M. Hoornweg 137

Share this post

Link to post

Brian Evans 83

Share this post

Link to post

David Heffernan 2288

Share this post

Link to post

David Heffernan 2288

Share this post

Link to post

Lars Fosdal 1725

Share this post

Link to post

Fr0sT.Brutal 897

Share this post

Link to post

David Heffernan 2288

Share this post

Link to post

jeroenp 25

Share this post

Link to post

David Heffernan 2288

Share this post

Link to post

Javier Tarí 23

Share this post

Link to post

renna 0

Share this post

Link to post

David Heffernan 2288

Share this post

Link to post

jeroenp 25

Share this post

Link to post

Create an account or sign in to comment

Create an account