Jump to content
David Schwartz

Unicode weirdness

Recommended Posts

I have some PDF files that I ran through a text extractor to get simple text files (.txt). I assumed they were ASCII text, but it appears not.

 

The files have lots of things like â€™ and â€“ and â€¦ scattered throughout.

 

I found a table that shows what they're supposed to be and wrote this to convert them (strs points to a memo.Lines property):
 

  var ln := '';
  strs.BeginUpdate;
  for var n := 0 to strs.Count-1 do
  begin
    ln := StringReplace( strs[n], '➤', '>', [rfReplaceAll] ); // '➤'
    ln := StringReplace( ln, '’', '''', [rfReplaceAll] );     // '’'
    ln := StringReplace( ln, '“', '"', [rfReplaceAll] );      // '“'
    ln := StringReplace( ln, 'â€', '"', [rfReplaceAll] );       // 'â€'
    ln := StringReplace( ln, '…', '...', [rfReplaceAll] );    // '…'
    ln := StringReplace( ln, 'â€"', '--', [rfReplaceAll] );     // 'â€"'
    ln := StringReplace( ln, '–', '--', [rfReplaceAll] );     // '–'
    strs[n] := ln;
  end;
  strs.EndUpdate;

This worked for a little while, until the Dephi IDE (10.4.2) unexpectedly decided to convert all of the string literals into actual Unicode characters, and then it stopped working since StringReplace didn't find any of them in the text. Ugh.

 

I corrected it here before pasting this code, and hopefully it won't get changed here as well.

 

For my purposes, these characters are irrelevant. I'm replacing them with ASCII characters so they make sense if you're reading the text. But whether they're ASCII or Unicode doesn't matter.

 

I found a table here: https://www.i18nqa.com/debug/utf8-debug.html

 

and it says an apostrophe can be represented in several ways:

 

image.png.493ea0729a2578226dbc4d9349431dce.png

 

How can I replace a 2- or 3-char literal like  â € ™   with one of these above codes so the compiler doesn't change them back to Unicode representations?

 

Is there a simpler way to do this?

 

Depending on what I'm using to look at the text data files, they may appear as their "real" Unicode representation, or they may appear as 2- or 3-char gibberish.

 

I just need ASCII text that comes close to what they represent.

 

  • Confused 1

Share this post


Link to post

I don't want or need them in Unicode -- I want plain ASCII or Ansi Strings. That's what I'm trying to do here -- the problem is the IDE is changing the MBCS back to Unicode, so the StringReplace isn't doing what I want it to do.

 

I'm wondering how to rewrite the StringReplace calls so they match the actual text, but the IDE won't translate them into Unicode?

Share this post


Link to post

@David Schwartz What type is ln, AnsiString or String? Try

ln := StringReplace( strs[n], AnsiString('➤'), '>', [rfReplaceAll] );

 

Share this post


Link to post

Can you not find a better 'text extractor' which produces more useful output?

  • Like 1

Share this post


Link to post
22 minutes ago, Kryvich said:

@David Schwartz What type is ln, AnsiString or String? Try


ln := StringReplace( strs[n], AnsiString('➤'), '>', [rfReplaceAll] );

 

That would be quite nonsense given that strs is TStrings as David wrote ("strs points to a memo.Lines property").

 

35 minutes ago, David Schwartz said:

I don't want or need them in Unicode -- I want plain ASCII or Ansi Strings.

Then don't use a Memo and its Lines property I would say - they are Unicode.

  • Like 1

Share this post


Link to post

You guys are totally missing the point here.

 

The code in the Delphi IDE looks like this:

 

    ln := StringReplace( strs[n], '➤', '>', [rfReplaceAll] ); // '➤'
    ln := StringReplace( ln, '’', '''', [rfReplaceAll] );     // '’'
    ln := StringReplace( ln, '“', '"', [rfReplaceAll] );      // '“'
    ln := StringReplace( ln, 'â€', '"', [rfReplaceAll] );       // 'â€'
    ln := StringReplace( ln, '…', '...', [rfReplaceAll] );    // '…'
    ln := StringReplace( ln, 'â€"', '--', [rfReplaceAll] );     // 'â€"'
    ln := StringReplace( ln, '–', '--', [rfReplaceAll] );     // '–'

 

This code works fine.

 

And at one point I opened Delphi and the same code now looks like this:

 

    ln := StringReplace( strs[n], '➤', '>', [rfReplaceAll] ); // '➤'
    ln := StringReplace( ln, '', '''', [rfReplaceAll] );     // ''
    ln := StringReplace( ln, '', '"', [rfReplaceAll] );      // ''
    ln := StringReplace( ln, '', '"', [rfReplaceAll] );       // ''
    ln := StringReplace( ln, '', '...', [rfReplaceAll] );    // ''
    ln := StringReplace( ln, '', '--', [rfReplaceAll] );     // ''
    ln := StringReplace( ln, '', '--', [rfReplaceAll] );     // ''


This code DOES NOT WORK!

 

My text does NOT contain these Unicode characters! It contans 2- and 3-char representations.

 

I even tried something like this:

 

 

    ln := StringReplace( ln, 'â'+'€œ', '"', [rfReplaceAll] );      // '“'

 

It did not work either.

 

 

 

Edited by David Schwartz

Share this post


Link to post

Seems Project options > Codepage has been set to 65001

Just define the fragments to replace with numeric codes. Or try defining them char by char ('a'+'b'+'c')

Edited by Fr0sT.Brutal

Share this post


Link to post

Isn't the real problem that you have interpreted UTF-8 encoded data as though it were ANSI?

 

Quote

I have some PDF files that I ran through a text extractor to get simple text files (.txt). I assumed they were ASCII text, but it appears not.

 

I mean, it's clearly not ASCII because none of the characters in your code are in the ASCII set.

 

You can actually delete all of these StringReplace calls by simply using the correct encoding for your extracted data.

Edited by David Heffernan
  • Like 3

Share this post


Link to post
6 hours ago, David Schwartz said:

I'm replacing them with ASCII characters so they make sense if you're reading the text.

Wouldn't converting the chars to Unicode solve that problem?

All strings in modern Delphi components are using Unicode.

I don't understand why you don't want to handle the text as what it is.

Once you have the text as Unicode, you also get all the nice TCharHelper functions to understand what kind of character you are looking at, in case you want to do more manipulations.

A lot better and more robust than string replacements.

  • Like 1

Share this post


Link to post
On 4/4/2023 at 4:55 AM, David Heffernan said:

You can actually delete all of these StringReplace calls by simply using the correct encoding for your extracted data.

That was not an option on the extraction tool; you upload the file, it processes the file, then you click a Download button. 

 

21 hours ago, Lars Fosdal said:

Wouldn't converting the chars to Unicode solve that problem?

 

I don't understand why you don't want to handle the text as what it is.

 

Yes, but how can I do that?

 

Because they are unnecessary. Deleting them would make no difference to the result I'm after. But having them as weird text IS a problem.

 

Share this post


Link to post
30 minutes ago, David Schwartz said:

That was not an option on the extraction tool; you upload the file, it processes the file, then you click a Download button. 

It's when you read the output into Delphi that there's a problem. You tool is emitting UTF-8 encoded text, but you are interpreting it as ANSI. The tool is fine. Your code is not.

Edited by David Heffernan

Share this post


Link to post

Just open the extracted *.txt files in Notepad++ and try out the different encoding options that this program offers until the files display correctly. 

 

Then save them as "unicode with bom".   tStringlist.loadfromfile will load the files correctly even if they countain foreign characters.

 

 

  • Like 1

Share this post


Link to post

See this most often when ASCII is automatically cleaned up typographically for printing by doing some conversions like dash to em dash. If this text is put back / interpreted as ASCII bytes the various UTF8 encodings of the typographical replacements end up us multiple characters each like a minus/dash that was converted to em dash then ending up as â€ ". Need to find where the problem is - the data in the PDF itself could already be corrupted this way or it can happen at some other stage including in the PDF -> Text or in how you load the text. 

 

Often even if you do interpret the encodings correctly so there is a — (em dash) instead of  â€ " the equivalent replacements might be worthwhile to convert text back to plain ASCII.

  • Like 1

Share this post


Link to post
4 hours ago, A.M. Hoornweg said:

Just open the extracted *.txt files in Notepad++ and try out the different encoding options that this program offers until the files display correctly. 

 

Then save them as "unicode with bom".   tStringlist.loadfromfile will load the files correctly even if they countain foreign characters.

 

 

No. We know the text is UTF8 encoded so just load it specifying that encoding. No point adding a extra step. 

Share this post


Link to post
4 hours ago, Brian Evans said:

See this most often when ASCII is automatically cleaned up typographically for printing by doing some conversions like dash to em dash. If this text is put back / interpreted as ASCII bytes the various UTF8 encodings of the typographical replacements end up us multiple characters each like a minus/dash that was converted to em dash then ending up as â€ ". Need to find where the problem is - the data in the PDF itself could already be corrupted this way or it can happen at some other stage including in the PDF -> Text or in how you load the text. 

 

Often even if you do interpret the encodings correctly so there is a — (em dash) instead of  â€ " the equivalent replacements might be worthwhile to convert text back to plain ASCII.

The text is clearly UTF8 encoded. That much we already know. 

Share this post


Link to post
On 4/4/2023 at 1:56 PM, David Schwartz said:

The code in the Delphi IDE looks like this:

 


    ln := StringReplace( strs[n], '➤', '>', [rfReplaceAll] ); // '➤'
    ln := StringReplace( ln, '’', '''', [rfReplaceAll] );     // '’'
    ln := StringReplace( ln, '“', '"', [rfReplaceAll] );      // '“'
    ln := StringReplace( ln, 'â€', '"', [rfReplaceAll] );       // 'â€'
    ln := StringReplace( ln, '…', '...', [rfReplaceAll] );    // '…'
    ln := StringReplace( ln, 'â€"', '--', [rfReplaceAll] );     // 'â€"'
    ln := StringReplace( ln, '–', '--', [rfReplaceAll] );     // '–'

 

This code works fine.

 

And at one point I opened Delphi and the same code now looks like this:

 


    ln := StringReplace( strs[n], '➤', '>', [rfReplaceAll] ); // '➤'
    ln := StringReplace( ln, '', '''', [rfReplaceAll] );     // ''
    ln := StringReplace( ln, '', '"', [rfReplaceAll] );      // ''
    ln := StringReplace( ln, '', '"', [rfReplaceAll] );       // ''
    ln := StringReplace( ln, '', '...', [rfReplaceAll] );    // ''
    ln := StringReplace( ln, '', '--', [rfReplaceAll] );     // ''
    ln := StringReplace( ln, '', '--', [rfReplaceAll] );     // ''


This code DOES NOT WORK! 

So you had file interpreted as ANSI and converted into UTF16 with all the "weird" chars just widened ($AB => $00AB). And you had your UTF16-encoded literals defined in the same way because IDE thought the source file is in ANSI. Then, in new version, the option has changed to UTF8. And your literals which together form a valid UTF8 compound char turned to single UTF16 char which is not contained in source string.

That's my version.

  • Like 1

Share this post


Link to post
5 hours ago, Lars Fosdal said:

@David Schwartz - You wouldn't happen to have an original file uploaded as an attachment to a post here, so that we can try some conversions?

It's UTF8. We don't need to check any more. And you don't need any more information than is in the original post. 

Share this post


Link to post
On 4/4/2023 at 10:57 AM, David Schwartz said:

The files have lots of things like â€™ and â€“ and â€¦ scattered throughout.

 

I found a table that shows what they're supposed to be and wrote this to convert them (strs points to a memo.Lines property):

 

Run these oddly looking Mojibake character sequences through ftfy: fixes text for you analyser which lists the encoding/decoding steps to get from the oddly looking text to proper text, then repeat these encoding sequences in Delphi code (using for instance the TEncoding class).

This is way better than using a conversion table, because likely that table will be incomplete.

It also solves your problem where apparently your Delphi source code got mangled undoing your table based conversion workaround.

 

That code mangling can have lots of causes including hard to reproduce bugs of the Delphi IDE itself or plugins used by the IDE.

 

BTW: if you install poppler (for instance through Chocolatey), the included pdftotext console executable can extract text from PDF files for you.

Edited by jeroenp

Share this post


Link to post
5 hours ago, jeroenp said:

Run these oddly looking Mojibake character sequences through ftfy: fixes text for you analyser which lists the encoding/decoding steps to get from the oddly looking text to proper text, then repeat these encoding sequences in Delphi code (using for instance the TEncoding class).

I'd just read them using the UTF8 encoding in the first place and so never ever see these characters. I'm sure you would too. 

  • Like 2

Share this post


Link to post

This entire thread blows my mind. The number of people who think it's normal to read UTF8 as though it were ANSI. 

  • Like 1

Share this post


Link to post
15 hours ago, David Heffernan said:

I'd just read them using the UTF8 encoding in the first place and so never ever see these characters. I'm sure you would too. 

That would be my first try too.

Since could just as well be the odd way the PDF to text on-line exporter makes an encoding error (it wouldn't be the first tool or site doing strange encoding stuff, hence the series of blog posts at https://wiert.me/category/mojibake/ ) and why I mentioned ftfy: it's a great tool helping to figure out encoding issues.

 

Looking at https://ftfy.vercel.app/?s=… (and hoping this forum does not mangle that URL) two encode/decode steps are required to fix, so it does not look like a plain "read using UTF8" solution:

 

s = s.encode('latin-1')
s = s.decode('utf-8')
s = s.encode('sloppy-windows-1252')
s = s.decode('utf-8')

 

Edited by jeroenp
ftfy example
  • Like 1

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×