David Schwartz 426 Posted April 4, 2023 I have some PDF files that I ran through a text extractor to get simple text files (.txt). I assumed they were ASCII text, but it appears not. The files have lots of things like ’ and – and … scattered throughout. I found a table that shows what they're supposed to be and wrote this to convert them (strs points to a memo.Lines property): var ln := ''; strs.BeginUpdate; for var n := 0 to strs.Count-1 do begin ln := StringReplace( strs[n], '➤', '>', [rfReplaceAll] ); // '➤' ln := StringReplace( ln, '’', '''', [rfReplaceAll] ); // '’' ln := StringReplace( ln, '“', '"', [rfReplaceAll] ); // '“' ln := StringReplace( ln, 'â€', '"', [rfReplaceAll] ); // 'â€' ln := StringReplace( ln, '…', '...', [rfReplaceAll] ); // '…' ln := StringReplace( ln, 'â€"', '--', [rfReplaceAll] ); // 'â€"' ln := StringReplace( ln, '–', '--', [rfReplaceAll] ); // '–' strs[n] := ln; end; strs.EndUpdate; This worked for a little while, until the Dephi IDE (10.4.2) unexpectedly decided to convert all of the string literals into actual Unicode characters, and then it stopped working since StringReplace didn't find any of them in the text. Ugh. I corrected it here before pasting this code, and hopefully it won't get changed here as well. For my purposes, these characters are irrelevant. I'm replacing them with ASCII characters so they make sense if you're reading the text. But whether they're ASCII or Unicode doesn't matter. I found a table here: https://www.i18nqa.com/debug/utf8-debug.html and it says an apostrophe can be represented in several ways: How can I replace a 2- or 3-char literal like â € ™ with one of these above codes so the compiler doesn't change them back to Unicode representations? Is there a simpler way to do this? Depending on what I'm using to look at the text data files, they may appear as their "real" Unicode representation, or they may appear as 2- or 3-char gibberish. I just need ASCII text that comes close to what they represent. 1 Share this post Link to post
Lars Fosdal 1791 Posted April 4, 2023 @David Schwartz - This looks like MBCS encoding - the old ANSI multibyte character set encoding scheme in Windows. The ANSI routines should be capable of converting the strings to Unicode, but they depend on knowing the appropriate code page. https://docwiki.embarcadero.com/RADStudio/Alexandria/en/Commonly_Used_Routines_for_AnsiStrings 1 Share this post Link to post
David Schwartz 426 Posted April 4, 2023 I don't want or need them in Unicode -- I want plain ASCII or Ansi Strings. That's what I'm trying to do here -- the problem is the IDE is changing the MBCS back to Unicode, so the StringReplace isn't doing what I want it to do. I'm wondering how to rewrite the StringReplace calls so they match the actual text, but the IDE won't translate them into Unicode? Share this post Link to post
Kryvich 165 Posted April 4, 2023 @David Schwartz What type is ln, AnsiString or String? Try ln := StringReplace( strs[n], AnsiString('➤'), '>', [rfReplaceAll] ); Share this post Link to post
timfrost 78 Posted April 4, 2023 Can you not find a better 'text extractor' which produces more useful output? 1 Share this post Link to post
Stefan Glienke 2002 Posted April 4, 2023 22 minutes ago, Kryvich said: @David Schwartz What type is ln, AnsiString or String? Try ln := StringReplace( strs[n], AnsiString('➤'), '>', [rfReplaceAll] ); That would be quite nonsense given that strs is TStrings as David wrote ("strs points to a memo.Lines property"). 35 minutes ago, David Schwartz said: I don't want or need them in Unicode -- I want plain ASCII or Ansi Strings. Then don't use a Memo and its Lines property I would say - they are Unicode. 1 Share this post Link to post
David Schwartz 426 Posted April 4, 2023 (edited) You guys are totally missing the point here. The code in the Delphi IDE looks like this: ln := StringReplace( strs[n], '➤', '>', [rfReplaceAll] ); // '➤' ln := StringReplace( ln, '’', '''', [rfReplaceAll] ); // '’' ln := StringReplace( ln, '“', '"', [rfReplaceAll] ); // '“' ln := StringReplace( ln, 'â€', '"', [rfReplaceAll] ); // 'â€' ln := StringReplace( ln, '…', '...', [rfReplaceAll] ); // '…' ln := StringReplace( ln, 'â€"', '--', [rfReplaceAll] ); // 'â€"' ln := StringReplace( ln, '–', '--', [rfReplaceAll] ); // '–' This code works fine. And at one point I opened Delphi and the same code now looks like this: ln := StringReplace( strs[n], '➤', '>', [rfReplaceAll] ); // '➤' ln := StringReplace( ln, '’', '''', [rfReplaceAll] ); // '’' ln := StringReplace( ln, '“', '"', [rfReplaceAll] ); // '“' ln := StringReplace( ln, '”', '"', [rfReplaceAll] ); // '”' ln := StringReplace( ln, '…', '...', [rfReplaceAll] ); // '…' ln := StringReplace( ln, '—', '--', [rfReplaceAll] ); // '—' ln := StringReplace( ln, '—', '--', [rfReplaceAll] ); // '—' This code DOES NOT WORK! My text does NOT contain these Unicode characters! It contans 2- and 3-char representations. I even tried something like this: ln := StringReplace( ln, 'â'+'€œ', '"', [rfReplaceAll] ); // '“' It did not work either. Edited April 4, 2023 by David Schwartz Share this post Link to post
Fr0sT.Brutal 900 Posted April 4, 2023 (edited) Seems Project options > Codepage has been set to 65001 Just define the fragments to replace with numeric codes. Or try defining them char by char ('a'+'b'+'c') Edited April 4, 2023 by Fr0sT.Brutal Share this post Link to post
David Heffernan 2345 Posted April 4, 2023 (edited) Isn't the real problem that you have interpreted UTF-8 encoded data as though it were ANSI? Quote I have some PDF files that I ran through a text extractor to get simple text files (.txt). I assumed they were ASCII text, but it appears not. I mean, it's clearly not ASCII because none of the characters in your code are in the ASCII set. You can actually delete all of these StringReplace calls by simply using the correct encoding for your extracted data. Edited April 4, 2023 by David Heffernan 3 Share this post Link to post
Lars Fosdal 1791 Posted April 4, 2023 6 hours ago, David Schwartz said: I'm replacing them with ASCII characters so they make sense if you're reading the text. Wouldn't converting the chars to Unicode solve that problem? All strings in modern Delphi components are using Unicode. I don't understand why you don't want to handle the text as what it is. Once you have the text as Unicode, you also get all the nice TCharHelper functions to understand what kind of character you are looking at, in case you want to do more manipulations. A lot better and more robust than string replacements. 1 Share this post Link to post
David Schwartz 426 Posted April 5, 2023 On 4/4/2023 at 4:55 AM, David Heffernan said: You can actually delete all of these StringReplace calls by simply using the correct encoding for your extracted data. That was not an option on the extraction tool; you upload the file, it processes the file, then you click a Download button. 21 hours ago, Lars Fosdal said: Wouldn't converting the chars to Unicode solve that problem? I don't understand why you don't want to handle the text as what it is. Yes, but how can I do that? Because they are unnecessary. Deleting them would make no difference to the result I'm after. But having them as weird text IS a problem. Share this post Link to post
David Heffernan 2345 Posted April 5, 2023 (edited) 30 minutes ago, David Schwartz said: That was not an option on the extraction tool; you upload the file, it processes the file, then you click a Download button. It's when you read the output into Delphi that there's a problem. You tool is emitting UTF-8 encoded text, but you are interpreting it as ANSI. The tool is fine. Your code is not. Edited April 5, 2023 by David Heffernan Share this post Link to post
A.M. Hoornweg 144 Posted April 5, 2023 Just open the extracted *.txt files in Notepad++ and try out the different encoding options that this program offers until the files display correctly. Then save them as "unicode with bom". tStringlist.loadfromfile will load the files correctly even if they countain foreign characters. 1 Share this post Link to post
Brian Evans 105 Posted April 5, 2023 See this most often when ASCII is automatically cleaned up typographically for printing by doing some conversions like dash to em dash. If this text is put back / interpreted as ASCII bytes the various UTF8 encodings of the typographical replacements end up us multiple characters each like a minus/dash that was converted to em dash then ending up as †". Need to find where the problem is - the data in the PDF itself could already be corrupted this way or it can happen at some other stage including in the PDF -> Text or in how you load the text. Often even if you do interpret the encodings correctly so there is a — (em dash) instead of †" the equivalent replacements might be worthwhile to convert text back to plain ASCII. 1 Share this post Link to post
David Heffernan 2345 Posted April 5, 2023 4 hours ago, A.M. Hoornweg said: Just open the extracted *.txt files in Notepad++ and try out the different encoding options that this program offers until the files display correctly. Then save them as "unicode with bom". tStringlist.loadfromfile will load the files correctly even if they countain foreign characters. No. We know the text is UTF8 encoded so just load it specifying that encoding. No point adding a extra step. Share this post Link to post
David Heffernan 2345 Posted April 5, 2023 4 hours ago, Brian Evans said: See this most often when ASCII is automatically cleaned up typographically for printing by doing some conversions like dash to em dash. If this text is put back / interpreted as ASCII bytes the various UTF8 encodings of the typographical replacements end up us multiple characters each like a minus/dash that was converted to em dash then ending up as †". Need to find where the problem is - the data in the PDF itself could already be corrupted this way or it can happen at some other stage including in the PDF -> Text or in how you load the text. Often even if you do interpret the encodings correctly so there is a — (em dash) instead of †" the equivalent replacements might be worthwhile to convert text back to plain ASCII. The text is clearly UTF8 encoded. That much we already know. Share this post Link to post
Lars Fosdal 1791 Posted April 6, 2023 @David Schwartz - You wouldn't happen to have an original file uploaded as an attachment to a post here, so that we can try some conversions? 1 Share this post Link to post
Fr0sT.Brutal 900 Posted April 6, 2023 On 4/4/2023 at 1:56 PM, David Schwartz said: The code in the Delphi IDE looks like this: ln := StringReplace( strs[n], '➤', '>', [rfReplaceAll] ); // '➤' ln := StringReplace( ln, '’', '''', [rfReplaceAll] ); // '’' ln := StringReplace( ln, '“', '"', [rfReplaceAll] ); // '“' ln := StringReplace( ln, 'â€', '"', [rfReplaceAll] ); // 'â€' ln := StringReplace( ln, '…', '...', [rfReplaceAll] ); // '…' ln := StringReplace( ln, 'â€"', '--', [rfReplaceAll] ); // 'â€"' ln := StringReplace( ln, '–', '--', [rfReplaceAll] ); // '–' This code works fine. And at one point I opened Delphi and the same code now looks like this: ln := StringReplace( strs[n], '➤', '>', [rfReplaceAll] ); // '➤' ln := StringReplace( ln, '’', '''', [rfReplaceAll] ); // '’' ln := StringReplace( ln, '“', '"', [rfReplaceAll] ); // '“' ln := StringReplace( ln, '”', '"', [rfReplaceAll] ); // '”' ln := StringReplace( ln, '…', '...', [rfReplaceAll] ); // '…' ln := StringReplace( ln, '—', '--', [rfReplaceAll] ); // '—' ln := StringReplace( ln, '—', '--', [rfReplaceAll] ); // '—' This code DOES NOT WORK! So you had file interpreted as ANSI and converted into UTF16 with all the "weird" chars just widened ($AB => $00AB). And you had your UTF16-encoded literals defined in the same way because IDE thought the source file is in ANSI. Then, in new version, the option has changed to UTF8. And your literals which together form a valid UTF8 compound char turned to single UTF16 char which is not contained in source string. That's my version. 1 Share this post Link to post
David Heffernan 2345 Posted April 6, 2023 5 hours ago, Lars Fosdal said: @David Schwartz - You wouldn't happen to have an original file uploaded as an attachment to a post here, so that we can try some conversions? It's UTF8. We don't need to check any more. And you don't need any more information than is in the original post. Share this post Link to post
jeroenp 26 Posted April 7, 2023 (edited) On 4/4/2023 at 10:57 AM, David Schwartz said: The files have lots of things like ’ and – and … scattered throughout. I found a table that shows what they're supposed to be and wrote this to convert them (strs points to a memo.Lines property): Run these oddly looking Mojibake character sequences through ftfy: fixes text for you analyser which lists the encoding/decoding steps to get from the oddly looking text to proper text, then repeat these encoding sequences in Delphi code (using for instance the TEncoding class). This is way better than using a conversion table, because likely that table will be incomplete. It also solves your problem where apparently your Delphi source code got mangled undoing your table based conversion workaround. That code mangling can have lots of causes including hard to reproduce bugs of the Delphi IDE itself or plugins used by the IDE. BTW: if you install poppler (for instance through Chocolatey), the included pdftotext console executable can extract text from PDF files for you. Edited April 7, 2023 by jeroenp Share this post Link to post
David Heffernan 2345 Posted April 7, 2023 5 hours ago, jeroenp said: Run these oddly looking Mojibake character sequences through ftfy: fixes text for you analyser which lists the encoding/decoding steps to get from the oddly looking text to proper text, then repeat these encoding sequences in Delphi code (using for instance the TEncoding class). I'd just read them using the UTF8 encoding in the first place and so never ever see these characters. I'm sure you would too. 2 Share this post Link to post
Javier Tarí 23 Posted April 8, 2023 I would just substitute the original characters for their equivalent codes, in the #99 or #$ab format 1 Share this post Link to post
renna 0 Posted April 8, 2023 It's UTF8. so don't need to check any more. Share this post Link to post
David Heffernan 2345 Posted April 8, 2023 This entire thread blows my mind. The number of people who think it's normal to read UTF8 as though it were ANSI. 1 Share this post Link to post
jeroenp 26 Posted April 8, 2023 (edited) 15 hours ago, David Heffernan said: I'd just read them using the UTF8 encoding in the first place and so never ever see these characters. I'm sure you would too. That would be my first try too. Since could just as well be the odd way the PDF to text on-line exporter makes an encoding error (it wouldn't be the first tool or site doing strange encoding stuff, hence the series of blog posts at https://wiert.me/category/mojibake/ ) and why I mentioned ftfy: it's a great tool helping to figure out encoding issues. Looking at https://ftfy.vercel.app/?s=… (and hoping this forum does not mangle that URL) two encode/decode steps are required to fix, so it does not look like a plain "read using UTF8" solution: s = s.encode('latin-1') s = s.decode('utf-8') s = s.encode('sloppy-windows-1252') s = s.decode('utf-8') Edited April 8, 2023 by jeroenp ftfy example 1 Share this post Link to post