dummzeuch 1517 Posted March 7, 2020 Recently I came across a problem with INI files: Some editors (including recent versions of Windows Notepad) add a byte order mark (BOM) to the files they save. In particular the BOM for UTF-8 kept appearing in INI files which then were read incorrectly by the Delphi 2007 implementation of TMemIniFile (I guess the same applies to all pre Unicode versions of Delphi). In particular this was a problem with programs that used TJvAppIniStorage for streaming application settings to disk. (TJvAppIniStorage internally uses TMemIniFile.) So I tried to fix this ... read on in my blog post Share this post Link to post
Fr0sT.Brutal 900 Posted March 10, 2020 IMHO BOM causes more troubles than profit. I just remove it from every file I have 1 Share this post Link to post
dummzeuch 1517 Posted March 10, 2020 If you only ever handle files created with Windows programs, that's probably fine. Unfortunately I also get files from other systems, where the BOM might be useful. Share this post Link to post
Lars Fosdal 1793 Posted March 10, 2020 I try to ensure that all my text files have a BOM and are encoded as UTF-8. 2 Share this post Link to post
timfrost 79 Posted March 10, 2020 Exactly. Stripping out every UTF-8 BOM is not viable unless all users of the files will only ever use a single national language and its encoding. Share this post Link to post
Fr0sT.Brutal 900 Posted March 11, 2020 Without BOM every file-handling utility is usable even that which knows nothing about UTF8. I had some shaman dances with XE2 and resource compiler trying to make them handle UTF8 properly and they all disliked BOM. Now 10.1 seems able to handle BOM UTF8 which is good but old habits remain for a while. 1 Share this post Link to post
A.M. Hoornweg 144 Posted March 11, 2020 23 hours ago, Fr0sT.Brutal said: IMHO BOM causes more troubles than profit. I just remove it from every file I have That's only safe if your character set is limited to Ascii. Once it contains any character > #127, it is unclear which character it is, because how's the application to know the code page the file is using? My code / annotations contain lots of umlauts and maths symbols. Share this post Link to post
Rollo62 539 Posted March 11, 2020 16 hours ago, Lars Fosdal said: I try to ensure that all my text files have a BOM and are encoded as UTF-8. I do create my files with BOM too, and additionally try to keep existing status when loading alien documents. That way, all parties should be happy. I try to follow these rules: File has BOM --> Keep BOM File without BOM --> Keep without BOM Create my own file --> Always use BOM 2 Share this post Link to post
Fr0sT.Brutal 900 Posted March 11, 2020 3 hours ago, A.M. Hoornweg said: That's only safe if your character set is limited to Ascii. Once it contains any character > #127, it is unclear which character it is, because how's the application to know the code page the file is using? My code / annotations contain lots of umlauts and maths symbols. I just try to use UTF8 everywhere and get rid of all ANSIs completely 1 Share this post Link to post
A.M. Hoornweg 144 Posted March 11, 2020 1 hour ago, Fr0sT.Brutal said: I just try to use UTF8 everywhere and get rid of all ANSIs completely I prefer that, too. The matter is just if we should write a BOM or not. In regular Windows INI files it won't work. In tMemIniFile it will. In XML it's optional. In Linux it is frowned upon because everything is supposed to be UTF8. In Windows every text is some flavor of ANSI unless it has a BOM. 1 Share this post Link to post
A.M. Hoornweg 144 Posted March 11, 2020 5 hours ago, Fr0sT.Brutal said: Without BOM every file-handling utility is usable even that which knows nothing about UTF8. I had some shaman dances with XE2 and resource compiler trying to make them handle UTF8 properly and they all disliked BOM. Now 10.1 seems able to handle BOM UTF8 which is good but old habits remain for a while. Could you give us a short explanation how to compile a UTF8 RC file containing unicode strings ? Share this post Link to post
Fr0sT.Brutal 900 Posted March 12, 2020 (edited) 18 hours ago, A.M. Hoornweg said: Could you give us a short explanation how to compile a UTF8 RC file containing unicode strings ? Just as usual, but in Project options > Resource compiler set Code page=65001 and Multi-byte=True. I also set Resource compiler to Windows SDK Resource Compiler because my dialog designer generates RC that Borland's compiler doesn't understand. But AFAICT brcc32 works well too for files included as RCDATA. Edited March 12, 2020 by Fr0sT.Brutal Share this post Link to post
Fr0sT.Brutal 900 Posted March 19, 2020 Regarding BOM support https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#Programming_platforms Quote Microsoft's compilers often fail at producing UTF-8 string constants from UTF-8 source files. The most reliable method is to turn off UNICODE, not mark the input file as being UTF-8 (i.e. do not use a BOM), and arrange the string constants to have the UTF-8 bytes. If a BOM was added, a Microsoft compiler will interpret the strings as UTF-8, convert them to UTF-16, then convert them back into the current locale, thus destroying the UTF-8.[13] Without a BOM and using a single-byte locale, Microsoft compilers will leave the bytes in a quoted string unchanged. Share this post Link to post
Lars Fosdal 1793 Posted March 19, 2020 https://docs.microsoft.com/en-us/cpp/build/reference/utf-8-set-source-and-executable-character-sets-to-utf-8?view=vs-2019 Share this post Link to post
Joseph MItzen 252 Posted March 22, 2020 On 3/10/2020 at 12:03 PM, dummzeuch said: If you only ever handle files created with Windows programs, that's probably fine. Unfortunately I also get files from other systems, where the BOM might be useful. I thought it was only Windows programs that ever put the BOM in in the first place. At least in the Linux development community they tend to consider it a Windows thing. Wikipedia says... Quote The Unicode Standard permits the BOM in UTF-8 but does not require or recommend its use. Byte order has no meaning in UTF-8.... Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download. So basically no one should be expecting a BOM in a UTF-8 file (and it's not required) so I don't believe you'd run into problems omitting it. Share this post Link to post
Joseph MItzen 252 Posted March 22, 2020 On 3/10/2020 at 3:45 PM, Lars Fosdal said: I try to ensure that all my text files have a BOM and are encoded as UTF-8. BOM is both not required by the UTF8 standard and discouraged. Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature Share this post Link to post
Joseph MItzen 252 Posted March 22, 2020 On 3/10/2020 at 6:35 PM, timfrost said: Exactly. Stripping out every UTF-8 BOM is not viable unless all users of the files will only ever use a single national language and its encoding. UTF8 doesn't need a BOM regardless of encoding. Quote Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM. Share this post Link to post
A.M. Hoornweg 144 Posted March 23, 2020 On 3/22/2020 at 7:07 AM, Joseph MItzen said: UTF8 doesn't need a BOM regardless of encoding. But a program (such as an editor) needs to know if the file is UTF8 in the first place. Without a BOM, it must guess. 1 Share this post Link to post
Fr0sT.Brutal 900 Posted March 24, 2020 22 hours ago, A.M. Hoornweg said: But a program (such as an editor) needs to know if the file is UTF8 in the first place. Without a BOM, it must guess. If that's a text editor, it should have some encoding guessing heuristics or manual setting. Anyway editors should just suppose UTF8 by default so that these tricky ANSI codepages go into history. UTF8 was designed so that for latin char set it is 100% identical to ASCII encoding and most of older tools that take single-byte encodings could work without modifications (of course, those which deal with char-by-char tokenization or string content processing will need modification anyway but that's a minor part - at least in all my projects I almost never had a requirement to know whether a string is UTF8 or whatever). BOM obviously breaks this compatibility; moreover, it turns a "plain text" file into something with invisible header so that "empty file" <> "file of 0 bytes" anymore. Share this post Link to post
Gustav Schubert 25 Posted March 24, 2020 On 3/23/2020 at 10:54 AM, A.M. Hoornweg said: But a program (such as an editor) needs to know if the file is UTF8 Such as the Delphi IDE when loading pascal files. Visual Studio Code, which I use to do git, will assume UTF8. Using a file between the two environments only works if BOM is present. And my Merge tool (Araxis) also needs to be configured to copy with BOM. So for me it is three programs which need to be fine tuned as a set. ( When I work with a jekyll project, then BOM is forbidden, and I have to switch the configuration of the Merge tool. ) Share this post Link to post
Fr0sT.Brutal 900 Posted March 24, 2020 38 minutes ago, Gustav Schubert said: Such as the Delphi IDE when loading pascal files I find "Project options > Compiler > Code page" setting very useful. Though newer versions seem to not need it anymore Share this post Link to post
Gustav Schubert 25 Posted March 24, 2020 20 minutes ago, Fr0sT.Brutal said: newer versions seem to not need it anymore I am using latest CE. If I open a file without BOM in Delphi, which was saved from VS Code, it is interpreted wrongly as ANSI. Changes to project options would go into the the .dproj file which I do not want to rely on. And further, I do not want to rely on the configuration of the IDE, when I come to another machine and check out a project. It is good thing if code reading ini files can cope with a BOM. Whether you want your files to have a BOM may be determined by factors that have nothing to do with ini files. 2 Share this post Link to post