Jump to content
dummzeuch

Skipping the UTF-8 BOM with TMemIniFile in Delphi 2007

Recommended Posts

Recently I came across a problem with INI files: Some editors (including recent versions of Windows Notepad) add a byte order mark (BOM) to the files they save. In particular the BOM for UTF-8 kept appearing in INI files which then were read incorrectly by the Delphi 2007 implementation of TMemIniFile (I guess the same applies to all pre Unicode versions of Delphi). In particular this was a problem with programs that used TJvAppIniStorage for streaming application settings to disk. (TJvAppIniStorage internally uses TMemIniFile.) So I tried to fix this ...

read on in my blog post

Share this post


Link to post

IMHO BOM causes more troubles than profit. I just remove it from every file I have

  • Like 1

Share this post


Link to post

If you only ever handle files created with Windows programs, that's probably fine. Unfortunately I also get files from other systems, where the BOM might be useful.

Share this post


Link to post

I try to ensure that all my text files have a BOM and are encoded as UTF-8.

 

  • Like 2

Share this post


Link to post

Exactly.  Stripping out every UTF-8 BOM is not viable unless all users of the files will only ever use a single national language and its encoding.

Share this post


Link to post

Without BOM every file-handling utility is usable even that which knows nothing about UTF8. I had some shaman dances with XE2 and resource compiler trying to make them handle UTF8 properly and they all disliked BOM. Now 10.1 seems able to handle BOM UTF8 which is good but old habits remain for a while.

  • Like 1

Share this post


Link to post
23 hours ago, Fr0sT.Brutal said:

IMHO BOM causes more troubles than profit. I just remove it from every file I have

That's only safe if your character set is limited to Ascii.

 

Once it contains any character > #127, it is unclear which character it is, because how's the application to know the code page the file is using?  My code / annotations contain lots of umlauts and maths symbols.

Share this post


Link to post
16 hours ago, Lars Fosdal said:

I try to ensure that all my text files have a BOM and are encoded as UTF-8.

 

I do create my files with BOM too, and additionally try to keep existing status when loading alien documents.

That way, all parties should be happy.

 

I try to follow these rules:

File has BOM          --> Keep BOM

File without BOM   --> Keep without BOM

Create my own file --> Always use BOM

 

  • Like 2

Share this post


Link to post
3 hours ago, A.M. Hoornweg said:

That's only safe if your character set is limited to Ascii. 

 

Once it contains any character > #127, it is unclear which character it is, because how's the application to know the code page the file is using?  My code / annotations contain lots of umlauts and maths symbols.

I just try to use UTF8 everywhere and get rid of all ANSIs completely

  • Like 1

Share this post


Link to post
1 hour ago, Fr0sT.Brutal said:

I just try to use UTF8 everywhere and get rid of all ANSIs completely

I prefer that, too. The matter is just if we should write a BOM or not.

 

In regular Windows INI files it won't work.

In tMemIniFile it will.

In XML it's optional.

In Linux it is frowned upon because everything is supposed to be UTF8.

In Windows every text is some flavor of ANSI unless it has a BOM.

 

  • Thanks 1

Share this post


Link to post
5 hours ago, Fr0sT.Brutal said:

Without BOM every file-handling utility is usable even that which knows nothing about UTF8. I had some shaman dances with XE2 and resource compiler trying to make them handle UTF8 properly and they all disliked BOM. Now 10.1 seems able to handle BOM UTF8 which is good but old habits remain for a while.

Could you give us a short explanation how to compile a UTF8 RC file containing unicode strings ?

Share this post


Link to post
Posted (edited)
18 hours ago, A.M. Hoornweg said:

Could you give us a short explanation how to compile a UTF8 RC file containing unicode strings ?

Just as usual, but in Project options > Resource compiler set Code page=65001 and Multi-byte=True. I also set Resource compiler to Windows SDK Resource Compiler because my dialog designer generates RC that Borland's compiler doesn't understand. But AFAICT brcc32 works well too for files included as RCDATA.

Edited by Fr0sT.Brutal

Share this post


Link to post

Regarding BOM support

https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#Programming_platforms

Quote

Microsoft's compilers often fail at producing UTF-8 string constants from UTF-8 source files. The most reliable method is to turn off UNICODE, not mark the input file as being UTF-8 (i.e. do not use a BOM), and arrange the string constants to have the UTF-8 bytes. If a BOM was added, a Microsoft compiler will interpret the strings as UTF-8, convert them to UTF-16, then convert them back into the current locale, thus destroying the UTF-8.[13] Without a BOM and using a single-byte locale, Microsoft compilers will leave the bytes in a quoted string unchanged.

 

Share this post


Link to post
On 3/10/2020 at 12:03 PM, dummzeuch said:

If you only ever handle files created with Windows programs, that's probably fine. Unfortunately I also get files from other systems, where the BOM might be useful.

I thought it was only Windows programs that ever put the BOM in in the first place. At least in the Linux development community they tend to consider it a Windows thing.
Wikipedia says...

Quote


The Unicode Standard permits the BOM in UTF-8 but does not require or recommend its use. Byte order has no meaning in UTF-8.... Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download.

 

So basically no one should be expecting a BOM in a UTF-8 file (and it's not required) so I don't believe you'd run into problems omitting it.

Share this post


Link to post
On 3/10/2020 at 3:45 PM, Lars Fosdal said:

I try to ensure that all my text files have a BOM and are encoded as UTF-8.

 

BOM is both not required by the UTF8 standard and discouraged. Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature

Share this post


Link to post
On 3/10/2020 at 6:35 PM, timfrost said:

Exactly.  Stripping out every UTF-8 BOM is not viable unless all users of the files will only ever use a single national language and its encoding.

UTF8 doesn't need a BOM regardless of encoding.

Quote

Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM.

 

Share this post


Link to post
On 3/22/2020 at 7:07 AM, Joseph MItzen said:

UTF8 doesn't need a BOM regardless of encoding.

 

But a program (such as an editor) needs to know if the file is UTF8 in the first place. Without a BOM, it must guess.

 

  • Like 1

Share this post


Link to post
22 hours ago, A.M. Hoornweg said:

But a program (such as an editor) needs to know if the file is UTF8 in the first place. Without a BOM, it must guess.

If that's a text editor, it should have some encoding guessing heuristics or manual setting. Anyway editors should just suppose UTF8 by default so that these tricky ANSI codepages go into history.

 

UTF8 was designed so that for latin char set it is 100% identical to ASCII encoding and most of older tools that take single-byte encodings could work without modifications (of course, those which deal with char-by-char tokenization or string content processing will need modification anyway but that's a minor part - at least in all my projects I almost never had a requirement to know whether a string is UTF8 or whatever). BOM obviously breaks this compatibility; moreover, it turns a "plain text" file into something with invisible header so that "empty file" <> "file of 0 bytes" anymore.

Share this post


Link to post
On 3/23/2020 at 10:54 AM, A.M. Hoornweg said:

But a program (such as an editor) needs to know if the file is UTF8

Such as the Delphi IDE when loading pascal files. Visual Studio Code,  which I use to do git, will assume UTF8. Using a file between the two environments only works if BOM is present. And my Merge tool (Araxis) also needs to be configured  to copy with BOM. So for me it is three programs which need to be fine tuned as a set.

 

( When I work with a jekyll project, then BOM is forbidden, and I have to switch the configuration of the Merge tool. )

Share this post


Link to post
38 minutes ago, Gustav Schubert said:

Such as the Delphi IDE when loading pascal files

I find "Project options > Compiler > Code page" setting very useful. Though newer versions seem to not need it anymore

Share this post


Link to post
20 minutes ago, Fr0sT.Brutal said:

newer versions seem to not need it anymore

I am using latest CE. If I open a file without BOM in Delphi, which was saved from VS Code, it is interpreted wrongly as ANSI.

 

Changes to project options would go into the the .dproj file which I do not want to rely on. And further, I do not want to rely on the configuration of the IDE, when I come to another machine and check out a project.

 

It is good thing if code reading ini files can cope with a BOM. Whether you want your files to have a BOM may be determined by factors that have nothing to do with ini files.

  • Like 2

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×