Jump to content
luebbe

How to calculate Unicode text width?

Recommended Posts

Hi everybody,

 

The task at hand is to write column aligned text to an output file. The good old reliable format function can handle this fine unless CJK characters are involved. Generally these occupy two latin character columns, but I cannot blindly assume that a string which has to be formatted only contains CJK characters and multiply by two.

Example:
 

en   : "Coding:" - seven characters occupy seven columns
zh_CN: "编码:"   - three characters occupy five columns

I have found different solutions for different programming languages on SO, GitHub and other places, for example a wcwidth function in C, but nothing in Pascal (yet). Maybe I haven't uttered the proper incantations.

System.Character.TCharHelper doesn't have a property like UnicodeWidth (Full/Half).
Does the RTL offer any helper for this problem or do I have to roll my own?
Hints are greatly appreciated.

Share this post


Link to post

When I faced this problem I used to write to the canvas and checked the width afterwards. Now with FMX this is a bit easier through the TextLayout Unit. If you are using FMX, check out this: http://docwiki.embarcadero.com/RADStudio/Alexandria/en/ FireMonkey_Text_Layout in a nutshell you create a TextLayout object via a TextLayoutManager, which knows the Canvas you're writing on. Then set the properties you want the text to have (Font, size, color, etc.) and enter the text and you can retrieve the TextWidth as a property of the TextLayout object.

Share this post


Link to post

I can't see how it can be done.

 

Even if you roll your own, which you would have to, and even with a monospaced font, the width will depend on how the particular font composes Unicode code points into graphemes (individual graphical units) and then glyphs (the visual representation the font produces).

Share this post


Link to post
3 minutes ago, Sherlock said:

write to the canvas and checked the width afterwards

That doesn't help with aligning text in a text file.

  • Thanks 1

Share this post


Link to post

@Anders Melander You are right, I totally overlooked that small caveat. Sorry about the digression. Unless the output file is a pdf or some other graphical print format. Then my method should still be at least a base from which to explore further.

Share this post


Link to post
2 minutes ago, Anders Melander said:

You got me thinking...that is the way I used to do it. I'm unsure why I had to switch to this unwieldy TextLayout method. I think it is because of HiDPI and scaling and such....Which might be easier accounted for with a TextLayout.

Share this post


Link to post

I was thinking along the lines of TCanvas.TextWidth as well, but since I'm "just" dealing with text file output I thought this may be too much.
Delphi characters.inc seems to contain all unicode characters and the TCharHelper derives a lot of info from it.
There is also an official unicode property "east asian width" (https://www.unicode.org/reports/tr11/tr11-8.html)  and the corresponding lookup table (https://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt), but this doesn't seem to be used in TCharHelper.
I'd probably import this table and do something similar to TCharHelper.

Share this post


Link to post
3 hours ago, luebbe said:

Hi everybody,

 

The task at hand is to write column aligned text to an output file. The good old reliable format function can handle this fine unless CJK characters are involved. Generally these occupy two latin character columns, but I cannot blindly assume that a string which has to be formatted only contains CJK characters and multiply by two.

Example:
 


en   : "Coding:" - seven characters occupy seven columns
zh_CN: "编码:"   - three characters occupy five columns

I

 

This does not make much sense IMO since how the text looks when viewing the file depends on the font the viewing application uses in the window  that displays the file content, i.e. the width of what you consider a "column" depends on the font. If you cannot control that you may be better served by using tab characters instead of spaces for alignment. That format also imports easier into Excel etc.

  • Like 1

Share this post


Link to post
Posted (edited)
5 hours ago, Anders Melander said:

That doesn't help with aligning text in a text file.

One needs to lay out the text to determine which code points occupy the same character. There is no other way. Uniscribe on Windows contains functions for doing this kind of work (look at the functions designed to move the caret to the next character, etc). Core Text on Mac does it. I have not looked closely at the FMX text layout encapsulation, it may not expose enough data to determine this for outputting to a plain text file.

 

One may be able to cheat by examining the unicode block of each character and making some assumptions about how they will be laid out by a text viewer, but that is not accurate. It could be accurate enough, depending on the use case, I guess.

Edited by Brandon Staggs

Share this post


Link to post
6 hours ago, luebbe said:

The good old reliable format function can handle this fine unless CJK characters are involved.

I'm surprised that it can work properly outside of the ASCII range. Does it work properly if the text contains combining diacritical marks? 

Share this post


Link to post
6 hours ago, Brandon Staggs said:

One needs to lay out the text to determine which code points occupy the same character. There is no other way. Uniscribe on Windows contains functions for doing this kind of work (look at the functions designed to move the caret to the next character, etc). Core Text on Mac does it. I have not looked closely at the FMX text layout encapsulation, it may not expose enough data to determine this for outputting to a plain text file.

You are talking about shaping which is not relevant to the problem here.  Shaping translates a stream of Unicode code points into glyph IDs and the result 100% depends on the font used to do the shaping. That is the whole point of shaping. FWIW, I've written a shaper so I know a bit about these things...

 

Anyway I just realized that I forgot to explain why it is that...

three characters occupy five columns

...even though the font is monospaced.

 

The reason is that the font doesn't contain a mapping for the three Unicode codepoints (U+7F16, U+7801, U+FF1A). So what Windows (or whatever) does, instead of just giving up and display � or □, it searches its fallback font list for a font that 1) supports the Unicode script of the characters (Han in this case) and 2) contains a mapping for the characters (i.e. can map codepoint to glyph id) and then it uses that font to display the glyphs. Since the fallback font isn't monospaced, or at least isn't monospaced with the same character width, you get a different glyph width.

  • Like 1

Share this post


Link to post
Posted (edited)
On 5/3/2024 at 4:58 PM, Anders Melander said:

You are talking about shaping which is not relevant to the problem here.  Shaping translates a stream of Unicode code points into glyph IDs and the result 100% depends on the font used to do the shaping. That is the whole point of shaping. FWIW, I've written a shaper so I know a bit about these things...

I have written Unicode text layout and rendering engines on Windows and Mac which are used in live video production environments. I doubt we are far apart on this. I was thinking about the original problem in the thread. If you know you are going for a monospaced layout, you do not really need to worry about the differences in fonts so much. Of course, there may be an outlier. Uniscribe will give you the logical cluster data you need to know which code points are part of a visual character, and you could then account for that in a plain text output to (hopefully) end up with the columns you expect.

 

Of course this may end up being a hopeless exercise and I think it would be much easier, if one wants formatted columns in their output, to use a text format that can actually do that for real, like HTML.

Edited by Brandon Staggs

Share this post


Link to post
15 minutes ago, Brandon Staggs said:

If you know you are going for a monospaced layout, you do not really need to worry about the differences in fonts so much. Of course, there may be an outlier. Uniscribe will give you the logical cluster data you need to know which code points are part of a visual character.

I think the result in the first post shows that you do have to worry about the font used.

The clusters a shaper, such as the one in Uniscribe, produces are part of a run of characters with a single script and a single specific font. If a string of text contains different scripts, or characters which can not be represented by the font, then the layer above the shaper will have to split it up into multiple runs and process each run individually. The reason is that different scripts have different shaping rules and the font (assuming it's an OpenType font) dictates many of the rules.

It may very well be that Uniscribe hide most of these things but in the end the result will still be that multiple fonts are used and that the glyphs shown, and their size/position, depend on the fonts.

Share this post


Link to post
Posted (edited)
9 minutes ago, Anders Melander said:

I think the result in the first post shows that you do have to worry about the font used.

The clusters a shaper, such as the one in Uniscribe, produces are part of a run of characters with a single script and a single specific font. If a string of text contains different scripts, or characters which can not be represented by the font, then the layer above the shaper will have to split it up into multiple runs and process each run individually. The reason is that different scripts have different shaping rules and the font (assuming it's an OpenType font) dictates many of the rules.

It may very well be that Uniscribe hide most of these things but in the end the result will still be that multiple fonts are used and that the glyphs shown, and their size/position, depend on the fonts.

Yes, even with English plain text you could end up with a font rendering with ligatures that throws off your calculations. But presumably the monospaced requirement minimizes such edge cases.

 

Regardless, I think I agree that the bottom line is you can't actually guarantee the output is going to be rendered in the way you want it to, when you don't have control over the font and you are just outputting a plain text file padding columns with spaces hoping they will line up the same way on another system. Still, if someone was insisting I make this work, I think I could get very close to the expected result simply based on the script itemization and shaping stages of Uniscribe. Special stylized fonts notwithstanding, the analysis provided by Uniscribe API will tell you which code points in your Unicode script ended up in which logical cluster. 

 

But it's still all in vain if your text viewer can't handle the text in the first place and can't do proper composition. It's not so bad now, but many editors I have used would choke on monospaced Hebrew text that contains vowel points. It's just too much for their layout engine to handle because of all the assumptions they make about text.

 

Maybe just use a CSV file? :classic_cool:

Edited by Brandon Staggs

Share this post


Link to post
1 minute ago, Brandon Staggs said:

I think I could get very close to the expected result simply based on the script itemization and shaping stages of Uniscribe.

So how would you handle the case where the specified font can't map a character causing a fallback font to be used?

Share this post


Link to post
Posted (edited)
5 minutes ago, Anders Melander said:

So how would you handle the case where the specified font can't map a character causing a fallback font to be used?

The question comes down to what the OP has control over. If he is writing the function that does this, he can choose the fallback fonts as need. (That's what I do -- the API has all you need to re-run items in the script with fallbacks. How you choose those is another topic.)

 

Of course, that begs the question, when his text file is loaded on someone else's machine, will their text viewer have a similar rendering path (primary-fallback fonts)? Probably not.

 

Anyway, I'd do a best-effort attempt to get the logical cluster data based on a font with excellent block coverage.

 

https://learn.microsoft.com/en-us/windows/win32/api/usp10/nf-usp10-scriptshape

 

Quote

Character array, where c<n>u<m> means cluster n, Unicode code point m:

  • | c1u1 | c2u1 | c3u1 c3u2 c3u3 | c4u1 c4u2 |

Glyph array, where c<n>g<m> means cluster n, glyph m:

  • | c1g1 | c2g1 c2g2 c2g3 | c3g1 | c4g1 c4g2 c4g3 |

Cluster array, that is, the offset (in glyphs) to the cluster containing the character:

  • | 0 | 1 | 4 4 4 | 5 5 |

and then I would provide a CSV file of the column data so someone who doesn't like my text file can still get the data formatted properly.

Edited by Brandon Staggs

Share this post


Link to post
Posted (edited)
2 hours ago, Brandon Staggs said:

Regardless, I think I agree that the bottom line is you can't actually guarantee the output is going to be rendered in the way you want it to, when you don't have control over the font and you are just outputting a plain text file padding columns with spaces hoping they will line up the same way on another system

I agree - the key thing is that a text file (including a unicode text file) does not contain information about the display of the characters so any attempt at alignment can never be guaranteed to have success.

Thinking out of the box - what about the plain old TAB character? using this as a separator between your columns might result in any reasonable file viewer lining them up for you. Always good to try a simple solution first.....

Edited by Roger Cigol
typo

Share this post


Link to post

First of all my apologies for not replying sooner. That will teach me (not) to ask a question on a Friday before going on vacation...

We have no control over the font that is used to display the text file. It may be any standard console font or a font selected in the editor of the user's choice. So probably the best thing we can come up with is to count how many full width/half width CJK characters are in the string to be written and pad this with a matching number of half width spaces.

I tried using tabs, but this only works if the "jiggle" is less than a tab's width, which cannot be guaranteed. I'll check if I can come up with something based on the table I linked above. But since it's only a cosmetic problem, it's not worth that much effort.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×