Jump to content
David Schwartz

Unicode weirdness

Recommended Posts

Ok, thanks for all of this juicy info. I've learned a bit, but I'm still not sure what to tell someone who sends emails or posts things that have words with apostrophes in them that look like: I’m  can’t  won’t and so on.

 

I've solved my problem, but that’s not it. 🙂

 

I've also found that if you load things that look like this

<break time='5s' />

into TMS WEB Core visual controls, they get eaten up because the browser apparently thinks they're XML tags or something like that. So I changed them to this

«break time='5s' /»

which works great.  And, yes, I tried &lt; / &gt; but they got automatically converted and THEN the whole "tag" got eaten up.

 

<sigh>

Share this post


Link to post
2 hours ago, jeroenp said:

That would be my first try too.

Since could just as well be the odd way the PDF to text on-line exporter makes an encoding error (it wouldn't be the first tool or site doing strange encoding stuff, hence the series of blog posts at https://wiert.me/category/mojibake/ ) and why I mentioned ftfy: it's a great tool helping to figure out encoding issues.

 

Looking at https://ftfy.vercel.app/?s=… (and hoping this forum does not mangle that URL) two encode/decode steps are required to fix, so it does not look like a plain "read using UTF8" solution:

 


s = s.encode('latin-1')
s = s.decode('utf-8')
s = s.encode('sloppy-windows-1252')
s = s.decode('utf-8')

 

This is fair. I looked at the first couple which are UTF-8, and then assumed they all were. But a couple of them aren't. Not implausible that the Delphi code in the OP is wrong though. 

  • Like 1

Share this post


Link to post
1 hour ago, David Schwartz said:

Ok, thanks for all of this juicy info. I've learned a bit, but I'm still not sure what to tell someone who sends emails or posts things that have words with apostrophes in them that look like: I’m  can’t  won’t and so on.

Point them to the Problems in different writing systems section of the Wikipedia Mojibake page.

1 hour ago, David Schwartz said:

 

I've solved my problem, but that’s not it. 🙂

Did you really solve them of worked around with the table-based approach? Because if you did that, you are bound to be incomplete.

1 hour ago, David Schwartz said:

 

I've also found that if you load things that look like this


<break time='5s' />

into TMS WEB Core visual controls, they get eaten up because the browser apparently thinks they're XML tags or something like that. So I changed them to this

That's a thing many web-tools do. You should reproduce this and report it as an issue to the TMS WEB Core bug category.

 

The classic WordPress editor suffers from the same issue (and a whole lot more issues: search my blog for more), but they marked it as "legacy" while forcing the very a11y* unfriendly (and pretentiously named) Gutenberg editor (which has other issues, but I digress) into peoples face.

 

1 hour ago, David Schwartz said:

«break time='5s' /»

which works great.  And, yes, I tried &lt; / &gt; but they got automatically converted and THEN the whole "tag" got eaten up.

The WordPress classic editor, trying to be smart, does that too especially when you switch between preview and HTML text modes a few times. The TMS WEB Core might fall in a similar trap. Be sure to report it as bug to the TMS people.

 

--jeroen

* a11y: accessibility

  • Like 1

Share this post


Link to post
On 4/8/2023 at 9:34 AM, David Heffernan said:

This entire thread blows my mind. The number of people who think it's normal to read UTF8 as though it were ANSI. 

Is this a one-time problem for OP, and is this the only encoding he will encounter? 

 

 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×