A standard example here is the Turkish dotless I, which yields "ı" → "I" → "i" w...

Macha · on Nov 1, 2024

It feels like unifying it with the ASCII i is the mistake here. There should have just been 4 turkish characters in 2 pairs, rather than trying to reuse I/i

It's not like we insist Α (Greek) is the same as A (Latin) or А (Cyrillic) just because they're visually identical.

WorldMaker · on Nov 1, 2024

But even with separate characters, you aren't safe because the ASCII "unification" isn't just Unicode's fault to begin with, in some cases it is historic/cultural in its own ways: German ß has distinct upper and lower case forms, but also has a complicated history of sometimes, depending on locale, the upper case form is "SS" rather than the upper-case form of ß. In many of those same locales the lower-case form of "SS" is "ss", not ß. It doesn't even try to round-trip, and that's sort of intentional/cultural.

atoav · on Nov 1, 2024

Uppercase ẞ exists since 2017, so before that using SS as a replacement was the correct way of doing things. That is relatively recent wh3n it comes tonthat kind of change

layer8 · on Nov 1, 2024

This stems from the earlier Turkish 8-bit character sets like IBM code page 857, which Unicode was designed to be roundtrip-compatible with.

Aside from that, it‘s unlikely that authors writing both Turkish and non-Turkish words would properly switch their input method or language setting between both, so they would get mixed up in practice anyway.

There is no escape from knowing (or best-guessing) which language you are performing transformations on, or else just leave the text as-is.

Arnt · on Nov 1, 2024

When do you think that first mistake happened?

(Pick a year, then think about why it didn't happen in that year.)

Macha · on Nov 1, 2024

When Unicode was being specced out originally I guess. There was more interest in unifying characters at that stage (see also the far more controversial Han unification)

Arnt · on Nov 1, 2024

Uh-huh. At that time roundtrip compatiblity with all widely used 8-bit encodings was a major design criterion. Roundtrip meaning that you could take an input string in e.g. iso 8859-9, convert it to unicode, convert it back, and get the same string, still usable for purposes like database lookups. Would you have argued to break database lookups at the time?

Macha · on Nov 1, 2024

ISO-8859-9 actually does have what I suggest:

FD/49 are lower/upper dotless ı/I

DD/69 are upper/lower dotted İ/i.

There's nothing around the capability to round trip that through unicode that required 49 in ISO-8859-9 to be assigned the same unicode codepoint as 49 in ISO-8859-1 because they happen to be visually identical

josefx · on Nov 1, 2024

There is a reason: ISO-8859-9 is an extended ASCII character set. The shared characters are not an accident, they are by definition the same characters. Most ISO character sets follow a specific template with fixed ranges for shared and custom characters. Interpreting that i as anything special would violate the spec.

Arnt · on Nov 2, 2024

In practical terms:

Back in those days, people would store a mixture of ASCII and other data in the same database, e.g. ASCII in some rows, ISO-8859-9 in others. (My bank at the time did that, some customers had all-ASCII names, some had names with ø and so on.) If unicode were only mostly compatible with the combination, it wouldn't have been safe to start migrating software that accessed databases/servers/… For example, using UTF8 for display and a database's encoding to access a DBMS would have had difficult-to-understand limitations.

You can fix all kinds of bugs if you're able to disregard compatibility with old data or old systems. But you can't. And that's why unicode is constrained by e.g. the combination of a decision made in Sweden hundreds of years ago with one made in Germany around the same time. Compatibility with both leads to nontrivial choices and complexity, incompatibility leads to the scrap heap of software.

JimDabell · on Nov 1, 2024

The transliteration of this specific character was also involved in a violent attack and suicide:

https://languagelog.ldc.upenn.edu/nll/?p=73

von_lohengramm · on Nov 1, 2024

Hardly fair to call it a murder-suicide. Ramazan killed Emine in self-defense.

JimDabell · on Nov 1, 2024

You’re absolutely right, I misremembered the details. Thanks for the correction!

Wowfunhappy · on Nov 1, 2024

So, uh, is this actually desirable per the Turkish language? Or is it more-or-less a bug?

I'm having trouble imagining a scenario where you wouldn't want uppercase and lowercase to map 1-to-1, unless the entire concept of "uppercase" and "lowercase" means something very different in that language, in which case maybe we shouldn't be calling them by those terms at all.

dgoldstein0 · on Nov 1, 2024

My understanding is it's a bug that the case changes don't round trip correctly, in part due to questionable Unicode design that made the upper and lower case operations language dependent.

This stack overflow has more details - but apparently Turkish i and I are not their own Unicode code points which is why this ends up gnarly.

https://stackoverflow.com/questions/48067545/why-does-unicod...

Wowfunhappy · on Nov 1, 2024

Ah, I see the problem now!

In Turkish:

• Lowercase dotted I ("i") maps to uppercase dotted I ("İ")

• Lowercase dotless I ("ı") maps to uppercase dotless I ("I")

In English, uppercase dotless I ("I") maps to lowercase dotted I ("i"), because those are the only kinds we have.

Ew! So it's a conflict of language behavior. There's no "correct" way to handle this unless you know which language is currently in use!

Even if you were to start over, I'm not convinced that using different unicode point points would have been the right solution since the rest of the alphabet is the same.

Rendello · on Nov 1, 2024

Me and someone else in the thread ran into the same string searching bug with that same character:

https://news.ycombinator.com/item?id=42016936

dgoldstein0 · on Nov 3, 2024

yup. lowercase and uppercase operations depend on language. It's rough.

In some apis this distinction shows through - e.g. javascript's Intl.Collator is a language-aware sorting interface in JS.

In practice, the best bet is usually to try to not do any casing conversions and just let the users handle uppercase vs lowercase on their own. But if you have to do case-insensitive operations, lots more questions about which normalization you should use, and if you want to match user intuition you are going to want to take the language of the text into consideration.

xg15 · on Nov 1, 2024

Yeah, making a specific "Turkish lowercase dotted i" character which looks and behaves exactly like the regular i except for uppercasing feels like introducing even more unexpected situations (and also invites the next homograph attack)

I guess it's a general situation: If you have some data structure which works correctly for 99.99% of all cases, but there is one edge case that cannot be represented correctly, do you really want to throw out the whole data structure?