A corollary to this is that if we had a simpler function for converting between UTF-8 and UTF-16LE, then I could remove all uses of iconv from my code, since I only use it to convert to/from MS Windows formats. (iconv's API is ugly and difficult to use correctly.)
Ugh i forgot about Windows. Windows used UCS-2 some decades ago, from there, UCS-2 got standardized into MEFI FAT and that document got incorporated into the EFI spec which is now the mechanism how x86 boots.
There’s more fun stuff like that: because Microsoft wrote the spec for the Language Server Protocol, the offsets in it are in UTF-16 code units even though the transport format is UTF-8. (This is less awful than it sounds, because a code point takes two UTF-16 code units iff it takes the maximum allowed four UTF-8 ones, i.e. if its UTF-8 starts with ≥ 0xF0. But it’s still pretty awful.) And of course UCS-2 had also been baked into Java and Java/ECMAScript at about the same time (circa 1990) and only afterwards was it half-brokenly extended to UTF-16.
In defense of all those usages (except the LSP one, which is indefensible), the original pitches for Unicode[1] literally said that it was intended to be fixed-width, an international ASCII of sorts; that was to be achieved by restricting it to “commercially-relevant” text (and Han unification). Then it turned out there are plenty of very rare Han characters people really, really want to see encoded (for place and personal names, etc.). Of course, in hindsight an encoding with nontrivial string equivalences (e.g. combining diacritics) was never going to be as simple to handle as ASCII.
`wcstombs` and `mbstowcs` sound like they might do this?
They're C99 standard functions and should be converting between "wide strings" and "multibyte strings", which should be native UTF-16 and UTF-8 if your current locale is an UTF-8 locale.
Apparently this works on Windows since Windows 10 version 1803 (April 2018).
There are also "restartable" variants `wcsrtombs` and `mbsrtowcs` where the conversion state is explicitly stored, instead of (presumably) a thread-local variable.
C11 added "secure" variants (with an `_s` suffix) of all these which check the destination buffer size and have different return values.
I'd encourage you to just write the function. It's ultimately just two encodings of the same data. You can figure things out from the wikipedia pages for utf-8 and utf-16.
But if you are calling Win32 APIs, you can call Win32 APIs MultiByteToWideChar/WideCharToMultiByte to convert between UTF8 and UTF16. No need for iconv.
Either way, I suggest to the readers who might feel upset over this statement to explore something outside of C and C++, liking which, when it comes to strings, is nothing short of Stockholm syndrome.
I'm working on a UTF-8 string library for C# and across the last 6-8 months explored string design in Rust, Swift, Go, C, C++ and a little in other languages. C and C++ were, by far, most horrifying in the amount of footguns as well as the average effort required to perform trivial operations (including transcoding discussed here).
Strings are not easy. But it does not mean their complexity has to be unjustified or unreasonable, which it is in C++ and C (for reasons somewhat different although overlapping). The problem comes from the fact that C and C++ do not enjoy the benefit of the hindsight that Rust had designing its string around being UTF-8 exclusive with special types to express either opaque, ANSI or UTF-16 encodings to deal with situations where UTF-8 won't do.
But I assure you, there will be strong negative correlation here between complaining about string complexity and using Rust, or C#/Java or even Go. Keep in mind that Go's strings are still a poor design that lets you arbitrarily tear code points and foregoes richness and safety of Rust strings. Same, to an extent, applies to C# and Java strings, though they are also safe mostly through a quirk of UTF-16 where you can only ever tear non-BMP code points, which happen infrequently at the edges of substrings or string slices as the offsets are produced by scanning or from known good constants.
If, at your own peril, you still wish to stay with C++, then you may want to look at QString from Qt which is how a decent string type UX should look like.
Go's strings aren't poor design. The only difference between a Go string and a Rust &str/String is that the latter is required to be valid UTF-8. In Go, a string is only conventionally valid UTF-8. It is permitted to contain invalid UTF-8. This is a feature, not a bug, because it more closely represents the reality of data encoded in a file on Unix. Of course, this feature comes with a trade-off, because Rust's guarantee that &str/String is valid UTF-8 is also a feature and not a bug.
I mention gecko as an example repository that contains data that isn't valid UTF-8. But it isn't unique. The cpython repository does too. When you make your string type have the invariant that it must be valid UTF-8, you're giving up something when it comes to writing tools that process the contents of arbitrary files.
Go strings aren't necessarily text. Rust strings are text, as long as you consider things like emoji or Egyptian hieroglyphics to be text. Lots of confusion has come from the imprecise meaning of "string", whether it's referring to arbitrary byte sequences, restricted byte sequences (e.g. not containing 0x00), arbitrary sequences of characters with some encoding, restricted sequences of characters with some encoding, or something else is often unclear. And when it's a restricted sequence what those restrictions are is also often unclear.
You sometimes need a way to operate on entirely arbitrary sequences of bytes. This is mostly easy, it's been a long time since non-octet bytes were relevant in most situations, so the vast majority of the time you can just assume they're all octets.
You sometimes need a way to operate on arbitrary text. This inherently requires knowing how that text is encoded, but as long as you know that it's mostly easy.
You sometimes need a way to operate on text-like things that aren't necessarily text, like the output of old CLI programs that used the BEL character to alert the user to events. Or POSIX filenames. Or text where you don't know the encoding. This is where the bugs lie, where we make unchecked assumptions about the data that turn out to be invalid.
You didn't really respond directly to anything I said, nor anything I said in the blog I linked (that I also wrote). You also seem to be speaking to me as if I'm some spring chicken. I'm not. I'm on the Rust libs-api team and I'm in favor of the &str/String API design (including its UTF-8 requirement). I wrote ripgrep. I've spent 10 years working on regex engines. I understand text encodings and the design space of string data types. I've implemented string data types. It might help to understand things a little better by perusing the bstr crate API[1]. Notice that it doesn't require valid UTF-8, yet assumes by convention that the string is UTF-8. And this assumption provides a path to implementing things like "iterate over all grapheme clusters" with sensible semantics when invalid UTF-8 is seen.
You'll notice that I didn't say "Go's string design is good and we should all use it." I made an argument that's Go's string design is not poor and provided an argument for why that is. In particular, I described trade offs and a particular pragmatic point on which abdicating the UTF-8 requirement makes for a more seamless experience when dealing with arbitrary file content.
> but as long as you know that it's mostly easy. [..snip..] Or text where you don't know the encoding.
You don't know. That was my whole point! I gave real-world concrete examples of popular things (Mozilla and CPython repositories) that contain text files that aren't entirely valid UTF-8. They are only mostly valid UTF-8. If I instead treated them as malformed and refused to process them in my command line utilities or libraries, I would get instant bug reports.
> Go strings aren't necessarily text.
I would generally consider this to be an incorrect statement. The more precise statement is that Go strings may contain invalid UTF-8. But the operations defined on strings treat strings as text. For example, if you iterate over the codepoints in a Go string, you'll get U+FFFD for bytes that are invalid UTF-8. By your own reasoning, U+FFFD must be considered text because it can also appear in a Rust &str/String. Despite the fact that a Go string and a []byte can represent arbitrary sequences of bytes, a Go string is not the same thing as a []byte. Aside from mutability and growability, the operations on them (both those provided as a library and those provided by the language definition itself) are what distinguish them. They are what make a `string` text, even when it contains invalid UTF-8.
There are deep trade offs here, but the UTF-8-is-required does have downsides that UTF-8-by-convention does not have. And of course, vice versa.
A lot of what makes c string handling hard to use is the decision they made that api's should write into a user supplied buffer rather allocate one for you.
Not any that I'm aware of. You have your choice of programming languages where they got it wrong (Python 3, Ruby), languages which are incredibly nit-picky (Rust), languages which are full of footguns (C, C++), languages which pass on the issue (OCaml), languages which assume everything is UTF8 (Perl), languages which assume everything is UTF16 except when they forgot about planes and assume UCS-2 (Java, everything Windows), languages which are mad (APL), ...
Of these outcomes I like the choice to pick nits best. I felt the same way before I learned any Rust (ie I wrote C where I had to manually groom strings) so this went in my pile of things to like about Rust. Apparently Rust only chose relatively narrowly to have &str (the string slice type, which is guaranteed to be UTF-8 text) at all, rather than just &[u8] (a slice of bytes) everywhere and to my mind that's a pretty serious benefit.
The choice to have this in a language with the safe/ unsafe distinction works very nicely because in so many languages you'd have this promised UTF-8 type and then in practice everybody and their dog uses the unsafe assumed conversion because it's easier, but in Rust you're pulled up short because that conversion needs an unsafe block, your local style may require (and good practice certainly does) that you address this with a safety comment explaining why it's OK and... it just isn't, in most cases. So you write the safe conversion instead unless you really need not to. This is a really nice nudge, you can do the Wrong Thing™, but it's just easier not to.
Rust still has its warts. When dealing with an archive, say, you can find yourself needing to deal with Windows strings on Unix, or vice versa, but Rust only provides the string type for the platform you are running on. It could do with UnixString and WindowsString in addition to OsString.
MFC had a similar problem for years. It had a CString class which was ANSI or Unicode, depending on how your compiled your app, but moderately often you needed the other one, so it should have had a CStringA and CStringW too, with nice conversions between them.
OsString isn't "the native string type", it's a container for whatever was convenient for Rust's internals on that OS, there are convenience functions so it probably feels like it's "UnixString" or "WindowsString" but it's neither.
In most of these file format cases what you've got is &[u8] or &[u16] and maybe it's the NonZero variant instead, so I think it's fine to be explicit that's what is going on and maybe in the process remind you to check - is this data UTF16LE? UTF16 with a BOM? UCS2 with a nod and a wink? Just arbitrary 16-bit integers and good luck?
But like I said, I favoured "picking nits" long before I learned Rust, so mileage may vary.
IMHO Rust got OSString wrong – it indirectly promised that UTF-8 can always be cast to it without any copying or allocations, so on Windows it has to use WTF-8 rather than UCS2/UTF-16. Instead of being OS’s preferred string type, it’s merely a trick to preserve unpaired surrogates.
I wouldn't say it's necessarily wrong; it may be (accidental) foresight. Windows has added a UTF-8 code page, which means you can get as near enough as makes no difference full support with the A functions instead of the W functions.
That said, even now in 2024, it's not clear how much of a bet Windows is making on UTF-8 versus UTF-16.
I would be surprised if the UTF-8 support in Windows was anything deeper than the A functions creating a W string and calling W functions, which is what Rust is doing already.
The question is whether that's all they ever do, or whether gradually some parts of Windows begin to decide to be UTF-8 first and their W functions now translate the opposite way. This must be tempting for some networking stuff, where you're otherwise taking a perf hit compared to Linux.
Huh. Where is the "indirect promise" ? Is there a (conventionally hinted as "free", so it's not good if there's one which isn't very cheap) as function? Like as_os_string or something?
It's because str implements AsRef<OsStr> [0]. The function signature promises that whenever you have a borrowed &'a str, the standard library can give you a borrowed &'a OsStr with the same data.
Since references can't have destructors (they don't own the data like an OsString does), it means that the standard library can't give you a newly-allocated string without leaking it. Since obviously it isn't going to do that, the &OsStr must instead just act as a view into the underlying &str. And the conversion can't enforce any extra restrictions on the input string without breaking backward compatibility.
The overall effect is that whatever format OsStr uses, it has to be a superset of UTF-8.
It was a sly dig at APL for using non-ASCII characters as regular operators. Actually I have no idea how those were implemented in the language. Presumably not as Unicode since APL predates Unicode by quite a considerable number of years. Does anyone know?
In modern APLs a character scalar is just a Unicode code point, which you might consider UTF-32. It's no trouble to work with. Although interfacing with things like filenames that can be invalid UTF-8 is a bit of a mess; Dyalog encodes these using the character range that is reserved for UTF-16 surrogates and therefore unused. If you know you want to work with a byte sequence instead, you can use Unicode characters 0-255, and an array of these will be optimized to use one byte per character in dialects that care about performance.
Your program will crash at runtime if you pass any string that isn't valid UTF-8 and you forgot to use bytes, and using .decode requires that your input has a valid, known encoding which is rarely true in the real world of messy data.
It's a problem with Unix filenames where the encoding is just a convention. A Python program that doesn't take great care can crash on a parameter that takes a filename even if that filename is just passed to a function like 'open' so no sanitisation or conversion is necessary.
This is a very real problem in hivex, our Windows registry library, where the Python bindings don't really work well. The Windows registry is a hodge podge of random encodings, essentially whatever the program that wrote the registry key thought it was using at the time. When parsed through Python as a string this means you'll get unicode decoding errors all over the place.
> It's a problem with Unix filenames where the encoding is just a convention
UNIX filenames are NOT necessarily printable text, they're byte strings. Don't treat them as printable text. They're sequences of bytes not containing 0x00 or 0x2F, but with no encoding.
Same for Windows registry keys. Don't mistake byte strings for text.
Text is a byte string with an encoding which describes which byte values are valid and which characters each byte (or sequence of bytes) corresponds to. If the encoding information is discarded, it stops being text and becomes just a byte string.
OTOH, if "stop whining about how hard the system is to use correctly, just pay attention, be careful, and don't write bugs" worked, we wouldn't have issues like the one this thread is about, and no-one would have needed to invent Rust.
Oh, 100%. It's critical to wrap the crappy APIs into good ones. But that means you have to write the wrapper, and do so very carefully. Then hope you can get your wrapper adopted widely enough to be standardized, and have the old crappy API deprecated & removed (unlikely). In practice, the crappy API will be kept for backwards compatibility, and will forever remain a footgun. You have to be aware of it, even if only in a "here be dragons" manner.
The POSIX filename rules are a "here be dragons" sign. If you can avoid dealing with them, it's best to do so. If you can't avoid it, you'll need to parse your inputs into valid text, and fall back to a byte string handling path (or just fail with error) if they're not text. You can't safely just assume they're text & treat them as such, it doesn't end well.
To be honest, I disagree. Unix filenames are printable text, they're just text that the OS chooses not to enforce any validation on. Especially since the charset is implied by your choice of locale, which is a user decision, not a kernel decision. But we've transitioned into a world where pretty much all systems have settled on UTF-8 as the charset of choice, and the OS's refusal to even permit kernel options to forcibly validate filenames as UTF-8 is starting to look like a poor decision.
(IIRC, at least one of the BSDs has actually moved to forcing filenames to be UTF-8 and refusing path names that aren't UTF-8. Would only that Linux moved down that path as well so that we could be done with this farce.)
Kernel developers need to consider backwards compatibility. You won't want to see some users lose their data because they upgraded the kernel. Therefore it is very hard to "force" something.
A filename consisting of nothing but ASCII Bell characters (0x07) is valid. Those are non-printable characters that (used to) make a sound from the PC's speaker. POSIX filenames can be sound, not text.
I'd agree it'd be nice if we could restrict filenames to valid UTF-8. But that's not the API that existing filesystems provide, nor what (most of) the existing OSes enforce.
That's an important way of looking at it, and is correct as far as traditional UNIX operations like fopen go. But it isn't the whole story because many other operations require treating paths as text. For example, converting them to URIs, putting them in ZIP files, or looking up a file on a filesystem which internally stores filenames in UCS-2.
Threading both of these needles at once basically requires viewing paths as potentially-invalid encoded text.
You can tell .decode what to do with errors so it won't throw an exception, the default is 'strict'. I think python 3.6+ did a pretty good job with it overall.
Don't get me wrong it's still painful and annoying and bug prone, but the point is, it's encodings, it's always going to suck no matter what.
>using .decode requires that your input has a valid, known encoding which is rarely true in the real world of messy data.
How else would you decode a string without knowing its encoding? You can either guess (and risk invalid result/decode errors) or store this information somewhere. This is universal and true in every language. In most cases today people choose to guess utf-8.
>>> 'abc' + 'def'[1]
abce
>>> b'abc' + 'def'[1]
TypeError: can't concat int to bytes
It's been responsible for a lot of bugs in my code. They copied the "bytes is an array of integers" thing from Java. Big mistake. Python is not Java.
Them treating filenames as strings is another bug factory on Linux. Almost no one unit tests filenames with invalid encodings, so the result is a whole pile of python3 programs will fall when given perfectly valid input, whereas those same programs in python2 were fine.
It's a very odd outcome given python harks from Linux.
It must be very difficult to use one of the dozens of packages that let you detect and pick correct encoding to wrap a file stream before reading into a string.
Empirically yes! Nearly everything we tried would do a passable job on European texts, and misconvert short texts in less common character sets. Because the problem is ambiguous, we had the best luck with a library that allowed us to disable possible encodings. We thought we could just remove encodings that were not widely in use in the countries we served, and that mostly fixed it, but it still came up a couple times a year. Eventually we had a large enough corpus of sample documents that we could test any changes to the character set bitmap for impact to at least some of each better supported character set.
They're getting there, starting from later builds of Windows 10 there's a manifest flag you can set on your executables which makes all of the legacy ASCII interfaces accept and return UTF-8 instead. Windows is probably just converting to and from UTF-16 internally but that's not your problem anymore.
I believe it silently does the wrong thing so you should probably enforce a minimum Windows version by some other means. The last version to not support that UTF-8 flag has been officially EOL for years so cutting it off completely is on the table, depending on your audience.
Man, if English is the only human language in this world, who would need UTF-8?
The other encodings exist because they are more efficient for the other languages. Especially, for the Chinese, Japanese, and Korean languages. UTF-8 takes 50% more space than the alternatives. To bad modern Linux systems only support UTF-8 locales.
That's 183 non-UTF-8 locales that are available on my system. OK, I don't have any non-UTF-8 locales currently configured for use, but I don't have to install anything extra for them to be available. Just uncomment some configuration lines and re-run `locale-gen`.
But the reality is: most glibc functions like `dirname` could not handle non UTF-8 encodings, because some encodings (like GBK) have overlaps with ASCII, which means when you search an ASCII char(like '\') in a char array, you may accidentally hit a half of a non-English character. Therefore, people in Asia usually do not use the non UTF-8 locales.
But the byte '/' can never be part of any filename/dirname under a UNIX filesystem. Which kinda sucks generally for anyone wanting to use a charset like that, but doesn't it also mean that should never be a problem for `dirname()`?
I'm struggling to imagine how this failure would manifest. Can you give an example of how dirname() would fail? What combination of existing file/directory name, and usage of that function, would not work as expected?
Edit: I'm also a bit confused how this counts as being a problem for "modern Linux systems" - wouldn't it have always been a problem for all Unix-based OSs?
> Man, i wish everything was UTF-8 so we iconv would not be needed anymore. Too bad its defined in POSIX.
I wish nothing was in UTF-8 and UTF-8 was relegated to properties files. There are codebases out there with complete i18n and l10n in more languages that most here have ever worked with where there's zero Unicode characters allowed in source code files (with pre-commit hooks preventing committing such source code files).
Bruce Schneier was right all along in 1998 or whatever the date was when he said: "Unicode is too complex to ever be secure".
We've seen countless exploits based on Unicode. The latest (re)posted here on HN was a few days ago: some Unicode parsing but affecting OpenSSL. Why? To allow support for internationalized domain names and/or internationalized emails.
Something that should never have been authorized.
We don't need more of what brings countless security exploits: we need less of it.
Relegated Unicode to translation/properties file, where it belongs.
Sure, Unicode is great for documents, chat, etc.
But everything in UTF-8? emails? domain names? source code? This is madness.
I don't understand how anyone can admire the fact that HANGUL fillers are valid in source code are somehow a great win for our industry.
This is the obvious complaint, but ASCII is the only common subset of the various encoding schemes. For some things like programming languages and protocols I think it make sense to have an ASCII constraint. You can express pretty much any sound phonetically using the Latin alphabet.
The last part is so obviously wrong as to be silly, off the top of my head voiced lateral fricatives are used in some of my home languages that would not be possible to explain in simple latin alphabet terms.
How about a tonal language? Or a whistling language? Or a clicking language?
I agree that grandparent is wrong in their statement about the ability to express sounds.
I don't, however, understand why you would need support for unicode chars in the source code. To support unicode identifiers? Why would we want that?
I am non-native English speaker. Almost always we work in an international setup, so I already have to reeducate graduates from university not to use localized variable names and not to put in localized comments.
English is the de facto standard in the programs and it's mostly due to technical constraints and historical leadership in programming languages (well, there was French version of Pascal for a while, don't get me started on that ;) ). I think this made us communicate across the borders better. We already have problems of properly naming things, even limited to English, so why would we want to increase the misunderstandings?
Why would you want to increase misunderstandings by forcing people to communicate exclusively in a language that they are strictly less proficient in, when everybody on the team is much more proficient in a different language?
This is blatant North American and Western European bias. It makes sense as a policy in those regions, but it is desperately necessary that computing standards do not prevent the rest of the world from participating on as equal terms as possible.
No, it is not blatant "North American and Western European bias".
I am Bulgarian, living in Bulgaria, a small country in Eastern Europe. My native language is Bulgarian. I also know English, Russian, and a little bit of French and Turkish.
I do not want to write, or read source code written in Bulgarian, or Russian or French, or Kiswahili, or Japanese etc. I want to use one single language for that. It currently happens to be English for historical reasons.
Supporting unicode in source code, will lead to language fragmentation, and will limit my ability to participate, and to understand everything already build by others.
Supporting Unicode in itself does not nudge anyone towards localizing their codebase.
Please consider that as a (presumably young or young-ish) European, you have access to different educational resources, and are exposed to a cultural context where English is the lingua franca. As a Bulgarian, or any other European nationality, the only foreign language you truly need to learn to get by almost anywhere on the continent is English.
There are many, many countries and places on Earth where English is a significant barrier, due to being much further away, geographically and culturally.
While English is unavoidable after a certain level (just like French is unavoidable for a chef, or Italian is for a musician), it is crucially important that the barrier to entry does not also necessarily come with huge a language barrier up front.
Most people in Tibet only speak Tibetan. They also need to use smart phones. They type texts on their phones to communicate with their friends. They simply cannot use Latin alphabet for doing that.
That's the other direction (legacy charset conversion to UCS-4 or UTF-8). This other direction is often reachable using the charset parameter in the Content-Type header and similar MIME contexts.
HTTP theoretically supports Accept-Charset, but it's deprecated:
The charset in question does not have a locale associated with it (it's not even ASCII-transparent), so I don't think it's usable in a local context together with SUID/SGID/AT_SECURE programs.
I seriously doubt you can make PHP convert anything to that exotic charset automatically even with creative configuration, and pretty much sure it wouldn't do any of the sort in common configuration. What I suspect is going on is that the author is interested in exploiting PHP engine and is assuming PHP code using iconv() and wants to talk about how to get from there to full scale RCE. It is indeed a fascinating and non-trivial topic, though the relationship between a particular CVE and the PHP angle is rather coincidental - any buffer overflow would do, it's just the author happened to have one in a reasonably common function.
My guess is, it's application specific, php applications that use the iconv function in some specific way, in some specific context, will be vulnerable.
As a test on various distros, I ran
iconv -l |grep 2022 |grep -i cn
and it listed ISO-2022-CN-EXT// and ISO2022CNEXT// before I made any changes. After editing the modules and running iconvconfig the command no longer showed those charsets.
This was handy since the alma8 has a /usr/lib64/gconv/gconv-modules file but the file to edit was /usr/lib64/gconv/gconv-modules.d/gconv-modules-extra.conf
Thanks a bunch to you and thenickdude for the test command and config to change.
I have an old VPS that isn't worth trying to update to a newer OS image, because I'm already (slowly) migrating things off of it before the current paid-up term expires, but it definitely won't get the newer glibc. Disabling the vulnerable character encoding works for me, since no legitimate user of the server will need these conversion pairs.
There's a CVE for it - CVE-2024-2961 - and fixed glibc versions have already been released (and reached Ubuntu LTS for one thing) so I think it's fine.
Responsible disclosure generally means you tell the maintainers first before doing your splashy talk. Which appears to be what happened here since there is a cve, and we know mostly what was fixed. The talk would probably just go into nitty gritty details about how it was found and how its exploitable, stuff a skilled researcher would already be able to figure out based on what has already been publicly released.
If people found out about this from an abstract, the fixed version wouldn't have already been pushed to distros, cves issued, etc by the time it was public. People would also be a lot more angsty about it.
Nothing about this suggests that it wasn't first privately disclosed to the glibc maintainers or that anything else improper happened.