Unfortunately looks like still no compact strings like Java or JS: https://githu...

gavinray · on Aug 31, 2022

Is this what you're talking about?

  > "Arguably the biggest improvement around UTF8 in .NET 7 is the new C# 11 support for UTF8 literals."

 > "UTF8 literals enables the compiler to perform the UTF8 encoding into bytes at compile-time. Rather than writing a normal string, e.g. "hello", a developer simply appends the new u8 suffix onto the string literal, e.g. "hello"u8. At that point, this is no longer a string. Rather, the natural type of this expression is a ReadOnlySpan<byte>. If you write:"

  > public static ReadOnlySpan<byte> Text => "hello"u8;

  > public static ReadOnlySpan<byte> Text =>
    new ReadOnlySpan<byte>(new byte[] { (byte)'h', (byte)'e', (byte)'l', (byte)'l', (byte)'o', (byte)'\0' }, 0, 5);

esprehn · on Aug 31, 2022

No, not unless you can pass a ReadOnlySpan into every API that expects a String. The change you referenced let's folks work around the fact that String is UTF16. It doesn't transparently handle ASCII with one byte like other languages.

See the GitHub issue or https://openjdk.org/jeps/254

The .NET 7 change doesn't retain backwards compatibility

jsmith45 · on Sept 1, 2022

Doing magic with the internal string format was considered by the .net team and rejected.

There are multiple operations that right now are O(1) no-allocation, that would become sometimes O(n) allocating if they did that sort of magic.

This includes: PINVOKEing windows APIs that are expecting WCHARS, converting a string to a ReadOnlySpan<char>, and using `unsafe` to pin a string and access its contents as a `char*`.

Making code that users may have been relying on being O(1) and non-allocating into possibly O(n) and allocating was deemed too disruptive.

Plus because strings store their contents within themselves, the on-demand conversion of a string to UTF-16 would require allocating a new object, and updating all pointers to the old object to refer to the new one. Which currently is an operation only done by the garbage collector. If expanding the string on-demand required running a full garbage collection to handle this... yeah, going from O(1) non-allocating, to o(n)+Full Garbage Collection is a total non-starter.

I think Java on the other hand did not expose very many places where a string could be observed to be a utf-16 character array under the hood (perhaps only as part of JNI marshaling?) making the ascii-only string optimization more feasible. Javascript never made the strings internal encoding visible, simply requiring that the indexer be able to return a UTF-16 value in O(1), so the ASCII optimization is simple there.

svick · on Aug 31, 2022

UTF-8 string literals will be part of C# 11/.Net 7, which could help. But they're still more awkward to use than the UTF-16 string.