Is there a possibility when calling .ToUpper() that the new string requires more memory?

Question

I want to use the the following function in the MemoryExtensions namespace

public static int ToUpper(this ReadOnlySpan<char> source, Span<char> destination, CultureInfo? culture)

My question now is: am I always safe when destination Span has the length of the source span? e.g.

destination = stackalloc char[source.Length];

If no, can someone provide an example which string converts to a larger string when calling ToUpper on in (including which culture)?

I want to say yes, but I know some character sets/cultures do weird things when certain letters are capitalized. I'd expect you'd be fine 99.99%+ of the time, but that extra 0.01% could be killer. Or maybe you are perfectly fine… I'm not an expert here, and I'm interested to see if anyone could prove otherwise. — 8 hours ago
The one example that I could think of, which might do this, was German ""viel spaß", possibly turning the ß character into SS, but it turns out that it is left alone by string.ToUpper(new CultureInfo("de-DE")). — 8 hours ago
All characters can range from 1-4 bytes, with mostly international chars taking up the 4 bytes. You should look into .GetByteCount(). This site has a table of the different UTF-8 1-4 byte characters. That being said, Dotnet is UTF-16 encoding by default. which is 2-4 bytes per character. So I am guessing that it spacing would be source.Length * 4 at most — 7 hours ago
ß could also turn into ẞ, the former having 2 utf-8 bytes and the latter having 3 utf-8 bytes, though as @500-InternalServerError pointed out german culture appears to just ignore ß — 7 hours ago
@Narish stackalloc will allocate that many <sizeInBytesOfYourType> which in this case is 2 * source.Length — 6 hours ago

score 2 · Accepted Answer · 2023-09-20 21:38:30Z

MemoryExtensions.ToUpper returns -1 if the destination is too small.

The source code for ToUpper has this gem:

            // Assuming that changing case does not affect length
            if (destination.Length < source.Length)
                return -1;

There is no other point where -1 is returned, the function finishes with return source.Length;

So they’ve assumed it can’t happen. Whether they’re right is another question: if you find a counter-example I suggest you file a bug report on GitHub.

The docs for TextInfo (used later on in the code) say:

The returned string might differ in length from the input string. For more information on casing, refer to the Unicode Technical Report #21 "Case Mappings," published by the Unicode Consortium (https://www.unicode.org/). The current implementation preserves the length of the string. However, this behavior is not guaranteed and could change in future implementations.

What the doc says about it: The returned string might differ in length from the input string. For more information on casing, refer to the Unicode Technical Report #21 "Case Mappings," published by the Unicode Consortium (unicode.org). The current implementation preserves the length of the string. However, this behavior is not guaranteed and could change in future implementations. — 6 hours ago
True but I think at that point there would have to be a new API surface which would return the required length, as there are a number of places that this assumption is made. There isn't even such a function buried privately in TextInfo and associated classes. — 5 hours ago

bpellettbpellett 11 bronze badge · Accepted Answer · 2023-09-20 22:34:04Z

-2

You did not make it clear why you are writing assembler code in C#.

If you want to translate it into C#, it will look like this:

string source = GetSourceString(...);
string destination = source.ToUpper();

answered 4 hours ago

bpellett

11 bronze badge

New contributor

Is there a possibility when calling .ToUpper() that the new string requires more memory?

Is there a possibility when calling .ToUpper() that the new string requires more memory?

2 Answers 2

2 Answers
2