Strings using special characters

Discostew

So I've been wanting to utilize special characters in strings for my project as a kind of array (which honestly, unless this is an actual problem and will get fixed, I may just stick with arrays), but it seems accessing strings a certain way do not behave well when special characters are used.

See, for FUZE, if you limit your character usage to the first 128 characters, it's no problem, as each is considered 1-byte each. But, once you go beyond that, it then becomes 2-bytes for the individual character, then 3-bytes, then up to 4. This is because the characters are more or less encoded as Unicode. So what's wrong with this?

Well, you could make a string that's as such...

string test = "abcd" + chr(255) + "efgh"

And when you print it, it prints "abcdÿefgh", which is correct by unicode standards. How many characters is that? 9, yes? Unfortunately, when you use the len() function, it returns 10, because "ÿ" is composed of 2 bytes whereas the rest are just 1 byte. The function is stated to return the number of characters, but instead, it returns number of bytes. Not only that, but you can't even access the character normally. The character is at location 4 (base-0), but accessing via "test[4]" returns " Ì ", which under unicode standards is numbered as 204. However, using chrVal() on that character returns 192, and when you convert back using chr(), you get " À ". With all this, there is still the fact that it's considered 2 bytes, so what happens when I access test[5] to get the other part? I get the character " ' ", but when using chrVal() on that, it returns -1.

Now, this is on what I considered to be per-character, but if I were to use test[4:5] which grabs both at the same time, I get the correct character and value it represents. Functions like strFind() are fine when you look up these specific characters too. The issue is when special characters are used, but you don't know where they are located in the string.

Is this whole thing a bug? Should len() have only returned 9? Should accessing via test[4] rather than test[4:5] retrieve the correct character?

12Me21

It's encoded in utf-8

Nisse5

@12Me21 Yes, but that doesn't answer the concerns. A Unicode string should still return the length of the string, not the storage size.

Personally, I think it's odd to to use UTF-8 on a character level, rather than using 32-bit fixed Unicode values.

jacobmph

@Nisse5 said in Strings using special characters:

Personally, I think it's odd to to use UTF-8 on a character level, rather than using 32-bit fixed Unicode values.

I kind of agree, but the whole of the internet is against us :-) Backward compatibility with ASCII wins out (even though I guess it’s no real advantage for FUZE).

Nisse5

@jacobmph said in Strings using special characters:

@Nisse5 said in Strings using special characters:

Personally, I think it's odd to to use UTF-8 on a character level, rather than using 32-bit fixed Unicode values.

I kind of agree, but the whole of the internet is against us :-) Backward compatibility with ASCII wins out (even though I guess it’s no real advantage for FUZE).

7-bit ASCII (the printable range at least) is a subset of 32-bit fixed Unicode, so in that regard the compatibility would still be intact.

12Me21

Yeah it really doesn't make sense to use utf-8 here. It's good for reducing file sizes but for strings it would make much more sense to use fixed-width 16 or 32 bit characters.

niconii

@Nisse5 said in Strings using special characters:

@jacobmph said in Strings using special characters:

I kind of agree, but the whole of the internet is against us :-) Backward compatibility with ASCII wins out (even though I guess it’s no real advantage for FUZE).

7-bit ASCII (the printable range at least) is a subset of 32-bit fixed Unicode, so in that regard the compatibility would still be intact.

What they mean by "compatibility" is that ASCII and UTF-8 have the same exact representation for characters in the ASCII range. This isn't true for UTF-32, because each character will take up four bytes instead of one. In other words:

Text: Hello, world!
ASCII: 48 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21
UTF-8: 48 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21
UTF-16LE: 48 00 65 00 6C 00 6C 00 6F 00 2C 00 20 00 77 00 6F 00 72 00 6C 00 64 00 21 00
UTF-32LE: 48 00 00 00 65 00 00 00 6C 00 00 00 6C 00 00 00 6F 00 00 00 2C 00 00 00 20 00 00 00 77 00 00 00 6F 00 00 00 72 00 00 00 6C 00 00 00 64 00 00 00 21 00 00 00