What sort of characters can be stored in a string?
-
So I'm working on my MML player, but first I'm making a parser to validate the MML strings. Initially I was just going to parse it for validation, then during playback, I'd process the strings as they are, but then I got to thinking that I could reduce playback processing by taking the different elements of the MML string, and reducing them to key values. In other words, if I have different elements like "@E" (for Envelope), "@ER" (for deactivating an Envelope), "V" (for Volume), etc, instead of comparing string characters, I have those converted to numbers ("@E" becomes 1, "@ER" becomes 2, "V" becomes 3, etc), then I convert those numbers to characters to be placed in the string instead, so during playback, I simply read a character, convert to a number, and process that way. Now, I assume converting the number 0 to a character is out of the question because typically, a value of 0 is a NULL character, and ends a string.
Other than than, so long as I'm not printing the string, can I safely store any converted number to a character? Are strings stored as 2-bytes per character (unicode)? That would help because if it were 1-byte each, then that would cause an issue. I want to have at least 256 values in a row to work with.
-
Having the character '0' in a string is no problem at all! Don't worry about null terminators, that's all taken care of internally. Strings are stored as one byte per character currently.
-
Hmm, just did some testing. Using chr(0) to force a NULL character between two strings seems to do nothing, like it is dismissed. I imagine this is done on purpose? I imagine this is one reason why we can't make assignments to individual elements in a string (as trying so causes an error).
Anyways, I began testing string length, and it would seem it isn't quite one-byte per character. If I use chr to convert individual numbers, I get the following lengths...
0 = 0
1 ~ 127 = 1
128 ~ 2047 = 2
2048 ~ 65535 = 3
65536 ~ 1114111 = 4
1114111 ~ .... = 0Anyways, at least this gives me something to work with. I can encode my MML commands as single-byte characters (127 is enough for that), then use two-byte characters for their parameters when needed (which can then be reduce to a range of 0 ~ whatever by subtracting the value by 128).
-
Sorry for the confusion- If you're using the
chr
function to work with characters, the result is encoded in UTF-8, which uses 1-4 bytes to store information.chrVal
will decode them from UTF-8 back to an integer. -
From what I can tell, certain characters are not possible to obtain in strings, due to
chr
outputting utf-8:0-127 (0xxxxxxx) - direct from chr 128-191 (10xxxxxx) - 2nd/3rd/4th byte of multi-byte chr (>127) 192-223 (110xxxxx) - 1st byte of 11 bit chr 224-239 (1110xxxx) - 1st byte of 16 bit chr 240-247 (11110xxx) - 1st byte of 21 bit chr 248-255 (11111xxx) - unobtainable
And even if you can create these chars, you won't be able to read their values very easily, because chrVal will probably fail with invalid UTF-8 data.
You should probably just use an array.