Official documentation for "100k character limit"?

I’ve never run into the source code character limit until playing the Squash Pi challenge. I have found references to 100k character limit in various forum locations and in the Help Center:

The code size for any game is limited to 100k characters.

Having spent a lot of time in C with char, I had in mind this meant a 100 kilobyte limit, which would make the Squash Pi puzzle significantly harder. Every new programmer knows what a character is, but eventually you learn some of The Absolute Minimum Every Software Developer Must Know About Unicode and realize this isn’t very exact.

I did some testing for the current character limits using various code points (the count of allowed chars, and bytes in different UTF encodings)

100,000 x  chr(61440)    utf-8 ef-80-80    utf-16 f0-00    utf-32 00-00-f0-00

 50,000 x 󰀀 chr(983040)    utf-8 f3-b0-80-80    utf-16 db-80-dc-00    utf-32 00-0f-00-00

 14,285 x 🤦🏼‍♂️ (Face Palm with Light Skin Tone, Male Sign)
Notice 100k / 7 = 14,285
Made up of 5 code points, totalling to 7 UTF-16 code units:
🤦 chr(129318)    utf-8 f0-9f-a4-a6    utf-16 d8-3e-dd-26    utf-32 00-01-f9-26
🏼 chr(127996)    utf-8 f0-9f-8f-bc    utf-16 d8-3c-df-fc    utf-32 00-01-f3-fc
‍ chr(8205)    utf-8 e2-80-8d    utf-16 20-0d    utf-32 00-00-20-0d
♂ chr(9794)    utf-8 e2-99-82    utf-16 26-42    utf-32 00-00-26-42
️ chr(65039)    utf-8 ef-b8-8f    utf-16 fe-0f    utf-32 00-00-fe-0f

The only measure I found that has a consistent limit is UTF-16 code units.
:question:Is this guaranteed, or just an implementation detail that could change if e.g. JS switched to UTF-8 strings? It would be great to communicate this using language-agnostic terms.


TL;DR: Can we update the official Help Center to say

The code size for any game is limited to 100k characters (measured in UTF-16 code-units).

1 Like

I’m fighting with this problem as well. I think 20 bits of 32 can be usably stored as utf-16 in the CG ide. The best means i have to view the utf encoding to copy/paste from is a web browser. I’m having trouble getting usable info into it and then out again tho. My understanding is that d000-dfff is no good. I can insert utf-8 codes, but it seems to screw up 0000-000a tho as well as any d000-dfff. So I think you can pack 3 decimal digits into 12 bits … write to a file , copy paste… then multibyte decode the string in the ide. But i can’t make it work just yet. any luck?

1 Like

Your strategy sounds good. Maybe your IDE is messing up the copy-paste – try different tools? It also helps to learn to use your programming language to convert unicode characters to hex, given your solution will need to do this too.

Using 20 bits of 32, you have 2^20 = 1048576 options, which could hold 6 decimal digits per 2 utf-16 code units. Your source is limited to 100k code units (200kB) so this solution would manage 300k digits – workable to store 295k digits if your decompression code fits in the last 5k.

I found it was more efficient to stick with code points < 0xD800 = 55296 which gives 4 digits per 1 code unit.

2 Likes

6 decimal digit per 2 utf-16 is fine, but there is a much simpler solution (only in CG): Use only the unicode chars that are 2-bytes in utf-16.
Some code points cannot be used: (surrogates U+D800…DFFF, control chars U+0000…001F, U+007F,U+0100…011F). I also had some problem in the IDE with some special “control” codepoints above U+2000 messing up left-to-right and right-to-left writing.
Still, you have over 50000 valid code-points, more than enough for 4 decimal digits (if we talk about the Squash Pi puzzle). For example a continuous valid range is U+3000 to U+CFFF. You can even encode ~15.5 bits of information per single UTF-16 codepoints, although that puzzle does not need to go to that extreme.
So even if in UTF-8 your source file is over 200 kB, CG will still accept it.
This is because U+0800…U+FFFF codepoints are 3 bytes in UTF-8 but only 2 bytes in UTF-16.
Intuitively, using U+010000…U+10FFFF (4 bytes both in UTF-8 and UTF-16) would be the most effective for encoding data, but not on CG, because of the weird UTF-16 based counting method they use. (Most likely they read the source file into a string in Java and checking its length in Java and that is using utf-16 by default internally.

2 Likes