Multibyte characters in tests

Two community puzzles were recently added which use multibyte characters (é, £, €, …) with some of the validators. I’m talking of “7-segment display” (validator #2) and “Snake encoding” (validator #4).

This makes the puzzles more difficult to solve for some languages, and I don’t think that it’s very interesting.

Nevertheless, when such characters are used, please:

  • add at least one visible test case with multibyte characters;
  • maybe say it clearly in the problem statement that such chararcters are allowed;
  • certainly don’t say that “the input text is composed of ASCII characters” (Snake encoding), since it’s not true.

So that’s what’s wrong with Snake encoding! Definitely a violation of constraints.

update: the encoding is UTF-8. I hope this helps.

I would say: it’s totally uninteresting.

Sorry. I voted to approve both puzzles, and because of my language choice when solving the problem, I didn’t notice the issue. I will try to pay more attention to this in the future. For now, I agree with you that something should be done to rectify the situation for these two puzzles.

  • danBhentschel
#include <locale>

#define CHAR    wchar_t
#define STRING  wstring
#define CIN     wcin
#define COUT    wcout

int main()
    auto loc = locale("");

this is how I finally solved it in C++

Not only this, but the “Auto-generated code [that] aims at helping you parse the standard input according to the problem statement” in C uses “char LINE[21]; fgets(LINE, 21, stdin);” which is obviously wrong if multibyte characters are allowed.

Here is what I used in C:

And in Javascript:

function fromUTF8(s) {
    return decodeURIComponent(escape(s));
var line = fromUTF8(readline());

Yes, I did try that, but for some reason it didn’t work (I used getwchar instead of scanf, and it kept returning -1, I don’t know why).
I now solved the problem by writing a getutf8 and a pututf8 function, returning and taking an unsigned int, so I only have to #include stdio.h. Now I saw your code, and you can see mine :wink:

The difference is that scanf reads multibyte chars (utf8) and makes the conversion to wide chars, while getwchar tries to read wide chars directly.

yes, a getmbchar() should have been used, if such a routine ever exist.

Hopefully someone from CG will fix these puzzles soon.

For Snake encoding, this is the only line that needs to change:

it._If_S¤mebody_find ----> it._If_S0mebody_find

And the corresponding line in the output:

nn_ir.aI__e¤te.o_yhf ----> nn_ir.aI__e0te.o_yhf

For 7-segment display, Validator 2 uses the ´ character. It should probably be changed to ' instead.

  • danBhentschel
1 Like

That’s a rather subtle difference… and I’m not even sure it is correct. With the input string “héhé” in UTF-8, getwchar is called 6 times (two times for each accented letter). If I call scanf on a wchar_t buffer with that same input string, wcslen also returns 6, and I can print its contents character by character with printf ("%lc", …), in which case printf is again called 6 times. Likewise I can iterate through that buffer and print its contents with putwchar, that must be called 6 times again.

Got it. Actually you should use setlocale(LC_CTYPE, “en_US.UTF-8”) instead of setlocale(LC_CTYPE, “”), unless your default encoding happens to be UTF-8. Otherwise scanf, cwslen and friends have no way to magically detect that the input is UTF-8.

A last reply to myself: actually setlocale (LC_CTYPE, “en_US.UTF-8”) doesn’t even work as I would have expected: you have to use either scanf and printf or getwchar and putwchar, you cannot mix them. At least it doesn’t work reliably on my computer, whose local encoding is ISO-8859-1: for example, putwchar-printf does not output the characters of the printf, and printf-putwchar-printf outputs the characters of both.

You actually can’t solve this in Javascript.

All multibyte characters that are printed with print() or printErr() output the string: “-60” which obviously don’t validate.

Yes you can, but the input string should be correctly decoded. See my function fromUTF8 above.

Yeah, tried that but test 2 for 7-Segment display still won’t pass. I’m guessing emoticon.
As I say, doesn’t matter what the input is, the print() function can’t output multibyte character.

Thanks for the tip, very useful.

AFAIU, (LC_CTYPE “”) makes sense to use in CG IDE, as the code is executed on server side. Probably using the same locale as the puzzle text.
For your local machine, locale of course might be different.

Bump! This thread has become a bit more relevant now that “7-segment display” was made puzzle of the week, and its issue is still not fixed.

Offtopic: Is the choice of puzzle of the week automated?