Multibyte characters in tests

Plopx · April 13, 2016, 11:28am

Two community puzzles were recently added which use multibyte characters (é, £, €, …) with some of the validators. I’m talking of “7-segment display” (validator #2) and “Snake encoding” (validator #4).

This makes the puzzles more difficult to solve for some languages, and I don’t think that it’s very interesting.

Nevertheless, when such characters are used, please:

add at least one visible test case with multibyte characters;
maybe say it clearly in the problem statement that such chararcters are allowed;
certainly don’t say that “the input text is composed of ASCII characters” (Snake encoding), since it’s not true.

magaiti · April 13, 2016, 12:10pm

So that’s what’s wrong with Snake encoding! Definitely a violation of constraints.

update: the encoding is UTF-8. I hope this helps.

durango · April 13, 2016, 2:28pm

I would say: it’s totally uninteresting.

player_one · April 13, 2016, 2:40pm

Sorry. I voted to approve both puzzles, and because of my language choice when solving the problem, I didn’t notice the issue. I will try to pay more attention to this in the future. For now, I agree with you that something should be done to rectify the situation for these two puzzles.

danBhentschel

magaiti · April 13, 2016, 3:54pm

#include <locale>

#define CHAR    wchar_t
#define STRING  wstring
#define CIN     wcin
#define COUT    wcout

int main()
{
    ios_base::sync_with_stdio(false);
    auto loc = locale("");
    CIN.imbue(loc);
    COUT.imbue(loc);

this is how I finally solved it in C++

durango · April 13, 2016, 4:00pm

Not only this, but the “Auto-generated code [that] aims at helping you parse the standard input according to the problem statement” in C uses “char LINE[21]; fgets(LINE, 21, stdin);” which is obviously wrong if multibyte characters are allowed.

Plopx · April 13, 2016, 8:34pm

Here is what I used in C:
https://forum.codingame.com/t/snake-encoding-community-puzzle/1423/8?u=plopx

And in Javascript:

function fromUTF8(s) {
    return decodeURIComponent(escape(s));
}
var line = fromUTF8(readline());

durango · April 13, 2016, 9:28pm

Yes, I did try that, but for some reason it didn’t work (I used getwchar instead of scanf, and it kept returning -1, I don’t know why).
I now solved the problem by writing a getutf8 and a pututf8 function, returning and taking an unsigned int, so I only have to #include stdio.h. Now I saw your code, and you can see mine

Plopx · April 14, 2016, 7:26am

The difference is that scanf reads multibyte chars (utf8) and makes the conversion to wide chars, while getwchar tries to read wide chars directly.

magaiti · April 14, 2016, 8:42am

yes, a getmbchar() should have been used, if such a routine ever exist.

player_one · April 14, 2016, 4:12pm

Hopefully someone from CG will fix these puzzles soon.

For Snake encoding, this is the only line that needs to change:

it._If_S¤mebody_find ----> it._If_S0mebody_find

And the corresponding line in the output:

nn_ir.aI__e¤te.o_yhf ----> nn_ir.aI__e0te.o_yhf

For 7-segment display, Validator 2 uses the ´ character. It should probably be changed to ' instead.

danBhentschel

durango · April 14, 2016, 9:09pm

That’s a rather subtle difference… and I’m not even sure it is correct. With the input string “héhé” in UTF-8, getwchar is called 6 times (two times for each accented letter). If I call scanf on a wchar_t buffer with that same input string, wcslen also returns 6, and I can print its contents character by character with printf ("%lc", …), in which case printf is again called 6 times. Likewise I can iterate through that buffer and print its contents with putwchar, that must be called 6 times again.

durango · April 15, 2016, 7:26am

Got it. Actually you should use setlocale(LC_CTYPE, “en_US.UTF-8”) instead of setlocale(LC_CTYPE, “”), unless your default encoding happens to be UTF-8. Otherwise scanf, cwslen and friends have no way to magically detect that the input is UTF-8.

durango · April 15, 2016, 9:00am

A last reply to myself: actually setlocale (LC_CTYPE, “en_US.UTF-8”) doesn’t even work as I would have expected: you have to use either scanf and printf or getwchar and putwchar, you cannot mix them. At least it doesn’t work reliably on my computer, whose local encoding is ISO-8859-1: for example, putwchar-printf does not output the characters of the printf, and printf-putwchar-printf outputs the characters of both.

Solid3 · April 15, 2016, 12:30pm

You actually can’t solve this in Javascript.

All multibyte characters that are printed with print() or printErr() output the string: “-60” which obviously don’t validate.

Plopx · April 15, 2016, 12:33pm

Yes you can, but the input string should be correctly decoded. See my function fromUTF8 above.

Solid3 · April 15, 2016, 12:52pm

Yeah, tried that but test 2 for 7-Segment display still won’t pass. I’m guessing emoticon.
As I say, doesn’t matter what the input is, the print() function can’t output multibyte character.

Djoums · April 15, 2016, 6:13pm

Thanks for the tip, very useful.

magaiti · April 15, 2016, 9:59pm

AFAIU, (LC_CTYPE “”) makes sense to use in CG IDE, as the code is executed on server side. Probably using the same locale as the puzzle text.
For your local machine, locale of course might be different.

Visual · May 24, 2016, 4:05pm

Bump! This thread has become a bit more relevant now that “7-segment display” was made puzzle of the week, and its issue is still not fixed.

Offtopic: Is the choice of puzzle of the week automated?