Review of existing puzzles

David_Augusto_Villa · November 21, 2024, 6:53am

Hi All,

I’ve been somewhat active with community puzzles recently, and I’m concerned that CodinGame may not be providing the best user experience, especially for newcomers who may stumble upon some puzzles that are misleading, miscategorized, or not on par with the rest of the site’s offerings. Puzzles are already removed when the rating drops below 3, but there are sometimes also good puzzle ideas with problematic implementations which could be a whole lot better.

What I’d like to suggest is establishing other criteria to review existing puzzles, a low bar if you will, or apply a triple nomination scheme or something similar, so that critical feedback provided beyond just the initial 3 approvals can be taken into account. My specific suggestion is to identify such puzzles based on completion success rate and occasionally take one through the approval process again, to at least identify weaknesses and decide if there is a pressing enough need to do anything about it.

In some but not all cases, the best course of action may be to change the puzzle. Of course we would want this to be rare since it can be extremely disruptive. But with procedures in place, the negatives can be mitigated. For instance, it may make sense to recreate a puzzle under a similar name and deprecate the old one. And I’d like to start talking about doing that specifically with Parse SQL Queries, which is an easy puzzle with a 12.5% success rate, less than half that of the next and far below the 65% bottom quartile for easy puzzles. With an average rating of 3.84 submitted by over 500 users who were able to solve it, the puzzle is not in question being removed.

I’m raising this as a discussion topic first because I don’t want to be too bold. Sometimes there are changes to be made that appear obvious to me but apparently not so to others, so I’d appreciate any feedback to align with community expectations.

If it weren’t for the rule about originality, my next course of action would be replicating Parse SQL Queries and fixing any issues in the design myself. That’s because permission to edit is cumbersome even when changes are relatively minor, and I have no reason to believe they will be here. With 500 solvers, furthermore, most edits are out of the question. Adjusting the difficulty level may be an exception, but the success rate is lower than even any medium puzzle, so that feels like a bandaid.

Any thoughts?

Cheers!
David Villa

5DN1L · November 21, 2024, 10:11am

Just a slight clarification first: a player can rate a puzzle after submitting their code once, even if the code doesn’t pass any validators! However, I assume most ratings are submitted by players who have successfully solved the puzzle being rated.

Parse SQL Queries is an interesting case because the author has removed their account, making it potentially impossible to obtain their permission for you to edit the puzzle. However, if you feel there are fundamental issues with the puzzle that need fixing, please proceed with the edits and leave a comment in the contribution area detailing what you’ve edited and why (I’ve done so myself, though usually for minor fixes). Alternatively, if you prefer to get feedback from fellow players before editing, you can post in the discussion thread of the puzzle here.

I’m not so sure about the idea of creating a new formal process to review / deprecate and recreate existing puzzles.

I fully agree that the difficulty level is miscategorised for some puzzles, but I’m hesitant to change it since difficulty can be highly subjective. I’d prefer a new system where difficulty is adjusted automatically based on success rate and the number of people who have solved the puzzle. This issue alone warrants starting a new forum topic.

David_Augusto_Villa · November 21, 2024, 11:42pm

Ah, thanks for the clarification. The 3.84 stars for Parse SQL Queries is based on 188 ratings, presumably most after completing the puzzle, tho we can’t really tell from the stats. Importantly, it is not at risk of removal. It would take 26 more 1 star ratings, as many 1 star ratings again as it has now (but consecutive), to bring it below the threshold of 3. So in this case there is some failure that’s very apparent and that the automated system has not and is not going to catch.

The idea of adjusting the difficulty level automatically is entirely relevant here. I would like to establish a process for handling these cases, and an automated system is still a process that needs defining. My worry about making it automatic is that we don’t yet specifically know why there are issues, or at least I don’t. You have more experience. If it turns out all we’re lacking is a clear delineation between difficulty levels, then I’d be all for automation.

Let me say what I’ve seen based on comments. Only in one case so far does it appear to be solely a matter of the puzzle being misclassified. That may be so for Parse SQL Queries as well, if it truly deserves to be all the way up at the hard level, which is doubtful. I will push forward with investigation on the puzzle thread and summarize the results here. Yet there is a trend in the data that leads me to think this change in isolation would just be masking some underlying issue.

It may seem somewhat arbitrary, but the stats fit fairly well:

Most easy puzzles have a success rate better than 1/2. Only 5.6% of active puzzles do not.
Most medium puzzles have a success rate better than 1/4. Only 2.5% of active puzzles do not.
Most hard puzzles have a success rate better than 1/8. Only 2.9% of active puzzles do not.
Most very hard puzzles have a success rate better than 1/16. Only 6.5% of active puzzles do not.

If we exclude the Netflix contest, which is intended to have a low success rate, and 2 puzzles that I would have already bumped up from easy except for the issue of permissions, then these figures fall below 5% at all levels. One potential reason medium and hard puzzles have a higher success rate is that there are misclassifications in the other direction, whereas very hard is more guarded. It could also be that, on the whole, easy puzzles tend to be written by authors who are less experienced and let more flaws slip in. Regardless, it should be very instructive to have some guideline to follow.

jddingemanse · November 23, 2024, 3:10pm

Just one minor thought in this interesting thread: I think success rates are heavily influenced by which players try which puzzles. I think many codingamers avoid medium/hard/very hard puzzles (me myself regularly solve an easy puzzle, but only when I feel very much like it I start looking in medium and above). So, higher success rates amongst medium and hard might simply be because those puzzles are only started by the upper experienced codingamers.

Related to this, I’m not sure on basing difficulty levels on success rates. While beginning codingamers can now safely try and have fun in the easy category, with a success rate based system it will regularly occur that beginning players will have a negative experience of not being able to solve a puzzle (eg for example a hard puzzle based on 10% success rate would mean you need 90% of players having a negative experience).

About · November 24, 2024, 11:55am

I don’t think that automatic difficulty is a good idea. The difficulty of a task is generally very subjective. And it strongly depends on the richness of the standard library of the programming language.

But there are problems with the quality of tasks, that’s obvious. Sometimes they are simply impossible to understand without studying the hints on the forum.

David_Augusto_Villa · January 28, 2025, 2:43pm

Parse SQL Queries was not a difficult puzzle. I changed it to Medium because that’s typical for parsing, and also because there seems to be an unwritten understanding that it isn’t actually Easy as it was classified and as the statement itself had claimed. When a new user was having trouble, one of the discussion comments was to “try an easier puzzle”. So there’s definitely a disconnect between puzzle ratings as documented versus what the community believes. In my opinion it would be better to determine this by consensus rather than leaving it up to the author. However, in the case of Parse SQL Queries that’s a minor point, as there were much bigger issues. And from this we should conclude that automatically adjusting difficulty based on completion rates would not be entirely accurate. While it’s true that 12% is an exceedingly low success rate, Parse SQL Queries is not a Very Hard problem just because it had design issues.

It’s too early to know if my changes will have a significant impact, but I believe that the fundamental problem with the way the puzzle was written is that it was too open-ended. This is illustrated by a seemingly innocuous link to the W3Schools documentation for SQL queries, whose use would be overwhelming for someone knowledgeable of databases, let alone anyone new to them. It is not good introductory material, and it is neither appropriate reference material for this very restricted puzzle, which did not even have multiple conditions using AND and OR despite making reference to them as if it did. There were also superfluous references to row indexing, dropping tables, and numerical columns which were not explained. Concerning the latter point in particular, I was able to simplify significantly with only minor changes to one test case because the validators did not actually test certain situations. It would have been a better puzzle to also include sorting on string values and comparisons on numerical values, but because it was so lightly documented and furthermore not already sufficiently tested, I could not introduce or extend these in an already existing puzzle. In short, I spent a lot of time tidying up little details in order to better guide the user, by approaching the puzzle from the standpoint of someone trying to solve it instead of someone trying to learn SQL.

While hopefully all of these changes will be beneficial, aside from one or two I do not believe most to be critical. However, there was one very simple change that I’m sure could have had a massive impact on the perception of the puzzle and therefore determination to complete it once started. For some reason the validators had been given names which made it seem that they were testing for more advanced features such as “ternary operators”, which of course were not present in the test cases and not described in the statement. In truth they were not tested in the validators either, but since these are hidden to the user they really shouldn’t be expected to know. It’s quite possible that a lot of people quit after failing validators for reasons they could not anticipate, or gave up when reading that the last test case was gargantuan when it was actually not all that much larger and in fact simpler than the one before it. Furthermore, five tests/validators is not enough coverage even for a single aspect of the puzzle like sorting, let alone for the open-ended nature of the problem as presented. Things not tested for include different spacing, different order of WHERE and ORDER BY, different casing, multiple WHERE, multiple ORDER BY, etc etc etc. It certainly feels that a lot of motivation could have been lost when programming features where the extent is not clearly defined. Then seeing these validator names would have been enough to throw in the towel.

My experience in editing this puzzle is that it is too difficult to plan without just jumping in and going for it. There are simply too many small changes to enumerate, including things that are clearly broken like the formatting, more subjective decisions like which term to use when aiming for consistency across the puzzle, and dozens of other minuscule changes where it’s too much trouble to stop and ask, does this actually need to change for the better or would it be fine to leave slightly awkward as the author’s original intent? Of course I did hold back from rewriting most of the text, but I’m guessing I wouldn’t have uncovered many of the issues that were resolved if I had held back too much. On the one hand, I feel like I could have improved the puzzle much more if given freedom to change the problem itself, instead of being restrained by existing test cases. For instance, I could have created a new puzzle in the same vein. Yet I know that would also have taken a lot longer. And at the same time, the issues that were most pressing I can count on one hand. So I’m really at a loss for any recommendations on how to move forward with problematic puzzles like these, and I will not repeat the exercise. Except that I do think moderators should continue to make alterations when it is apparent that they are beneficial, and probably that’s a sufficient condition rather than the puzzle actually being broken in some way. In particular, mischaracterization of the difficulty level and incorrect validator names did not actually break the puzzle, but it was apparent from the comments that they would have been beneficial fixes. At least, if these came up in any of my puzzles, I hope I’d have the resolve to correct them immediately.