[Community Puzzle] 8086 Assembler, part I - Puzzle discussion

Coding Games and Programming Challenges to Code Better

Created by @JDOnline,validated by @JeZzElLutin,@Eulero314 and @UnknownException.
If you have any issues, feel free to ping them.

Hello I got stuck on one test.

Can I get some help with test case 09 Strong Parser?

Failure:
Found: 0000 | 83 DA FF | add bx, ‘;’-‘,’+‘'’-‘\’ ; ‘this, is, a, comment’
Expected: 0000 | 83 C3 DA | add bx, ‘;’-‘,’+‘'’-‘\’ ; ‘this, is, a, comment’

ASCII codes values used in expression:

‘;’ = 59
‘,’ = 44
‘'’ (single quote) = 39
‘\’ (backslash) = 92

Calculation:
59 - 44 + 39 - 92 = -38

Steps I followed:
Calculated the expression value as -38.
Converted -38 to two’s complement in 16-bit resulting in hexadecimal: FFDA
Stored the result in little-endian format: DA FF

The expected output seems to use a different encoding? C3 DA.

I confirmed the conversion at Decimal to Hexadecimal Converter

Is there’s a specific encoding rule I’m missing?

Found: 0000 | 83 DA FF | add bx, ‘;’-‘,’+‘‘’-‘\’ ; ‘this, is, a, comment’
Expected: 0000 | 83 C3 DA | add bx, ‘;’-‘,’+‘’’-‘\’ ; ‘this, is, a, comment’

The “DA” part of your answer is correct, but the instruction is of the type `ADD R16, I8`, and the converted code should be in the form of `83 C0+T data`. `C3` corresponds to the part `C0+T`.

1 Like

Great that solution worked and I missed that bit.

I was able to pass that test and get 83 C3 DA.

However, I still don’t properly understand the data bits (I don’t have a good knowledge of this).

R16, I8: 83 C0+T data (I8 will be sign-extended)

It says I8 to be sign extended. This means the data will become 16-bit right?

Data:
-38 = 0xDA
16-bit = 0xFFDA (sign extended)
In little endian = 0xDAFF

To add the data to 83 C3, I would get 83 C3 DA FF.

I’m just not entirely clear how the sign-extended data works to get 83 C3 DA.

When an x86 CPU actually executes the instruction 83 C3 DA (`add bx, -38`), the 8-bit immediate value (0xDA = -38) will be sign extended to a 16-bit value (0xFFDA = -38) which will then be added to the 16-bit register bx.

Sign extension refers preserving the value of a signed number while increasing the number of bits. For 2’s complement numbers, this is done by copying the most significant bit (MSB). For example, 0x03 will be sign extended to 0x0003 while 0xFD (= -3) will be sign extended to 0xFFFD (= -3).

When encoding an instruction, this matters for determining if a value can fit in an 8-bit immediate or if it requires a 16-bit immediate. The `ADD R16, I8` encoding can be used for 8-bit signed values [-128, 127]. Values outside of that range (and not equivalent to a value within that range) will need to use the `ADD R16, I16` encoding instead.

Hope that helps.

1 Like

It does help thank you for the detail

I’m running into a similar problem around this now on test case 13

Failure
Found: 0018 | 83 C3 89 | add bx, 0x89
Expected: 0018 | 81 C3 89 00 | add bx, 0x89

0x89 in decimal is 137
137 is in the range of unsigned I8: 0…255
BX is an R16

Instructions says:
4.2/ IMM is a 16-bit (I16 ) or 8-bit (I8) value which is unsigned (I16: 0…65535/I8: 0…255) or signed (I16: -32768…32767/I8: -128…127).

So this would be:
R16, I8: 83 C0+T data (I8 will be sign-extended)

but this isn’t the expected?

Failure
Found: 0018 | 83 C3 89 | add bx, 0x89
Expected: 0018 | 81 C3 89 00 | add bx, 0x89

But you said The `ADD R16, I8` encoding can be used for 8-bit signed values [-128, 127]. Values outside of that range (and not equivalent to a value within that range) will need to use the `ADD R16, I16` encoding instead.

Using this example: 0x89 in decimal is 137
So does 0 to 255 unsigned need to use `ADD R16, I16` then not `ADD R16, I8`

I’m still confused but trying to understand it.

For the ‘ADD R16, I8’ instruction encoding, the CPU has no way to know whether the immediate value is a signed or unsigned number, only that it should be sign-extended before computing the addition.

When the CPU sees 83 C3 89, it will sign extend the 8-bit immediate value 0x89 (-119 signed or 137 unsigned) to the 16-bit value 0xFF89 (-119 signed or 65417 unsigned) before performing the addition. This means that 83 C3 89 is a valid encoding for both `add bx, -119` and `add bx, 65417` but not `add bx, 137`.

For an instruction like `ADD R8, I8`, there is no need for sign extension. Thus any 8-bit immediate value [-128, 255] will be correctly handled by the CPU. The result of an addition/subtraction is the same whether the values are interpreted as 2’s complement signed values or unsigned values. The CPU will handle both cases by calculating the result and then updating both the signed overflow flag and the unsigned carry flag.

1 Like

I just finished the final test case (15) and program passes all the test cases.

Now to clean up all the code smells

This one bothers me a lot:

``````# Command specific encoding
if cmd == 'add' and imm_size == 'I16' and res_neg == -128:
imm_size = 'I8'
``````

res_neg is the negative decimal from signed two’s complement.

I needed that to pass line 7 test case 15:

`0013 | 83 C0 80 | add ax, 0xff80`

0xff80 in decimal is 65408 unsigned `I16` and 0xff80 is -128 signed `I8` (two’s complement).

Expected result is `I8` so I used specific encoding to use `I8` if the res_neg = 128.

Passes the test case but doesn’t seem a great way to do it. I’m not really clear on how the logic should work.

Individual immediate values should not be handled separately. Any value from 0xFF80 (65408) to 0xFFFF (65535) should use the `ADD R16, I8` encoding.

If you’re curious, these are two different approaches for nicely handling the detection of an I8 value that I tested when solving the puzzle:

Method 1 (likely similar to what you are currently doing)

Always convert the immediate value to a 16-bit signed number. This allows a single check of whether the value is within the range of an 8-bit signed number to correctly handle both signed and unsigned immediate values.

Method 2

Convert the immediate value to a 2 byte representation, then check if the more significant byte is unnecessary.

1 Like

Great puzzle! I made the parser, but I am quite confused with the encodings of the ADD instruction, especially on the AL/AX registers. My code passes local tests but fails remote tests “13 ADD more” and “15 ADD again”.

I understand that ADD AX, IMM is encoded as:

• R16, I8: 83 C0+T data (I8 will be sign-extended) ← `-2**7 <= IMM < 2**7` and `2**16-2**7 <= IMM < 2**16`
• ACC, IMM: 04+w data ← all other cases `-2**15 <= IMM < 2**16`
• R16, I16: 81 C0+T data ← never used for ADD AX, IMM

I understant that ADD AL, IMM is encoded as:

• ACC, IMM: 04+w data ← `-2**7 <= IMM < 2**8`
• R8, I8: 80 C0+T data ← never used for ADD AL, IMM

Some confusing points from the puzzle instructions:

• \$(encoded using the same size as the destination register). Is there really something special about the size of expressions that use the \$ sign?
• AX and AL are treated as primary accumulators (ACC) by some commands when the IMM command argument matches the number of bits (I16 for AX, I8 for AL). Does that mean `-2**15 <= IMM < 2**16` and `-2**7 <= IMM < 2**8`?

You seem to correctly understand the encodings for the `add ax/al, IMM` instructions. It is possible that your issue is with something else such as the `add R16, I8` encoding, parsing certain hex values, or properly handling expressions involving `\$`.

I don’t see anything obvious about those particular validation test cases but here is a simple test to try:

``````4
add dx, 0x0b + 2*\$*2 + 0dh + 1b

0000 | 83 C0 80          | add ax, -128
0003 | 81 C3 7F FF       | AdD bX, -129
0007 | 83 C1 AB          | add cx, 0xffab
000A | 81 C2 41 00       | add dx, 0x0b + 2*\$*2 + 0dh + 1b
``````

If that works fine then there is probably some off-by-one error that only applies to certain registers (al/ax vs others).

Assemblers such as NASM and MASM always encode expressions involving an address using a full-width immediate value. Both allow expressions adding/subtracting offsets to/from `\$` and specially mark the immediate values in the produced listing files. NASM also allows negating `\$` and/or scaling it by a constant while MASM gives an error. I haven’t seen anywhere that provides an explanation for using full-width immediate values but I believe it is to support easy relocation. If the program is moved to a different memory location, only the immediate values need to be changed and the rest of the machine code can remain as is. If full-width immediate values were not used, the instruction encodings could change and everything would need to be reassembled. Supporting relocation is not really relevant to this puzzle but we are trying to keep the output consistent with other assemblers.

The `add ACC, IMM` format is only used for `add ax, I16` and `add al, I8` to reduce the size of the encoded instruction by 1 byte. If the immediate value is small enough that it does not require a 16-bit immediate, that special format does not apply and the `add R16, I8` format is used instead (e.g. `add ax, -128`).

1 Like

I enjoyed working on solving this puzzle… until I didn’t.
I find the part of statement about dealing with signed/unsigned I16 and I8 values, and when to use `ADD ACC, IMM` and when `ADD Rx, Ix` very unclear.
I still don’t understand why these results are expected:

``````001E | 83 C0 00          | add ax, 0x12 * 2h - '\$'
0011 | 83 C0 7F          | add ax, 127
``````

but in a different test:

``````001C | 05 1C 00          | add ax, \$
``````

So the value 0x1C shall use the `ADD ACC, IMM` but the 0x00 and 0x7F not???

I pass all other tests and validators but not these 3 cases, at least not the same time.

EDIT: I finaly figured out how to pass 100%. \$ had to be treated a bit differently than other numbers. But my rant stands, it was hours of guesswork for me, which could have been avoided with some additional details in the statement.

There are a lot of details and explanations necessary for this puzzle and fitting them all in is definitely a challenge. Currently, the problem statement is only 9 characters under the limit. This is also the reason why the author only has 3 instructions in Part 1 and even included some additional explanations within one of the test cases.

I agree that the special handling of `\$` is not super obvious in the statement. I just tweaked the wording slightly to hopefully make it a little clearer. If you have any suggestions for further changes to the problem statement, feel free to leave a comment on the contribution.

Regarding the different encodings for the `add REGISTER, IMMEDIATE` instructions, the general goal is to reduce the size of the resulting executable (which can also improve performance). The `add ACC, IMM` and `add R16, I8` encoding formats are used instead of the `add R16, I16` format to reduce the size of the encoded instruction wherever possible:

• For `add al, I8`, the `add ACC, IMM` format is used since it results in a 2 byte instruction instead of a 3 byte instruction using `add R8, I8`.

• For `add ax, I16`, the `add ACC, IMM` format also results in a 1 byte saving (3 bytes vs 4 bytes).

• The `add R16, I8` format is always used whenever possible. For bx/cx/dx, this results in a 1 byte saving (3 bytes vs 4 bytes). For ax, the `add ACC, IMM` and `add R16, I8` formats both result in 3 byte instructions and the more general `add R16, I8` format is preferred (matching the results of other assemblers, such as NASM).

Any expression resulting in an address (e.g. `\$ + 2`) should be encoded using the largest immediate possible. For example, `add bx, \$` will treat `\$` as an I16 and use the `add R16, I16` format. Similarly, `add ax, \$ + 4` will use the `add ACC, IMM` format. For an explanation of the reasoning behind this requirement, see my previous post above.

1 Like