This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

C51: Is this a compiler bug ?

... or a misunderstanding on my part ?

From string "\x0CTUV", the compiler generates 0x0C 0x55 0x56 0x57 0x00

From string "\x0CABC", the compiler generates 0xCA 0x42 0x43 0x00 ... rather than the expected 0x0C 0x41 0x42 0x43 0x00

I thought the \x escape sequence in a string instructed the compiler to encode the very next two characters as a hexadecimal byte.

Am I missing something ?

Parents Reply Children
  • but do you think it was an unreasonble assumption?

    Yes, it was unreasonable, because it contradicts what the language definition says about \x. There are specific examples in the C99 standard demonstrating exactly this pitfall (6.4.4.4 paragraph 14, 6.4.5 paragraph 7).

    The deeper reason for this is that C doesn't assume chars are always 8 bits wide.

  • You just assumed that it accepts two non-zero charactersc :)

    It will continue to eat characters until the first non-valid character is found.

    gcc would consume every signle character in your second string, and then complain about "hex escape sequence out of range".

  • "There are specific examples in the C99 standard"

    I don't have the C99 standard, but C90 specifically states:

    "Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence" (my emphasis).

    So C51 is behaving exactly as specified!

  • "Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence" (my emphasis).

    OK, but ...
    1) In Kiel C51, what exactly is the \x escape sequence supposed encode? A single character? Multiple characters?
    2) In the case of \x, is "the longest sequence of characters that can constitute the escape sequence" all hexadecimal characters ?

  • The deeper reason for this is that C doesn't assume chars are always 8 bits wide.

    But C51 ( the MCU specified in this thread ) DOES specifically state chars are 8 bits wide. Please refer to the table of Data Types on page 95 of Users Guide 09.2001

  • \x will consume all combinations of 0..9, a..f, A..F for any length of characters up to any possible compiler-specific limitation of a string. If the compiler can handle 2000 character long strings, then it can consume a very, very long number of hexadecimal characters.

    However, most compilers will issue a warning if the hexadecimal constant overflows the max range for a character. I haven't looked into the standard if it is allowed, but there are compilers who will treat too long hex constants as an error. Note that a character does not have to be 8 bits large. Wide characters are normally 16 bit, but some architectures can have completely different character sizes.

    I leave it to C51 users to discuss specific limits for the Keil compiler. The important thing for you is to make sure that the compiler can not consume more characters than you specifically want to belong to the hex constant. My first post shows what I normally do.

  • The compiler may have a fixed number of bits in a character, but that is irrelevant.

    Is 0x0 or 0x00 or 0x000 or 0x0000 or 0x00000 or 0x000000 or 0x0000000 or 0x00000000, ... different numbers? All represent the value 0 - a value that fits in a single bit, a char, a short, an int, a long int, a long long int, ...

    So it should be obvious that the compiler can not be allowed to count number of digits in a number and stop parsing after a specific number of digits, and assuming that the following characters are part of the next symbol in the source code. There is no difference if we are talking about an integer symbol, or an inlined hexadecimal constant in a text string. The compiler reacts to the characters, not to the number of characters. An integer as a symbol is a non-zero count of digits, followed by a character that is not a digit.

  • The important thing for you is to make sure that the compiler can not consume more characters than you specifically want to belong to the hex constant. My first post shows what I normally do.

    I agree. What's odd, however, is why the compiler seemingly ignores leading zeros, rather than consuming all valid hex characters following the \x. Refering back to my original post, you'll see that the "\x0CABC" does NOT consume the B and the C in the string ... both of which by definition are valid hexidecimal digits. It apparently ignores the leading zero, then encodes the next TWO digits ( the C and A in this case ).

    Now, given that this forum's primary purpose is a venue for debate rather than assistance, I will accept the obligatory ongoing debate, but you do have to admit that the behavior of the \x conversion is a bit odd. No ?

  • So it should be obvious that the compiler can not be allowed to count number of digits in a number and stop parsing after a specific number of digits

    But it DOES. See my reply just above.

    Again, it did NOT include the B and C in the "\x0CABC". It stopped after the A.

  • An 8-bit processor with 8-bit characters can not swallow more than 8 bits of data. Because of this, you will get into undefined country as soon as you try to specify a hexadecimal value larger than is meaningful on the compiler.

    In short: You do not know exactly what will happen. And because of this, you should not make any assumptions about number of characters that the compiler will process but should force a break after the end of the constant.

    There is no rule that say that the compiler should consume any leading zero digits and then consume exactly bits/4 hex digits and then break. It can consume all digits, just doing n *= 16 + digit and emit the least significant 8 bits. Or it can consume all digits but emit the first 8 bits. Or it can decide to break as soon as it gets an overrun. That is why assumptions are bad. You test on one compiler and make the assumption that you have found a magic rule that is generally applicable.

    If there are no hard rule that a compiler _must_ behave in a specific way, then you should do your best to stay away from this implementation-specific zone. It will bite. It may bite when you switch to a different compiler. But it may just as well bite if you cange a compilation flag or update to the next release of a compiler.

    You should get the language standard, and spend some time with it. The standard itself says in paragram 6.4.4.4 (my emphasis):

    Point 6:
    The hexadecimal digits that follow the backslash and the letter x in a hexadecimal escape sequence are taken to be part of the construction of a single character for an integer character constant or of a single wide character for a wide character constant. The numerical value of the hexadecimal integer so formed specifies the value of the desired character or wide character.

    Point 7:
    Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.

    From the above, it could be "assumed" that the Keil compiler is buggy and should have consumed all characters. But note point 9:
    The value of an octal or hexadecimal escape sequence shall be in the range of representable values for the type unsigned char for an integer character constant, or the unsigned type corresponding to wchar_t for a wide character constant.

    I.e. it is up to you to make sure that you do not feed the compiler more digits than what will fit in a character. You made an invalid assumption and broke a constraint specified in the language standard. That left you in limbo land.

  • Yes, but see the quoted part of the standard below - your part of the contract is to make sure that the hexadecimal constant will fit in an unsigned 8 bit character if the compiler consumes all valid characters. In your case, the number was too large and when you violate a requirement in the standard the compiler is no longer obliged to perform in a specific way.

  • From the above, it could be "assumed" that the Keil compiler is buggy and should have consumed all characters. But note point 9:
    The value of an octal or hexadecimal escape sequence shall be in the range of representable values for the type unsigned char for an integer character constant, or the unsigned type corresponding to wchar_t for a wide character constant.

    So perhaps the compiler's logic in this matter is, "consume all characters up to, but not beyond the range of an unsigned char".

    I guess I can live with that and I agree that no assumptions should be made in this area with different compilers.

    I appreciate the feedback.

  • The point is that you should not concern yourself with the compiler's logic - you should concern yourself with ensuring that your source text is completely unambiguous and, therefore, not subject to any misinterpretation by any compiler logic!

  • But C51 ( the MCU specified in this thread ) DOES specifically state chars are 8 bits wide.

    Of course it does. But that doesn't affect the parsing of character and string literals. Those are defined by the language standard.