This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cyrillic scripts

Hi,

how is it possible to add Cyrillic scripts in one of my c-files (working with mdk version 3.4)?

const char text_cyrillic = {"Cyrillic script"};

best regards
Arne

Parents
  • I didn't ask what the uvision editor would show. Only an editor using UTF8 would correctly show (or allow you to enter) text constants containing UTF8.

    The question is: What happens with the HTML output if you do write the strings in UTF8? The presentation in uvision may show question marks, but the UTF8 strings just contains 8-bit data that is not colliding with anything in 7-bit ASCII, so the compiler should be able to pick up the " character, and then process a number of don't-care characters until it sees another " that doesn't have a \ before it.

    In short: if uvision EDITOR doesn't support UTF8, which it doesn't seem to do, then you will not be able to write cyrillic text with your keyboard. And opening a source file containing cyrillic with UTF8 encoding, you will not be able to read the cyrillic. But focus on the goal. The goal isn't that you can read cyrillic in the uvision editor. The goal is that the web browser receievs cyrillic text when processing the received data as UTF8.

    Download the Crimson editor. Open the source file and then set document encoding to UTF8 w/o BOM and write your cyrillic text. Save and tab back to uvision build and test.

Reply
  • I didn't ask what the uvision editor would show. Only an editor using UTF8 would correctly show (or allow you to enter) text constants containing UTF8.

    The question is: What happens with the HTML output if you do write the strings in UTF8? The presentation in uvision may show question marks, but the UTF8 strings just contains 8-bit data that is not colliding with anything in 7-bit ASCII, so the compiler should be able to pick up the " character, and then process a number of don't-care characters until it sees another " that doesn't have a \ before it.

    In short: if uvision EDITOR doesn't support UTF8, which it doesn't seem to do, then you will not be able to write cyrillic text with your keyboard. And opening a source file containing cyrillic with UTF8 encoding, you will not be able to read the cyrillic. But focus on the goal. The goal isn't that you can read cyrillic in the uvision editor. The goal is that the web browser receievs cyrillic text when processing the received data as UTF8.

    Download the Crimson editor. Open the source file and then set document encoding to UTF8 w/o BOM and write your cyrillic text. Save and tab back to uvision build and test.

Children
  • Download the Crimson editor. Open the source file and then set document encoding to UTF8 w/o BOM and write your cyrillic text. Save and tab back to uvision build and test.

    I've downloaded this editor, but with the setting UTF8 w/o BOM, I can't write cyrillic text. The keyboard language is Russian.

    I've wrote some cyrillic text with with normal editor in windows - which is working. After that I opened the file in the crimson editor (with UTF8) and all cyrillic text are question marks.... Of course I stored the file with UTF-8 - not in ascii.

  • Strange. I have done a lot of work with uncommon characters with Crimson, but it does not like Cyrillic for me either.

    Notepad can handle cyrillic, but only with a BOM. That is probably not good, unless the Keil compiler explicitly supports UTF8 and can accept the BOM. The editor in Code::Blocks can also handle Cyrillic, but creates UTF8 files with a BOM.

    Unless the compiler (and the uvision editor) can correctly read a UTF8 file with a BOM, you will just have to look around for an editor that can save to UTF8 without a BOM, or write a little tool that can add/remove the BOM from the file. That would allow you to use Notepad (ok if you move the constants to a separate file, in which case the lack of syntax highlight etc doesn't really matter) or Code::Blocks or similar for editing the text.

  • A source file can't mix different encodings, but the compiler doesn't really care much about what characters you have in a text string.

    That's not exactly true. A multibyte character in a string can have a byte matching the closing quote of the string. That would give you unexpected results. Same with comments: it should be possible to construct a UTF8 comment where a compiler only understanding ASCII will find a comment terminator withing multibyte characters.
    So it is important that the compiler understands the encoding of your source files.
    Another issue is text editor able to produce source files with necessary encoding. We all know it can't be uVision. I can't really recommend a suitable editor.
    At the end of the day, it might be necessary to use encoding-converting tools. If the compiler doesn't understanf UTF8, strings should be converted to octal or hexadecimal notation (\012 or \x12). If the editor of choice doesn't produce the exact required encoding, its output could be post-processed prior to compilation.

  • UTF8 only makes use of safe break characters. It doesn't interfere with the original 7-bit ASCII standard. To the compiler, it will just be a sequence of single-byte characters to copy from source into string constants. If the high bit is zero it's a one-byte character. if the two high bits are 11, it's a break character, signalling that two, three or four bytes are needed. if the two high bits are 10, it's one of the following bytes in a multi-byte character. In no way will this interfere with the compiler, as long as the compiler supports 8-bit data.

    It is not possible to create a UTF8 text string where any character but the " will have a byte with the numercial value of the ". Same with all other critical characters in C, since all tokens in C are within the 7-bit ASCII set.

    I don't think I have seen a compiler that doesn't handles 8-bit data in a long, long time, since such a compiler would not even be able the 8-bit code pages used in text mode on a PC.

    Generic multibyte character sets on the other hand may not be safe since there is no guarantee that the follow-up bytes stays out of the old 7-bit ASCII range.

  • it seems that uvision is not able to understand utf-8, which is very disappointing.

    Babelstone is a free editor, which is able to store the text in utf-8 (with or without BOM). But uvision shows only a few strange characters....

    This editor is also able to store the cyrillic text in your mentioned hexadecimal notation (\012 or \x12). The only problem is that the code size for the webpage will be enlarged.

  • maybe the problem is, that uvision only supports the 8bit unicode and all cyrillic scripts starts at unicode (hex: Ð). So that you will need more than 8bits to show the correct letters.

  • Hasn't that been the conclusion from post one?

    But the nice thing with UTF8 is that a program that doesn't support UTF8 can normally work as a safe container for UTF8 text. You can't edit charactesr outside 7-bit ASCII, but you can load them and save them without destroying them. Opening a UTF8 file without a BOM in a plain 8-bit text editor will show all 7-bit characters as expected. For any extended character, you will get two, three or four "noise" characters displayed.

    An example is what happens if I write national characters in this post - the Keil forum claims UTF8 support but doesn't.

    Opening an UTF8 file in uvision could look like:

    const char str[] = "This is a string with extended charactesr: åäöÅÄÖюÑиСЎεθώϊϊÃØÞßÅ'Å Å'ÈŸÉ³Î£Î˜Ê„Ï Óá¿·âˆâˆ­â–¤â—⡽��"
    

    As long as the compiler is 8-bit safe, this really doesn't matter. You will get a perfect UTF8 text stord in the character constant. The only thing that will not work is that strlen() will return number of non-zero bytes, instead of number of characters. Not important for sending out text to a web browser.

  • maybe the problem is, that uvision only supports the 8bit unicode

    No, I think the problem is that uVision creators never thought that someone might want to use their text editor with different encodings.
    I keep wondering: why everyone keeps reinventing their own text editor and/or IDE? There must be a better way, and I don't mean Eclipse...

  • >Opening an UTF8 file in uvision could look like:
    correct - your example is very similar.

    As long as the compiler is 8-bit safe, this really doesn't matter. You will get a perfect UTF8 text stord in the character constant. The only thing that will not work is that strlen() will return number of non-zero bytes, instead of number of characters. Not important for sending out text to a web browser.

    I know what you mean, uvision don't erase any kind of information but is not able to interpret the text correct (unicode format).

    But the web browser also shows these characters (from your example - not the correct unicode). I've tested the page in IE7 and firefox. Other pages in the www using cyrillic scripts, will be shown in the correct way. And of course I'm using the content-header with charset utf-8 (http header).

    So the only thing I don't understand is that the web browser won't show the correct unicode.

  • Have you made really sure that the web pages that gets sent out specifies the UTF8 encoding? If they don't, then the web browser will not know that there are any UTF8 multi-byte characters to display. It may default to the ISO-8859-1 character set instead.

    It really is important to note that a byte is just a binary storage cell capable of storing a value between 0 and 255. To display one or more bytes as specific characters, you must make sure that the renderer is informed about what character set to use, and also supports it.

    The support for different character sets in uvision is irrelevant in relation to your possibilities of selecting character sets for use in the web browser.

    In short: You must make sure that
    1) The web page data contains UTF8 data.
    2) The web page mentions that it is using UTF8 data.

  • I don't think I really understand this thread/subject. Just want to provide some information. (might be totally useless.)

    ====================================
    RealView Compiler Reference Guide
    Character sets and identifiers

    www.keil.com/.../armccref_cihdigag.htm

    # Source files are compiled according to the currently selected locale. You might have to select a different locale, with the --locale command-line option, if the source file contains non-ASCII characters. See Invoking the ARM compiler in the Compiler User Guide for more information.

    # The ARM compiler supports multibyte character sets, such as Unicode.

    # Other properties of the source character set are host-specific.
    ====================================
    --multibyte_chars, --no_multibyte_chars

    www.keil.com/.../armccref_CHDCECBH.htm

    This option enables or disables processing for multibyte character sequences in comments, string literals, and character constants.
    ====================================

    Cyrillic, are basically the characters from ISO 8859-5 moved upward by 864 positions.
    ( en.wikipedia.org/.../Cyrillic_characters_in_Unicode )

    I guess that, host supports ISO 8859-5, but does not support Cyrillic; so thought KEIL supports unicode, but not able to support Cyrillic.