Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Non latin characters in string literals

Reply
Thread Tools

Non latin characters in string literals

 
 
Ioannis Vranos
Guest
Posts: n/a
 
      01-03-2010
I am asking so as to be sure:


AFAIK non-latin, other language characters, produce undefined behaviour,
when used with standard library facilities expecting char strings like
printf(), and when used in string literals.


Is this correct?


The C99 standard mentions:


"5.2.1 Character sets

1 Two sets of characters and their associated collating sequences shall be
defined: the set in
which source files are written (the source character set), and the set
interpreted in the
execution environment (the execution character set). Each set is further
divided into a
basic character set, whose contents are given by this subclause, and a set
of zero or more
locale-specific members (which are not members of the basic character set)
called
extended characters. The combined set is also called the extended character
set. The
values of the members of the execution character set are implementation-
defined.

2 In a character constant or string literal, members of the execution
character set shall be
represented by corresponding members of the source character set or by
escape
sequences consisting of the backslash \ followed by one or more characters.
A byte with
all bits set to 0, called the null character, shall exist in the basic
execution character set; it
is used to terminate a character string.

3 Both the basic source and basic execution character sets shall have the
following
members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
the space character, and control characters representing horizontal tab,
vertical tab, and
form feed. The representation of each member of the source and execution
basic
character sets shall fit in a byte. In both the source and execution basic
character sets, the
value of each character after 0 in the above list of decimal digits shall be
one greater than
the value of the previous. In source files, there shall be some way of
indicating the end of
each line of text; this International Standard treats such an end-of-line
indicator as if it
were a single new-line character. In the basic execution character set,
there shall be
control characters representing alert, backspace, carriage return, and new
line. If any
other characters are encountered in a source file (except in an identifier,
a character
constant, a string literal, a header name, a comment, or a preprocessing
token that is never
converted to a token), the behavior is undefined.

4 A letter is an uppercase letter or a lowercase letter as defined above; in
this International
Standard the term does not include other characters that are letters in
other alphabets".




Thanks a lot,

--
Ioannis Vranos

C95 / C++03 Software Developer

http://www.cpp-software.net
 
Reply With Quote
 
 
 
 
Ioannis Vranos
Guest
Posts: n/a
 
      01-03-2010
Ioannis Vranos wrote:

> I am asking so as to be sure:
>
>
> AFAIK non-latin, other language characters, produce undefined behaviour,
> when used with standard library facilities expecting char strings like
> printf(), and when used in string literals.



I mean, for other language characters, wchar_t character type, wchar_t
strings/string literals, and wchar_t pointers and wchar_t facilities like
wprintf(), wscanf(), etc, should be used instead, along with the required
locales.





Thanks a lot,

--
Ioannis Vranos

C95 / C++03 Software Developer

http://www.cpp-software.net
 
Reply With Quote
 
 
 
 
Nick
Guest
Posts: n/a
 
      01-03-2010
Ioannis Vranos <(E-Mail Removed)> writes:

> Ioannis Vranos wrote:
>
>> I am asking so as to be sure:
>>
>>
>> AFAIK non-latin, other language characters, produce undefined behaviour,
>> when used with standard library facilities expecting char strings like
>> printf(), and when used in string literals.

>
>
> I mean, for other language characters, wchar_t character type, wchar_t
> strings/string literals, and wchar_t pointers and wchar_t facilities like
> wprintf(), wscanf(), etc, should be used instead, along with the required
> locales.


I don't think so. I can't see any problems with using standard C
strings for UTF-8 characters (indeed, I'm been doing it for a while with
no problems).

A few things to note:

- UTF-8 contains no zero bytes, so C strings will not terminate prematurely.

- it's risky, to say the least, to stick accented characters etc in
string literals; you need to use the appropriate hex escapes to be safe
(which also avoids your editor doing horrible things if you aren't
careful).

- strlen will come back with the number of bytes in the string, not the
number of characters. However as often as not you are using strlen to
work out how much storage space you need anyway.

- you need to be careful to meticulously cast to unsigned char. In
particular before passing to a ctype macro, but also before any time you
assign to an integer (unless you want enormous negative numbers flying
around).

But with all that in mind, it works fine. I only do it because I
already had a mountain of code that I wanted to make work with accented
characters - UTF-8 proved a remarkably pain-free way to do it, certainly
easier than learning all the w* features (which I've never used) and
editing all the code.

Nick, cheerfully expecting to find out he's wrong and what he thought
was a bad cold was something a lot worse.
--
Online waterways route planner | http://canalplan.eu
Plan trips, see photos, check facilities | http://canalplan.org.uk
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      01-03-2010
On 1/3/2010 10:40 AM, Ioannis Vranos wrote:
> I am asking so as to be sure:
>
>
> AFAIK non-latin, other language characters, produce undefined behaviour,
> when used with standard library facilities expecting char strings like
> printf(), and when used in string literals.
>
>
> Is this correct?


Not undefined behavior, but implementation-defined behavior.
If the implementation supports "extended" characters beyond those
specifically required by the Standard, you can use them. If it
doesn't, you can't. Also, they're implementation-defined rather
than unspecified, because of

> "5.2.1 Character sets
> [...] The
> values of the members of the execution character set are implementation-
> defined.


Since the implementation must define (i.e., document) the values
of all the supported characters, it must document the characters
themselves, en passant as it were, and thus define them.

In a follow-up post you mention using wchar_t in connection
with exotic glyphs. Those aren't "characters" in the sense of
5.2.1, but "wide characters." If the implementation provides
wide characters outside the repertoire of ordinary characters,
they, too, are implementation-defined (6.4.4.4p11), and using
them has implementation-defined behavior.

Summary: Not "portable," but not "undefined."

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)lid
 
Reply With Quote
 
Nick
Guest
Posts: n/a
 
      01-03-2010
Joe Wright <(E-Mail Removed)> writes:
>
> I love the idea of UTF-8 but I don't know how to use it. Code points
> 0..127 are single byte ASCII characters and offer no problem. But what
> do we do with the multi-byte characters?


We simply stick them into our C strings, byte by byte. There are no
zeros in UTF-8, so they are still valid strings (as CHAR_BIT is always
at least 8, a char can always hold at least 1 byte).

For example, try this.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
char s[] = "\xc2\xa1ol\xc3\xa9! This string costs \xc2\xa3""25.20";
printf("The string is \"%s\" - 30 symbols, but is %d bytes long",s,(unsigned int)strlen(s));
return EXIT_SUCCESS;
}

In a bash shell on my Ubuntu box that prints out perfectly.

This is my favourite reference page for the characters:
http://www.utf8-chartable.de/unicode-utf8-table.pl

BTW, if anyone can tell me how to avoid having the string concatenation in
the assignment line (in other words, how to end a hex string where the
next character is a digit or A-F) I'd be grateful.
--
Online waterways route planner | http://canalplan.eu
Plan trips, see photos, check facilities | http://canalplan.org.uk
 
Reply With Quote
 
osmium
Guest
Posts: n/a
 
      01-03-2010

"Nick" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> Joe Wright <(E-Mail Removed)> writes:
>>
>> I love the idea of UTF-8 but I don't know how to use it. Code points
>> 0..127 are single byte ASCII characters and offer no problem. But what
>> do we do with the multi-byte characters?

>
> We simply stick them into our C strings, byte by byte. There are no
> zeros in UTF-8, so they are still valid strings (as CHAR_BIT is always
> at least 8, a char can always hold at least 1 byte).
>
> For example, try this.
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> int main(void) {
> char s[] = "\xc2\xa1ol\xc3\xa9! This string costs \xc2\xa3""25.20";
> printf("The string is \"%s\" - 30 symbols, but is %d bytes
> long",s,(unsigned int)strlen(s));
> return EXIT_SUCCESS;
> }
>
> In a bash shell on my Ubuntu box that prints out perfectly.
>
> This is my favourite reference page for the characters:
> http://www.utf8-chartable.de/unicode-utf8-table.pl
>
> BTW, if anyone can tell me how to avoid having the string concatenation in
> the assignment line (in other words, how to end a hex string where the
> next character is a digit or A-F) I'd be grateful.
> --
> Online waterways route planner | http://canalplan.eu
> Plan trips, see photos, check facilities | http://canalplan.org.uk



 
Reply With Quote
 
Ioannis Vranos
Guest
Posts: n/a
 
      01-04-2010
Nick wrote:

> Joe Wright <(E-Mail Removed)> writes:
>>
>> I love the idea of UTF-8 but I don't know how to use it. Code points
>> 0..127 are single byte ASCII characters and offer no problem. But what
>> do we do with the multi-byte characters?

>
> We simply stick them into our C strings, byte by byte. There are no
> zeros in UTF-8, so they are still valid strings (as CHAR_BIT is always
> at least 8, a char can always hold at least 1 byte).
>
> For example, try this.
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> int main(void) {
> char s[] = "\xc2\xa1ol\xc3\xa9! This string costs \xc2\xa3""25.20";
> printf("The string is \"%s\" - 30 symbols, but is %d bytes
> long",s,(unsigned int)strlen(s)); return EXIT_SUCCESS;
> }
>
> In a bash shell on my Ubuntu box that prints out perfectly.
>
> This is my favourite reference page for the characters:
> http://www.utf8-chartable.de/unicode-utf8-table.pl
>
> BTW, if anyone can tell me how to avoid having the string concatenation in
> the assignment line (in other words, how to end a hex string where the
> next character is a digit or A-F) I'd be grateful.




Isn't multibyte characters usage more messy and more restricted, than usage
of wchar_t characters?




--
Ioannis Vranos

C95 / C++03 Software Developer

http://www.cpp-software.net
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      01-04-2010
On 1/4/2010 1:49 AM, Ioannis Vranos wrote:
>
> Isn't multibyte characters usage more messy and more restricted, than usage
> of wchar_t characters?


"Messier," yes, in the sense that it's more complicated
to navigate in a string of multibyte characters than in an
ordinary string where all characters are the same size. If
`p' points to a char in an ordinary string, `p+1' points to
the next char. But in a multibyte string, the char at `p+1'
might be a continuation of the "character" starting at `p'
rather than an independent "character" on its own. Going
backwards is, if possible at all, even worse.

"More restricted" -- I honestly don't know. It's my
impression that there's supposed to be a wchar_t value for
every (valid) multibyte sequence, and at least one multibyte
sequence for every wchar_t value, but I can't find a guarantee
to that effect. (Since the mapping is locale-dependent, and
since locales are implementation-defined, such a guarantee
might be impossible -- consider converting one way, changing
locales, and trying to convert back again).

If you're handling the multibyte strings as "pass-through"
data, you can probably leave them in multibyte form and not
worry about their internal structure. If you're going to
analyze and/or manipulate the strings' characters, it may be
best to convert from multibyte to wchar_t, do the work, and
(if needed) convert back again. That is, treat the multibyte
string as the "external encoding" of a wchar_t string that
you work with internally.

--
Eric Sosman
(E-Mail Removed)lid
 
Reply With Quote
 
Nobody
Guest
Posts: n/a
 
      01-04-2010
On Sun, 03 Jan 2010 17:40:16 +0200, Ioannis Vranos wrote:

> I am asking so as to be sure:
>
>
> AFAIK non-latin, other language characters, produce undefined behaviour,
> when used with standard library facilities expecting char strings like
> printf(), and when used in string literals.
>
>
> Is this correct?


Using non-ASCII characters in string literals is problematic.

Passing non-ASCII strings to library functions isn't a problem, although
some of them will expect the strings to be valid according to the encoding
of the current locale().

If the library functions only accepted ASCII strings, there wouldn't be
much point in having locales.

 
Reply With Quote
 
Nobody
Guest
Posts: n/a
 
      01-04-2010
On Mon, 04 Jan 2010 08:49:15 +0200, Ioannis Vranos wrote:

> Isn't multibyte characters usage more messy and more restricted, than usage
> of wchar_t characters?


It's more messy, but you typically have to convert from/to multibyte
representation for input and output (less so on Windows, where the OS APIs
use wchar_t, although you still need to convert for non-Microsoft file
formats and network protocols).

Generally, if you're just passing strings around without processing them,
it's easier to keep the "char" representation.

If you need to do non-trivial processing, wide characters are easier (e.g.
indexing a wchar_t array indexes characters rather than bytes).

OTOH, Windows uses a 16-bit wchar_t, so if you're using characters outside
of the basic multilingual plane (BMP), you end up with a multi-wchar_t
representation, which is the worst of both worlds.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Searching for latin/umlaut characters in string John Butler Ruby 7 05-02-2008 02:38 PM
Unicode literals to latin-1 Python 6 01-30-2008 02:47 PM
Java: byte literals and short literals John Goche Java 8 01-17-2006 11:12 PM
need help formatting pages for non-latin Foreign Characters (will pay) WEB 1040 HTML 2 04-06-2005 10:26 PM
Mac Safari and urlencoding non-latin characters Riku Kangas Javascript 2 06-15-2004 01:15 PM



Advertisments