Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Multicharacter literals

Reply
Thread Tools

Multicharacter literals

 
 
Richard Smith
Guest
Posts: n/a
 
      08-22-2012
I recently encountered some C++ code that made use of multicharacter
literals -- that is, something that looks like a character literal,
but contains more than one character:

int i = 'foo';

I must admit, I hadn't realised that C++ still allowed these and had
assumed they went the way of implicit int and K&R-style function
declarations. The standard tells me that, unsurprisingly, their
representation implementation-defined (and so does the C standard), so
my questions here are not about what the standard requires (nor
whether I should be using them), but rather what implementations
commonly do and why.

Using GCC on i386, I find that

'foo' == ('f' << 16 | 'o' << 8 | 'o');

Because i386 is little-endian, this implies it lays out the literal as
"oof\0", and this is confirmed if I look at the object code
generated. I must admit, this surprised me. Certainly this choice is
permitted, and it's easiest for the compiler to parse as it's just a
base-256 integer. But the only sensible reason I can think of for
using multicharacter literals is when doing binary I/O. Short strings
the length of the machine word exist in a number of binary formats --
e.g. "\x7FELF" at the start of an ELF binary, or labels like "RIFF"
and "WAVE" in the WAV audio format. If I were writing in assembly, I
might well convert these manually to 32-bit integers and then simply
dump them; and I can possibly imagine wanting to do that in C or C++
when writing low-level code. But if I do that with GCC's
multicharacter literals, they have the wrong byte order: I would have
to dump 'EVAW' instead of 'WAVE'.

It seems unlikely that GCC would make an inconvenient implementation
choice for no good reason, so presumably, then, there is (or once was)
another use for these that's eluding me. Can anyone suggest what it
is?

Richard
 
Reply With Quote
 
 
 
 
Öö Tiib
Guest
Posts: n/a
 
      08-22-2012
On Wednesday, August 22, 2012 7:35:57 PM UTC+3, Richard Smith wrote:
> I recently encountered some C++ code that made use of multicharacter
> literals -- that is, something that looks like a character literal,
> but contains more than one character:
>
> int i = 'foo';
>
> I must admit, I hadn't realised that C++ still allowed these and had
> assumed they went the way of implicit int and K&R-style function
> declarations. The standard tells me that, unsurprisingly, their
> representation implementation-defined (and so does the C standard), so
> my questions here are not about what the standard requires (nor
> whether I should be using them), but rather what implementations
> commonly do and why.


....

That you further found out. There are really no other reasons to use it (and have never been) but to confuse the heck out of a novice maintainer.

The newer versions of compilers compile it into same data and instructions what they always did. Some compilers may issue warnings and that is it because there may be is some legacy code that might use it for whatever reasons.

Same thing likely happens with implicit int and K&R function declarations, despite it is kicked out from standards at least some of the compilers still compile it and issue warnings. Legacy code is too sacred to touch.

It is left up to development process (with its possible coding standards, tools and code reviews) how to address usage of all such features.

 
Reply With Quote
 
 
 
 
Richard Smith
Guest
Posts: n/a
 
      08-22-2012
On Aug 22, 6:40*pm, Öö Tiib <oot...@hot.ee> wrote:
> On Wednesday, August 22, 2012 7:35:57 PM UTC+3, Richard Smith wrote:
> > I recently encountered some C++ code that made use of multicharacter
> > literals -- that is, something that looks like a character literal,
> > but contains more than one character:

>
> > * int i = 'foo';

>
> > I must admit, I hadn't realised that C++ still allowed these and had
> > assumed they went the way of implicit int and K&R-style function
> > declarations. The standard tells me that, unsurprisingly, their
> > representation implementation-defined (and so does the C standard), so
> > my questions here are not about what the standard requires (nor
> > whether I should be using them), but rather what implementations
> > commonly do and why.

>
> ...
>
> That you further found out. There are really no other reasons to use it (and have never been) but to confuse the heck out of a novice maintainer.


Well, clearly that's not true. Compiler writers don't decide to add
functionality simply "to confuse the heck out of a novice
maintainer."

Multicharacter literals go back to the late 1960s in the B language; C
inherited them from B, and C++ from B. It's easy to see why they
existed in B. For one thing, there was no char type: everything was
an int, even if you only cared about the lowest 8 bits. Optimising
for code size was also far more important than today, and if you were
used to writing in assembler, you'd be used to putting small strings
as immediates. If you look in the B manual, you'll see examples of
multicharacter literals used in this way: effectively, optimised very
short strings.

However, GCC's (perfectly legal) implementation choices doesn't allow
that usage. As you point out, compiler writers don't break
compatibility with old code for no reason, yet here, somewhere along
the line, a compiler vendor evidently decided to implement
multicharacter literals in a way that broke their use as small
strings. It would have been trivial to have implemented them on a
little-endian machine so that they worked as short strings. So I can
only assume there was some other use of multicharacter literals that
was more important to keep working. I am curious as to what that
other, more important use was.

Richard
 
Reply With Quote
 
Öö Tiib
Guest
Posts: n/a
 
      08-22-2012
On Wednesday, August 22, 2012 9:18:53 PM UTC+3, Richard Smith wrote:
> On Aug 22, 6:40*pm, Öö Tiib <oot...@hot.ee> wrote:
> > On Wednesday, August 22, 2012 7:35:57 PM UTC+3, Richard Smith wrote:
> > > I recently encountered some C++ code that made use of multicharacter
> > > literals -- that is, something that looks like a character literal,
> > > but contains more than one character:

> >
> > > * int i = 'foo';

> >
> > > I must admit, I hadn't realised that C++ still allowed these and had
> > > assumed they went the way of implicit int and K&R-style function
> > > declarations. The standard tells me that, unsurprisingly, their
> > > representation implementation-defined (and so does the C standard), so
> > > my questions here are not about what the standard requires (nor
> > > whether I should be using them), but rather what implementations
> > > commonly do and why.

> >
> > ...
> >
> > That you further found out. There are really no other reasons to use it(and have never been) but to confuse the heck out of a novice maintainer.

>
>
> Well, clearly that's not true. Compiler writers don't decide to add
> functionality simply "to confuse the heck out of a novice
> maintainer."


You likely misunderstood what i meant. I meant that i do not see there are other reasons "to use it" [in modern C++ code]. The reason why it is (and possibly will be forever) in the C++ language i discussed further. Thanks for adding history about B etc.

> As you point out, compiler writers don't break
> compatibility with old code for no reason, yet here, somewhere along
> the line, a compiler vendor evidently decided to implement
> multicharacter literals in a way that broke their use as small
> strings.


I do not somehow believe that there is an ultra profitable way to use thosestrange literals. For all cases there must exist less obscure and more elegant and portable code to get exactly same compiled binary.

 
Reply With Quote
 
Richard Damon
Guest
Posts: n/a
 
      08-25-2012
On 8/22/12 12:35 PM, Richard Smith wrote:
> I recently encountered some C++ code that made use of multicharacter
> literals -- that is, something that looks like a character literal,
> but contains more than one character:
>
> int i = 'foo';
>
> I must admit, I hadn't realised that C++ still allowed these and had
> assumed they went the way of implicit int and K&R-style function
> declarations. The standard tells me that, unsurprisingly, their
> representation implementation-defined (and so does the C standard), so
> my questions here are not about what the standard requires (nor
> whether I should be using them), but rather what implementations
> commonly do and why.
>
> Using GCC on i386, I find that
>
> 'foo' == ('f' << 16 | 'o' << 8 | 'o');
>
> Because i386 is little-endian, this implies it lays out the literal as
> "oof\0", and this is confirmed if I look at the object code
> generated. I must admit, this surprised me. Certainly this choice is
> permitted, and it's easiest for the compiler to parse as it's just a
> base-256 integer. But the only sensible reason I can think of for
> using multicharacter literals is when doing binary I/O. Short strings
> the length of the machine word exist in a number of binary formats --
> e.g. "\x7FELF" at the start of an ELF binary, or labels like "RIFF"
> and "WAVE" in the WAV audio format. If I were writing in assembly, I
> might well convert these manually to 32-bit integers and then simply
> dump them; and I can possibly imagine wanting to do that in C or C++
> when writing low-level code. But if I do that with GCC's
> multicharacter literals, they have the wrong byte order: I would have
> to dump 'EVAW' instead of 'WAVE'.
>
> It seems unlikely that GCC would make an inconvenient implementation
> choice for no good reason, so presumably, then, there is (or once was)
> another use for these that's eluding me. Can anyone suggest what it
> is?
>
> Richard
>


It looks like GCC has decided to implement multicharacter literals as a
form of "Base 256" numbers, which is actually a common use for this sort
of thing. This make 't' the same as '\0\0\0t' instead of 't\0\0\0' which
if you think about it is the required meaning for a single character
literal. Since a single character literal MUST place its value in the
bottom of the value, it makes sense to keep this up. It also means that
two character literals are 16 bit values, 4 character literals are 32
bit values, and if you want to allow them, 8 character literals are 64
bit values.

Any encoding of string that puts them in memory order on little endian
machines breaks this very useful property.

Since any program which directly reads binary files with multi-byte
files needs to worry about endian issues, it shouldn't be THAT hard to
have the program consider this header code as a "big endian int" and
thus do the byte reversal on fetching, thus allowing the comparison to
be done to the "natural" format for multicharacter literals.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Java: byte literals and short literals John Goche Java 8 01-17-2006 11:12 PM
Visibility of enumeration literals under use clauses biau@altavista.com VHDL 15 02-01-2005 04:19 AM
literals Al Wilkerson ASP .Net 2 09-25-2004 05:24 AM
Array of literals or better? Duncan Welch ASP .Net 2 07-27-2004 12:36 PM
Literals ASP .Net 1 08-25-2003 07:19 PM



Advertisments