Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > char_traits<char>::compare

Reply
Thread Tools

char_traits<char>::compare

 
 
Earl Purple
Guest
Posts: n/a
 
      08-10-2005
On VC++.NET it is implemented like this

static int __cdecl compare
(
const _Elem *_First1,
const _Elem *_First2,
size_t _Count
)
{ // compare [_First1, _First1 + _Count) with [_First2, ...)
return (::memcmp(_First1, _First2, _Count));
}

i.e. using memcmp. But memcmp is an unsigned comparison, whereas char
is a signed character.

Therefore if I declare a std::string as "\x80" and another std::string
as "\x7f" and do a comparison, the one that is "\x7f" is "lower",
although if I compared their first characters then the first character
of the "\x80" string is "lower".

Is this behaviour standard? Is it correct? Is there a formal definition
of what the result of a std::string comparison should return if one or
more of the characters in one or other of the strings is "negative".

 
Reply With Quote
 
 
 
 
Victor Bazarov
Guest
Posts: n/a
 
      08-10-2005
Earl Purple wrote:
> On VC++.NET it is implemented like this
>
> static int __cdecl compare
> (
> const _Elem *_First1,
> const _Elem *_First2,
> size_t _Count
> )
> { // compare [_First1, _First1 + _Count) with [_First2, ...)
> return (::memcmp(_First1, _First2, _Count));
> }
>
> i.e. using memcmp. But memcmp is an unsigned comparison, whereas char
> is a signed character.


Whether 'char' is signed is implementation-defined. You can change it
usually with some compiler command-line switch.

> Therefore if I declare a std::string as "\x80" and another std::string
> as "\x7f" and do a comparison, the one that is "\x7f" is "lower",
> although if I compared their first characters then the first character
> of the "\x80" string is "lower".
>
> Is this behaviour standard?


Reading the requirements for char_traits, 'compare' should yield 0 if for
any i in the range [0,_Count) 'eq(_First[i], _Second[i])' is true, and
yield -1 if exists j for which 'lt(_First[j], _Second[j])' is true and
'eq' is true for all preceding chars, and 1 otherwise.

There is no requirement in the Standard as to how to implement those.
The traits essentially govern the sorting, not operator< or operator==,
which you were probably using when you "compared their first characters".

> Is it correct? Is there a formal definition
> of what the result of a std::string comparison should return if one or
> more of the characters in one or other of the strings is "negative".


There is no "negative" or "positive" in there. Those are just characters
for which there are traits, which in turn say how the strings compare.

V
 
Reply With Quote
 
 
 
 
Earl Purple
Guest
Posts: n/a
 
      08-10-2005

Victor Bazarov wrote:
> Reading the requirements for char_traits, 'compare' should yield 0 if for
> any i in the range [0,_Count) 'eq(_First[i], _Second[i])' is true, and
> yield -1 if exists j for which 'lt(_First[j], _Second[j])' is true and
> 'eq' is true for all preceding chars, and 1 otherwise.
>
> There is no requirement in the Standard as to how to implement those.
> The traits essentially govern the sorting, not operator< or operator==,
> which you were probably using when you "compared their first characters".


from char_traits<char> (on VC .NET)

static bool __cdecl lt(const _Elem& _Left, const _Elem& _Right)
{
// test if _Left precedes _Right

return (_Left < _Right);
}

but 0x80 < 0x7f because char is signed. Thus when I have my strings

std::string s128( "\x80" );
std::string s127 ("\x7f" );

s127 < s128 but s128[0] < s127[0]

As basic_string (correctly) uses char_traits to do the comparison
(that's what it's there for isn't it?) the inconsistency is in
char_traits.

VC .NET provides no specialisation for char_traits<unsigned char> and I
have actually implemented my own traits class for unsigned char (but
not char_traits because I'm not supposed to extend namespace std),
which for me guarantees I will get consistent behaviour.

I just wanted to know if this inconsistency is part of the standard,
and by your quoting of the standard it is not - it is against the
standard rule for char_traits::compare.




> > Is it correct? Is there a formal definition
> > of what the result of a std::string comparison should return if one or
> > more of the characters in one or other of the strings is "negative".

>
> There is no "negative" or "positive" in there. Those are just characters
> for which there are traits, which in turn say how the strings compare.
>
> V


 
Reply With Quote
 
Victor Bazarov
Guest
Posts: n/a
 
      08-10-2005
Earl Purple wrote:
> Victor Bazarov wrote:
>
>>Reading the requirements for char_traits, 'compare' should yield 0 if for
>>any i in the range [0,_Count) 'eq(_First[i], _Second[i])' is true, and
>>yield -1 if exists j for which 'lt(_First[j], _Second[j])' is true and
>>'eq' is true for all preceding chars, and 1 otherwise.
>>
>>There is no requirement in the Standard as to how to implement those.
>>The traits essentially govern the sorting, not operator< or operator==,
>>which you were probably using when you "compared their first characters".

>
>
> from char_traits<char> (on VC .NET)
>
> static bool __cdecl lt(const _Elem& _Left, const _Elem& _Right)
> {
> // test if _Left precedes _Right
>
> return (_Left < _Right);
> }
>
> [...]
> I just wanted to know if this inconsistency is part of the standard,
> and by your quoting of the standard it is not - it is against the
> standard rule for char_traits::compare.
>


Yes, it certainly seems so. You should perhaps contact Dinkumware (the
implementors of the standard library Microsoft ships along with VC++
compilers) and let them know...

V
 
Reply With Quote
 
P.J. Plauger
Guest
Posts: n/a
 
      08-10-2005
"Earl Purple" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...

> Victor Bazarov wrote:
>> Reading the requirements for char_traits, 'compare' should yield 0 if for
>> any i in the range [0,_Count) 'eq(_First[i], _Second[i])' is true, and
>> yield -1 if exists j for which 'lt(_First[j], _Second[j])' is true and
>> 'eq' is true for all preceding chars, and 1 otherwise.
>>
>> There is no requirement in the Standard as to how to implement those.
>> The traits essentially govern the sorting, not operator< or operator==,
>> which you were probably using when you "compared their first characters".

>
> from char_traits<char> (on VC .NET)
>
> static bool __cdecl lt(const _Elem& _Left, const _Elem& _Right)
> {
> // test if _Left precedes _Right
>
> return (_Left < _Right);
> }
>
> but 0x80 < 0x7f because char is signed. Thus when I have my strings
>
> std::string s128( "\x80" );
> std::string s127 ("\x7f" );
>
> s127 < s128 but s128[0] < s127[0]
>
> As basic_string (correctly) uses char_traits to do the comparison
> (that's what it's there for isn't it?) the inconsistency is in
> char_traits.
>
> VC .NET provides no specialisation for char_traits<unsigned char> and I
> have actually implemented my own traits class for unsigned char (but
> not char_traits because I'm not supposed to extend namespace std),
> which for me guarantees I will get consistent behaviour.


The template definition works fine for unsigned char. You don't
need to explicitly specialize it.

> I just wanted to know if this inconsistency is part of the standard,
> and by your quoting of the standard it is not - it is against the
> standard rule for char_traits::compare.


Once upon a time, the draft C++ Standard spelled out that memcmp
should be used for char_traits<char>::compare. That got lost
along the way. Most (or possibly all) implementations still use
memcmp as a result. I know there has been discussion on the
C++ library committee reflector about this. IIRC, the consensus
is that memcmp is the right way to go. Whether there's a Defect
Report on this topic I don't recall.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


 
Reply With Quote
 
Earl Purple
Guest
Posts: n/a
 
      08-10-2005

P.J. Plauger wrote:
>
> The template definition works fine for unsigned char. You don't
> need to explicitly specialize it.


Actually it does not work fine when using it for basic_ofstream to
write binary, but this is caused by another issue. If the character at
position 0 or any multiple of 8192 happens to be 0xff it rips it out as
an EOF.

The templated version for compare "works" but does not take advantage
of the nature of unsigned char such that memcmp and memcpy can be
safely used for comparison/copying and are probably more efficient than
the byte-by-byte versions.

> Once upon a time, the draft C++ Standard spelled out that memcmp
> should be used for char_traits<char>::compare. That got lost
> along the way. Most (or possibly all) implementations still use
> memcmp as a result. I know there has been discussion on the
> C++ library committee reflector about this. IIRC, the consensus
> is that memcmp is the right way to go. Whether there's a Defect
> Report on this topic I don't recall.


Thank you for clearing that up. So effectively it's better not to use
it if you are going to have any characters in your string that have the
negative bit set if you want consistent results across all compilers.

 
Reply With Quote
 
P.J. Plauger
Guest
Posts: n/a
 
      08-10-2005
"Earl Purple" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...

> P.J. Plauger wrote:
>>
>> The template definition works fine for unsigned char. You don't
>> need to explicitly specialize it.

>
> Actually it does not work fine when using it for basic_ofstream to
> write binary, but this is caused by another issue. If the character at
> position 0 or any multiple of 8192 happens to be 0xff it rips it out as
> an EOF.


I'm assuming that's a lower-level C issue. No reason why it should
happen in the C++ buffering.

> The templated version for compare "works" but does not take advantage
> of the nature of unsigned char such that memcmp and memcpy can be
> safely used for comparison/copying and are probably more efficient than
> the byte-by-byte versions.


Until you can demonstrate that your program runs too slow because
this optimization is missing, it's safe to say that the templated
version works, period.

>> Once upon a time, the draft C++ Standard spelled out that memcmp
>> should be used for char_traits<char>::compare. That got lost
>> along the way. Most (or possibly all) implementations still use
>> memcmp as a result. I know there has been discussion on the
>> C++ library committee reflector about this. IIRC, the consensus
>> is that memcmp is the right way to go. Whether there's a Defect
>> Report on this topic I don't recall.

>
> Thank you for clearing that up. So effectively it's better not to use
> it if you are going to have any characters in your string that have the
> negative bit set if you want consistent results across all compilers.


The only real issue is the ordering rule used for comparisons. If
you don't like what you get by default, you can always make your
own.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


 
Reply With Quote
 
Earl Purple
Guest
Posts: n/a
 
      08-11-2005

P.J. Plauger wrote:
> > Actually it does not work fine when using it for basic_ofstream to
> > write binary, but this is caused by another issue. If the character at
> > position 0 or any multiple of 8192 happens to be 0xff it rips it out as
> > an EOF.

>
> I'm assuming that's a lower-level C issue. No reason why it should
> happen in the C++ buffering.

own.
>
> P.J. Plauger
> Dinkumware, Ltd.
> http://www.dinkumware.com


No, the error comes from this function in basic_streambuf: (I have
formatted it to make it a bit easier to read)

virtual streamsize xsputn
(const _Elem *_Ptr, streamsize _Count)
{ // put _Count characters to stream
streamsize _Size, _Copied;

for (_Copied = 0; 0 < _Count; )
{
if
(
( pptr() != 0 ) &&
( 0 < (_Size = (streamsize)(epptr() - pptr())) )
)
{ // copy to write buffer
if (_Count < _Size)
{
_Size = _Count;
}
_Traits::copy(pptr(), _Ptr, _Size);
_Ptr += _Size;
_Copied += _Size;
_Count -= _Size;
pbump((int)_Size);
}
else if // ** ERROR IN THIS SECTION **
(
_Traits::eq_int_type
(
_Traits::eof(), overflow(_Traits::to_int_type(*_Ptr) )
)
)
{
break; // single character put failed, quit
}
else
{ // count character successfully put
++_Ptr;
++_Copied;
--_Count;
}
}
return (_Copied);
}

thus you have assumed that if the first character in our buffer happens
to be 0xff it is an end of file. (For a binary file this is not the
case). to_int_type for 0xff (unsigned) produces 0x000000ff which is not
equal to 0xffffffff.

My "fix" in my own version was to make int_type an int so eq always
fails.

Here is a test to reproduce the bug.

#include <fstream>
#include <string>

int main()
{
std::basic_ofstream< unsigned char > outFile
(
"test.dat",
std::ios_base::binary | std::ios_base::trunc
);

std::basic_string<unsigned char> data( 16, '\xff' );
for ( int iters=0; iters<8192; ++iters )
{
outFile.write( data.c_str(), 17 );
}
}

So we are writing 17 characters, 16 of 0xff followed by the 0
terminator, 8192 times. That should give us a file length of 139264 or
in hex 22000. On mine (VC7.1.308 it is 49 bytes short.

 
Reply With Quote
 
P.J. Plauger
Guest
Posts: n/a
 
      08-11-2005
"Earl Purple" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...

> P.J. Plauger wrote:
>> > Actually it does not work fine when using it for basic_ofstream to
>> > write binary, but this is caused by another issue. If the character at
>> > position 0 or any multiple of 8192 happens to be 0xff it rips it out as
>> > an EOF.

>>
>> I'm assuming that's a lower-level C issue. No reason why it should
>> happen in the C++ buffering.

> own.
>
> No, the error comes from this function in basic_streambuf: (I have
> formatted it to make it a bit easier to read)
>
> virtual streamsize xsputn
> (const _Elem *_Ptr, streamsize _Count)
> { // put _Count characters to stream
> streamsize _Size, _Copied;
>
> for (_Copied = 0; 0 < _Count; )
> {
> if
> (
> ( pptr() != 0 ) &&
> ( 0 < (_Size = (streamsize)(epptr() - pptr())) )
> )
> { // copy to write buffer
> if (_Count < _Size)
> {
> _Size = _Count;
> }
> _Traits::copy(pptr(), _Ptr, _Size);
> _Ptr += _Size;
> _Copied += _Size;
> _Count -= _Size;
> pbump((int)_Size);
> }
> else if // ** ERROR IN THIS SECTION **
> (
> _Traits::eq_int_type
> (
> _Traits::eof(), overflow(_Traits::to_int_type(*_Ptr) )
> )
> )
> {
> break; // single character put failed, quit
> }
> else
> { // count character successfully put
> ++_Ptr;
> ++_Copied;
> --_Count;
> }
> }
> return (_Copied);
> }
>
> thus you have assumed that if the first character in our buffer happens
> to be 0xff it is an end of file. (For a binary file this is not the
> case). to_int_type for 0xff (unsigned) produces 0x000000ff which is not
> equal to 0xffffffff.
>
> My "fix" in my own version was to make int_type an int so eq always
> fails.


Ah, now I see the problem. We've long since changed the default type
for the template version of basic_streambuf to long, which is essentially
the same as your fix. That happened after we delivered the V7.1 library
to Microsoft. The old default, having int_type the same as char_type,
is not binary transparent, as you've observed.

It's fixed in the library we currently license from our web site (thus
my confusion). Should also work fine in Whidbey (VC++ V.

Thanks for the clarification.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off




Advertisments