Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > wtf is happening here @ bitwise comparison

Reply
Thread Tools

wtf is happening here @ bitwise comparison

 
 
tschmittldk
Guest
Posts: n/a
 
      12-22-2010
Hey guys... I had an issue today in the university which i really dont
understand:

char c = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....

now i tried to compare several times:
if(c == '\xc3')
if((unsigned int)c == 0xc3)
if((int)c == 0xc3)
if((unsigned int)c == (unsigned int)0xc3)

All of them negate and go on. But when i do a very stupid bitwise
comparison before it works:

if(((unsigned int)c & 0xc3) == 0xc3)

can anyone explain that to me? I really don't get the difference
betweet if(((unsigned int)c & 0xc3) == 0xc3) and if((unsigned int)c ==
0xc3).

Best regards
Tobias
 
Reply With Quote
 
 
 
 
Victor Bazarov
Guest
Posts: n/a
 
      12-22-2010
On 12/22/2010 7:59 AM, tschmittldk wrote:
> Hey guys... I had an issue today in the university which i really dont
> understand:
>
> char c = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....
>
> now i tried to compare several times:
> if(c == '\xc3')
> if((unsigned int)c == 0xc3)
> if((int)c == 0xc3)
> if((unsigned int)c == (unsigned int)0xc3)
>
> All of them negate and go on.


Really? Please post the entire program. I'm asking because I just tried

#include <cassert>

int main()
{
char c = '\xc3';

assert(c == '\xc3');
}

And it passed with flying colors (as it should). So, you're either
mistaken about your first case or you're lying intentionally to make
your point. I don't like the latter, and hopefully it's not true.

> But when i do a very stupid bitwise
> comparison before it works:
>
> if(((unsigned int)c& 0xc3) == 0xc3)
>
> can anyone explain that to me? I really don't get the difference
> betweet if(((unsigned int)c& 0xc3) == 0xc3) and if((unsigned int)c ==
> 0xc3).


The trick with the other three initial equality comparisons is that the
explicit promotions and conversions cause different effect (apparently)
than the default ones.

The value of 'c' (which is likely only 8 bits long) is *negative*
according to your initialization (and is -61). The value 0xC3 (an
implicit int) is positive (+ 195). Convert -61 (which undergoes an
implicit conversion to int first) to unsigned, and you get 0xFFC3, which
is definitely not equal to 0xC3. Converting to int (your third
comparison) just makes explicit the usual implicit one. In the fourth
comparison casting of 0xC3 to unsigned int makes no difference, the
value does not change.

The problem you have is that your 'c' is *signed* and *negative*.
Please study explicit and implicit integral promotions and arithmetic
conversions to get to the bottom of what's happening.

V
--
I do not respond to top-posted replies, please don't ask
 
Reply With Quote
 
 
 
 
SG
Guest
Posts: n/a
 
      12-22-2010
On 22 Dez., 13:59, tschmittldk wrote:
>
> char c *= '\xc3' or '\xc4' ect... its about lead bytes in UTF8....
>
> now i tried to compare several times:
> if(c == '\xc3')


Really? This fails? Weird...

> if((unsigned int)c == 0xc3)
> if((int)c == 0xc3)
> if((unsigned int)c == (unsigned int)0xc3)
>
> All of them negate and go on. But when i do a very stupid bitwise
> comparison before it works:
>
> if(((unsigned int)c & 0xc3) == 0xc3)
>
> can anyone explain that to me? I really don't get the difference


A couple of hints:
- integral promotion
- 'char' appears to be a signed type in your case

Before the comparison operator is applied, integral promotion takes
place which converts both operands to a common type that's at least
'int'. Assuming 'char' is a signed 8-bit type, '\xc3' represents a
negative number. Assuming the popular two's complement, its value is
-61. Even (unsigned int)c gives you a value like 0xF...FC3 due to the
rules about converting signed to unsigned values.

Btw, your bit mask trick is neither portable (w.r.t. signed value
representations) nor correct (false positives).

I'd simply use unsigned char and unsigned types. The C++ standard
allows you to use a pointer of type "unsigned char*" to point to a
char array.

Cheers!
SG
 
Reply With Quote
 
tschmittldk
Guest
Posts: n/a
 
      12-22-2010
Okay thanks for all your answers. I try it tomorrow and post the code
then (I left my notebook in my student flat...). But it seems more
clearly to me now, thanks!

 
Reply With Quote
 
tschmittldk
Guest
Posts: n/a
 
      12-23-2010
On 22 Dez., 19:45, tschmittldk <(E-Mail Removed)> wrote:
> Okay thanks for all your answers. I try it tomorrow and post the code
> then (I left my notebook in my student flat...). But it seems more
> clearly to me now, thanks!


Okay, now here's the code:

void codevert(char *ArrayToTransform)
{
int j = 0;
char *ptr = ArrayToTransform;
while (*ptr != '\0') {
if((*ptr & 0xC0) > 0xbf)
{
if(*ptr == '\xc3')
simplifier_correct(3, ptr++);
else if(*ptr == '\xc4')
simplifier_correct(3, ptr++);
else if(*ptr == '\xc4')
simplifier_correct(3, ptr++);
else
std::cout << "E01";
}
ptr++;
}
}

it runs through and just checks if the byte is an leadbyte and passes
it to different mapfunctions, which replace the byte with a normal
ascii letter. For example making an to o or an to A.

Now i just need to kill the Leadbyte and it's done.
 
Reply With Quote
 
tschmittldk
Guest
Posts: n/a
 
      12-23-2010
> This is all very brittle.
Sorry I'm new to c++ .

> I would rewrite this code about like this:
>
> * * *const unsigned char *ptr = reinterpret_cast<unsigned char*>
> (ArrayToTransform);
> * * *while (*ptr) {
> * * * * * if((*ptr & 0xC0) > 0xbf)
> * * * * * {
> * * * * * * * *if(*ptr == 0xc3)
> * * * * * * * *// ...


I mostly fixed my program with your code, the only thing: I cannot use
*ptr as const, because simplifier_correct gets ptr as a referenced var
and writes into it's value.

We have this now:

void unicodevert(char *ArrayToTransform) // works
{
int j = 0;
unsigned char *ptr = reinterpret_cast<unsigned
char*>(ArrayToTransform);

//char *ptr = ArrayToTransform;
while (*ptr)
{
if((*ptr & 0xC0) > 0xbf) // is Leadbyte?!
{
// Check which Leadbyte and give the right information to
simplifier_correct...
if(*ptr == 0xc3)
simplifier_correct(3,(ptr+1));
//....




And

void simplifier_correct(int j, const unsigned char *search)
{
unsigned char *buff = search;
if(j == 4)
{
for(int i=0; i<3 ;i++) {
buff = _mbspbrk(gsC4UCHAR_CONVMAP[i].MAP, search);
if(buff != NULL)
*search = gsC4UCHAR_CONVMAP[i].REPLACER;
}
}
//... with other cases, but it's all the same code with other maps.


Another thing... i tried to use "memmove" to overwrite the leadbyte in
the char array, like:
"helloworld" should be "hellworld" if o was a lead byte. But i got
Access violation errors all the time. So i coded:

unsigned char *ptr3 = ptr;
unsigned char *ptr2 = (ptr+1);
while(*ptr2)
{
*ptr3 = *ptr2;
ptr3++;
ptr2++;
}
*ptr3 = '\0';

I know, it just works "to the left" but i just need it like that. Do
you think that is okay? I mean... it does mostly the same than memmove
does.


Thanks for help
best regards
Tobias
 
Reply With Quote
 
Paul N
Guest
Posts: n/a
 
      12-23-2010
On Dec 22, 12:59*pm, tschmittldk <(E-Mail Removed)> wrote:
> Hey guys... I had an issue today in the university which i really dont
> understand:
>
> char c *= '\xc3' or '\xc4' ect... its about lead bytes in UTF8....
>
> now i tried to compare several times:
> if(c == '\xc3')
> if((unsigned int)c == 0xc3)
> if((int)c == 0xc3)
> if((unsigned int)c == (unsigned int)0xc3)
>
> All of them negate and go on. But when i do a very stupid bitwise
> comparison before it works:
>
> if(((unsigned int)c & 0xc3) == 0xc3)
>
> can anyone explain that to me? I really don't get the difference
> betweet if(((unsigned int)c & 0xc3) == 0xc3) and if((unsigned int)c ==
> 0xc3).


Other people have gone into the detail of this but there is one detail
that *might* be causing problems.

In the language C, '\xc3' has type int. In the language C++, '\xc3'
has type char. So the exact same code can give different results,
depending on whether you feed it into a C compiler or a C++ compiler.

For good measure, many C++ compilers actually include a C compiler
which, if told to compile a C program, will compile the code as if it
is a C program. So you need to be sure you are driving the compiler
correctly. It might be a useful test to include something in your
program which is valid C++ but not valid C, just to make sure you are
using the right language.

Hope that helps.
Paul.

 
Reply With Quote
 
James Kanze
Guest
Posts: n/a
 
      12-26-2010
On Dec 23, 9:03 am, Paavo Helde <(E-Mail Removed)> wrote:
> tschmittldk <(E-Mail Removed)> wrote in news:9139bceb-5be4-
> (E-Mail Removed):
> > On 22 Dez., 19:45, tschmittldk <(E-Mail Removed)> wrote:
> >> Okay thanks for all your answers. I try it tomorrow and
> >> post the code then (I left my notebook in my student
> >> flat...). But it seems more clearly to me now, thanks!


> > Okay, now here's the code:


> > void codevert(char *ArrayToTransform)
> > {
> > int j = 0;
> > char *ptr = ArrayToTransform;
> > while (*ptr != '\0') {
> > if((*ptr & 0xC0) > 0xbf)
> > {
> > if(*ptr == '\xc3')
> > simplifier_correct(3, ptr++);
> > else if(*ptr == '\xc4')
> > simplifier_correct(3, ptr++);
> > else if(*ptr == '\xc4')
> > simplifier_correct(3, ptr++);
> > else
> > std::cout << "E01";
> > }
> > ptr++;
> > }
> > }


> This is all very brittle.


Yes, but not for the reasons you imply. It's brittle because
it only handles a very small subset of UTF-8. But presumably,
the poster knows that, and accepts that any but a few specific
two byte sequences will result in "E01". Not to mention the
typo: the last two else if test exactly the same thing.

There's nothing brittle about it at the C++ level.

> *ptr is char, which is most probably a signed
> type and can be negative.


And is probably 8 bits.

> (*ptr & 0xC0) is int and appears to be positive


Not only appears to be: is.

The intermediate values will be unexpected, of course, but the
final result should be correct. (The expression *ptr might be
negative.)

> and of the desired value even if *ptr is negative, this is
> more by chance and not very portable.


Could you name an architecture where it wouldn't work? And
explain why, and what you'd get. (There is, perhaps, a brittle
part in filling the char[]. Formally, at least, it's possible
that the iostream library reject any negative char's. In
practice, a compiler whose iostream library didn't support this
kind of thing won't be used, so you don't have to worry about it.)

> 0xbf is int and positive, '\xc3' is char and
> negative.


And? In all cases, integral promotion occurs. And when the &
is present, it ensures that the results must be positive.

> I would rewrite this code about like this:


> const unsigned char *ptr = reinterpret_cast<unsigned char*>
> (ArrayToTransform);
> while (*ptr) {
> if((*ptr & 0xC0) > 0xbf)
> {
> if(*ptr == 0xc3)
> // ...


Why bother?

Actually, I'd rewrite the code more fundamentally, to make it
clear what is actually being tested; if nothing else >= 0xC0,
rather than > 0xBF, but more likely with a switch on the results
of *ptr & 0xC0 (with four cases clearly delimiting the
possibilities).

--
James Kanze
 
Reply With Quote
 
Jorgen Grahn
Guest
Posts: n/a
 
      12-29-2010
On Wed, 2010-12-22, Victor Bazarov wrote:
> On 12/22/2010 7:59 AM, tschmittldk wrote:
>> Hey guys... I had an issue today in the university which i really dont
>> understand:
>>
>> char c = '\xc3' or '\xc4' ect... its about lead bytes in UTF8....
>>
>> now i tried to compare several times:
>> if(c == '\xc3')
>> if((unsigned int)c == 0xc3)
>> if((int)c == 0xc3)
>> if((unsigned int)c == (unsigned int)0xc3)
>>
>> All of them negate and go on.

>
> Really? Please post the entire program. I'm asking because I just tried
>
> #include <cassert>
>
> int main()
> {
> char c = '\xc3';
>
> assert(c == '\xc3');
> }
>
> And it passed with flying colors (as it should). So, you're either
> mistaken about your first case or you're lying intentionally to make
> your point. I don't like the latter, and hopefully it's not true.


Interesting. I read his first line

char c = '\xc3' or '\xc4' ect...

as actually containing the token 'or', the synonym for ||. Then his
problems would make perfect sense.

The later postings showed this wasn't was he really meant, though ...

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Bitwise Object Comparison - Is it possible & safe? D. Susman C++ 19 03-19-2008 02:05 PM
Bitwise comparison in RowFilters Mark Rae ASP .Net 2 10-07-2006 05:13 PM
What is happening here nin234@yahoo.com C++ 3 02-22-2005 08:04 PM
Bitwise comparison failing corky Perl Misc 1 07-10-2004 03:09 PM
What's happening here? Scottie Computer Support 9 05-19-2004 03:11 PM



Advertisments