Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > strncmp and unsigned char

Reply
Thread Tools

strncmp and unsigned char

 
 
me
Guest
Posts: n/a
 
      05-19-2011
Hi guys,

I'm using an utf8 state-machine I made to check and handle unicode
strings, and was wondering if strncmp could be used for comparing the
after check or if I should roll my own?

It's prototype accepts const char and (on linux at least) internally
uses unsigned char.

What should I do?

Regards,

 
Reply With Quote
 
 
 
 
Shao Miller
Guest
Posts: n/a
 
      05-19-2011
On 5/19/2011 4:04 PM, me wrote:
> I'm using an utf8 state-machine I made to check and handle unicode
> strings, and was wondering if strncmp could be used for comparing the
> after check or if I should roll my own?
>
> It's prototype accepts const char and (on linux at least) internally
> uses unsigned char.
>
> What should I do?


Might you be interested in 'wcsncmp()?'
 
Reply With Quote
 
 
 
 
Ben Bacarisse
Guest
Posts: n/a
 
      05-19-2011
me <> writes:

> I'm using an utf8 state-machine I made to check and handle unicode
> strings, and was wondering if strncmp could be used for comparing the
> after check or if I should roll my own?


This confused me until I decided that a "strings" was missing:

| if strncmp could be used for comparing the [strings] after check[ing]

is that what you meant? If so, you certainly could use strncmp but the
result would be much less useful than a proper Unicode compare. As has
been suggested, you could convert to a wide string an use wcsncmp (or
wcscmp).

However, if all you want is a rather arbitrary ordering (say for a
binary search) then the byte comparison of the UTF8 encoded strings
would do.

> It's prototype accepts const char and (on linux at least) internally
> uses unsigned char.


That's not an issue. All of C's compare functions treat the bytes as if
they were unsigned char, despite the prototypes. If you don't like the
look of the prototype, memcmp uses void *.

--
Ben.
 
Reply With Quote
 
Angel
Guest
Posts: n/a
 
      05-19-2011
On 2011-05-19, Ben Bacarisse <> wrote:
>
>> It's prototype accepts const char and (on linux at least) internally
>> uses unsigned char.

>
> That's not an issue. All of C's compare functions treat the bytes as if
> they were unsigned char, despite the prototypes. If you don't like the
> look of the prototype, memcmp uses void *.


Unlike the str*cmp() functions, memcmp() doesn't check for null bytes so
if you do that you might end up comparing garbage data if the strings
are shorter than the given size.


--
"C provides a programmer with more than enough rope to hang himself.
C++ provides a firing squad, blindfold and last cigarette."
- seen in comp.lang.c
 
Reply With Quote
 
Keith Thompson
Guest
Posts: n/a
 
      05-20-2011
"christian.bau" <> writes:
>> I'm using an utf8 state-machine I made to check and handle unicode
>> strings, and was wondering if strncmp could be used for comparing the
>> after check or if I should roll my own?

>
> strcmp will compare strings and return a result assuming that the data
> is signed char.


No, it won't.

strcmp's arguments are of type const char*; plain char may be either
signed or unsigned. But even if plain char is signed, 7.21.4p1 says:

The sign of a nonzero value returned by the comparison functions
memcmp, strcmp, and strncmp is determined by the sign of the
difference between the values of the first pair of characters
(both interpreted as unsigned char) that differ in the objects
being compared.

[...]

> The main problem is that with Unicode, just comparing code points
> isn't very meaningful. You'd have to put the code points into a
> canonical order at least to get any meaningful result. And when you do
> that, using strcmp is quite pointless.


I *think* that strcmp() returns correctly ordered results for UTF-8
strings. UTF-8 was carefully designed to make this work.

--
Keith Thompson (The_Other_Keith) kst- <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
 
Reply With Quote
 
Ben Bacarisse
Guest
Posts: n/a
 
      05-20-2011
Keith Thompson <kst-> writes:

> "christian.bau" <> writes:

<snip>
>> The main problem is that with Unicode, just comparing code points
>> isn't very meaningful. You'd have to put the code points into a
>> canonical order at least to get any meaningful result. And when you do
>> that, using strcmp is quite pointless.

>
> I *think* that strcmp() returns correctly ordered results for UTF-8
> strings. UTF-8 was carefully designed to make this work.


It all depends on "correctly ordered" of course. A byte-by-byte compare
of correctly encoded UTF-8 encoded strings preserves the ordering on the
code points the strings represent. To put it another way, converting to
wide strings and using wcscmp will give the same result as strcmp will
when passed the originals. The encoded strings must be not contain any
over-long representations (nor any other forbidden bytes or byte
combinations) but I think the OP has covered that since they talked
about checking the strings first.

However, because Unicode says so much about the characters, one could
argue that a truly correct ordering should be rather more than this.
For example, "fine" with an fi ligature should compare equal to "fine"
without one and so on. If that seems too much like a detail, in some
scripts that code points are not in the correct collating sequence for
even the most basic ordering. That's what Christian Bau is saying, I
think.

--
Ben.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Casting from const pair<const unsigned char*, size_t>* to constpair<unsigned char*, size_t>* Alex Vinokur C++ 9 10-13-2008 05:05 PM
Padding bits and char, unsigned char, signed char Ioannis Vranos C Programming 6 03-29-2008 10:55 AM
Padding bits and char, unsigned char, signed char Ioannis Vranos C++ 11 03-28-2008 10:47 PM
Linking error LNK2001 - "__declspec(dllimport) private: void __thiscall std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> >::_Tidy(unsigned short)" (__imp_?_Tidy@?$basic_string@DU?$char_ sharmadeep1980@gmail.com C++ 1 07-07-2006 07:27 AM
void*, char*, unsigned char*, signed char* Steffen Fiksdal C Programming 1 05-09-2005 02:33 AM



Advertisments