Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > toupper UTF8 string

Reply
Thread Tools

toupper UTF8 string

 
 
David RF
Guest
Posts: n/a
 
      09-24-2009
Hi friends, here I am trying to avoid wchar_t in UTF8 strings.
glad to hear some critics

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/*
Return a new allocate string
Upper from (a - z) and ()
-61 = first byte
-65 =
-96 =
*/
char *stoupper(const char *s)
{
size_t len;
char *p = NULL;
int c = 0;

if (s) {
len = strlen(s);
p = malloc(len + 1);
if (p) {
while (*s) {
if ((*s >= 'a') && (*s <= 'z')) {
c = *p = *s - 'a' + 'A';
} else if ((c == -61) && ((*s <= -65) && (*s >= -96))) {
c = *p = *s - 32;
} else {
c = *p = *s;
}
p++;
s++;
}
*p = '\0';
p -= len;
}
}
return p;
}

int main(void)
{
char *s = "Mara tiene moo, Ramn tiene un camin.";

s = stoupper(s);
printf("%s\n", s);
return 0;
}
 
Reply With Quote
 
 
 
 
Ben Bacarisse
Guest
Posts: n/a
 
      09-24-2009
David RF <(E-Mail Removed)> writes:

> Hi friends, here I am trying to avoid wchar_t in UTF8 strings.


Why? Without knowing why, it is almost impossible to comment on the
code. It relies on a set of assumptions that might be acceptable but
I can't tell without knowing why you are not using C's multi-byte
string functions.

For example you assume char is signed.

> glad to hear some critics
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> /*
> Return a new allocate string
> Upper from (a - z) and (ÿþýüûúùø÷öõôóòñðïî*ìëêéè æåäãâá*)
> -61 = first byte
> -65 = ÿ
> -96 = *
> */


It can't work for ÿ (there is a Ÿ but it is not where your code
expects it to be) and upper-casing ÷ to × is just odd!

<snip>
--
Ben.
 
Reply With Quote
 
 
 
 
David RF
Guest
Posts: n/a
 
      09-24-2009
On 24 sep, 16:09, Ben Bacarisse <(E-Mail Removed)> wrote:

> It can't work for ÿ (there is a Ÿ but it is not where your code
> expects it to be) and upper-casing ÷ to × is just odd!


You're right

> I can't tell without knowing why you are not using C's multi-byte
> string functions.


Perhaps is time to take a look to those libraries


 
Reply With Quote
 
David RF
Guest
Posts: n/a
 
      09-24-2009
On 24 sep, 16:09, Ben Bacarisse <(E-Mail Removed)> wrote:

> Why? *Without knowing why, it is almost impossible to comment on the
> code. *It relies on a set of assumptions that might be acceptable but
> I can't tell without knowing why you are not using C's multi-byte
> string functions.


Another way to do this? I am a rookie using wchars

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
#include <wctype.h>

char *stoupper(const char *s)
{
char *p = NULL;
wchar_t wc;
size_t len;
int mblen;

if (s) {
len = strlen(s);
p = malloc(len + 1);
if (p) {
while (*s) {
mbtowc(&wc, s, MB_CUR_MAX);
wc = towupper(wc);
mblen = wctomb(p, wc);
p += mblen;
s += mblen;
}
*p = '\0';
p -= len;
}
}
return p;
}

int main(void)
{
char *s = "Mara tiene moo, Ramn tiene un camin.";

setlocale(LC_CTYPE, "");
s = stoupper(s);
if (s) {
printf("%s\n", s);
free(s);
}
return 0;
}

Thanks again Ben
 
Reply With Quote
 
Ben Bacarisse
Guest
Posts: n/a
 
      09-25-2009
David RF <(E-Mail Removed)> writes:
<snip>
> Another way to do this? I am a rookie using wchars
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <locale.h>
> #include <wchar.h>
> #include <wctype.h>
>
> char *stoupper(const char *s)
> {
> char *p = NULL;
> wchar_t wc;
> size_t len;
> int mblen;
>
> if (s) {
> len = strlen(s);
> p = malloc(len + 1);
> if (p) {
> while (*s) {
> mbtowc(&wc, s, MB_CUR_MAX);


I'd make a few small changes here. (1) mbtowc tells you how many chars
it used to make the wide one. You can use this later on to confirm
your assumption that the overall length is not changed by
upper-casing. (2) you can pass len instead of MB_CUR_MAX so long as
you update it using the return from mbtowc. This means there is no
possibility of ever looked past the end of s even with an ill-formed
UTF-8 string. (3) mbtowc might fail (and it call tell you when the
string has run out) so you can put the call in the while loop test:

while ((mblen = mbtowc(&wc, s, len)) > 0) ...

> wc = towupper(wc);
> mblen = wctomb(p, wc);


I'd use a new variable so that...

> p += mblen;
> s += mblen;


.... here you can put the brakes on if you find the two lengths are not
the same.

> }
> *p = '\0';
> p -= len;
> }
> }
> return p;
> }


<snip>
--
Ben.
 
Reply With Quote
 
David RF
Guest
Posts: n/a
 
      09-25-2009
On 25 sep, 02:29, Ben Bacarisse <(E-Mail Removed)> wrote:
> I'd make a few small changes here. (1) mbtowc tells you how many chars
> it used to make the wide one. *You can use this later on to confirm
> your assumption that the overall length is not changed by
> upper-casing. *(2) you can pass len instead of MB_CUR_MAX so long as
> you update it using the return from mbtowc. *This means there is no
> possibility of ever looked past the end of s even with an ill-formed
> UTF-8 string. *(3) mbtowc might fail (and it call tell you when the
> string has run out) so you can put the call in the while loop test:
>
> * while ((mblen = mbtowc(&wc, s, len)) > 0) ...
>
> > * * * * * * * * * * * * * *wc = towupper(wc);
> > * * * * * * * * * * * * * *mblen = wctomb(p, wc);

>
> I'd use a new variable so that...
>
> > * * * * * * * * * * * * * *p += mblen;
> > * * * * * * * * * * * * * *s += mblen;

>
> ... here you can put the brakes on if you find the two lengths are not
> the same.
>
> > * * * * * * * * * *}
> > * * * * * * * * * **p = '\0';
> > * * * * * * * * * *p -= len;
> > * * * * * *}
> > * *}
> > * *return p;
> > }


Thanks again Ben, I miss Pascal (a lot)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
#include <wctype.h>

static char *stoupper(const char *s)
{
char *p = NULL, *oldp;
size_t len;
wchar_t wc;
int wclen, mclen;

if (s) {
len = strlen(s);
oldp = p = malloc(len + MB_CUR_MAX + 1);
if (p) {
while ((wclen = mbtowc(&wc, s, len)) > 0) {
/* I know, too many casts, but makes -Wconversion flag happy */
mclen = wctomb(p, (wchar_t)towupper((wint_t)wc));
/* Strange ... but I always trust Ben */
if (mclen > wclen) {
len += (size_t)(mclen - wclen);
mclen = (int)(p - oldp);
/* realloc it's a pain, but what else can I do? */
p = realloc(oldp, len);
if (!p) {
free(oldp);
return NULL;
}
oldp = p;
}
p += mclen;
s += wclen;
}
*p = '\0';
p -= len;
}
}
return p;
}

int main(void)
{
char *s = "Mara tiene moo, Ramn tiene un camin.";

setlocale(LC_CTYPE, "");
s = stoupper(s);
if (s) {
printf("%s\n", s);
free(s);
}
return 0;
}
 
Reply With Quote
 
Nobody
Guest
Posts: n/a
 
      09-25-2009
On Thu, 24 Sep 2009 05:33:19 -0700, David RF wrote:

> Hi friends, here I am trying to avoid wchar_t in UTF8 strings.
> glad to hear some critics


Convert to wchar_t[], use towupper(), convert back to UTF-8.

Note: the C standard doesn't guarantee that wchar_t is Unicode, nor does
it provide any function which can reliably convert between a specific
encoding and wchar_t (mbstowcs/wcstombs use the locale's encoding, and the
details of locales are implementation-defined).

Also, note that converting a string to upper-case isn't quite as simple as
replacing each character with another character. For some characters, the
upper-case equivalent consists of multiple characters; e.g. the upper-case
equivalent of "" (German sharp s) is "SS".

 
Reply With Quote
 
James Kuyper
Guest
Posts: n/a
 
      09-26-2009
Joe Wright wrote:
....
> Nor does the C Standard know anything at all about Unicode.


It may not know enough about Unicode, but it does know something: see
6.4.3 and Annex D.
 
Reply With Quote
 
Nobody
Guest
Posts: n/a
 
      09-27-2009
On Sat, 26 Sep 2009 13:25:23 -0400, James Kuyper wrote:

> Joe Wright wrote:
> ...
>> Nor does the C Standard know anything at all about Unicode.

>
> It may not know enough about Unicode, but it does know something: see
> 6.4.3 and Annex D.


Also 6.10.8p2:

__STDC_ISO_10646__ A decimal constant of the form yyyymmL
(for example, 199712L), intended to
indicate that values of type wchar_t are
the coded representations of the
characters defined by ISO/IEC 10646,
along with all amendments and technical
corrigenda as of the specified year and
month.

So wchar_t *might* be Unicode, and if it is, the implementation will state
this. But it isn't required to be.

If it isn't, then you have to either:

a) figure out how to convert UTF-8 to/from wchar_t, in which case you can
then use towupper(), or

b) convert UTF-8 to/from Unicode codepoints yourself (easy enough), but
then you need to write your own towupper() equivalent (which
basically means that you need to get the tables).

 
Reply With Quote
 
Keith Thompson
Guest
Posts: n/a
 
      09-27-2009
Nobody <(E-Mail Removed)> writes:
> On Sat, 26 Sep 2009 13:25:23 -0400, James Kuyper wrote:
>
>> Joe Wright wrote:
>> ...
>>> Nor does the C Standard know anything at all about Unicode.

>>
>> It may not know enough about Unicode, but it does know something: see
>> 6.4.3 and Annex D.

>
> Also 6.10.8p2:
>
> __STDC_ISO_10646__ A decimal constant of the form yyyymmL
> (for example, 199712L), intended to
> indicate that values of type wchar_t are
> the coded representations of the
> characters defined by ISO/IEC 10646,
> along with all amendments and technical
> corrigenda as of the specified year and
> month.
>
> So wchar_t *might* be Unicode, and if it is, the implementation will state
> this. But it isn't required to be.
>
> If it isn't, then you have to either:
>
> a) figure out how to convert UTF-8 to/from wchar_t, in which case you can
> then use towupper(), or


If wchar_t values don't represent Unicode code points, then converting
from UTF-8 to wchar_t might not be possible. For example, wchar_t
might be only 16 bits.

> b) convert UTF-8 to/from Unicode codepoints yourself (easy enough), but
> then you need to write your own towupper() equivalent (which
> basically means that you need to get the tables).


--
Keith Thompson (The_Other_Keith) http://www.velocityreviews.com/forums/(E-Mail Removed) <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
given char* utf8, how to read unicode line by line, and output utf8 gry C++ 2 03-13-2012 04:32 AM
toUpper and pointer to a string gaga C++ 16 09-23-2007 08:28 PM
error trying to toupper string sandy@murdocks.on.ca C++ 4 11-30-2006 01:00 AM
error trying to toupper string sandy@murdocks.on.ca C++ 2 11-29-2006 04:33 AM
String.ToUpper Kerri ASP .Net 2 10-27-2003 12:09 AM



Advertisments