Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > ASP .Net > Re: The natue of string and char[] in .NET

Reply
Thread Tools

Re: The natue of string and char[] in .NET

 
 
Michael \(michka\) Kaplan [MS]
Guest
Posts: n/a
 
      06-04-2004
"Lau Lei Cheong" <> wrote...

> I'm trying to write a converter for converting between Big5 and UTF-8,
> but I want to make sure a few facts before writing.
>
> 1) I know that by default .NET store string in unicode. Would there
> be any problem if I store Big5 characters in the string?


This will not work,

> Or could I set the codepage setting for individual string?


No.....

> 2) There are basically three types of Unicode scheme - UTF-7, UTF-8
> and UCS-2. Which one does the default Unicode setting refer to?


The third is UTF-16. Well, you could use either, but String is UTF-16
only -- so it is much easier to convert to/from anything to UTF-16. To
convert to/from anything else, you have to go through UTF-16.

> 3) Same as 1) but this time is for char[].


Same answer.

> I'm writing this because the webpage I'm writing is in Unicode, it
> stores data to MySQL database which store data in Big5, and we also have a
> backend written in VB6 which would be nearly rewritting if need to change

to
> Unicode. Here, I plan to translate the data immediately when read from the
> database and vice versa so no other existing part need to be changed. I'm
> using LibEx with MyODBC for accessing MySQL.


Well, for this you may want to see what the database access will support so
you know what your choices are, and then also to see if the database layer
will handle the conversion from UTF-16 to UTF-8 or whatever automatically.

> This post will be crossposted to
> microsoft.public.dotnet.framework.aspnet. Any advice would be greatly
> appreciated. Whether for the questions or for a better way to fatch the

data
> so no manual translation is needed.


Technically, you mean it was multiposted (crossposting is posting one
message in multiple places at the same time, multiposting is posting one
message to multiple places separately). Generally crossposting is preferred,
since that will avoid having more than one thread and will avoid duplication
of effort of others....


--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies

This posting is provided "AS IS" with
no warranties, and confers no rights.


 
Reply With Quote
 
 
 
 
Lau Lei Cheong
Guest
Posts: n/a
 
      06-05-2004
Thank's for the information. But I still believe there must be some way to
handle non-unicode string/char[] because there are components in the market
like "Chilkat Chinese Character Encoding .NET Assembly" which not only
handle unicode, but also big5 and gb2312 conversion.
If there aren't way to handle non-unicode string, end-users of these
component should have no way to input/output or use the converted string.

"Michael (michka) Kaplan [MS]" <> 在郵件
news: 中撰寫...
> "Lau Lei Cheong" <> wrote...
>
> > I'm trying to write a converter for converting between Big5 and

UTF-8,
> > but I want to make sure a few facts before writing.
> >
> > 1) I know that by default .NET store string in unicode. Would there
> > be any problem if I store Big5 characters in the string?

>
> This will not work,
>
> > Or could I set the codepage setting for individual string?

>
> No.....
>
> Technically, you mean it was multiposted (crossposting is posting one
> message in multiple places at the same time, multiposting is posting one
> message to multiple places separately). Generally crossposting is

preferred,
> since that will avoid having more than one thread and will avoid

duplication
> of effort of others....

I had not realize the difference between multiposting and crossposting
because I used to post through BBS where user defined crossposting is not
possible. In BBS we do crossposting by placing a note about where the
message is also be posted at the end to avoid redundant discussion. Thanks
for the reminder and I'll do real crossposting next time.


 
Reply With Quote
 
 
 
 
Michael \(michka\) Kaplan [MS]
Guest
Posts: n/a
 
      06-05-2004
Actually, you can just convert the the gb2312 via code page 936 or the big5
via codpage 950 -- and then they will be Unicode. But char[] and string in
Unicode are only ever UTF-16 Unicode in c#.


--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies

This posting is provided "AS IS" with
no warranties, and confers no rights.


"Lau Lei Cheong" <> wrote in message
news:uO#...
> Thank's for the information. But I still believe there must be some way to
> handle non-unicode string/char[] because there are components in the

market
> like "Chilkat Chinese Character Encoding .NET Assembly" which not only
> handle unicode, but also big5 and gb2312 conversion.
> If there aren't way to handle non-unicode string, end-users of these
> component should have no way to input/output or use the converted string.
>
> "Michael (michka) Kaplan [MS]" <> 在郵件
> news: 中撰寫...
> > "Lau Lei Cheong" <> wrote...
> >
> > > I'm trying to write a converter for converting between Big5 and

> UTF-8,
> > > but I want to make sure a few facts before writing.
> > >
> > > 1) I know that by default .NET store string in unicode. Would

there
> > > be any problem if I store Big5 characters in the string?

> >
> > This will not work,
> >
> > > Or could I set the codepage setting for individual string?

> >
> > No.....
> >
> > Technically, you mean it was multiposted (crossposting is posting one
> > message in multiple places at the same time, multiposting is posting one
> > message to multiple places separately). Generally crossposting is

> preferred,
> > since that will avoid having more than one thread and will avoid

> duplication
> > of effort of others....

> I had not realize the difference between multiposting and crossposting
> because I used to post through BBS where user defined crossposting is not
> possible. In BBS we do crossposting by placing a note about where the
> message is also be posted at the end to avoid redundant discussion. Thanks
> for the reminder and I'll do real crossposting next time.
>
>



 
Reply With Quote
 
Mihai N.
Guest
Posts: n/a
 
      06-05-2004
"Michael \(michka\) Kaplan [MS]" <> wrote in
news::

>> 2) There are basically three types of Unicode scheme - UTF-7, UTF-8
>> and UCS-2. Which one does the default Unicode setting refer to?


> The third is UTF-16. Well, you could use either, but String is UTF-16
> only -- so it is much easier to convert to/from anything to UTF-16. To
> convert to/from anything else, you have to go through UTF-16.


Just making sure: UTF-16 is almost the same with UCS2.
Among the differences is the fact that UTF-16 is aware of surrogates.
Also, I think the .NET string is not aware of surrogates.
Implication -> the .NET string is not UTF-16, but rather UCS2.

Am I wrong?

--
Mihai
-------------------------
Replace _year_ with _ to get the real email
 
Reply With Quote
 
Michael \(michka\) Kaplan [MS]
Guest
Posts: n/a
 
      06-05-2004
"Mihai N." <> wrote...
> "Michael \(michka\) Kaplan [MS]" <> wrote:
>
> >> 2) There are basically three types of Unicode scheme - UTF-7, UTF-8
> >> and UCS-2. Which one does the default Unicode setting refer to?

>
> > The third is UTF-16. Well, you could use either, but String is UTF-16
> > only -- so it is much easier to convert to/from anything to UTF-16. To
> > convert to/from anything else, you have to go through UTF-16.

>
> Just making sure: UTF-16 is almost the same with UCS2.
> Among the differences is the fact that UTF-16 is aware of surrogates.
> Also, I think the .NET string is not aware of surrogates.
> Implication -> the .NET string is not UTF-16, but rather UCS2.


Actually, this is incorrect. There is knowledge at the rendering level (used
for cursor movement, etc.) in GDI+ of surrogate pairs. Thus there is
knowledge of supplementary characters. A true UCS-2 string is one from
before supplementary characters even existed, so that surrogate code units
as a mechanism were not present and methods like ParseCombiningCharacters
would not exist.

> Am I wrong?


Technically, yes. Because the following two strings:

U+0065 U+0065 U+0301

U+0065 U+d800 U+dc00

Both have two answers to the question "how long is the string?" -- you can
make a choice based on the raw number of UTF-16 code units that make up the
string (3) or the number of what a user would think of a character (2). Both
answers are entirely and equally valid and which one uses would be based on
context. Both have value in programmatic situations, although (and this is
where the rub lies) in most circumstances the programatic camp needs to use
answer #1 rather than answer #2 even when rendering (buffer sizes and such),
with the main exceptions being cursor movement (which most programmers do
not need if the rendering method does the work) and truncation (the biggest
needed usage for a method that supports answer #2 of even a typical
programmer).

With that said, I think that the usage of answer #2 (the
ParseCombiningCharacters method and the TextElement* classes) are not the
most inuitive tools in the world, and there is a technical/philosopical
problem when the customer's most intuitve notion of what is a character has
only non-intuitve methods of usage. There are developers, testers, and
program managers who are aware of this problem who would like to be able to
determine the best way to solve it both in documentation and in code, so
hopefully over time that problem will be able to be addressed in a
satisfactory way.


--
MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies

This posting is provided "AS IS" with
no warranties, and confers no rights.


 
Reply With Quote
 
Mihai N.
Guest
Posts: n/a
 
      06-05-2004
>> Am I wrong?

> Technically, yes. Because the following two strings:

....
> only non-intuitve methods of usage. There are developers, testers, and
> program managers who are aware of this problem who would like to be able to
> determine the best way to solve it both in documentation and in code, so
> hopefully over time that problem will be able to be addressed in a
> satisfactory way.


Thank you!
Even an old dog can learn new tricks from you!


--
Mihai
-------------------------
Replace _year_ with _ to get the real email
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Is "String s = "abc";" equal to "String s = new String("abc");"? Bruce Sam Java 15 11-19-2004 06:03 PM
String[] files = {"a.doc, b.doc"}; VERSUS String[] files = new String[] {"a.doc, b.doc"}; Matt Java 3 09-17-2004 10:28 PM
Counting occurances of string A in string B, and adding it to string B Sandman Perl Misc 7 08-03-2004 08:46 PM
String.replaceAll(String regex, String replacement) question Mladen Adamovic Java 3 12-05-2003 04:20 PM
Re: String.replaceAll(String regex, String replacement) question Mladen Adamovic Java 0 12-04-2003 04:40 PM



Advertisments