Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Re: xml, windows, utf-8, and httpclient

Reply
Thread Tools

Re: xml, windows, utf-8, and httpclient

 
 
Chris Uppal
Guest
Posts: n/a
 
      12-20-2005
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

> In MS Windows XP Notepad I created an xml file. I pasted a single
> tibetan character '\u0F40' as the text part of a certain element:
> <body>?</body>


I was amazed to find that the Ka letter in that paragraph is rendered correctly
by my newsreader !


> I saved this file using notepad's Save as... -> encoding = UTF-8.


Check whether Notepad has added a Byte Order Mark. It shouldn't (for UTF-
but I seem to remember that it usually does anyway.


> If I use a hex editor and view the document I can see that the
> character is stored as I would expect in utf-8 encoding '\u0F40' -> E0
> BD 80.
>
> Next, I use dom4j to read in and parse the file. dom4j should be using
> the xerces parser. I assume that the parser knows how to read the utf-8
> file. After all, it prepends the xml file with:
> <?xml version="1.0" encoding="UTF-8"?>


It's not clear at this point whether you mean that the file you created has
such a charset declaration ?


> Question 1:
> At this point, is the character stored in memory as '\u0f40'?


Why don't you try printing out the integer value of the character(s) ? If it
is 0x0F40 then all is well so far, if not then something has already gone wrong
(presumably the parser didn't realise that it was parsing UTF-.


> Maybe not, because if I print my xml as a string and view it in hex I
> can see my utf-8 characters in there 'E0 BD 80'.


The problem with that is that you don't know how the process of printing the
string is converting characters into binary.


> Next, I want to post my xml to a webserver using jakarta commons
> httpclient. I add a header declaring the encoding as utf-8:
> content-type=text/xml; charset=UTF-8. This action has the same effect
> as taking my xml string and using the String.getBytes("UTF-8")
> function. The bytes are pushed through the utf-8 encoding algorithm
> again and are sent as 'c3 a0 c2 bd e2 82 ac'.
>
> Question 2:
> Is that how it should be done?
>
> Question 3:
> 'c3 a0 c2 bd' translates back to 'E0 BD', but I have no idea where 'e2
> 82 ac' comes from... Any ideas?


It sounds as if the Ka character's UTF-8 representation hasn't been de-UTF-8-ed
as it was read in by the parser, thus resulting in a String containing the
chars 0x00E0 0x00BD 0x0080. Which has then been encoded as UTF-8 /again/
resulting in the gibberish you see.

I don't know much about dom4j (or Xerces, come to that), but it might be
worth posting the code you use to open the XML file. I suspect it's not
decoding the UTF-8.

-- chris




 
Reply With Quote
 
 
 
 
Alex Buell
Guest
Posts: n/a
 
      12-20-2005
On Tue, 20 Dec 2005 12:13:11 -0000 "Chris Uppal"
<(E-Mail Removed)-THIS.org> wibbled:


> > In MS Windows XP Notepad I created an xml file. I pasted a single
> > tibetan character '\u0F40' as the text part of a certain element:
> > <body>?</body>

>
> I was amazed to find that the Ka letter in that paragraph is rendered correctly
> by my newsreader !


No it isn't. It is shown as a ? in your post, but I can do this: ཀ.
Perfect.

--
http://www.munted.org.uk

Anyone that thinks an imaginary deity is going to protect them against
earthquakes and hurricanes needs psychiatric help.
 
Reply With Quote
 
 
 
 
Chris Uppal
Guest
Posts: n/a
 
      12-20-2005
Alex Buell wrote:

> > > In MS Windows XP Notepad I created an xml file. I pasted a single
> > > tibetan character '\u0F40' as the text part of a certain element:
> > > <body>?</body>

> >
> > I was amazed to find that the Ka letter in that paragraph is rendered
> > correctly by my newsreader !

>
> No it isn't. It is shown as a ? in your post, but I can do this: ?.
> Perfect.


Well, it was /rendered/ correctly (even in the reply composition window), it's
just that it throws the character away before actually sending the post...

-- chris


 
Reply With Quote
 
Alex Buell
Guest
Posts: n/a
 
      12-20-2005
On Tue, 20 Dec 2005 13:14:49 -0000 "Chris Uppal"
<(E-Mail Removed)-THIS.org> wibbled:

> Alex Buell wrote:
>
> > > > In MS Windows XP Notepad I created an xml file. I pasted a single
> > > > tibetan character '\u0F40' as the text part of a certain element:
> > > > <body>?</body>
> > >
> > > I was amazed to find that the Ka letter in that paragraph is rendered
> > > correctly by my newsreader !

> >
> > No it isn't. It is shown as a ? in your post, but I can do this: ?.
> > Perfect.

>
> Well, it was /rendered/ correctly (even in the reply composition window), it's
> just that it throws the character away before actually sending the post...


I actually posted it as an UTF-8 enabled message which might be why I
can do ཀ. I strongly suggest you have a look at Sylpheed, there's a
version for Windows (http://www.sylpheed.good-day.net). The author is
Japanese and very much aware of those issues and that's why it's
excellent.


--
http://www.munted.org.uk

Anyone that thinks an imaginary deity is going to protect them against
earthquakes and hurricanes needs psychiatric help.
 
Reply With Quote
 
Chris Uppal
Guest
Posts: n/a
 
      12-21-2005
Alex Buell wrote:

> I actually posted it as an UTF-8 enabled message which might be why I
> can do ?. I strongly suggest you have a look at Sylpheed, there's a
> version for Windows (http://www.sylpheed.good-day.net). The author is
> Japanese and very much aware of those issues and that's why it's
> excellent.


The URL seems to be:
http://www.sylpheed.good-day.je/

Looks interesting. I'll probably try it out when the Window's version leaves
beta. (I'd rather not use Outlook Express, but -- for all its many defects --
I still haven't found anything like an acceptable replacement.)

-- chris



 
Reply With Quote
 
Alex Buell
Guest
Posts: n/a
 
      12-21-2005
On Wed, 21 Dec 2005 09:37:04 -0000 "Chris Uppal"
<(E-Mail Removed)-THIS.org> waved a wand and this message
magically appeared:

> Alex Buell wrote:
>
> > I actually posted it as an UTF-8 enabled message which might be why I
> > can do ?. I strongly suggest you have a look at Sylpheed, there's a
> > version for Windows (http://www.sylpheed.good-day.net). The author is
> > Japanese and very much aware of those issues and that's why it's
> > excellent.

>
> The URL seems to be:
> http://www.sylpheed.good-day.je/


Correction: http://sylpheed.good-day.net

> Looks interesting. I'll probably try it out when the Window's version leaves
> beta. (I'd rather not use Outlook Express, but -- for all its many defects --
> I still haven't found anything like an acceptable replacement.)


Anything but Outlook, please ;o)

--
http://www.munted.org.uk

Anyone that thinks an imaginary deity is going to protect them against
earthquakes and hurricanes needs psychiatric help.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
HTTPClient and Progressbar. crazytazo Java 5 11-19-2008 12:02 AM
Jakarta Commons HttpClient and a socket reset Joe Linux Java 0 09-15-2006 02:53 PM
Network timeout with HTTPClient and Tomcat Jimi Hullegrd Java 1 09-07-2005 10:27 AM
HTTPClient 2.0.2 and HTTP PUT Kevin McMurtrie Java 0 02-15-2005 04:35 AM
HTTPClient - Sessions and cookies what am I doing wrong? Maverick Java 1 02-07-2004 02:25 PM



Advertisments