Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Extended Characters in XML

Reply
Thread Tools

Extended Characters in XML

 
 
barthome1@comcast.net
Guest
Posts: n/a
 
      03-18-2005
Hello,

My company collects data from non-US sources. We are starting projects
where this data will be output in an XML document and passed around to
our applications and third party tools.

The data includes some of the extended characters. We get strange
accent marks, italics and the like. These characters have decimal
value in the 200+ range.

So how do you handle these in XML with the assurance that you won't
lose content and the off-the-shelf XML technologies will interpret them
correctly and not simply reject the document as flawed?

We know about the special escape sequences for the reserved XML
characters like '>' and '<'. Is there a standard escape sequences for
the extended characters?

Thanks ahead of time for any help.

Bart
http://www.velocityreviews.com/forums/(E-Mail Removed)

 
Reply With Quote
 
 
 
 
Andreas Prilop
Guest
Posts: n/a
 
      03-18-2005
On 18 Mar 2005 (E-Mail Removed) wrote:

> The data includes some of the extended characters. We get strange
> accent marks, italics


Italics??

> and the like. These characters have decimal
> value in the 200+ range.
> So how do you handle these in XML with the assurance that you won't
> lose content and the off-the-shelf XML technologies will interpret them
> correctly and not simply reject the document as flawed?


One possibility is to write all of them in the form &#number;
where number is the decimal code position in Unicode.

> Is there a standard escape sequences for the extended characters?


&#number; , which is the same as in SGML/HTML. See
http://www.unics.uni-hannover.de/nht...ilingual2.html
for examples in various scripts.

--
Mars, unlike Earth, has no atmosphere.
The Chicago manual of style, 15th ed., p. 362

 
Reply With Quote
 
 
 
 
Martin Honnen
Guest
Posts: n/a
 
      03-19-2005


(E-Mail Removed) wrote:


> My company collects data from non-US sources. We are starting projects
> where this data will be output in an XML document and passed around to
> our applications and third party tools.
>
> The data includes some of the extended characters. We get strange
> accent marks, italics and the like. These characters have decimal
> value in the 200+ range.


Any XML parser is supposed to support the UTF-8 encoding thus you could
encode your XML documents as UTF-8 and then you are able to use all
characters Unicode supports directly in your document. You only need to
make sure you use an editor that allows creation of UTF-8 encoded
documents. Or you could, as already suggested, escape characters with
the Unicode code point e.g. € for the Euro sign €.
<http://www.unicode.org/>


--

Martin Honnen
http://JavaScript.FAQTs.com/
 
Reply With Quote
 
Jukka K. Korpela
Guest
Posts: n/a
 
      03-19-2005
Andreas Prilop <(E-Mail Removed)-hannover.de> wrote:

> On 18 Mar 2005 (E-Mail Removed) wrote:
>
>> The data includes some of the extended characters. We get strange
>> accent marks, italics

>
> Italics??


That sounds somewhat strange indeed, since normally the font style is
expressed at a level other than character level, e.g. in markup.
(Contrary to populistic propaganda, XML markup is not inherently
"logical"; nothing prevents you from using XML markup for purely
presentational purposes. If you need to store information in a manner
that preserves formatting information, that might be a good idea.
Using <i> for italics as in HTML would be natural then.)

But there _are_ characters in Unicode that are italicized variants of
other characters. Many of them are compatibility characters that have
been included just because they exist as characters in other standards.
There are other cases as well. If this topic is relevant, then the
document "Unicode in XML and other Markup Languages"
http://www.w3.org/TR/unicode-xml/ should be studied.

>> and the like. These characters have decimal
>> value in the 200+ range.
>> So how do you handle these in XML with the assurance that you
>> won't lose content and the off-the-shelf XML technologies will
>> interpret them correctly and not simply reject the document as
>> flawed?

>
> One possibility is to write all of them in the form &#number;
> where number is the decimal code position in Unicode.


That's certainly a way represent them in XML, and this might be useful
to protect against problems with encodings (and transcoding). However
it normally wins nothing and loses a lot in readability of the text in
XML source. (In XML it might be better to use &#xhhhh; where hhhh is
the code in hexadecimal, since character code standards and references
generally use hex.)

If the data needs to be processed using old software too, then all
kinds of problems may arise. If you need to prepared to _anything_,
then only the invariant subset of ASCII is safe, or mostly safe. But it
would be a mistake to convert data to ASCII using some simplifications
and transmogrifications, unless you _know_ there will be serious and
unsolvable problems otherwise.

Anything that you can use XML technology even in the feeblest sense
_must_ be able to accept data in UTF-8 encoding and at least store and
forward it unmodified, even if it is incapable of rendering all the
characters or recognizing them in a useful way. So the first step
should be to convert the arriving data into UTF-8 in a safe way.
Normally you should get information about the encoding of the data and
do the conversions automatically, but at early phases you might wish to
do some occasional checks to verify the sensibility of the data. It is
not uncommon to send text data as incorrectly labelled (as regards to
its encoding), or unlabelled (so that the recipient must guess or
deduce what encoding has been used).

Quite apart from this, we cannot realistically expect that all Unicode
characters will be adequately processed and rendered. So it's very
relevant what characters there will be in the input data and how it
should be processed. For example, we can probably expect that if some
software is advertized as reading XML data and storing it into a
database and supporting some searching and retrieval, then it will
accept and store any Unicode data in UTF-8 format. But it might fail to
display the data when retrieved, its sorting routines might not work by
Unicode rules, its case-insensitive search might be something rather
trivial that really works for basic Latin letters only, and it might
even fail to display characters properly right to left according to
their directionality.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
 
Reply With Quote
 
Shmuel (Seymour J.) Metz
Guest
Posts: n/a
 
      03-20-2005
In <(E-Mail Removed) .com>, on
03/18/2005
at 08:17 AM, (E-Mail Removed) said:

>So how do you handle these in XML with the assurance that you won't
>lose content and the off-the-shelf XML technologies will interpret
>them correctly and not simply reject the document as flawed?


You can't really guaranty anything, but your best bet is probably to
use UTF-8, which is a transform of Unicode into 8-bit bytes. Note that
there are standard entity names for many Unicode characters.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to (E-Mail Removed)

 
Reply With Quote
 
Jukka K. Korpela
Guest
Posts: n/a
 
      03-20-2005
"Shmuel (Seymour J.) Metz" <(E-Mail Removed)>
wrote:

> You can't really guaranty anything, but your best bet is probably
> to use UTF-8, which is a transform of Unicode into 8-bit bytes.


Indeed.

> Note that there are standard entity names for many Unicode
> characters.


No, there aren't - in XML. In XML, the only predefined entity names
are &lt;, &gt;, &amp;, &quot;, and &apos;.

There are "standard entity names" in the sense that the SGML standard
contains a large number of entity declarations as samples, and some of
them have been copied to HTML. But from the XML viewpoint, there is
nothing standard about them; XML is logically independent of the SGML
standard. One might argue that if you declare entities that denote
Unicode characters, it would be advisable to use the same names as in
the SGML standard if possible. But even this is far from clear; the
SGML names are partly ridiculously and obscurely truncated (quickly,
guess what the "mnemonic" &lang; means!). Besides, you don't _need_ the
entities (except &lt; and &amp when you use UTF-8.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
 
Reply With Quote
 
Peter Flynn
Guest
Posts: n/a
 
      03-22-2005
(E-Mail Removed) wrote:

> Hello,
>
> My company collects data from non-US sources. We are starting projects
> where this data will be output in an XML document and passed around to
> our applications and third party tools.
>
> The data includes some of the extended characters. We get strange
> accent marks, italics and the like. These characters have decimal
> value in the 200+ range.


Accents are normal in many non-English languages, so they probably
aren't "strange" to the originators. As Jukka has pointed out, what
look like italics are probably variant characters which happen to
be sloping.

> So how do you handle these in XML with the assurance that you won't
> lose content and the off-the-shelf XML technologies will interpret them
> correctly and not simply reject the document as flawed?


If you use XML software which conforms to the standards then it will handle
all the characters correctly (provided you also conform to the same
standards). If you need to be able to accept pretty much any character
from any source, use the UTF-8 encoding.

> We know about the special escape sequences for the reserved XML
> characters like '>' and '<'. Is there a standard escape sequences for
> the extended characters?


">" is not a reserved character, it's just a character. It only has a
special meaning when it's used to close a start-tag or end tag. The
only two reserved characters are "<" and "&". The latter is the one you
want for the named or numeric codes for non-ASCII characters, but if you
use UTF-8 then you won't need it at all except for espacing "<" and "&",
as has already been pointed out.

///Peter
--
sudo sh -c "cd /;/bin/rm -rf `which killall kill ps shutdown mount gdb` *
&;top"

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Extended ASCII characters in console app Bob Hartung C++ 5 05-28-2009 12:04 PM
multiple versions of "Extended ASCII characters"(No. 128 to 255) wob C Programming 4 08-01-2005 05:37 AM
Reading and writing extended ascii characters Geoff Warnock Java 2 03-09-2005 11:59 AM
Extended ASCII characters in PIX's remarks. AM Cisco 0 12-30-2004 08:21 AM
Request.Querystring does not return extended characters Navanith ASP .Net 2 12-30-2003 02:15 PM



Advertisments