Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > xml entity problem

Reply
Thread Tools

xml entity problem

 
 
Jos van Uden
Guest
Posts: n/a
 
      08-26-2004
Can somebody explain why the following file
has the wrong output:

<?xml version="1.0" encoding="iso-8859-1"?>
<test>
<elem>‘bla bla bla’</elem>
</test>

Expected: ‘bla bla bla’
output: ?bla bla bla?

It's not caused by the browser, but
by the (expat) xml parser.

Thanks.


test script:


<?php

$file = "test.xml";
$testdata;
$tagname;

function startElement($parser, $name, $attrs) {
global $tagname;
$tagname = $name;
}

function endElement($parser, $name) {
}

function characterData(&$parser, $data) {
global $testdata, $tagname;
if(trim($data) != "") {
switch($tagname) {
case 'ELEM' :
$testdata .= $data;
break;
}
}
}

$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, 'characterData');
if (!($fp = fopen($file, "r"))) {
die("could not open XML input");
}

while ($data = fread($fp, 4096)) {
if (!xml_parse($xml_parser, $data, feof($fp))) {
die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
}
}
xml_parser_free($xml_parser);

print "output : " .$testdata;


?>
 
Reply With Quote
 
 
 
 
Derek Harmon
Guest
Posts: n/a
 
      08-27-2004
"Jos van Uden" <(E-Mail Removed)> wrote in message news:cgl6aq$rvt$(E-Mail Removed)1.nb.home.nl...
> Can somebody explain why the following file
> has the wrong output:
>
> <?xml version="1.0" encoding="iso-8859-1"?>
> <test>
> <elem>‘bla bla bla’</elem>
> </test>
>
> Expected: ‘bla bla bla’
> output: ?bla bla bla?


The output is correct. The file has the wrong encoding.
ISO-8859-1 is essentially ANSI; it does not have 8,217
code points for it's characters.

Try UTF-16 encoding instead (and then, as it sounds like
you're aware, user agents can introduce '?' as well if they
are not displayed with an appropriate Unicode code page
and font.)


Derek Harmon


 
Reply With Quote
 
 
 
 
Jos van Uden
Guest
Posts: n/a
 
      08-27-2004
Derek Harmon wrote:
> "Jos van Uden" <(E-Mail Removed)> wrote in message news:cgl6aq$rvt$(E-Mail Removed)1.nb.home.nl...
>
>>Can somebody explain why the following file
>>has the wrong output:
>>
>><?xml version="1.0" encoding="iso-8859-1"?>
>><test>
>> <elem>‘bla bla bla’</elem>
>></test>
>>
>>Expected: ‘bla bla bla’
>>output: ?bla bla bla?


> The output is correct. The file has the wrong encoding.
> ISO-8859-1 is essentially ANSI; it does not have 8,217
> code points for it's characters.


> Try UTF-16 encoding instead


Unfortunately, the xml_parser doesn't support UTF-16.
The supported encodings are ISO-8859-1, US-ASCII
and UTF-8, so I can't try this.

>(and then, as it sounds like
> you're aware, user agents can introduce '?' as well if they
> are not displayed with an appropriate Unicode code page
> and font.)


I've tested this, of course.

Thanks for your response.

Jos
 
Reply With Quote
 
David Carlisle
Guest
Posts: n/a
 
      08-27-2004

Unfortunately, the xml_parser doesn't support UTF-16.
The supported encodings are ISO-8859-1, US-ASCII
and UTF-8, so I can't try this.

Then it's not an XML parser as UTF8 and UTF16 are both required
encodings in any conformant XML parser.

8217 is LEFT SINGLE QUOTATION MARK you say you expect this to be output
as byte octal 221 (dec 145) You haven't said what you were using to
output your parsed file, so I don't know what output encoding you have
requested, there is no character with such a byte encoding in ISO-8859-1
or UTF8. Perhaps you want some Microsoft code page. (The character in
question in your posting is displayed as \221 in my mail reader which
defaults to showing octal codes for unknown bytes.

David
 
Reply With Quote
 
Jos van Uden
Guest
Posts: n/a
 
      08-27-2004
David Carlisle wrote:
> Unfortunately, the xml_parser doesn't support UTF-16.
> The supported encodings are ISO-8859-1, US-ASCII
> and UTF-8, so I can't try this.
>
> Then it's not an XML parser as UTF8 and UTF16 are both required
> encodings in any conformant XML parser.


I see. From the Php 4 manual:

"This PHP extension implements support for James Clark's expat™ in PHP.
(...) It supports three source character encodings also provided by PHP:
US-ASCII, ISO-8859-1 and UTF-8. UTF-16 is not supported."

> 8217 is LEFT SINGLE QUOTATION MARK you say you expect this to be output
> as byte octal 221 (dec 145) You haven't said what you were using to
> output your parsed file, so I don't know what output encoding you have
> requested, there is no character with such a byte encoding in ISO-8859-1
> or UTF8. Perhaps you want some Microsoft code page. (The character in
> question in your posting is displayed as \221 in my mail reader which
> defaults to showing octal codes for unknown bytes.


Also:

"(...)There are two types of character encodings, source encoding and
target encoding. PHP's internal representation of the document is always
encoded with UTF-8.

(...) The default source encoding used by PHP is ISO-8859-1.

(...) When an XML parser is created, the target encoding is set to the
same as the source encoding, (...)

If PHP encounters characters in the parsed XML document that can not be
represented in the chosen target encoding, the problem characters will
be "demoted". Currently, this means that such characters are replaced by
a question mark. "

So it seems it's php 4 that's the limiting factor here. Strange thing
is: if I replace the encoding with the original characters, it shows
up fine. (using ISO-8859-1 as charset).

We're having this problem with a rss feeder called zfeeder. We've
already contacted the author, but haven't received any response. So
I thought I'd try and fix it myself.

Thanks for your help.

Jos
 
Reply With Quote
 
Martin Honnen
Guest
Posts: n/a
 
      08-27-2004


Jos van Uden wrote:

> Can somebody explain why the following file
> has the wrong output:
>
> <?xml version="1.0" encoding="iso-8859-1"?>
> <test>
> <elem>‘bla bla bla’</elem>
> </test>
>
> Expected: ‘bla bla bla’
> output: ?bla bla bla?
>
> It's not caused by the browser, but
> by the (expat) xml parser.
>
> Thanks.
>
>
> test script:
>
>
> <?php
>
> $file = "test.xml";
> $testdata;
> $tagname;
>
> function startElement($parser, $name, $attrs) {
> global $tagname;
> $tagname = $name;
> }
>
> function endElement($parser, $name) {
> }
>
> function characterData(&$parser, $data) {
> global $testdata, $tagname;
> if(trim($data) != "") {
> switch($tagname) {
> case 'ELEM' :
> $testdata .= $data;
> break;
> }
> }
> }
>
> $xml_parser = xml_parser_create();
> xml_set_element_handler($xml_parser, "startElement", "endElement");
> xml_set_character_data_handler($xml_parser, 'characterData');
> if (!($fp = fopen($file, "r"))) {
> die("could not open XML input");
> }
>
> while ($data = fread($fp, 4096)) {
> if (!xml_parse($xml_parser, $data, feof($fp))) {
> die(sprintf("XML error: %s at line %d",
> xml_error_string(xml_get_error_code($xml_parser)),
> xml_get_current_line_number($xml_parser)));
> }
> }
> xml_parser_free($xml_parser);
>
> print "output : " .$testdata;


Have you tried here to use
print "output : " . utf_decode($testdata);
?
Or try
header('Content-Type: text/plain; charset=UTF-8');
before you print out the data assembled with the XML parser.



--

Martin Honnen
http://JavaScript.FAQTs.com/
 
Reply With Quote
 
David Carlisle
Guest
Posts: n/a
 
      08-27-2004

So it seems it's php 4 that's the limiting factor here. Strange thing
is: if I replace the encoding with the original characters, it shows
up fine. (using ISO-8859-1 as charset).

ISO-8859-1 doesn't have any such quotes and has nothing at all in that
position which is why the characters in your posting don't display for
me
eg
> James Clark's expat™ in PHP

comes out as
James Clark's expat\231 in PHP

as my news reader (emacs) is defaulting to latin-1 (iso-8859-1) but your
posting is in the Microsoft-specific encoding
charset=windows-1252
which is properly declared in the headers but that doesn't help me as my
system apparently doesn't know (or at least can't display in) that encoding.

So if you ask for iso-8859-1 output then the unicode character for a
left quote is likely to map to some kind of missing glyph marker as the
specified encoding doesn't have such a character, on the other hand if
the system just believes that teh first 127 Unicode slots should map
straight to latin-1 (which is more or less true) and doesn't raise an
error on "non-characters" in the Control positions then you may find
that bytes corresponding to Micorosft encoded quote characters do happen
to be output, but that is more by luck and lack of error detection than
anything else.

Of these encodings you list as supported

It supports three source character encodings also provided by PHP:
US-ASCII, ISO-8859-1 and UTF-8. UTF-16 is not supported."

only UTF-8 has the left and right quotes, so you would have to output as
utf-8.

David
 
Reply With Quote
 
Jos van Uden
Guest
Posts: n/a
 
      08-27-2004
Martin Honnen wrote:

(...)

> Have you tried here to use
> print "output : " . utf_decode($testdata);
> ?
> Or try
> header('Content-Type: text/plain; charset=UTF-8');
> before you print out the data assembled with the XML parser.


Ok, it works now.

1) I removed the iso-8859-1 encoding from the xml file, which
makes it default to UTF-8
2) I set the encoding of the parser to UTF-8 explicitly, otherwise
it will default to iso-8859-1. The target encoding follows the source
encoding.

Thanks David, Martin and Derek

What's with this iso-8859-1? Why are we still using that? I use
it because it seems to be common practice, and I figure there
must be a reason for it. But what is that reason? Backward
compatibility?

If I start using UTF-8 as charset in my meta tags will there be
undesirable side-effects? Currently I use iso-8859-1 and simply
convert special characters to entities. My xhtml 1.0 transitional
always validates (eventually). So I guess no harm is done.

I'm not even sure if the meta tag does anything in a valid xhtml
file.

Anyway, I guess I'll have some googling ahead of me

Thanks again

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Entity, problem with entity key ThatsIT.net.au ASP .Net 1 09-07-2009 02:20 AM
Entity Framework - Reassigning child entity's parent Norm ASP .Net 3 07-06-2009 07:28 PM
How to relate a SQL based entity with an Object based entity in Entity Framework markla ASP .Net 1 10-06-2008 09:42 AM
Entity Name or Entity Number? Samuel van Laere HTML 4 02-24-2007 10:11 PM
Utitily Method: XML Entity Replacement?? John Davison Java 1 06-25-2004 04:58 AM



Advertisments