Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   UTF-8 problem (http://www.velocityreviews.com/forums/t904311-utf-8-problem.html)

Todor Vachkov 08-21-2007 10:23 PM

UTF-8 problem
 
Hello all,

I'm trying to convert an exported xml file into a perl data structre with the XML::LibXML modul.
Thus I got this error message:

>Entity: line 315442: parser error : Input is not proper UTF-8, indicate
>encoding !
>Bytes: 0xE2 0x26 0x6C 0x74


I thought the solution would be:

>open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');
>my $parser = XML::LibXML->new();
>my $dom = $parser->parse_fh($fh);
>my $root = $dom->getDocumentElement;


but this produce a long long list (maybe for each parsed character in the xml file) of error messages :

>utf8 "\xE2" does not map to Unicode at /perlmodules/lib/i586-linux-thread-multi/XML/LibXML.pm line 429.

..
..
..
>utf8 "\xE4" does not map to Unicode >at /perlmodules/lib/i586-linux-thread-multi/XML/LibXML.pm line 429.
>Segmentation fault


The segmentaion fail always at the same \xE4 character, but it's a secondary problem.
I just want to let the modul to parse the xml file, which is really large (over 20MB)
and has being exported from another software. Thus I haven't any influence what comes into it.

I hope you can help me! Thanks in advance!

Greetings Todor


A. Sinan Unur 08-22-2007 02:19 AM

Re: UTF-8 problem
 
Todor Vachkov <vachkov@math.tu-berlin.de> wrote in
news:5j16ulF3saghfU1@mid.dfncis.de:

> Hello all,
>
> I'm trying to convert an exported xml file into a perl data structre
> with the XML::LibXML modul. Thus I got this error message:
>
>>Entity: line 315442: parser error : Input is not proper UTF-8,
>>indicate encoding !
>>Bytes: 0xE2 0x26 0x6C 0x74

>
> I thought the solution would be:
>
>>open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');


The file contents are not UTF-8. Specify the real encoding.

Sinan

PS: Avoid RxParse

--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)
clpmisc guidelines: <URL:http://www.augustmail.com/~tadmc/clpmisc.shtml>


Ted Zlatanov 08-22-2007 02:11 PM

Re: UTF-8 problem
 
On Wed, 22 Aug 2007 00:23:17 +0200 Todor Vachkov <vachkov@math.tu-berlin.de> wrote:

TV> Hello all,
TV> I'm trying to convert an exported xml file into a perl data structre with the XML::LibXML modul.
TV> Thus I got this error message:

>> Entity: line 315442: parser error : Input is not proper UTF-8, indicate
>> encoding !
>> Bytes: 0xE2 0x26 0x6C 0x74


TV> I thought the solution would be:

>> open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');
>> my $parser = XML::LibXML->new();
>> my $dom = $parser->parse_fh($fh);
>> my $root = $dom->getDocumentElement;


TV> but this produce a long long list (maybe for each parsed character in the xml file) of error messages :

>> utf8 "\xE2" does not map to Unicode at /perlmodules/lib/i586-linux-thread-multi/XML/LibXML.pm line 429.

TV> .
TV> .
TV> .
>> utf8 "\xE4" does not map to Unicode >at /perlmodules/lib/i586-linux-thread-multi/XML/LibXML.pm line 429.
>> Segmentation fault


TV> The segmentaion fail always at the same \xE4 character, but it's a secondary problem.
TV> I just want to let the modul to parse the xml file, which is really large (over 20MB)
TV> and has being exported from another software. Thus I haven't any influence what comes into it.

Can you post with the first 50 lines of the file, or put up a smaller
complete version of it online somewhere we can examine it? Your post
doesn't help at all with finding the problem (we can only guess that
your input file is not valid).

Ted

Todor Vachkov 08-22-2007 03:55 PM

Re: UTF-8 problem
 
Thanks for your replies!

The xml file is really huge - it has 666.025 lines and it is result of an export from a software.

It contents:
- the meta description of the software itself (i am pretty sure that it is conform to UTF-8)
- form inputs made by users. Thus, they fill out the software with information about several
databases.The goal is to have a distributed search engine. (again, I assume that the software
also saves the inputs in UTF-8)
- perl scripts for each database, which are written by various programmers. The scripts are
the interfaces between the databases and the software (the UTF-8 encoding of the scripts is not guaranteed)
All this stuff is contained by the huge XML file.

Parsing the file with XML::LibXML gives:

>Entity: line 315442: parser error : Input is not proper UTF-8, indicate * * * * * * * * * * * * * * * *
>encoding !
>Bytes: 0xE2 0x26 0x6C 0x74


I've figured out that this are the characters :

* U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
â (Â)

* U+0026 AMPERSAND
&

* U+006C LATIN SMALL LETTER L
l (L)

* U+0074 LATIN SMALL LETTER T
t (T)

Line 315442 looks:
><line>&lt;refpt id=&quot;bafn1&quot;/&gt;&lt;lk refid=&quot;afn1&quot;&gt;&lt;sup&gt;â&lt;/sup&gt;&lt;/lk&gt;</line>

^

The element <line></line> contains a single line from a perl script as mentioned above. The character 0xE2 was the point,
where the parser stopped, at line 315442, it went far enough, almost to the half.

It seems that the perl scripts within are my problem. I'am wondering why this single character is being treated from parser
as a non utf-8 code point? Could I tell the parser somehow to ignore this?

Thanks for your help!

Greetings, Todor



Martijn Lievaart 08-22-2007 07:08 PM

Re: UTF-8 problem
 
On Wed, 22 Aug 2007 17:55:50 +0200, Todor Vachkov wrote:

> Parsing the file with XML::LibXML gives:
>
> >Entity: line 315442: parser error : Input is not proper UTF-8,
> >indicate encoding !
> >Bytes: 0xE2 0x26 0x6C 0x74

>
> I've figured out that this are the characters :
>
> * U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
> â (Â)


U+00E2 is Unicode. In utf-8 encoding this would be a two character
sequence. So your input is not proper utf-8.

HTH,
M4

Todor Vachkov 08-22-2007 07:52 PM

Re: UTF-8 problem
 
Martijn Lievaart wrote:

>> Parsing the file with XML::LibXML gives:
>>
>> >Entity: line 315442: parser error : Input is not proper UTF-8,
>> >indicate encoding !
>> >Bytes: 0xE2 0x26 0x6C 0x74

>>
>> I've figured out that this are the characters :
>>
>> * U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
>> â (Â)

>
> U+00E2 is Unicode. In utf-8 encoding this would be a two character
> sequence. So your input is not proper utf-8.


Thanks for your posting!

The parser says:
>Bytes: 0xE2 0x26 0x6C 0x74

So 0xE2 is meant to be the problematic character.

U+00E2 was not in the error message, I've just pasted the output of my check on linux with:
user@timemashine:~$ unicode 0xe2
U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
UTF-8: c3 a2 UTF-16BE: 00e2 Decimal: â
â (Â)
Uppercase: U+00C2
Category: Ll (Letter, Lowercase)
Bidi: L (Left-to-Right)
Decomposition: 0061 0302

Greetings Todor

Martijn Lievaart 08-22-2007 08:32 PM

Re: UTF-8 problem
 
On Wed, 22 Aug 2007 21:52:16 +0200, Todor Vachkov wrote:

> Martijn Lievaart wrote:
>
>>> Parsing the file with XML::LibXML gives:
>>>
>>> >Entity: line 315442: parser error : Input is not proper
>>> >UTF-8, indicate encoding !
>>> >Bytes: 0xE2 0x26 0x6C 0x74
>>>
>>> I've figured out that this are the characters :
>>>
>>> * U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
>>> â (Â)

>>
>> U+00E2 is Unicode. In utf-8 encoding this would be a two character
>> sequence. So your input is not proper utf-8.

>
> Thanks for your posting!
>
> The parser says:
> >Bytes: 0xE2 0x26 0x6C 0x74

> So 0xE2 is meant to be the problematic character.
>
> U+00E2 was not in the error message, I've just pasted the output of my
> check on linux with:
> user@timemashine:~$ unicode 0xe2
> U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX UTF-8: c3 a2
> UTF-16BE: 00e2 Decimal: â â (Â)
> Uppercase: U+00C2
> Category: Ll (Letter, Lowercase)
> Bidi: L (Left-to-Right)
> Decomposition: 0061 0302


But 0xE2 seems to be the problematic character. It is not utf-8! Your
imputfile seems to be encoded in most probably latin-1 or latin-15, not
utf-8.

M4

Peter J. Holzer 08-25-2007 09:04 PM

Re: UTF-8 problem
 
On 2007-08-21 22:23, Todor Vachkov <vachkov@math.tu-berlin.de> wrote:
> Hello all,
>
> I'm trying to convert an exported xml file into a perl data structre with the XML::LibXML modul.
> Thus I got this error message:
>
>>Entity: line 315442: parser error : Input is not proper UTF-8, indicate
>>encoding !
>>Bytes: 0xE2 0x26 0x6C 0x74

>
> I thought the solution would be:
>
>>open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');


Don't do this. XML-files contain an indication of their encoding, you
should treat them as binary files

open(my $fh, "< :raw" ,'/foodir/export.xml');

and let the XML parser do the rest.

It that doesn't work, the encoding stored in the file is probably
wrong, either because the generating software was buggy or because
someone already incorrectly converted the file. You may have luck by
fixing the encoding (it should be in the first line which looks like
this:

<?xml version="1.0" encoding="UTF-8" ?>

If the encoding is missing, UTF-8 is assumed).

--
_ | Peter J. Holzer | I know I'd be respectful of a pirate
|_|_) | Sysadmin WSR | with an emu on his shoulder.
| | | hjp@hjp.at |
__/ | http://www.hjp.at/ | -- Sam in "Freefall"

Todor Vachkov 08-25-2007 10:59 PM

Re: UTF-8 problem
 
Peter J. Holzer wrote:

> On 2007-08-21 22:23, Todor Vachkov <vachkov@math.tu-berlin.de> wrote:
>> Hello all,
>>
>> I'm trying to convert an exported xml file into a perl data structre with
>> the XML::LibXML modul. Thus I got this error message:
>>
>>>Entity: line 315442: parser error : Input is not proper UTF-8, indicate
>>>encoding !
>>>Bytes: 0xE2 0x26 0x6C 0x74

>>
>> I thought the solution would be:
>>
>>>open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');

>
> Don't do this. XML-files contain an indication of their encoding, you
> should treat them as binary files
>
> open(my $fh, "< :raw" ,'/foodir/export.xml');
>
> and let the XML parser do the rest.
>
> It that doesn't work, the encoding stored in the file is probably
> wrong, either because the generating software was buggy or because
> someone already incorrectly converted the file. You may have luck by
> fixing the encoding (it should be in the first line which looks like
> this:
>
> <?xml version="1.0" encoding="UTF-8" ?>
>
> If the encoding is missing, UTF-8 is assumed).
>

Thanks for your reply Peter!

I'm using now XML::Smart and so I don't have the UTF-8 problem anymore.
The file has the declaration
<?xml version="1.0" encoding="UTF-8" ?>
As I already mentioned, it contains source code from perl scripts and I
found out that some of them are iso-8859-1 encoded. Especially the german "Umlaute" made some trouble as you know;)

Greetings,
Todor


All times are GMT. The time now is 12:38 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.