Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Nokogiri SAX parser encoding problem

Reply
Thread Tools

Nokogiri SAX parser encoding problem

 
 
Michel Demazure
Guest
Posts: n/a
 
      08-24-2010
According to Nokogiri's doc, it works internally in UTF-8.
Running this :

# encoding: utf-8

require 'nokogiri'

class MyDoc < Nokogiri::XML::SAX:ocument
def characters(string)
puts string.encoding
puts string
end
end

puts RUBY_VERSION
puts Encoding.default_external

parser = Nokogiri::XML::SAX:arser.new(MyDoc.new, 'UTF-8')
parser.parse('<foo>épée</foo>')

gives :

1.9.2
UTF-8
UTF-8
épée

Why ?
_md
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
Michel Demazure
Guest
Posts: n/a
 
      08-24-2010
Ryan Davis wrote:
>
> What if you redirect nokogiri's output to a file and view it in whatever
> you entered the above string in?
>
> Chances are it is your terminal, not ruby.


Yes, Ryan, you are right : writing to a utf-8 file gives the good
answer.

Actually, in my project, I use the SAX parser to build complex ruby
objects, which are marshaled to a file, and then used by a Shoes app.
This app gets the wrong answer. The culprit may therefore be Marshal.
I'll shift to YAML and report.

_md
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
Michel Demazure
Guest
Posts: n/a
 
      08-24-2010
Michel Demazure wrote:
> Ryan Davis wrote:
>>
>> What if you redirect nokogiri's output to a file and view it in whatever
>> you entered the above string in?
>>
>> Chances are it is your terminal, not ruby.

>
> Yes, Ryan, you are right : writing to a utf-8 file gives the good
> answer.
>


Alas, no !

This is strange : when writing to a file :
1. by luck, for the example I gave ("épée"), I get back "épée"
correctly,
2. but when parsing "<foo>deuxième</foo>", I get "ème" (this was the
initial bug I discovered in my app).

This is not the first time I see the "grave accented e" giving trouble
when scanning or parsing in ruby, whatever tool is used...

_md


--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Michel Demazure
Guest
Posts: n/a
 
      08-24-2010
Michel Demazure wrote:
> Michel Demazure wrote:


> 2. but when parsing "<foo>deuxième</foo>", I get "ème" (this was the
> initial bug I discovered in my app).
>
> This is not the first time I see the "grave accented e" giving trouble
> when scanning or parsing in ruby, whatever tool is used...
>

Sorry for posting again. Actually, in this last example, 'characters' is
called twice, the first call giving "deuxi", the second one "ème".
Strange feature, still a bug (?), but one can do with...

_md


--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Ryan Davis
Guest
Posts: n/a
 
      08-24-2010

On Aug 24, 2010, at 06:49 , Michel Demazure wrote:

> Michel Demazure wrote:
>> Michel Demazure wrote:

>=20
>> 2. but when parsing "<foo>deuxi=E8me</foo>", I get "=E8me" (this was =

the=20
>> initial bug I discovered in my app).
>>=20
>> This is not the first time I see the "grave accented e" giving =

trouble=20
>> when scanning or parsing in ruby, whatever tool is used...
>>=20

> Sorry for posting again. Actually, in this last example, 'characters' =

is=20
> called twice, the first call giving "deuxi", the second one "=E8me".=20=


> Strange feature, still a bug (?), but one can do with...


Yeah. that last part sounds like a bug. Unfortunately, Aaron Patterson =
is on an airplane for the next 12ish hours as he flies to rubykaigi. =
Mike may be able to help out here... otherwise I suggest you email the =
nokogiri mailing list with a minimal reproduction of the bug.



 
Reply With Quote
 
Bob Hutchison
Guest
Posts: n/a
 
      08-24-2010

Hi,

On 2010-08-24, at 9:49 AM, Michel Demazure wrote:

> Michel Demazure wrote:
>> Michel Demazure wrote:

>=20
>> 2. but when parsing "<foo>deuxi=C3=A8me</foo>", I get "=C3=A8me" =

(this was the=20
>> initial bug I discovered in my app).
>>=20
>> This is not the first time I see the "grave accented e" giving =

trouble=20
>> when scanning or parsing in ruby, whatever tool is used...
>>=20

> Sorry for posting again. Actually, in this last example, 'characters' =

is=20
> called twice, the first call giving "deuxi", the second one "=C3=A8me".=20=


> Strange feature, still a bug (?), but one can do with...


Actually this is allowed by the XML spec, annoying as it is. Many =
parsers do this when encountering an entity (e.g. &apos in the input =
stream (you get three strings, before, entity character, after). Some =
XML parsers have a parameter that tells it to join adjacent strings =
together before reporting a single string. I don't know if Nokogiri =
provides this functionality, but it might be worth a quick peek.

Cheers,
Bob

>=20
> _md
>=20
>=20
> --=20
> Posted via http://www.ruby-forum.com/.
>=20


----
Bob Hutchison
Recursive Design Inc.
http://www.recursive.ca/
weblog: http://xampl.com/so





 
Reply With Quote
 
Michel Demazure
Guest
Posts: n/a
 
      08-25-2010
Bob Hutchison wrote:
>
> Actually this is allowed by the XML spec, annoying as it is. Many
> parsers do this when encountering an entity (e.g. &apos in the input
> stream (you get three strings, before, entity character, after). Some
> XML parsers have a parameter that tells it to join adjacent strings
> together before reporting a single string. I don't know if Nokogiri
> provides this functionality, but it might be worth a quick peek.
>


@Bob : Yes, it is allowed.

From the nokogiri doc for the 'characters' method :

"This method might be called multiple times given one contiguous string
of characters."

@Ryan : strange as it is, it's a feature. So, IMHO, no bug report.

Actually, it is very strange. Parsing 'deuxième', you get two calls
'deuxi' + 'ème', but parsing the more complex 'épée deuxième', you get
only one ...

Thanks to both of you.
_md
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to force SAX parser to ignore encoding problems Łukasz Python 2 08-07-2009 06:40 AM
Nokogiri sax parser error Trans Ruby 2 02-08-2009 03:26 PM
c++ Xalan1.4/Xerces2.1 SAX parser. How to find the encoding type? RamaKrishna Narla XML 1 08-22-2006 12:08 PM
Sax Parser problem : xml encoding of string?? brightoceanlight@hotmail.com Java 5 09-15-2005 03:58 PM
Encoding problem with SAX parser Martin Schlatter Java 2 12-14-2003 10:33 AM



Advertisments