Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Ruby (http://www.velocityreviews.com/forums/f66-ruby.html)
-   -   Nokogiri SAX parser encoding problem (http://www.velocityreviews.com/forums/t864119-nokogiri-sax-parser-encoding-problem.html)

Michel Demazure 08-24-2010 09:26 AM

Nokogiri SAX parser encoding problem
 
According to Nokogiri's doc, it works internally in UTF-8.
Running this :

# encoding: utf-8

require 'nokogiri'

class MyDoc < Nokogiri::XML::SAX::Document
def characters(string)
puts string.encoding
puts string
end
end

puts RUBY_VERSION
puts Encoding.default_external

parser = Nokogiri::XML::SAX::Parser.new(MyDoc.new, 'UTF-8')
parser.parse('<foo>épée</foo>')

gives :

1.9.2
UTF-8
UTF-8
épée

Why ?
_md
--
Posted via http://www.ruby-forum.com/.


Michel Demazure 08-24-2010 01:13 PM

Re: Nokogiri SAX parser encoding problem
 
Ryan Davis wrote:
>
> What if you redirect nokogiri's output to a file and view it in whatever
> you entered the above string in?
>
> Chances are it is your terminal, not ruby.


Yes, Ryan, you are right : writing to a utf-8 file gives the good
answer.

Actually, in my project, I use the SAX parser to build complex ruby
objects, which are marshaled to a file, and then used by a Shoes app.
This app gets the wrong answer. The culprit may therefore be Marshal.
I'll shift to YAML and report.

_md
--
Posted via http://www.ruby-forum.com/.


Michel Demazure 08-24-2010 01:35 PM

Re: Nokogiri SAX parser encoding problem
 
Michel Demazure wrote:
> Ryan Davis wrote:
>>
>> What if you redirect nokogiri's output to a file and view it in whatever
>> you entered the above string in?
>>
>> Chances are it is your terminal, not ruby.

>
> Yes, Ryan, you are right : writing to a utf-8 file gives the good
> answer.
>


Alas, no !

This is strange : when writing to a file :
1. by luck, for the example I gave ("épée"), I get back "épée"
correctly,
2. but when parsing "<foo>deuxième</foo>", I get "ème" (this was the
initial bug I discovered in my app).

This is not the first time I see the "grave accented e" giving trouble
when scanning or parsing in ruby, whatever tool is used...

_md


--
Posted via http://www.ruby-forum.com/.


Michel Demazure 08-24-2010 01:49 PM

Re: Nokogiri SAX parser encoding problem
 
Michel Demazure wrote:
> Michel Demazure wrote:


> 2. but when parsing "<foo>deuxième</foo>", I get "ème" (this was the
> initial bug I discovered in my app).
>
> This is not the first time I see the "grave accented e" giving trouble
> when scanning or parsing in ruby, whatever tool is used...
>

Sorry for posting again. Actually, in this last example, 'characters' is
called twice, the first call giving "deuxi", the second one "ème".
Strange feature, still a bug (?), but one can do with...

_md


--
Posted via http://www.ruby-forum.com/.


Ryan Davis 08-24-2010 06:35 PM

Re: Nokogiri SAX parser encoding problem
 

On Aug 24, 2010, at 06:49 , Michel Demazure wrote:

> Michel Demazure wrote:
>> Michel Demazure wrote:

>=20
>> 2. but when parsing "<foo>deuxi=E8me</foo>", I get "=E8me" (this was =

the=20
>> initial bug I discovered in my app).
>>=20
>> This is not the first time I see the "grave accented e" giving =

trouble=20
>> when scanning or parsing in ruby, whatever tool is used...
>>=20

> Sorry for posting again. Actually, in this last example, 'characters' =

is=20
> called twice, the first call giving "deuxi", the second one "=E8me".=20=


> Strange feature, still a bug (?), but one can do with...


Yeah. that last part sounds like a bug. Unfortunately, Aaron Patterson =
is on an airplane for the next 12ish hours as he flies to rubykaigi. =
Mike may be able to help out here... otherwise I suggest you email the =
nokogiri mailing list with a minimal reproduction of the bug.




Bob Hutchison 08-24-2010 11:54 PM

Re: Nokogiri SAX parser encoding problem
 

Hi,

On 2010-08-24, at 9:49 AM, Michel Demazure wrote:

> Michel Demazure wrote:
>> Michel Demazure wrote:

>=20
>> 2. but when parsing "<foo>deuxi=C3=A8me</foo>", I get "=C3=A8me" =

(this was the=20
>> initial bug I discovered in my app).
>>=20
>> This is not the first time I see the "grave accented e" giving =

trouble=20
>> when scanning or parsing in ruby, whatever tool is used...
>>=20

> Sorry for posting again. Actually, in this last example, 'characters' =

is=20
> called twice, the first call giving "deuxi", the second one "=C3=A8me".=20=


> Strange feature, still a bug (?), but one can do with...


Actually this is allowed by the XML spec, annoying as it is. Many =
parsers do this when encountering an entity (e.g. &apos;) in the input =
stream (you get three strings, before, entity character, after). Some =
XML parsers have a parameter that tells it to join adjacent strings =
together before reporting a single string. I don't know if Nokogiri =
provides this functionality, but it might be worth a quick peek.

Cheers,
Bob

>=20
> _md
>=20
>=20
> --=20
> Posted via http://www.ruby-forum.com/.
>=20


----
Bob Hutchison
Recursive Design Inc.
http://www.recursive.ca/
weblog: http://xampl.com/so






Michel Demazure 08-25-2010 06:29 AM

Re: Nokogiri SAX parser encoding problem
 
Bob Hutchison wrote:
>
> Actually this is allowed by the XML spec, annoying as it is. Many
> parsers do this when encountering an entity (e.g. &apos;) in the input
> stream (you get three strings, before, entity character, after). Some
> XML parsers have a parameter that tells it to join adjacent strings
> together before reporting a single string. I don't know if Nokogiri
> provides this functionality, but it might be worth a quick peek.
>


@Bob : Yes, it is allowed.

From the nokogiri doc for the 'characters' method :

"This method might be called multiple times given one contiguous string
of characters."

@Ryan : strange as it is, it's a feature. So, IMHO, no bug report.

Actually, it is very strange. Parsing 'deuxième', you get two calls
'deuxi' + 'ème', but parsing the more complex 'épée deuxième', you get
only one ...

Thanks to both of you.
_md
--
Posted via http://www.ruby-forum.com/.



All times are GMT. The time now is 10:16 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.