Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Extended ASCII character handeling

Reply
Thread Tools

Extended ASCII character handeling

 
 
Don Norcott
Guest
Posts: n/a
 
      11-17-2010
"200 Millionen Jahre sp=C3=A4ter # 17.39
\n",
"200 Millionen Jahre sp=C3=A4ter # 9.87
3404211707 \n",
"A l'assaut de l'invisible 1977 # 4.91
\n",
"A l'assaut de l'invisible 1990 # 5.18
226603779 \n",

The above 4 lines are data I was attempting to load into an array to
test some code. I was getting what I thought were strange results until
I realized not all characters were being loaded into the element
resulting in column alignment problems.

The data above was cut from a file that had been manipulated a dozen
times in ruby arrays before being written to a file. So it appears the
default way ruby handles extended ASCII(?) is fine.

I have two questions
1) Should I ever have to worry about data being scraped from web pages
not being handled correctly by ruby.

2)How do I flag this data to allow me to manipulate it properly. That is
load it into an array or write to a file.

Tried playing with the following but even if the code below is correct
the extended ascii characters are lost by the time it gets to IRB

str =3D String.new
str.encode(("US-ASCII")
str =3D "Millionen Jahre sp=C3=A4ter"

Any suggestions where I might find some insight.

Thanks Don

-- =

Posted via http://www.ruby-forum.com/.=

 
Reply With Quote
 
 
 
 
Robert Klemme
Guest
Posts: n/a
 
      11-17-2010
On 17.11.2010 17:01, Don Norcott wrote:
> "200 Millionen Jahre später # 17.39
> \n",
> "200 Millionen Jahre später # 9.87
> 3404211707 \n",
> "A l'assaut de l'invisible 1977 # 4.91
> \n",
> "A l'assaut de l'invisible 1990 # 5.18
> 226603779 \n",
>
> The above 4 lines are data I was attempting to load into an array to
> test some code. I was getting what I thought were strange results until
> I realized not all characters were being loaded into the element
> resulting in column alignment problems.
>
> The data above was cut from a file that had been manipulated a dozen
> times in ruby arrays before being written to a file. So it appears the
> default way ruby handles extended ASCII(?) is fine.
>
> I have two questions
> 1) Should I ever have to worry about data being scraped from web pages
> not being handled correctly by ruby.


Depends how you read the data from webpages.

> 2)How do I flag this data to allow me to manipulate it properly. That is
> load it into an array or write to a file.


You need to set encodings properly. You can do that when opening the
file. Example:

irb(main):001:0> io = File.open "x","r"
=> #<File>
irb(main):002:0> io.external_encoding
=> #<Encoding:UTF-8>
irb(main):003:0> io.internal_encoding
=> nil
irb(main):004:0> io.read.encoding
=> #<Encoding:UTF-8>
irb(main):005:0> io.close
=> nil

irb(main):006:0> io = File.open "x","r:ASCII"
=> #<File>
irb(main):007:0> io.external_encoding
=> #<Encoding:US-ASCII>
irb(main):008:0> io.internal_encoding
=> nil
irb(main):009:0> io.read.encoding
=> #<Encoding:US-ASCII>
irb(main):010:0> io.close
=> nil

See http://blog.grayproductions.net/arti...rstanding_m17n

> Tried playing with the following but even if the code below is correct
> the extended ascii characters are lost by the time it gets to IRB
>
> str = String.new
> str.encode(("US-ASCII")
> str = "Millionen Jahre später"


This won't work - ever. You set the encoding for an instance and then
you reassign str to point to another instance, so all your encoding
settings are lost. Also, there is no "ü" in ASCII which is 7bit!

irb(main):011:0> s="a"
=> "a"
irb(main):012:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):013:0> t = s.encode "ASCII"
=> "a"
irb(main):014:0> t.encoding
=> #<Encoding:US-ASCII>

Now with "ü":

irb(main):015:0> s="ü"
=> "ü"
irb(main):016:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):017:0> t = s.encode "ASCII"
Encoding::UndefinedConversionError: "\xC3\xBC" from UTF-8 to US-ASCII
from (irb):17:in `encode'
from (irb):17
from /usr/local/bin/irb19:12:in `<main>'

Kind regards

robert


--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

 
Reply With Quote
 
 
 
 
Don Norcott
Guest
Posts: n/a
 
      11-17-2010
I am using nokogiri (with Mechanize) to scrape the data and the data I
am concerned with is extracted only from displayable fields <table
class=3D"result> .... </table>

The code set/language references I see are
<meta content=3D"text/html; charset=3DISO-8859-1" http-equiv=3D"Content-T=
ype">
Which is I believe, what I am calling Extended ASCII(8 bit 0 - 255)

AND

//<![CDATA[ var awsDomain =3D 'xxxxxxxx.xxx';
var surveyLink =3D "sm=3D93_2fjk6BaUHEqrn2qpdbknQ_3d_d"
var twoLetterISOCode =3D 'en'; //]]>

The scrapped data has never caused a problem within the ruby program
(would have been very obvious). Can I safely assume that code sets will
never present a problem for this specific application as long as the
retrieval methods do not change????.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3 D=3D=3D=3D=3D=3D=3D=3D=3D=

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3 D=3D=3D=3D=3D=3D=3D=3D=3D=

That being said when I open the file with io it reports
#<Encoding:IBM437> which would contain the characters giving problems
(but not there correct representation). That is to say the IBM437 for
character E4 is a Graphic character not the accented French 'a' in
"sp=C3=A4ter". The graphic is what is also being displayed in the IRB
console.

I have gone through most of the Shades of Gray link and only thing that
I thought might have been of value is the LC_TYPE but either UTF-9 or
ISO-8859-1 both work identically in my situation. I have removed
LC_TYPE since there is no problem with internal data and it might cause
a problem down the line when I have forgotten about it.

Also tried saving code & data to a file and running the file (ruby
xxx.rb) and still reports a multibyte error.

Played with ruby command line encoding settings (ruby -E XXX)and still
received errors regardless of code set I picked - may be related to
LC_TYPE as did not reboot so still valid??

Error is
CodeSet.rb:4: invalid multibyte char (US-ASCII) which is 7 bit.

Extended ASCII code sets ISO-8859 & IBM437 are 8 bit but can not seem to
set this.


=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3 D=3D=3D=3D=3D=3D=3D
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3 D=3D=3D=3D=3D=3D=3D

I can edit the data file externally and read the data into an array
without problems.
So will assume no need to pursue the code set settings at this time.

Will not update unless I have a revelation.

By the recommended link was excellent, will save URL as a resource.

-- =

Posted via http://www.ruby-forum.com/.=

 
Reply With Quote
 
Brian Candler
Guest
Posts: n/a
 
      11-18-2010
Don Norcott wrote in post #962171:
> I have two questions
> 1) Should I ever have to worry about data being scraped from web pages
> not being handled correctly by ruby.


In ruby 1.9, you have to worry about this very much.

Strings in ruby 1.9 are two-dimensional: they have a sequence of bytes,
and they have an encoding. There are additional 'dimensions' based on
the string's content - empty, ascii_compatible, valid_encoding.

If your scraper library doesn't document how it choses the encodings to
tag each string it returns, and doesn't document how it handles invalid
encodings if it comes across them, then you have to test its behaviour
for all the various edge cases.

You never have this issue with ruby 1.8, because a string is just a
string of bytes. Of course, the "garbage in, garbage out" principle
still applies; you just don't choke on the garbage.

> 2)How do I flag this data to allow me to manipulate it properly. That is
> load it into an array or write to a file.


That's a short question with a long answer, and I'm afraid my own
attempt to answer it is incomplete:
https://github.com/candlerb/string19...er/string19.rb

If you're reading stuff from a file or a socket yourself, you can
control the process. If you're trusting a third-party library to fetch
data from somewhere, then you have to trust that library to do the right
thing in the situations you're interested in.

> Tried playing with the following but even if the code below is correct
> the extended ascii characters are lost by the time it gets to IRB


irb is not a good predictor of encoding behaviour for ruby 1.9, and
you'd be better writing standalone .rb scripts that you run.

Note that it's one of the 1.9 language inconsistencies that transcoding
is *not* done on output by default. So if you have a read a string from
a file, and carefully tag it as say UTF-8, but your terminal is IBM437,
then

puts my_string

will just squirt the UTF-8 bytes to the terminal and they'll display
wrongly. You can try something like this:

STDOUT.set_encoding "IBM437"
or
STDOUT.set_encoding "locale"

Regards,

Brian.

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Convert an Character to the ASCII Character Code John Gregory Ruby 0 07-05-2009 10:32 PM
[question] Event Handeling Between Two wxPanles in A wxNotebook support.services.complaints@gmail.com Python 4 03-21-2005 06:26 AM
Re: Question about Extended ASCII character set, and fstream C++ 1 10-21-2004 05:40 PM
Question about Extended ASCII character set, and fstream C++ 1 10-21-2004 09:30 AM
routine/module to translate microsoft extended ascii to plain ascii James O'Brien Perl Misc 3 03-05-2004 04:33 PM



Advertisments