Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Random Access using IO#pos in code blocks

Reply
Thread Tools

Random Access using IO#pos in code blocks

 
 
Arun Kumar
Guest
Posts: n/a
 
      04-28-2009
Hello everyone,
I'm 20 days new to Ruby, please forgive if I make any mistakes. I'm
on a project where I'm indexing certain words in a text document. So
I'm also storing the file position where the word occurs. But the
Problem is:
The IO#pos points to the end of the file all the while... Below is
the code I'm working on:

File.open(file_name) do |f|

f.readlines("\r\n\r\n").each do |para|

para.scan(/\b\w+\b/).each do |word|

word =3D word.downcase.stem
if (!stoplist.include? word) && (!word.empty?) #excludes empty
and frequent words

unless freq.has_key?(word)
freq[word] =3D [1,f.pos,file_name] # freq is a hash, that
stores an array containing index, position of word (THE PROBLEM)..
else
freq[word].to_a[0] +=3D 1
freq[word].to_a<< f.pos << file_name
end

unless wfreq.has_key?(word)
wfreq[word] =3D [1,f.pos,file_name]
else
wfreq[word].to_a[0] +=3D 1
wfreq[word].to_a<< f.pos << file_name
end

end
end
end


File.open(file_name+".yaml","w"){|f| YAML.dump(freq,f)}

Also it would be great if someone told me the replacement for the
deprecated 'to_a' method used above

Any help is greatly appreciated


---------------



--=20
|| =E0=A4=B6=E0=A5=8D=E0=A4=B0=E0=A5=80 =E0=A4=9C=E0=A4=BE=E0=A4=A8=E0=A4=
=95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E 0=A4=BE=E0=A4=A5=E0=A5=8B=
=E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A 5=87 ||

 
Reply With Quote
 
 
 
 
Robert Klemme
Guest
Posts: n/a
 
      04-29-2009
On 28.04.2009 22:32, Arun Kumar wrote:
> Hello everyone,
> I'm 20 days new to Ruby, please forgive if I make any mistakes. I'm
> on a project where I'm indexing certain words in a text document. So
> I'm also storing the file position where the word occurs. But the
> Problem is:
> The IO#pos points to the end of the file all the while... Below is
> the code I'm working on:
>
> File.open(file_name) do |f|
>
> f.readlines("\r\n\r\n").each do |para|


The reason is in the line above.

> para.scan(/\b\w+\b/).each do |word|
>
> word = word.downcase.stem
> if (!stoplist.include? word) && (!word.empty?) #excludes empty
> and frequent words
>
> unless freq.has_key?(word)
> freq[word] = [1,f.pos,file_name] # freq is a hash, that
> stores an array containing index, position of word (THE PROBLEM)..
> else
> freq[word].to_a[0] += 1
> freq[word].to_a<< f.pos << file_name
> end
>
> unless wfreq.has_key?(word)
> wfreq[word] = [1,f.pos,file_name]
> else
> wfreq[word].to_a[0] += 1
> wfreq[word].to_a<< f.pos << file_name
> end
>
> end
> end
> end
>
>
> File.open(file_name+".yaml","w"){|f| YAML.dump(freq,f)}
>
> Also it would be great if someone told me the replacement for the
> deprecated 'to_a' method used above


Why do you convert an Array into an Array?

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
Reply With Quote
 
 
 
 
Brian Candler
Guest
Posts: n/a
 
      04-29-2009
Arun Kumar wrote:

> unless freq.has_key?(word)
> freq[word] = [1,f.pos,file_name] # freq is a hash, that
> stores an array containing index, position of word (THE PROBLEM)..
> else
> freq[word].to_a[0] += 1
> freq[word].to_a<< f.pos << file_name
> end


BTW, you can replace all that by:

freq[word] ||= [0]
freq[word][0] += 1
freq[word] << f.pos << file_name

As for the pos, since you've already slurped in the data you'll need to
remember where you are within your buffer. Your outer loop could become
something like this:

para_pos = 0
f.readlines("\r\n\r\n").each do |para|
...
para_pos += para.size + 4
end

Unfortunately, I don't think string#scan will give you offsets into the
strings found.

In ruby 1.8 you can write this:

pos = 0
while md = /\b\w+\b/.match(para[pos..-1])
word = md[0]
puts "Match #{word} at #{para_pos+pos+md.begin(0)}"
pos += md.end(0)
...
end

In ruby 1.9 (but not 1.8.6/1.8.7), Regexp.match takes a start pos, so
you could optimise it to this:

pos = 0
while md = /\b\w+\b/.match(para, pos)
word = md[0]
puts "Match #{word} at #{para_pos+md.begin(0)}"
pos = md.end(0)
...
end

However in ruby 1.9 the offsets used will be in terms of number of
characters, not number of bytes. It would be up to you to convert this
back into byte offsets into the file, if that's what you're after.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      04-29-2009
2009/4/29 Brian Candler <(E-Mail Removed)>:
> Unfortunately, I don't think string#scan will give you offsets into the
> strings found.
>
> In ruby 1.8 you can write this:
>
> =A0pos =3D 0
> =A0while md =3D /\b\w+\b/.match(para[pos..-1])
> =A0 =A0word =3D md[0]
> =A0 =A0puts "Match #{word} at #{para_pos+pos+md.begin(0)}"
> =A0 =A0pos +=3D md.end(0)
> =A0 =A0...
> =A0end
>
> In ruby 1.9 (but not 1.8.6/1.8.7), Regexp.match takes a start pos, so
> you could optimise it to this:
>
> =A0pos =3D 0
> =A0while md =3D /\b\w+\b/.match(para, pos)
> =A0 =A0word =3D md[0]
> =A0 =A0puts "Match #{word} at #{para_pos+md.begin(0)}"
> =A0 =A0pos =3D md.end(0)
> =A0 =A0...
> =A0end


String#scan is likely faster than manually matching portions with
#match. In both versions of Ruby you can do this to get the
/character/ offset:

irb(main):001:0> s=3D%{foo bar baz}
=3D> "foo bar baz"
irb(main):002:0> s.scan(/\w+/) { p $`.length }
0
4
8
=3D> "foo bar baz"

> However in ruby 1.9 the offsets used will be in terms of number of
> characters, not number of bytes. It would be up to you to convert this
> back into byte offsets into the file, if that's what you're after.


This is an important point to remember!

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

 
Reply With Quote
 
Arun Kumar
Guest
Posts: n/a
 
      04-29-2009
Thank you so much for the kind responses! I'm pleased to be part of such a
kind community

On Wed, Apr 29, 2009 at 4:32 PM, Robert Klemme
<(E-Mail Removed)>wrote:

> 2009/4/29 Brian Candler <(E-Mail Removed)>:
> > Unfortunately, I don't think string#scan will give you offsets into the
> > strings found.
> >
> > In ruby 1.8 you can write this:
> >
> > pos =3D 0
> > while md =3D /\b\w+\b/.match(para[pos..-1])
> > word =3D md[0]
> > puts "Match #{word} at #{para_pos+pos+md.begin(0)}"
> > pos +=3D md.end(0)
> > ...
> > end
> >
> > In ruby 1.9 (but not 1.8.6/1.8.7), Regexp.match takes a start pos, so
> > you could optimise it to this:
> >
> > pos =3D 0
> > while md =3D /\b\w+\b/.match(para, pos)
> > word =3D md[0]
> > puts "Match #{word} at #{para_pos+md.begin(0)}"
> > pos =3D md.end(0)
> > ...
> > end

>
> String#scan is likely faster than manually matching portions with
> #match. In both versions of Ruby you can do this to get the
> /character/ offset:
>
> irb(main):001:0> s=3D%{foo bar baz}
> =3D> "foo bar baz"
> irb(main):002:0> s.scan(/\w+/) { p $`.length }
> 0
> 4
> 8
> =3D> "foo bar baz"
>
> > However in ruby 1.9 the offsets used will be in terms of number of
> > characters, not number of bytes. It would be up to you to convert this
> > back into byte offsets into the file, if that's what you're after.

>
> This is an important point to remember!
>
> Kind regards
>
> robert
>
> --
> remember.guy do |as, often| as.you_can - without end
> http://blog.rubybestpractices.com/
>
>



--=20
|| =E0=A4=B6=E0=A5=8D=E0=A4=B0=E0=A5=80 =E0=A4=9C=E0=A4=BE=E0=A4=A8=E0=A4=
=95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E 0=A4=BE=E0=A4=A5=E0=A5=8B=
=E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A 5=87 ||

 
Reply With Quote
 
Arun Kumar
Guest
Posts: n/a
 
      04-29-2009
[Note: parts of this message were removed to make it a legal post.]

> > unless freq.has_key?(word)
> > freq[word] = [1,f.pos,file_name] # freq is a hash, that
> > stores an array containing index, position of word (THE PROBLEM)..
> > else
> > freq[word].to_a[0] += 1
> > freq[word].to_a<< f.pos << file_name
> > end

>
> BTW, you can replace all that by:
>
> freq[word] ||= [0]
> freq[word][0] += 1
> freq[word] << f.pos << file_name
>
>

I've tried doing it but since 'freq' is a hash it gives the following error:

preprocessor.rb:32:in `calc_frequency_word_list': undefined method `[]='
for 0:Fixnum (NoMethodError)
from copy of preprocessor.rb:25:in `scan'
from copy of preprocessor.rb:25:in `calc_frequency_word_list'
from copy of preprocessor.rb:23:in `each'
from copy of preprocessor.rb:23:in `calc_frequency_word_list'
from copy of preprocessor.rb:61

 
Reply With Quote
 
Brian Candler
Guest
Posts: n/a
 
      04-29-2009
Arun Kumar wrote:
>> freq[word] ||= [0]
>> freq[word][0] += 1
>> freq[word] << f.pos << file_name
>>
>>

> I've tried doing it but since 'freq' is a hash it gives the following
> error:


Show your actual code. The following code works just fine:

freq = {}
%w{foo bar baz bar}.each do |word|
freq[word] ||= [0]
freq[word][0] += 1
freq[word] << "pos" << "name"
end
puts freq.inspect

The error suggests that you have initialized freq[word] to 0, not to
[0].

Or perhaps you set freq = Hash.new(0), which is wrong in this case,
because the default element needs to be [0] not 0.

An alternative is to auto-initialize each hash element like this:

freq = Hash.new { |h,k| h[k] = [0] }
%w{foo bar baz bar}.each do |word|
freq[word][0] += 1
freq[word] << "pos" << "name"
end
puts freq.inspect
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Arun Kumar
Guest
Posts: n/a
 
      04-29-2009
>
>
> Or perhaps you set freq =3D Hash.new(0), which is wrong in this case,
> because the default element needs to be [0] not 0.
>
> An alternative is to auto-initialize each hash element like this:
>
> freq =3D Hash.new { |h,k| h[k] =3D [0] }
> %w{foo bar baz bar}.each do |word|
> freq[word][0] +=3D 1
> freq[word] << "pos" << "name"
> end
> puts freq.inspect
> --
> Posted via http://www.ruby-forum.com/.
>
> Exactly the mistake I had done! So silly of me! Thank you SO MUCH :=

)


--=20
|| =E0=A4=B6=E0=A5=8D=E0=A4=B0=E0=A5=80 =E0=A4=9C=E0=A4=BE=E0=A4=A8=E0=A4=
=95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E 0=A4=BE=E0=A4=A5=E0=A5=8B=
=E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A 5=87 ||

 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      04-29-2009
2009/4/29 Arun Kumar <(E-Mail Removed)>:
>>
>>
>> Or perhaps you set freq =3D Hash.new(0), which is wrong in this case,
>> because the default element needs to be [0] not 0.
>>
>> An alternative is to auto-initialize each hash element like this:
>>
>> freq =3D Hash.new { |h,k| h[k] =3D [0] }
>> %w{foo bar baz bar}.each do |word|
>> =A0freq[word][0] +=3D 1
>> =A0freq[word] << "pos" << "name"
>> end
>> puts freq.inspect


This is a typical case where I would introduce a separate class or
even multiple classes because it makes life so much more readable.

WordPositon =3D Struct.new :file, os

WordStats =3D Struct.new :word, ositions do
def count; positions.size; end
end

freq =3D Hash.new {|h,word| h[word.freeze] =3D WordStat.new(word, [])}
...
freq[word].positions << WordPosition.new(file_name, pos)
...

Then you can do

freq.sort_by {|w,stat| stat.count}

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

 
Reply With Quote
 
Brian Candler
Guest
Posts: n/a
 
      04-30-2009
Robert Klemme wrote:
> String#scan is likely faster than manually matching portions with
> #match. In both versions of Ruby you can do this to get the
> /character/ offset:
>
> irb(main):001:0> s=%{foo bar baz}
> => "foo bar baz"
> irb(main):002:0> s.scan(/\w+/) { p $`.length }
> 0
> 4
> 8
> => "foo bar baz"


Well, my guess is that would be *less* efficient for large paragraphs,
since $` forces allocation of a new string containing all the text from
the start to the current point. But that reminds me, there is a global
variable containing a MatchData object: $~

So you can write:

irb(main):001:0> s=%{foo bar baz}
=> "foo bar baz"
irb(main):002:0> s.scan(/\w+/) { p $~.begin(0) }
0
4
8
=> "foo bar baz"

Regards,

Brian.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Math.random() and Math.round(Math.random()) and Math.floor(Math.random()*2) VK Javascript 15 05-02-2010 03:43 PM
random.random(), random not defined!? globalrev Python 4 04-20-2008 08:12 AM
is Random Access File really "random access"? Kevin Java 19 02-13-2006 09:31 PM
procs/blocks - blocks with procs, blocks with blocks? matt Ruby 1 08-06-2004 01:33 AM



Advertisments