Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Is there any way to mark an object as "always in use" (specifically,in a C extension)?

Reply
Thread Tools

Is there any way to mark an object as "always in use" (specifically,in a C extension)?

 
 
Harry Ohlsen
Guest
Posts: n/a
 
      02-06-2004
Some background ...

I have an application where there are many identical strings (the data consists of huge chunks of XML, with a lot of duplication in both the tag names and the CDATA content).

I've written a tiny XML parser in C, because trying to load these documents using REXML ran all night and was still running the next day, presumably due to the size (hundreds of thousands of tags).

Anyway, to reduce the memory used, given the repetitive nature of a lot of the data, I decided to store the strings as a (C coded) hash table of VALUE objects.

Changes to the data are very, very few, so when this happens, I just create a new Ruby string, so the values in the hash table never change.

Now, to my questions ...

I found that when I played with particularly large documents, my code fell over with what looked like some kind of memory corruption. I eventually twigged to the fact that Ruby might be garbage collecting some of the strings I'd constructed, because my C code wasn't doing any rb_gc_mark() calls. That definitely seemed to be the story, because when I wrote one that just went through the entire hashtable and marked each value, the corruption disappeared.

So, I guess my questions are: (1) is this likely to be what was really going wrong, or did adding the rb_gc_mark() calls fix the problem by pure luck and it's waiting to bite me again, further down the track; (2) is there some way I can mark all of those objects as always being in use, so that they'll never be considered for garbage collection; and more importantly (3) is there a better way to do achieve this?

Thanks in advance,

Harry O.



 
Reply With Quote
 
 
 
 
nobu.nokada@softhome.net
Guest
Posts: n/a
 
      02-06-2004
Hi,

At Fri, 6 Feb 2004 10:29:43 +0900,
Harry Ohlsen wrote in [ruby-talk:91665]:
> So, I guess my questions are:
> (1) is this likely to be what was really going wrong, or did
> adding the rb_gc_mark() calls fix the problem by pure luck
> and it's waiting to bite me again, further down the track;


Seems correct.

> (2) is there some way I can mark all of those objects as
> always being in use, so that they'll never be considered for
> garbage collection;


You may want to use rb_gc_register_address()?

> and more importantly (3) is there a better way to do achieve
> this?


Is it the single big hash in process, but not per instance,
right?

2 ways:
1. rb_gc_register_address(),

2. make the hash a hidden instance variable of any class, if
exists.

--
Nobu Nakada


 
Reply With Quote
 
 
 
 
Harry Ohlsen
Guest
Posts: n/a
 
      02-06-2004
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
>>(2) is there some way I can mark all of those objects as
>>always being in use, so that they'll never be considered for
>>garbage collection;

>
>
> You may want to use rb_gc_register_address()?


Thanks. I'll look that up!

> Is it the single big hash in process, but not per instance,
> right?


There's a single hash table, that's not registered as a Ruby object. However, the data stored in the table are all VALUE objects obtained from rb_str_new2().

There is a single instance of an object implemented in C that Ruby *does* know about, and that object holds many references to the strings in the hash table. For one document, for example, there were 2.7 million references to around 87,000 strings, totalling just under 267,000 bytes of text. It's the ..._mark() function of that class where I currently mark all of the strings.

While it doesn't *seem* to be taking a huge amount of time to do that each time, I'd like to try to avoid it, just to see whether that really is the case ... ie, see whether the actual time being used is significant. In any case, if it's easy to fix, it just doesn't make sense to keep marking them over and over.

> 2. make the hash a hidden instance variable of any class, if
> exists.


That might work, but as things are currently, Ruby doesn't know anything about the hash table. It's simply an implementation detail of the extension. However, if I can't work out how to get it working with rb_gc_register_address(), I'll see if I can do something along these lines.

Thanks for the suggestions!

Harry O.


 
Reply With Quote
 
nobu.nokada@softhome.net
Guest
Posts: n/a
 
      02-06-2004
Hi,

At Fri, 6 Feb 2004 11:49:59 +0900,
Harry Ohlsen wrote in [ruby-talk:91669]:
> > Is it the single big hash in process, but not per instance,
> > right?

>
> There's a single hash table, that's not registered as a Ruby
> object. However, the data stored in the table are all VALUE
> objects obtained from rb_str_new2().


You mean struct st_table? If so, you may register a Hash
instance to GC and use its tbl member directly.

> There is a single instance of an object implemented in C that
> Ruby *does* know about, and that object holds many references
> to the strings in the hash table. For one document, for
> example, there were 2.7 million references to around 87,000
> strings, totalling just under 267,000 bytes of text. It's
> the ..._mark() function of that class where I currently mark
> all of the strings.


Current ruby's GC is weak for large amounts of live objects.

> While it doesn't *seem* to be taking a huge amount of time to
> do that each time, I'd like to try to avoid it, just to see
> whether that really is the case ... ie, see whether the
> actual time being used is significant. In any case, if it's
> easy to fix, it just doesn't make sense to keep marking them
> over and over.


Generational GC may help you, but it isn't still incorporated.

--
Nobu Nakada


 
Reply With Quote
 
Harry Ohlsen
Guest
Posts: n/a
 
      02-06-2004
(E-Mail Removed) wrote:
>>There's a single hash table, that's not registered as a Ruby
>>object. However, the data stored in the table are all VALUE
>>objects obtained from rb_str_new2().

>
>
> You mean struct st_table?


Sorry, I should have been more specific. The hash table is just some C code I wrote to implement one. It's only he data held in it that are Ruby objects (String). It's an interesting point, though. Maybe I could save myself some code by changing the C to use a Ruby hash.

I've only learned enough about C extensions to get done what I needed. I plan to do some serious study when I get a chance. I must say, it was pretty easy to get started ... as I would expect from anything related to Ruby, of course!

> If so, you may register a Hash
> instance to GC and use its tbl member directly.


That definitely sounds simpler.

> Current ruby's GC is weak for large amounts of live objects.


I have a feeling this is why REXML had a problem loading the document, because it probably needs to create quite a few other (sub-)objects for each XML tag, hence it would *really* be working hard!

> Generational GC may help you, but it isn't still incorporated.


I've seen mention of GGC a number of times on the list. Is there a plan to add it to Ruby 1.X.Y, or will we have to wait until version 2?

Cheers,

Harry O.



 
Reply With Quote
 
Ralf Horstmann
Guest
Posts: n/a
 
      02-06-2004
At Fri, 06 Feb 2004 10:29:43 +0900 wrote Harry Ohlsen:

> Some background ...
>
> I have an application where there are many identical strings
> (the data consists of huge chunks of XML, with a lot of
> duplication in both the tag names and the CDATA content).
>
> I've written a tiny XML parser in C, because trying to load these
> documents using REXML ran all night and was still running the next day,
> presumably due to the size (hundreds of thousands of tags).


Have you already tried xmlparser (wrapper around
expat)? It's quite fast. I use it for huge XML documents where
rexml and nqxml are way too slow.

Ralf.

 
Reply With Quote
 
Harry Ohlsen
Guest
Posts: n/a
 
      02-06-2004
Ralf Horstmann wrote:

>Have you already tried xmlparser (wrapper around
>expat)?
>

Back when I originally wrote it, I didn't have control of the box and
hence couldn't get expat installed easily, so I didn't look any further
at the time. However, I might give it a go by installing it in my own
account. I also didn't have a lot of time to get this up and running
back then.

Since I already had some C code that did what I wanted (and nothing
more), I figured it would be faster to wrap it ... plus, in the back of
my mind, I'm sure I was thinking "what a great opportunity to learn how
to do C extensions" .

Nobu's suggestion worked fine, although I've not benchmarked yet to see
whether the change has made a significant difference ... this thing
takes quite a while to run, so it's hard to tell unless you think to
look at the clock, or print some timestamps out, which is what I'll do
when I get back to work on Monday.

>It's quite fast. I use it for huge XML documents where
>rexml and nqxml are way too slow.
>
>

Just out of interest, how large was your "huge". Some of my documents
are (literally) hundreds of megabytes.

The other point I should make is that this application has to be able to
make fairly arbitrary changes to the DOM, like moving whole subtrees
around, and the changes are user-defined, hence I can't even use some
kind of smart housekeeping, so event driven won't work for me.

Cheers,

Harry O.




 
Reply With Quote
 
Ralf Horstmann
Guest
Posts: n/a
 
      02-07-2004
At Sat, 07 Feb 2004 07:46:33 +0900 wrote Harry Ohlsen:

>>It's quite fast. I use it for huge XML documents where
>>rexml and nqxml are way too slow.
>>

> Just out of interest, how large was your "huge". Some of my documents
> are (literally) hundreds of megabytes.


I just checked and found it to be about 10 megabytes. So actually not that
much data. But it was already enough to let rexml run for hours

Regards,
Ralf.

 
Reply With Quote
 
Zachary P. Landau
Guest
Posts: n/a
 
      02-07-2004
--bp/iNruPH9dso1Pn
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

> >>It's quite fast. I use it for huge XML documents where
> >>rexml and nqxml are way too slow.
> >>

> > Just out of interest, how large was your "huge". Some of my documents=

=20
> > are (literally) hundreds of megabytes.

>=20
> I just checked and found it to be about 10 megabytes. So actually not that
> much data. But it was already enough to let rexml run for hours


There seems to be a problem/bug/whatever with the current version of
REXML that makes large files take extra long to process. It reads the
entire file in before it starts processing, which kills performance. Try
adding this code to your program:


module REXML
class IOSource
alias_method :_initialize, :initialize

def initialize(arg, block_size=3D500)
@er_source =3D @source =3D arg
@to_utf =3D false
@line_break =3D '>'
super @source.readline(@line_break)
@line_break =3D encode( '>' )
end
end
end

That seems to fix the problem for other people.

--
Zachary P. Landau <(E-Mail Removed)>
GPG: gpg --recv-key 0x24E5AD99 | http://kapheine.hypa.net/kapheine.asc

--bp/iNruPH9dso1Pn
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFAJSXfCwWyMCTlrZkRAjNcAJsF8I511wJdohLMvWRR8x y68gD+NwCeK4pu
OGdtmqcZmrSc/a26S5RcfuM=
=ojMg
-----END PGP SIGNATURE-----

--bp/iNruPH9dso1Pn--


 
Reply With Quote
 
Harry Ohlsen
Guest
Posts: n/a
 
      02-08-2004
Zachary P. Landau wrote:

>There seems to be a problem/bug/whatever with the current version of
>REXML that makes large files take extra long to process.
>

Is this a new problem introduced in a recent version? If so, it's
probably not the cause of the slowness I was seeing, because I tried it
about five or six months ago.

However, it's definitely worth knowing about that patch for the next
time I want to do some XML processing, because REXML is just so nice to
use that it would normally be my first choice!

Cheers,

Harry O.




 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
To Mark Fitzpatrick, Juan Libre, Mark Rae, Steve Orr, Cowboy and other MVPs Cirene ASP .Net 5 05-17-2008 07:17 PM
501 PIX "deny any any" "allow any any" Any Anybody? Networking Student Cisco 4 11-16-2006 10:40 PM
Is there any way to allow an Object property to be set in aspx? =?Utf-8?B?VG9kZCBCdXJyeQ==?= ASP .Net 3 03-16-2006 06:56 PM
is there a way ..... any way Andries Perl Misc 27 04-27-2004 06:15 AM



Advertisments