Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > dynamic allocation file buffer

Reply
Thread Tools

dynamic allocation file buffer

 
 
Aaron \Castironpi\ Brady
Guest
Posts: n/a
 
      09-12-2008
On Sep 12, 1:30*am, Steven D'Aprano
<(E-Mail Removed)> wrote:
> On Thu, 11 Sep 2008 22:40:01 -0700, Dennis Lee Bieber wrote:
> > On 12 Sep 2008 03:37:51 GMT, Steven D'Aprano
> > <(E-Mail Removed)> declaimed the following in
> > comp.lang.python:

>
> >> I'm pretty sure you're wrong. XML can be used for serialization, but
> >> that doesn't mean it is only sequential data. XML is suitable for
> >> hierarchical data too. To quote Wikipedia:

>
> > * *There is a difference between the format of the data content, and
> > the processing of that data... Regardless of the content, one
> > essentially has to process the XML /file/ sequentially, and translate
> > into an in-memory model that allows for accessing said data. To reach
> > the nth subelement of the mth element requires reading all 1..m-1
> > elements, followed by all 1..n-1 subelements in m. Modifying any element
> > requires rewriting the entire file.

>
> Which is why I previously said that XML was not well suited for random
> access.
>
> I think we're starting to be sucked into a vortex of obtuse and opaque
> communication. We agree that XML can store hierarchical data, and that it
> has to be read and written sequentially, and that whatever the merits of
> castironpi's software, his original use-case of random access to a 4GB
> XML file isn't workable. Yes?
>
> --
> Steven


By 'isn't workable' do you mean, "no one ever uses 4GB of XML", or "no
one ever uses 4GB or hierarchical data period"?
 
Reply With Quote
 
 
 
 
Paul Boddie
Guest
Posts: n/a
 
      09-12-2008
On 12 Sep, 08:30, Steven D'Aprano
<(E-Mail Removed)> wrote:
>
> Which is why I previously said that XML was not well suited for random
> access.


Maybe not. A consideration of other storage formats such as HDF5 might
be appropriate:

http://hdf.ncsa.uiuc.edu/HDF5/whatishdf5.html

There are, of course, HDF5 tools available for Python.

> I think we're starting to be sucked into a vortex of obtuse and opaque
> communication.


I don't know about that. I'm managing to keep up with the discussion.

> We agree that XML can store hierarchical data, and that it
> has to be read and written sequentially, and that whatever the merits of
> castironpi's software, his original use-case of random access to a 4GB
> XML file isn't workable. Yes?


Again, XML specifically might not be workable for random access in a
serialised form, despite people's best efforts at processing it in
various unconventional ways, but that doesn't mean that random access
to a 4GB file containing hierarchical data isn't possible, so I
suppose it depends on whether he is wedded to the idea of using
vanilla XML or not. It's always worth exploring the available
alternatives before embarking on a challenging project, unless one
wants to pursue the exercise as a learning experience, and I therefore
suggest investigating whether HDF5 doesn't already solve at least some
of the problems or use-cases stated in this discussion.

Paul
 
Reply With Quote
 
 
 
 
Aaron \Castironpi\ Brady
Guest
Posts: n/a
 
      09-12-2008
On Sep 12, 4:34*am, Paul Boddie <(E-Mail Removed)> wrote:
> On 12 Sep, 08:30, Steven D'Aprano
>
> <(E-Mail Removed)> wrote:
>
> > Which is why I previously said that XML was not well suited for random
> > access.

>
> Maybe not.


No, it's not. Element trees are, which if I just would have said
originally...

> A consideration of other storage formats such as HDF5 might
> be appropriate:
>
> http://hdf.ncsa.uiuc.edu/HDF5/whatishdf5.html
>
> There are, of course, HDF5 tools available for Python.


PyTables came up within the past few weeks on the list.

"When the file is created, the metadata in the object tree is updated
in memory while the actual data is saved to disk. When you close the
file the object tree is no longer available. However, when you reopen
this file the object tree will be reconstructed in memory from the
metadata on disk...."

This is different from what I had in mind, but the extremity depends
on how slow the 'reconstructed in memory' step is. (From
http://www.pytables.org/docs/manual/ch01.html#id2506782 ). The
counterexample would be needing random access into multiple data
files, which don't all fit in memory at once, but the maturity of the
package might outweigh that. Reconstruction will form a bottleneck
anyway.

> > I think we're starting to be sucked into a vortex of obtuse and opaque
> > communication.

>
> I don't know about that. I'm managing to keep up with the discussion.
>
> > We agree that XML can store hierarchical data, and that it
> > has to be read and written sequentially, and that whatever the merits of
> > castironpi's software, his original use-case of random access to a 4GB
> > XML file isn't workable. Yes?


I could renege that bid and talk about a 4MB file, where recopying is
prohibitively expensive and so random access is needed, thereby
requiring an alternative to XML.

> Again, XML specifically might not be workable for random access in a
> serialised form, despite people's best efforts at processing it in
> various unconventional ways, but that doesn't mean that random access
> to a 4GB file containing hierarchical data isn't possible, so I
> suppose it depends on whether he is wedded to the idea of using
> vanilla XML or not.


No. It is always nice to be able to scroll through your data, but
it's much less common to be able to scroll though a data -structure-.
(Which is part of the reason data structures are hard to design.)

> It's always worth exploring the available
> alternatives before embarking on a challenging project, unless one
> wants to pursue the exercise as a learning experience, and I therefore
> suggest investigating whether HDF5 doesn't already solve at least some
> of the problems or use-cases stated in this discussion.


The potential for concurrency is definitely one benefit of raw alloc/
free management, and a requirement I was setting out to program
directly for. There is a multi-threaded version of HDF5 but
interprocess communication is unsupported.

"This version serializes the API suitable for use in a multi-threaded
application but does not provide any level of concurrency."

From: http://www.hdfgroup.uiuc.edu/papers/features/mthdf/

(It is always appreciated to find a statement of what a product does
not do.)

> Paul


There is an updated statement of the problem on the project website:

http://code.google.com/p/pymmapstruc...mmapstruct.txt

I don't have numbers for my claim that the abstraction layers in SQL,
including string construction and parsing, are ever a bottleneck or
limiting factor, despite that it's sort of intuitive. Until I get
those, maybe I should leave that allegation out.

Compared to the complexity of all these other packages (ZOPE,
memcached, HDF5/PyTables), alloc and free are almost looking like they
should become methods on a subclass of the builtin buffer type. Ha!
(Ducks.) They're beyond dangerous compared to the snuggly feeling of
Python though, so maybe they could belong in ctypes.

Aaron
 
Reply With Quote
 
Francesc
Guest
Posts: n/a
 
      09-15-2008
On 12 Set, 14:39, "Aaron \"Castironpi\" Brady" <(E-Mail Removed)>
wrote:
> > A consideration of other storage formats such as HDF5 might
> > be appropriate:

>
> >http://hdf.ncsa.uiuc.edu/HDF5/whatishdf5.html

>
> > There are, of course, HDF5 tools available for Python.

>
> PyTablescame up within the past few weeks on the list.
>
> "When the file is created, the metadata in the object tree is updated
> in memory while the actual data is saved to disk. When you close the
> file the object tree is no longer available. However, when you reopen
> this file the object tree will be reconstructed in memory from the
> metadata on disk...."
>
> This is different from what I had in mind, but the extremity depends
> on how slow the 'reconstructed in memory' step is. (Fromhttp://www.pytables.org/docs/manual/ch01.html#id2506782). The
> counterexample would be needing random access into multiple data
> files, which don't all fit in memory at once, but the maturity of the
> package might outweigh that. Reconstruction will form a bottleneck
> anyway.


Hmm, this was a part of a documentation that needed to be updated.
Now, the object tree is reconstructed in a lazy way (i.e. on-demand),
in order to avoid the bottleneck that you mentioned. I have corrected
the docs in:

http://www.pytables.org/trac/changeset/3714/trunk

Thanks for (indirectly bringing this to my attention,

Francesc
 
Reply With Quote
 
Aaron \Castironpi\ Brady
Guest
Posts: n/a
 
      09-15-2008
On Sep 15, 4:34*am, Francesc <(E-Mail Removed)> wrote:
> On 12 Set, 14:39, "Aaron \"Castironpi\" Brady" <(E-Mail Removed)>
> wrote:
>
>
>
> > > A consideration of other storage formats such as HDF5 might
> > > be appropriate:

>
> > >http://hdf.ncsa.uiuc.edu/HDF5/whatishdf5.html

>
> > > There are, of course, HDF5 tools available for Python.

>
> > PyTablescame up within the past few weeks on the list.

>
> > "When the file is created, the metadata in the object tree is updated
> > in memory while the actual data is saved to disk. When you close the
> > file the object tree is no longer available. However, when you reopen
> > this file the object tree will be reconstructed in memory from the
> > metadata on disk...."

>
> > This is different from what I had in mind, but the extremity depends
> > on how slow the 'reconstructed in memory' step is. *(Fromhttp://www.pytables.org/docs/manual/ch01.html#id2506782). *The
> > counterexample would be needing random access into multiple data
> > files, which don't all fit in memory at once, but the maturity of the
> > package might outweigh that. *Reconstruction will form a bottleneck
> > anyway.

>
> Hmm, this was a part of a documentation that needed to be updated.
> Now, the object tree is reconstructed in a lazy way (i.e. on-demand),
> in order to avoid the bottleneck that you mentioned. *I have corrected
> the docs in:
>
> http://www.pytables.org/trac/changeset/3714/trunk
>
> Thanks for (indirectly bringing this to my attention,
>
> Francesc


Depending on how lazy the reconstruction is, would it be possible to
modify separate tables from separate processes concurrently?
 
Reply With Quote
 
Francesc
Guest
Posts: n/a
 
      09-16-2008
On 15 Set, 22:09, "Aaron \"Castironpi\" Brady" <(E-Mail Removed)>
wrote:
> On Sep 15, 4:34 am, Francesc <(E-Mail Removed)> wrote:
>
>
>
> > On 12 Set, 14:39, "Aaron \"Castironpi\" Brady" <(E-Mail Removed)>
> > wrote:

>
> > > > A consideration of other storage formats such as HDF5 might
> > > > be appropriate:

>
> > > >http://hdf.ncsa.uiuc.edu/HDF5/whatishdf5.html

>
> > > > There are, of course, HDF5 tools available for Python.

>
> > > PyTablescame up within the past few weeks on the list.

>
> > > "When the file is created, the metadata in the object tree is updated
> > > in memory while the actual data is saved to disk. When you close the
> > > file the object tree is no longer available. However, when you reopen
> > > this file the object tree will be reconstructed in memory from the
> > > metadata on disk...."

>
> > > This is different from what I had in mind, but the extremity depends
> > > on how slow the 'reconstructed in memory' step is. (Fromhttp://www.pytables.org/docs/manual/ch01.html#id2506782). The
> > > counterexample would be needing random access into multiple data
> > > files, which don't all fit in memory at once, but the maturity of the
> > > package might outweigh that. Reconstruction will form a bottleneck
> > > anyway.

>
> > Hmm, this was a part of a documentation that needed to be updated.
> > Now, the object tree is reconstructed in a lazy way (i.e. on-demand),
> > in order to avoid the bottleneck that you mentioned. I have corrected
> > the docs in:

>
> >http://www.pytables.org/trac/changeset/3714/trunk

>
> > Thanks for (indirectly bringing this to my attention,

>
> > Francesc

>
> Depending on how lazy the reconstruction is, would it be possible to
> modify separate tables from separate processes concurrently?


No, modification of different tables in the same file simultaneously
is not supported yet. This is a limitation of the HDF5 library
itself. The HDF Group said that they have plans to address this, but
this is probably a long-term task.

Francesc
 
Reply With Quote
 
Francesc
Guest
Posts: n/a
 
      09-16-2008
On 15 Set, 22:09, "Aaron \"Castironpi\" Brady" <(E-Mail Removed)>
wrote:
> On Sep 15, 4:34 am, Francesc <(E-Mail Removed)> wrote:
>
>
>
> > On 12 Set, 14:39, "Aaron \"Castironpi\" Brady" <(E-Mail Removed)>
> > wrote:

>
> > > > A consideration of other storage formats such as HDF5 might
> > > > be appropriate:

>
> > > >http://hdf.ncsa.uiuc.edu/HDF5/whatishdf5.html

>
> > > > There are, of course, HDF5 tools available for Python.

>
> > > PyTablescame up within the past few weeks on the list.

>
> > > "When the file is created, the metadata in the object tree is updated
> > > in memory while the actual data is saved to disk. When you close the
> > > file the object tree is no longer available. However, when you reopen
> > > this file the object tree will be reconstructed in memory from the
> > > metadata on disk...."

>
> > > This is different from what I had in mind, but the extremity depends
> > > on how slow the 'reconstructed in memory' step is. (Fromhttp://www.pytables.org/docs/manual/ch01.html#id2506782). The
> > > counterexample would be needing random access into multiple data
> > > files, which don't all fit in memory at once, but the maturity of the
> > > package might outweigh that. Reconstruction will form a bottleneck
> > > anyway.

>
> > Hmm, this was a part of a documentation that needed to be updated.
> > Now, the object tree is reconstructed in a lazy way (i.e. on-demand),
> > in order to avoid the bottleneck that you mentioned. I have corrected
> > the docs in:

>
> >http://www.pytables.org/trac/changeset/3714/trunk

>
> > Thanks for (indirectly bringing this to my attention,

>
> > Francesc

>
> Depending on how lazy the reconstruction is, would it be possible to
> modify separate tables from separate processes concurrently?


No, modification of different tables in the same file simultaneously
is not supported yet. This is a limitation of the HDF5 library
itself. The HDF Group said that they have plans to address this, but
this is probably a long-term task.

Francesc
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
dynamic buffer allocation at char buf[1] Kok How Teh C Programming 22 04-13-2010 09:04 PM
When using System.IO.FileStream, I write 8 bytes, then seek to the start of the file, does the 8 bytes get flushed on seek and the buffer become a readbuffer at that point instead of being a write buffer? DR ASP .Net 2 07-29-2008 09:50 AM
When using System.IO.FileStream, I write 8 bytes, then seek to the start of the file, does the 8 bytes get flushed on seek and the buffer become a readbuffer at that point instead of being a write buffer? DR ASP .Net Building Controls 0 07-29-2008 01:37 AM
static memory allocation versus dynamic memory allocation Ken C Programming 24 11-30-2006 12:37 AM
What is the difference between dynamic memory allocation,and stack allocation ? chris C++ 6 10-28-2005 05:27 AM



Advertisments