Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: regex over files

Reply
Thread Tools

Re: regex over files

 
 
Robin Becker
Guest
Posts: n/a
 
      04-25-2005
Gerald Klix wrote:
> Map the file into RAM by using the mmap module.
> The file's contents than is availabel as a seachable string.
>


that's a good idea, but I wonder if it actually saves on memory? I just tried
regexing through a 25Mb file and end up with 40Mb as working set (it rose
linearly as the loop progessed through the file). Am I actually saving anything
by not letting normal vm do its thing?

> HTH,
> Gerald
>
> Robin Becker schrieb:
>
>> Is there any way to get regexes to work on non-string/unicode objects.
>> I would like to split large files by regex and it seems relatively
>> hard to do so without having the whole file in memory. Even with
>> buffers it seems hard to get regexes to indicate that they failed
>> because of buffer termination and getting a partial match to be
>> resumable seems out of the question.
>>
>> What interface does re actually need for its src objects?

>
>



--
Robin Becker

 
Reply With Quote
 
 
 
 
Richard Brodie
Guest
Posts: n/a
 
      04-26-2005

"Robin Becker" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> Gerald Klix wrote:
> > Map the file into RAM by using the mmap module.
> > The file's contents than is availabel as a seachable string.
> >

>
> that's a good idea, but I wonder if it actually saves on memory? I just tried
> regexing through a 25Mb file and end up with 40Mb as working set (it rose
> linearly as the loop progessed through the file). Am I actually saving anything
> by not letting normal vm do its thing?


You aren't saving memory in that sense, no. If you have any RAM spare the
file will end up in it. However, if you are short on memory though, mmaping the
file gives the VM the opportunity to discard pages from the file, instead of paging
them out. Try again with a 25Gb file and watch the difference YMMV.


 
Reply With Quote
 
 
 
 
Robin Becker
Guest
Posts: n/a
 
      04-26-2005
Richard Brodie wrote:
> "Robin Becker" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed)...
>
>>Gerald Klix wrote:
>>
>>>Map the file into RAM by using the mmap module.
>>>The file's contents than is availabel as a seachable string.
>>>

>>
>>that's a good idea, but I wonder if it actually saves on memory? I just tried
>>regexing through a 25Mb file and end up with 40Mb as working set (it rose
>>linearly as the loop progessed through the file). Am I actually saving anything
>>by not letting normal vm do its thing?

>
>
> You aren't saving memory in that sense, no. If you have any RAM spare the
> file will end up in it. However, if you are short on memory though, mmaping the
> file gives the VM the opportunity to discard pages from the file, instead of paging
> them out. Try again with a 25Gb file and watch the difference YMMV.
>
>




So we avoid dirty page writes etc etc. However, I still think I could
get away with a small window into the file which would be more efficient.
--
Robin Becker
 
Reply With Quote
 
Steve Holden
Guest
Posts: n/a
 
      04-26-2005
Robin Becker wrote:
> Richard Brodie wrote:
>
>> "Robin Becker" <(E-Mail Removed)> wrote in message
>> news:(E-Mail Removed)...
>>
>>> Gerald Klix wrote:
>>>
>>>> Map the file into RAM by using the mmap module.
>>>> The file's contents than is availabel as a seachable string.
>>>>
>>>
>>> that's a good idea, but I wonder if it actually saves on memory? I
>>> just tried
>>> regexing through a 25Mb file and end up with 40Mb as working set (it
>>> rose
>>> linearly as the loop progessed through the file). Am I actually
>>> saving anything
>>> by not letting normal vm do its thing?

>>
>>
>>
>> You aren't saving memory in that sense, no. If you have any RAM spare the
>> file will end up in it. However, if you are short on memory though,
>> mmaping the
>> file gives the VM the opportunity to discard pages from the file,
>> instead of paging
>> them out. Try again with a 25Gb file and watch the difference YMMV.
>>
>>

>
>
>
> So we avoid dirty page writes etc etc. However, I still think I could
> get away with a small window into the file which would be more efficient.


I seem to remember that the Medusa code contains a fairly good
overlapped search for a terminator string, if you want to chunk the file.

Take a look at the handle_read() method of class async_chat in the
standard library's asynchat.py.

regards
Steve
--
Steve Holden +1 703 861 4237 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/

 
Reply With Quote
 
Robin Becker
Guest
Posts: n/a
 
      04-26-2005
Steve Holden wrote:
.......
>
> I seem to remember that the Medusa code contains a fairly good
> overlapped search for a terminator string, if you want to chunk the file.
>
> Take a look at the handle_read() method of class async_chat in the
> standard library's asynchat.py.

......
thanks I'll give it a whirl

--
Robin Becker

 
Reply With Quote
 
Steve Holden
Guest
Posts: n/a
 
      04-26-2005
Robin Becker wrote:
> Steve Holden wrote:
> ......
>
>>
>> I seem to remember that the Medusa code contains a fairly good
>> overlapped search for a terminator string, if you want to chunk the file.
>>
>> Take a look at the handle_read() method of class async_chat in the
>> standard library's asynchat.py.

>
> .....
> thanks I'll give it a whirl
>

Whoops, I don't think it's a regex search

You should be able to adapt the logic fairly easily, I hope.

regards
Steve
--
Steve Holden +1 703 861 4237 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/

 
Reply With Quote
 
Robin Becker
Guest
Posts: n/a
 
      04-26-2005
Steve Holden wrote:
.....
>> .....
>> thanks I'll give it a whirl
>>

> Whoops, I don't think it's a regex search
>
> You should be able to adapt the logic fairly easily, I hope.
>

.....

The buffering logic is half the problem; doing it quickly is the other half.
The third half of the problem is getting re to co-operate and probably halves 4
5.....
--
Robin Becker

 
Reply With Quote
 
Skip Montanaro
Guest
Posts: n/a
 
      04-26-2005

Robin> So we avoid dirty page writes etc etc. However, I still think I
Robin> could get away with a small window into the file which would be
Robin> more efficient.

It's hard to imagine how sliding a small window onto a file within Python
would be more efficient than the operating system's paging system.

Skip
 
Reply With Quote
 
Robin Becker
Guest
Posts: n/a
 
      04-26-2005
Skip Montanaro wrote:
> Robin> So we avoid dirty page writes etc etc. However, I still think I
> Robin> could get away with a small window into the file which would be
> Robin> more efficient.
>
> It's hard to imagine how sliding a small window onto a file within Python
> would be more efficient than the operating system's paging system.
>
> Skip

well it might be if I only want to scan forward through the file (think lexical
analysis). Most lexical analyzers use a buffer and produce a stream of tokens ie
a compressed version of the input. There are problems crossing buffers etc, but
we never normally need the whole file in memory.

If the lexical analyzer reads the whole file into memory then we need more
pages. The mmap thing might help as we need only read pages (for a lexical scanner).

Scanners work by detecting the transitions between tokens so even if the tokens
are very long we don't need to store them twice (in the input stream and token
accumulator); I suppose that could be true of regex pattern matchers, but it
doesn't seem to be for re ie we need the entire pattern in the input before we
can match and extract to an accumulator.
--
Robin Becker

 
Reply With Quote
 
Jeremy Bowers
Guest
Posts: n/a
 
      04-26-2005
On Tue, 26 Apr 2005 19:32:29 +0100, Robin Becker wrote:

> Skip Montanaro wrote:
>> Robin> So we avoid dirty page writes etc etc. However, I still think I
>> Robin> could get away with a small window into the file which would be
>> Robin> more efficient.
>>
>> It's hard to imagine how sliding a small window onto a file within Python
>> would be more efficient than the operating system's paging system.
>>
>> Skip

> well it might be if I only want to scan forward through the file (think lexical
> analysis). Most lexical analyzers use a buffer and produce a stream of tokens ie
> a compressed version of the input. There are problems crossing buffers etc, but
> we never normally need the whole file in memory.


I think you might have a misunderstanding here. mmap puts a file into
*virtual* memory. It does *not* read the whole thing into physical memory;
if it did, there would be no purpose to mmap support in the OS in the
first place, as a thin wrapper around existing file calls would work.

> If the lexical analyzer reads the whole file into memory then we need more
> pages. The mmap thing might help as we need only read pages (for a lexical scanner).


The read-write status of the pages is not why mmap is an advantage; the
advantage is that the OS naturally and transparent is taking care of
loading just the portions you want, and intelligently discarding them when
you are done (more intelligently than you could, even in theory, since it
can take advantage of knowing the entire state of the system, your program
can't).

In other words, as Skip was trying to tell you, mmap *already
does* what you are saying might be better, and it does it better than you
can, even in theory, from inside a process (as the OS will not reveal to
you the data structures it has that you would need to match that
performance).

As you try to understand mmap, make sure your mental model can take into
account the fact that it is easy and quite common to mmap a file several
times larger than your physical memory, and it does not even *try* to read
the whole thing in at any given time. You may benefit from
reviewing/studying the difference between virtual memory and physical
memory.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
xslt 2.0 regex iterate over captured substrings ==> regex-group(n) RolfK XML 1 06-07-2009 12:04 PM
Files Visible Over WiFi and Hard-Wired, But Won't Open Over WiFi? (PeteCresswell) Wireless Networking 2 12-29-2008 05:21 PM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
VOIP over VPN over TCP over WAP over 3G Theo Markettos UK VOIP 2 02-14-2008 03:27 PM
regex over files Robin Becker Python 1 04-27-2005 04:34 AM



Advertisments