Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: Regex on a huge text

Reply
Thread Tools

Re: Regex on a huge text

 
 
Medardo Rodriguez
Guest
Posts: n/a
 
      08-22-2008
On Fri, Aug 22, 2008 at 11:24 AM, Dan <(E-Mail Removed)> wrote:
> I'm looking on how to apply a regex on a pretty huge input text (a file
> that's a couple of gigabytes). I found finditer which would return results
> iteratively which is good but it looks like I still need to send a string
> which would be bigger than my RAM. Is there a way to apply a regex directly
> on a file?
>
> Any help would be appreciated.



You can call *grep* posix utility.
But if the regex's matches are possible only inner the context of a
line of that file:
#<code>
res = []
with file(filename) as f:
for line in f:
res.extend(getmatches(regex, line))
# Of course "getmatches" describes the concept.
#</code>

Regards
 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      08-22-2008
On Aug 23, 6:19 am, "Medardo Rodriguez" <(E-Mail Removed)> wrote:
> On Fri, Aug 22, 2008 at 11:24 AM, Dan <(E-Mail Removed)> wrote:
> > I'm looking on how to apply a regex on a pretty huge input text (a file
> > that's a couple of gigabytes). I found finditer which would return results
> > iteratively which is good but it looks like I still need to send a string
> > which would be bigger than my RAM. Is there a way to apply a regex directly
> > on a file?

>
> > Any help would be appreciated.

>
> You can call *grep* posix utility.
> But if the regex's matches are possible only inner the context of a
> line of that file:
> #<code>

(snip)
> #</code>


Docs:
"""
mmap — Memory-mapped file support

Memory-mapped file objects behave like both strings and like file
objects. Unlike normal string objects, however, these are mutable. You
can use mmap objects in most places where strings are expected; for
example, you can use the re module to search through a memory-mapped
file.
"""

 
Reply With Quote
 
 
 
 
Gabriel Genellina
Guest
Posts: n/a
 
      08-24-2008
En Fri, 22 Aug 2008 18:56:51 -0300, John Machin <(E-Mail Removed)> escribió:
> On Aug 23, 6:19 am, "Medardo Rodriguez" <(E-Mail Removed)> wrote:
>> On Fri, Aug 22, 2008 at 11:24 AM, Dan <(E-Mail Removed)> wrote:
>> > I'm looking on how to apply a regex on a pretty huge input text (a file
>> > that's a couple of gigabytes). I found finditer which would return results
>> > iteratively which is good but it looks like I still need to send a string
>> > which would be bigger than my RAM. Is there a way to apply a regex directly
>> > on a file?

>
> Docs:
> """
> mmap — Memory-mapped file support
>
> Memory-mapped file objects behave like both strings and like file
> objects. Unlike normal string objects, however, these are mutable. You
> can use mmap objects in most places where strings are expected; for
> example, you can use the re module to search through a memory-mapped
> file.
> """


Still limited to virtual memory address range for user processes, 2GB or 3GB depending on the OS (assuming a 32 bits OS).

--
Gabriel Genellina

 
Reply With Quote
 
Paddy
Guest
Posts: n/a
 
      08-24-2008
On Aug 22, 9:19*pm, "Medardo Rodriguez" <(E-Mail Removed)> wrote:
> On Fri, Aug 22, 2008 at 11:24 AM, Dan <(E-Mail Removed)> wrote:
> > I'm looking on how to apply a regex on a pretty huge input text (a file
> > that's a couple of gigabytes). I found finditer which would return results
> > iteratively which is good but it looks like I still need to send a string
> > which would be bigger than my RAM. Is there a way to apply a regex directly
> > on a file?

>
> > Any help would be appreciated.

>
> You can call *grep* posix utility.
> But if the regex's matches are possible only inner the context of a
> line of that file:
> #<code>
> res = []
> with file(filename) as f:
> * * for line in f:
> * * * * res.extend(getmatches(regex, line))
> # *Of course "getmatches" describes the concept.
> #</code>
>
> Regards


Try and pre-filter your file on a line basis to cut it down , then
apply a further filter on the result.

For example, if you were looking for consecutive SPAM records with the
same Name field then you might first extract only the SPAM records
from the gigabytes to leave something more manageable to search for
consecutive Name fields in.

- Paddy.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Help needed to retrieve text from a text-file using RegEx Bruno Desthuilliers Python 4 02-09-2009 06:59 PM
Memory error due to the huge/huge input file size tejsupra@gmail.com Python 3 11-20-2008 07:21 PM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
Replacing the specific pattern of text in a huge text string.. Brown Smith ASP .Net 1 06-25-2005 05:34 AM
Reading huge text files one line at a time.... Brock Heinz Java 8 11-23-2004 08:52 PM



Advertisments