Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > need help extracting data from a text file

Reply
Thread Tools

need help extracting data from a text file

 
 
nephish@xit.net
Guest
Posts: n/a
 
      11-07-2005
Hey there,
i have a text file with a bunch of values scattered throughout it.
i am needing to pull out a value that is in parenthesis right after a
certain word,
like the first time the word 'foo' is found, retrieve the values in the
next set of parenthesis (bar) and it would return 'bar'

i think i can use re to do this, but is there some easier way?
thanks

 
Reply With Quote
 
 
 
 
Iain King
Guest
Posts: n/a
 
      11-07-2005

http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> Hey there,
> i have a text file with a bunch of values scattered throughout it.
> i am needing to pull out a value that is in parenthesis right after a
> certain word,
> like the first time the word 'foo' is found, retrieve the values in the
> next set of parenthesis (bar) and it would return 'bar'
>
> i think i can use re to do this, but is there some easier way?
> thanks


well, you can use string.find with offsets, but an re is probably a
cleaner way to go. I'm not sure which way is faster - it'll depend on
how many times you're searching compared to the overhead of setting up
an re.

start = textfile.find("foo(") + 4 # 4 being how long 'foo(' is
end = textfile.find(")", start)
value = textfile[start:end]

Iain

 
Reply With Quote
 
 
 
 
nephish@xit.net
Guest
Posts: n/a
 
      11-07-2005
this is cool, it is only going to run about 10 times a day,

the text is not written out like foo(bar) its more like
foo blah blah blah (bar)

the thing is , every few days the structure of the textfile may change,
one of the reasons i wanted to avoid the re.

thanks for the tip,

 
Reply With Quote
 
Iain King
Guest
Posts: n/a
 
      11-07-2005

(E-Mail Removed) wrote:
> this is cool, it is only going to run about 10 times a day,
>
> the text is not written out like foo(bar) its more like
> foo blah blah blah (bar)
>


then I guess you worked this out, but just for completeness:

keywordPos = textfile.find("foo")
start = textfile.find("(", keywordPos)
end = textfile.find(")", start)
value = textfile[start:end]


Iain

 
Reply With Quote
 
nephish@xit.net
Guest
Posts: n/a
 
      11-07-2005
um, wait. what you are doing here is easier than what i was doing after
your first post.
thanks a lot. this is going to work out ok.

thanks again.
sk

 
Reply With Quote
 
Paul McGuire
Guest
Posts: n/a
 
      11-07-2005
<(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
> Hey there,
> i have a text file with a bunch of values scattered throughout it.
> i am needing to pull out a value that is in parenthesis right after a
> certain word,
> like the first time the word 'foo' is found, retrieve the values in the
> next set of parenthesis (bar) and it would return 'bar'
>
> i think i can use re to do this, but is there some easier way?
> thanks
>

Using string methods to locate the 'foo' instances is by far the fastest way
to go.

If your requirements get more complicated, look into using pyparsing
(http://pyparsing.sourceforge.net). Here is a pyparsing rendition of this
problem. This does three scans through some sample data - the first lists
all matches, the second ignores matches if they are found inside a quoted
string, and the third reports only the third match. This kind of
context-sensitive matching gets trickier with basic string and re tools.

-- Paul

data = """
i have a text file with a bunch of foo(bar1) values scattered throughout it.
i am needing to pull out a value that foo(bar2) is in parenthesis right
after a
certain word,
like the foo(bar3) first time the word 'foo' is found, retrieve the values
in the
next set of parenthesis foo(bar4) and it would return 'bar'
do we want to skip things in quotes, such as 'foo(barInQuotes)'?
"""

from pyparsing import Literal,SkipTo,quotedString

pattern = Literal("foo") + "(" + SkipTo(")").setResultsName("payload") + ")"

# report all occurrences of xxx found in "foo(xxx)"
for tokens,start,end in pattern.scanString(data):
print tokens.payload, "at location", start
print

# ignore quoted strings
pattern.ignore(quotedString)
for tokens,start,end in pattern.scanString(data):
print tokens.payload, "at location", start
print

# only report 3rd occurrence
tokenMatch = {'foo':0}
def thirdTimeOnly(strg,loc,tokens):
word = tokens[0]
if word in tokenMatch:
tokenMatch[word] += 1
if tokenMatch[word] != 3:
raise ParseException(strg,loc,"wrong occurrence of token")

pattern.setParseAction(thirdTimeOnly)
for tokens,start,end in pattern.scanString(data):
print tokens.payload, "at location", start
print

Prints:
bar1 at location 36
bar2 at location 116
bar3 at location 181
bar4 at location 278
barInQuotes at location 360

bar1 at location 36
bar2 at location 116
bar3 at location 181
bar4 at location 278

bar3 at location 181


 
Reply With Quote
 
Kent Johnson
Guest
Posts: n/a
 
      11-07-2005
(E-Mail Removed) wrote:
> Hey there,
> i have a text file with a bunch of values scattered throughout it.
> i am needing to pull out a value that is in parenthesis right after a
> certain word,
> like the first time the word 'foo' is found, retrieve the values in the
> next set of parenthesis (bar) and it would return 'bar'
>
> i think i can use re to do this, but is there some easier way?


It's pretty easy with an re:

>>> import re
>>> fooRe = re.compile(r'foo.*?\((.*?)\)')
>>> fooRe.search('foo(bar)').group(1)

'bar'
>>> fooRe.search('This is a foo bar baz blah blah (bar)').group(1)

'bar'

Kent
 
Reply With Quote
 
Tom Anderson
Guest
Posts: n/a
 
      11-09-2005
On Mon, 7 Nov 2005, Kent Johnson wrote:

> (E-Mail Removed) wrote:
>
>> i have a text file with a bunch of values scattered throughout it. i am
>> needing to pull out a value that is in parenthesis right after a
>> certain word, like the first time the word 'foo' is found, retrieve the
>> values in the next set of parenthesis (bar) and it would return 'bar'

>
> It's pretty easy with an re:
>
>>>> import re
>>>> fooRe = re.compile(r'foo.*?\((.*?)\)')


Just out of interest, i've never really got into using non-greedy
quantifiers (i use them from time to time, but hardly ever feel the need
for them), so my instinct would have been to write this as:

>>> fooRe = re.compile(r"foo[^(]*\(([^)]*)\)")


Is there any reason to use one over the other?

>>>> fooRe.search('foo(bar)').group(1)

> 'bar'
>>>> fooRe.search('This is a foo bar baz blah blah (bar)').group(1)

> 'bar'


Ditto.

tom

--
[of Muholland Drive] Cancer is pretty ingenious too, but its best to
avoid. -- Tex
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Extracting Numerica Data Pairs from Text Box Michael Hill Javascript 5 02-15-2005 06:45 AM
Extracting text data from MS Word document Max Java 6 09-16-2004 11:01 PM
extracting unique strings from text file Bubbles ASP .Net 0 03-03-2004 06:55 PM
extracting text data in the presence of a "look-up" file: Is it possible? Vumani Dlamini Perl Misc 5 01-09-2004 08:54 PM
Extracting Rich Text data formats from win32clipboard Trader Python 2 08-26-2003 05:36 PM



Advertisments