![]() |
need help extracting data from a text file
Hey there,
i have a text file with a bunch of values scattered throughout it. i am needing to pull out a value that is in parenthesis right after a certain word, like the first time the word 'foo' is found, retrieve the values in the next set of parenthesis (bar) and it would return 'bar' i think i can use re to do this, but is there some easier way? thanks |
Re: need help extracting data from a text file
nephish@xit.net wrote: > Hey there, > i have a text file with a bunch of values scattered throughout it. > i am needing to pull out a value that is in parenthesis right after a > certain word, > like the first time the word 'foo' is found, retrieve the values in the > next set of parenthesis (bar) and it would return 'bar' > > i think i can use re to do this, but is there some easier way? > thanks well, you can use string.find with offsets, but an re is probably a cleaner way to go. I'm not sure which way is faster - it'll depend on how many times you're searching compared to the overhead of setting up an re. start = textfile.find("foo(") + 4 # 4 being how long 'foo(' is end = textfile.find(")", start) value = textfile[start:end] Iain |
Re: need help extracting data from a text file
this is cool, it is only going to run about 10 times a day,
the text is not written out like foo(bar) its more like foo blah blah blah (bar) the thing is , every few days the structure of the textfile may change, one of the reasons i wanted to avoid the re. thanks for the tip, |
Re: need help extracting data from a text file
nephish@xit.net wrote: > this is cool, it is only going to run about 10 times a day, > > the text is not written out like foo(bar) its more like > foo blah blah blah (bar) > then I guess you worked this out, but just for completeness: keywordPos = textfile.find("foo") start = textfile.find("(", keywordPos) end = textfile.find(")", start) value = textfile[start:end] Iain |
Re: need help extracting data from a text file
um, wait. what you are doing here is easier than what i was doing after
your first post. thanks a lot. this is going to work out ok. thanks again. sk |
Re: need help extracting data from a text file
<nephish@xit.net> wrote in message
news:1131375863.977379.120620@f14g2000cwb.googlegr oups.com... > Hey there, > i have a text file with a bunch of values scattered throughout it. > i am needing to pull out a value that is in parenthesis right after a > certain word, > like the first time the word 'foo' is found, retrieve the values in the > next set of parenthesis (bar) and it would return 'bar' > > i think i can use re to do this, but is there some easier way? > thanks > Using string methods to locate the 'foo' instances is by far the fastest way to go. If your requirements get more complicated, look into using pyparsing (http://pyparsing.sourceforge.net). Here is a pyparsing rendition of this problem. This does three scans through some sample data - the first lists all matches, the second ignores matches if they are found inside a quoted string, and the third reports only the third match. This kind of context-sensitive matching gets trickier with basic string and re tools. -- Paul data = """ i have a text file with a bunch of foo(bar1) values scattered throughout it. i am needing to pull out a value that foo(bar2) is in parenthesis right after a certain word, like the foo(bar3) first time the word 'foo' is found, retrieve the values in the next set of parenthesis foo(bar4) and it would return 'bar' do we want to skip things in quotes, such as 'foo(barInQuotes)'? """ from pyparsing import Literal,SkipTo,quotedString pattern = Literal("foo") + "(" + SkipTo(")").setResultsName("payload") + ")" # report all occurrences of xxx found in "foo(xxx)" for tokens,start,end in pattern.scanString(data): print tokens.payload, "at location", start # ignore quoted strings pattern.ignore(quotedString) for tokens,start,end in pattern.scanString(data): print tokens.payload, "at location", start # only report 3rd occurrence tokenMatch = {'foo':0} def thirdTimeOnly(strg,loc,tokens): word = tokens[0] if word in tokenMatch: tokenMatch[word] += 1 if tokenMatch[word] != 3: raise ParseException(strg,loc,"wrong occurrence of token") pattern.setParseAction(thirdTimeOnly) for tokens,start,end in pattern.scanString(data): print tokens.payload, "at location", start Prints: bar1 at location 36 bar2 at location 116 bar3 at location 181 bar4 at location 278 barInQuotes at location 360 bar1 at location 36 bar2 at location 116 bar3 at location 181 bar4 at location 278 bar3 at location 181 |
Re: need help extracting data from a text file
nephish@xit.net wrote:
> Hey there, > i have a text file with a bunch of values scattered throughout it. > i am needing to pull out a value that is in parenthesis right after a > certain word, > like the first time the word 'foo' is found, retrieve the values in the > next set of parenthesis (bar) and it would return 'bar' > > i think i can use re to do this, but is there some easier way? It's pretty easy with an re: >>> import re >>> fooRe = re.compile(r'foo.*?\((.*?)\)') >>> fooRe.search('foo(bar)').group(1) 'bar' >>> fooRe.search('This is a foo bar baz blah blah (bar)').group(1) 'bar' Kent |
Re: need help extracting data from a text file
On Mon, 7 Nov 2005, Kent Johnson wrote:
> nephish@xit.net wrote: > >> i have a text file with a bunch of values scattered throughout it. i am >> needing to pull out a value that is in parenthesis right after a >> certain word, like the first time the word 'foo' is found, retrieve the >> values in the next set of parenthesis (bar) and it would return 'bar' > > It's pretty easy with an re: > >>>> import re >>>> fooRe = re.compile(r'foo.*?\((.*?)\)') Just out of interest, i've never really got into using non-greedy quantifiers (i use them from time to time, but hardly ever feel the need for them), so my instinct would have been to write this as: >>> fooRe = re.compile(r"foo[^(]*\(([^)]*)\)") Is there any reason to use one over the other? >>>> fooRe.search('foo(bar)').group(1) > 'bar' >>>> fooRe.search('This is a foo bar baz blah blah (bar)').group(1) > 'bar' Ditto. tom -- [of Muholland Drive] Cancer is pretty ingenious too, but its best to avoid. -- Tex |
| All times are GMT. The time now is 03:19 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.