![]() |
converting a sed / grep / awk / . . . bash pipe line into python
Hi,
Something I have to do very often is filtering / transforming line based file contents and storing the result in an array or a dictionary. Very often the functionallity exists already in form of a shell script with sed / awk / grep , . . . and I would like to have the same implementation in my script What's a compact, efficient (no intermediate arrays generated / regexps compiled only once) way in python for such kind of 'pipe line' Example 1 (in bash): (annotated with comment (thus not working) if copied / pasted #------------------------------------------------------------------------------------------- cat file \ ### read from file | sed 's/\.\..*//' \ ### remove '//' comments | sed 's/#.*//' \ ### remove '#' comments | grep -v '^\s*$' \ ### get rid of empty lines | awk '{ print $1 + $2 " " $2 }' \ ### knowing, that all remaining lines contain always at least \ ### two integers calculate sum and 'keep' second number | grep '^42 ' ### keep lines for which sum is 42 | awk '{ print $2 }' ### print number Same example in perl: # I guess (but didn't try), taht the perl example will create more intermediate # data structures than necessary. # Ideally the python implementation shouldn't do this, but just 'chain' iterators. #------------------------------------------------------------------------------------------- my $filename= "file"; open(my $fh,$filename) or die "failed opening file $filename"; # order of 'pipeline' is syntactically reversed (if compared to shell script) my @numbers = map { $_->[1] } # extract num 2 grep { $_->[0] == 42 } # keep lines with result 42 map { [ $_->[0]+$_->[1],$_->[1] ] } # calculate sum of first two nums and keep second num map { [ split(' ',$_,3) ] } # split by white space grep { ! ($_ =~ /^\s*$/) } # remove empty lines map { $_ =~ s/#.*// ; $_} # strip '#' comments map { $_ =~ s/\/\/.*// ; $_} # strip '//' comments <$fh>; print "Numbers are:\n",join("\n",@numbers),"\n"; thanks in advance for any suggestions of how to code this (keeping the comments) H |
Re: converting a sed / grep / awk / . . . bash pipe line intopython
On Tue, 02 Sep 2008 10:36:50 -0700, hofer wrote:
> sed 's/\.\..*//' \ ### remove '//' comments | sed 's/#.*//' Comment does not match the code. Or vice versa. :-) Untested: from __future__ import with_statement from itertools import ifilter, ifilterfalse, imap def is_junk(line): line = line.rstrip() return not line or line.startswith('//') or line.startswith('#') def extract_numbers(line): result = map(int, line.split()[:2]) assert len(result) == 2 return result def main(): with open('test.txt') as lines: clean_lines = ifilterfalse(is_junk, lines) pairs = imap(extract_numbers, clean_lines) print '\n'.join(b for a, b in pairs if a + b == 42) if __name__ == '__main__': main() Ciao, Marc 'BlackJack' Rintsch |
Re: converting a sed / grep / awk / . . . bash pipe line into python
On Sep 2, 12:36*pm, hofer <bla...@dungeon.de> wrote:
> Hi, > > Something I have to do very often is filtering / transforming line > based file contents and storing the result in an array or a > dictionary. > > Very often the functionallity exists already in form of a shell script > with sed / awk / grep , . . . > and I would like to have the same implementation in my script > All that sed'ing, grep'ing and awk'ing, you might want to take a look at pyparsing. Here is a pyparsing take on your posted problem: from pyparsing import LineEnd, Word, nums, LineStart, OneOrMore, restOfLine test = """ 1 2 3 47 23 // this will never match # blank lines are not of any interest 91 26 23 19 41 1 97 26 // extra numbers don't matter """ # define pyparsing expressions to match a line of integers EOL = LineEnd() integer = Word(nums) # by default, pyparsing will implicitly skip over whitespace and # newlines, so EOL is skipped over by default - this would mix together # integers on consecutive lines - we only want OneOrMore integers as long # as they are on the same line, that is, integers with no intervening # EOL's line_of_integers = (LineStart() + integer + OneOrMore(~EOL + integer)) # use a parse action to identify the target lines def select_significant_values(t): v1, v2 = map(int, t[:2]) if v1+v2 == 42: print v2 line_of_integers.setParseAction(select_significant _values) # skip over comments, wherever they are line_of_integers.ignore( '//' + restOfLine ) line_of_integers.ignore( '#' + restOfLine ) # use the line_of_integers expression to search through the test text # the parse action will print the matching values line_of_integers.searchString(test) -- Paul |
Re: converting a sed / grep / awk / . . . bash pipe line into python
hofer wrote:
> Something I have to do very often is filtering / transforming line > based file contents and storing the result in an array or a > dictionary. > > Very often the functionallity exists already in form of a shell script > with sed / awk / grep , . . . > and I would like to have the same implementation in my script > > What's a compact, efficient (no intermediate arrays generated / > regexps compiled only once) way in python > for such kind of 'pipe line' > > Example 1 (in bash): (annotated with comment (thus not working) if > copied / pasted > cat file \ ### read from file > | sed 's/\.\..*//' \ ### remove '//' comments > | sed 's/#.*//' \ ### remove '#' comments > | grep -v '^\s*$' \ ### get rid of empty lines > | awk '{ print $1 + $2 " " $2 }' \ ### knowing, that all remaining > lines contain always at least > \ ### two integers calculate > sum and 'keep' second number > | grep '^42 ' ### keep lines for which sum is 42 > | awk '{ print $2 }' ### print number > thanks in advance for any suggestions of how to code this (keeping the > comments) for line in open("file"): # read from file try: a, b = map(int, line.split(None, 2)[:2]) # remove extra columns, # convert to integer except ValueError: pass # remove comments, get rid of empty lines, # skip lines with less than two integers else: # line did start with two integers if a + b == 42: # keep lines for which the sum is 42 print b # print number The hard part was keeping the comments ;) Without them it looks better: import sys for line in sys.stdin: try: a, b = map(int, line.split(None, 2)[:2]) except ValueError: pass else: if a + b == 42: print b Peter |
Re: converting a sed / grep / awk / . . . bash pipe line into python
In article <g9ldi5$2ea$03$1@news.t-online.com>,
Peter Otten <__peter__@web.de> wrote: > Without them it looks better: > > import sys > for line in sys.stdin: > try: > a, b = map(int, line.split(None, 2)[:2]) > except ValueError: > pass > else: > if a + b == 42: > print b I'm philosophically opposed to one-liners like: > a, b = map(int, line.split(None, 2)[:2]) because they're difficult to understand at a glance. You need to visually parse it and work your way out from the inside to figure out what's going on. Better to keep it longer and simpler. Now that I've got my head around it, I realized there's no reason to make the split part so complicated. No reason to limit how many splits get done if you're explicitly going to slice the first two. And since you don't need to supply the second argument, the first one can be defaulted as well. So, you immediately get down to: > a, b = map(int, line.split()[:2]) which isn't too bad. I might take it one step further, however, and do: > fields = line.split()[:2] > a, b = map(int, fields) in fact, I might even get rid of the very generic, but conceptually overkill, use of map() and just write: > a, b = line.split()[:2] > a = int(a) > b = int(b) |
Re: converting a sed / grep / awk / . . . bash pipe line into python
Roy Smith wrote:
> In article <g9ldi5$2ea$03$1@news.t-online.com>, > Peter Otten <__peter__@web.de> wrote: > >> Without them it looks better: >> >> import sys >> for line in sys.stdin: >> try: >> a, b = map(int, line.split(None, 2)[:2]) >> except ValueError: >> pass >> else: >> if a + b == 42: >> print b > > I'm philosophically opposed to one-liners I'm not, as long as you don't /force/ the code into one line. > like: > >> a, b = map(int, line.split(None, 2)[:2]) > > because they're difficult to understand at a glance. You need to visually > parse it and work your way out from the inside to figure out what's going > on. Better to keep it longer and simpler. > > Now that I've got my head around it, I realized there's no reason to make > the split part so complicated. No reason to limit how many splits get > done > if you're explicitly going to slice the first two. And since you don't > need to supply the second argument, the first one can be defaulted as > well. So, you immediately get down to: > >> a, b = map(int, line.split()[:2]) I agree that the above is an improvement. > which isn't too bad. I might take it one step further, however, and do: > >> fields = line.split()[:2] >> a, b = map(int, fields) > > in fact, I might even get rid of the very generic, but conceptually > overkill, use of map() and just write: > >> a, b = line.split()[:2] >> a = int(a) >> b = int(b) If you go that route your next step is to introduce another try...except, one for the unpacking and another for the integer conversion... Peter |
Re: converting a sed / grep / awk / . . . bash pipe line into python
Roy Smith:
> No reason to limit how many splits get done if you're > explicitly going to slice the first two. You are probably right for this problem, because most lines are 2 items long, but in scripts that have to process lines potentially composed of many parts, setting a max number of parts speeds up your script and reduces memory used, because you have less parts at the end. Bye, bearophile |
Re: converting a sed / grep / awk / . . . bash pipe line into python
In article <g9lvc5$8qq$03$1@news.t-online.com>,
Peter Otten <__peter__@web.de> wrote: > > I might take it one step further, however, and do: > > > >> fields = line.split()[:2] > >> a, b = map(int, fields) > > > > in fact, I might even get rid of the very generic, but conceptually > > overkill, use of map() and just write: > > > >> a, b = line.split()[:2] > >> a = int(a) > >> b = int(b) > > If you go that route your next step is to introduce another try...except, > one for the unpacking and another for the integer conversion... Why another try/except? The potential unpack and conversion errors exist in both versions, and the existing try block catches them all. Splitting the one line up into three with some intermediate variables doesn't change that. |
Re: converting a sed / grep / awk / . . . bash pipe line into python
In article
<7f2d4b4a-bc97-4b46-a31e-63f98e9fee73@34g2000hsh.googlegroups.com>, bearophileHUGS@lycos.com wrote: > Roy Smith: > > No reason to limit how many splits get done if you're > > explicitly going to slice the first two. > > You are probably right for this problem, because most lines are 2 > items long, but in scripts that have to process lines potentially > composed of many parts, setting a max number of parts speeds up your > script and reduces memory used, because you have less parts at the > end. > > Bye, > bearophile Sounds like premature optimization to me. Make it work and be easy to understand first. Then worry about how fast it is. But, along those lines, I've often thought that split() needed a way to not just limit the number of splits, but to also throw away the extra stuff. Getting the first N fields of a string is something I've done often enough that refactoring the slicing operation right into the split() code seems worthwhile. And, it would be even faster :-) |
Re: converting a sed / grep / awk / . . . bash pipe line into python
Roy Smith wrote:
> In article <g9lvc5$8qq$03$1@news.t-online.com>, > Peter Otten <__peter__@web.de> wrote: > >> > I might take it one step further, however, and do: >> > >> >> fields = line.split()[:2] >> >> a, b = map(int, fields) >> > >> > in fact, I might even get rid of the very generic, but conceptually >> > overkill, use of map() and just write: >> > >> >> a, b = line.split()[:2] >> >> a = int(a) >> >> b = int(b) >> >> If you go that route your next step is to introduce another try...except, >> one for the unpacking and another for the integer conversion... > > Why another try/except? The potential unpack and conversion errors exist > in both versions, and the existing try block catches them all. Splitting > the one line up into three with some intermediate variables doesn't change > that. As I understood it you didn't just split a line of code into three, but wanted two processing steps. These logical steps are then somewhat remixed by the shared error handling. You lose the information which step failed. In the general case you may even mask a bug. Peter |
| All times are GMT. The time now is 10:38 AM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.