Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   converting a sed / grep / awk / . . . bash pipe line into python (http://www.velocityreviews.com/forums/t633391-converting-a-sed-grep-awk-bash-pipe-line-into-python.html)

hofer 09-02-2008 05:36 PM

converting a sed / grep / awk / . . . bash pipe line into python
 
Hi,

Something I have to do very often is filtering / transforming line
based file contents and storing the result in an array or a
dictionary.

Very often the functionallity exists already in form of a shell script
with sed / awk / grep , . . .
and I would like to have the same implementation in my script

What's a compact, efficient (no intermediate arrays generated /
regexps compiled only once) way in python
for such kind of 'pipe line'

Example 1 (in bash): (annotated with comment (thus not working) if
copied / pasted
#-------------------------------------------------------------------------------------------
cat file \ ### read from file
| sed 's/\.\..*//' \ ### remove '//' comments
| sed 's/#.*//' \ ### remove '#' comments
| grep -v '^\s*$' \ ### get rid of empty lines
| awk '{ print $1 + $2 " " $2 }' \ ### knowing, that all remaining
lines contain always at least
\ ### two integers calculate
sum and 'keep' second number
| grep '^42 ' ### keep lines for which sum is 42
| awk '{ print $2 }' ### print number

Same example in perl:
# I guess (but didn't try), taht the perl example will create more
intermediate
# data structures than necessary.
# Ideally the python implementation shouldn't do this, but just
'chain' iterators.
#-------------------------------------------------------------------------------------------
my $filename= "file";
open(my $fh,$filename) or die "failed opening file $filename";

# order of 'pipeline' is syntactically reversed (if compared to shell
script)
my @numbers =
map { $_->[1] } # extract num 2
grep { $_->[0] == 42 } # keep lines with result 42
map { [ $_->[0]+$_->[1],$_->[1] ] } # calculate sum of first two
nums and keep second num
map { [ split(' ',$_,3) ] } # split by white space
grep { ! ($_ =~ /^\s*$/) } # remove empty lines
map { $_ =~ s/#.*// ; $_} # strip '#' comments
map { $_ =~ s/\/\/.*// ; $_} # strip '//' comments
<$fh>;
print "Numbers are:\n",join("\n",@numbers),"\n";

thanks in advance for any suggestions of how to code this (keeping the
comments)


H





Marc 'BlackJack' Rintsch 09-02-2008 06:26 PM

Re: converting a sed / grep / awk / . . . bash pipe line intopython
 
On Tue, 02 Sep 2008 10:36:50 -0700, hofer wrote:

> sed 's/\.\..*//' \ ### remove '//' comments | sed 's/#.*//'


Comment does not match the code. Or vice versa. :-)

Untested:

from __future__ import with_statement
from itertools import ifilter, ifilterfalse, imap


def is_junk(line):
line = line.rstrip()
return not line or line.startswith('//') or line.startswith('#')


def extract_numbers(line):
result = map(int, line.split()[:2])
assert len(result) == 2
return result


def main():
with open('test.txt') as lines:
clean_lines = ifilterfalse(is_junk, lines)
pairs = imap(extract_numbers, clean_lines)
print '\n'.join(b for a, b in pairs if a + b == 42)


if __name__ == '__main__':
main()

Ciao,
Marc 'BlackJack' Rintsch

Paul McGuire 09-03-2008 05:43 AM

Re: converting a sed / grep / awk / . . . bash pipe line into python
 
On Sep 2, 12:36*pm, hofer <bla...@dungeon.de> wrote:
> Hi,
>
> Something I have to do very often is filtering / transforming line
> based file contents and storing the result in an array or a
> dictionary.
>
> Very often the functionallity exists already in form of a shell script
> with sed / awk / grep , . . .
> and I would like to have the same implementation in my script
>


All that sed'ing, grep'ing and awk'ing, you might want to take a look
at pyparsing. Here is a pyparsing take on your posted problem:

from pyparsing import LineEnd, Word, nums, LineStart, OneOrMore,
restOfLine

test = """

1 2 3
47 23 // this will never match
# blank lines are not of any interest
91 26

23 19

41 1 97 26 // extra numbers don't matter
"""

# define pyparsing expressions to match a line of integers
EOL = LineEnd()
integer = Word(nums)

# by default, pyparsing will implicitly skip over whitespace and
# newlines, so EOL is skipped over by default - this would mix
together
# integers on consecutive lines - we only want OneOrMore integers as
long
# as they are on the same line, that is, integers with no intervening
# EOL's
line_of_integers = (LineStart() + integer + OneOrMore(~EOL + integer))

# use a parse action to identify the target lines
def select_significant_values(t):
v1, v2 = map(int, t[:2])
if v1+v2 == 42:
print v2
line_of_integers.setParseAction(select_significant _values)

# skip over comments, wherever they are
line_of_integers.ignore( '//' + restOfLine )
line_of_integers.ignore( '#' + restOfLine )

# use the line_of_integers expression to search through the test text
# the parse action will print the matching values
line_of_integers.searchString(test)


-- Paul


Peter Otten 09-03-2008 07:15 AM

Re: converting a sed / grep / awk / . . . bash pipe line into python
 
hofer wrote:

> Something I have to do very often is filtering / transforming line
> based file contents and storing the result in an array or a
> dictionary.
>
> Very often the functionallity exists already in form of a shell script
> with sed / awk / grep , . . .
> and I would like to have the same implementation in my script
>
> What's a compact, efficient (no intermediate arrays generated /
> regexps compiled only once) way in python
> for such kind of 'pipe line'
>
> Example 1 (in bash): (annotated with comment (thus not working) if
> copied / pasted


> cat file \ ### read from file
> | sed 's/\.\..*//' \ ### remove '//' comments
> | sed 's/#.*//' \ ### remove '#' comments
> | grep -v '^\s*$' \ ### get rid of empty lines
> | awk '{ print $1 + $2 " " $2 }' \ ### knowing, that all remaining
> lines contain always at least
> \ ### two integers calculate
> sum and 'keep' second number
> | grep '^42 ' ### keep lines for which sum is 42
> | awk '{ print $2 }' ### print number
> thanks in advance for any suggestions of how to code this (keeping the
> comments)


for line in open("file"): # read from file
try:
a, b = map(int, line.split(None, 2)[:2]) # remove extra columns,
# convert to integer
except ValueError:
pass # remove comments, get rid of empty lines,
# skip lines with less than two integers
else:
# line did start with two integers
if a + b == 42: # keep lines for which the sum is 42
print b # print number

The hard part was keeping the comments ;)

Without them it looks better:

import sys
for line in sys.stdin:
try:
a, b = map(int, line.split(None, 2)[:2])
except ValueError:
pass
else:
if a + b == 42:
print b

Peter

Roy Smith 09-03-2008 11:54 AM

Re: converting a sed / grep / awk / . . . bash pipe line into python
 
In article <g9ldi5$2ea$03$1@news.t-online.com>,
Peter Otten <__peter__@web.de> wrote:

> Without them it looks better:
>
> import sys
> for line in sys.stdin:
> try:
> a, b = map(int, line.split(None, 2)[:2])
> except ValueError:
> pass
> else:
> if a + b == 42:
> print b


I'm philosophically opposed to one-liners like:

> a, b = map(int, line.split(None, 2)[:2])


because they're difficult to understand at a glance. You need to visually
parse it and work your way out from the inside to figure out what's going
on. Better to keep it longer and simpler.

Now that I've got my head around it, I realized there's no reason to make
the split part so complicated. No reason to limit how many splits get done
if you're explicitly going to slice the first two. And since you don't
need to supply the second argument, the first one can be defaulted as well.
So, you immediately get down to:

> a, b = map(int, line.split()[:2])


which isn't too bad. I might take it one step further, however, and do:

> fields = line.split()[:2]
> a, b = map(int, fields)


in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

> a, b = line.split()[:2]
> a = int(a)
> b = int(b)


Peter Otten 09-03-2008 12:19 PM

Re: converting a sed / grep / awk / . . . bash pipe line into python
 
Roy Smith wrote:

> In article <g9ldi5$2ea$03$1@news.t-online.com>,
> Peter Otten <__peter__@web.de> wrote:
>
>> Without them it looks better:
>>
>> import sys
>> for line in sys.stdin:
>> try:
>> a, b = map(int, line.split(None, 2)[:2])
>> except ValueError:
>> pass
>> else:
>> if a + b == 42:
>> print b

>
> I'm philosophically opposed to one-liners


I'm not, as long as you don't /force/ the code into one line.

> like:
>
>> a, b = map(int, line.split(None, 2)[:2])

>
> because they're difficult to understand at a glance. You need to visually
> parse it and work your way out from the inside to figure out what's going
> on. Better to keep it longer and simpler.
>
> Now that I've got my head around it, I realized there's no reason to make
> the split part so complicated. No reason to limit how many splits get
> done
> if you're explicitly going to slice the first two. And since you don't
> need to supply the second argument, the first one can be defaulted as
> well. So, you immediately get down to:
>
>> a, b = map(int, line.split()[:2])


I agree that the above is an improvement.

> which isn't too bad. I might take it one step further, however, and do:
>
>> fields = line.split()[:2]
>> a, b = map(int, fields)

>
> in fact, I might even get rid of the very generic, but conceptually
> overkill, use of map() and just write:
>
>> a, b = line.split()[:2]
>> a = int(a)
>> b = int(b)


If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...

Peter

bearophileHUGS@lycos.com 09-03-2008 12:23 PM

Re: converting a sed / grep / awk / . . . bash pipe line into python
 
Roy Smith:
> No reason to limit how many splits get done if you're
> explicitly going to slice the first two.


You are probably right for this problem, because most lines are 2
items long, but in scripts that have to process lines potentially
composed of many parts, setting a max number of parts speeds up your
script and reduces memory used, because you have less parts at the
end.

Bye,
bearophile

Roy Smith 09-03-2008 01:35 PM

Re: converting a sed / grep / awk / . . . bash pipe line into python
 
In article <g9lvc5$8qq$03$1@news.t-online.com>,
Peter Otten <__peter__@web.de> wrote:

> > I might take it one step further, however, and do:
> >
> >> fields = line.split()[:2]
> >> a, b = map(int, fields)

> >
> > in fact, I might even get rid of the very generic, but conceptually
> > overkill, use of map() and just write:
> >
> >> a, b = line.split()[:2]
> >> a = int(a)
> >> b = int(b)

>
> If you go that route your next step is to introduce another try...except,
> one for the unpacking and another for the integer conversion...


Why another try/except? The potential unpack and conversion errors exist
in both versions, and the existing try block catches them all. Splitting
the one line up into three with some intermediate variables doesn't change
that.

Roy Smith 09-03-2008 01:41 PM

Re: converting a sed / grep / awk / . . . bash pipe line into python
 
In article
<7f2d4b4a-bc97-4b46-a31e-63f98e9fee73@34g2000hsh.googlegroups.com>,
bearophileHUGS@lycos.com wrote:

> Roy Smith:
> > No reason to limit how many splits get done if you're
> > explicitly going to slice the first two.

>
> You are probably right for this problem, because most lines are 2
> items long, but in scripts that have to process lines potentially
> composed of many parts, setting a max number of parts speeds up your
> script and reduces memory used, because you have less parts at the
> end.
>
> Bye,
> bearophile


Sounds like premature optimization to me. Make it work and be easy to
understand first. Then worry about how fast it is.

But, along those lines, I've often thought that split() needed a way to not
just limit the number of splits, but to also throw away the extra stuff.
Getting the first N fields of a string is something I've done often enough
that refactoring the slicing operation right into the split() code seems
worthwhile. And, it would be even faster :-)

Peter Otten 09-03-2008 02:18 PM

Re: converting a sed / grep / awk / . . . bash pipe line into python
 
Roy Smith wrote:

> In article <g9lvc5$8qq$03$1@news.t-online.com>,
> Peter Otten <__peter__@web.de> wrote:
>
>> > I might take it one step further, however, and do:
>> >
>> >> fields = line.split()[:2]
>> >> a, b = map(int, fields)
>> >
>> > in fact, I might even get rid of the very generic, but conceptually
>> > overkill, use of map() and just write:
>> >
>> >> a, b = line.split()[:2]
>> >> a = int(a)
>> >> b = int(b)

>>
>> If you go that route your next step is to introduce another try...except,
>> one for the unpacking and another for the integer conversion...

>
> Why another try/except? The potential unpack and conversion errors exist
> in both versions, and the existing try block catches them all. Splitting
> the one line up into three with some intermediate variables doesn't change
> that.


As I understood it you didn't just split a line of code into three, but
wanted two processing steps. These logical steps are then somewhat remixed
by the shared error handling. You lose the information which step failed.
In the general case you may even mask a bug.

Peter


All times are GMT. The time now is 10:38 AM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.