Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Re: Newbie with sort text file question (http://www.velocityreviews.com/forums/t319605-re-newbie-with-sort-text-file-question.html)

Bob Gailer 07-12-2003 09:06 PM

Re: Newbie with sort text file question
 
At 12:46 PM 7/12/2003 -0700, stuartc wrote:

>Hi:
>
>I'm not a total newbie, but I'm pretty green. I need to sort a text
>file and then get a total for the number of occurances for a part of
>the string. Hopefully, this will explain it better:
>
>Here's the text file:
>
>banana_c \\yellow
>apple_a \\green
>orange_b \\yellow
>banana_d \\green
>orange_a \\orange
>apple_w \\yellow
>banana_e \\green
>orange_x \\yellow
>orange_y \\orange
>
>I would like two output files:
>
>1) Sorted like this, by the fruit name (the name before the dash)
>
>apple_a \\green
>apple_w \\yellow
>banana_c \\yellow
>banana_d \\green
>banana_e \\green
>orange_a \\orange
>orange_b \\yellow
>orange_x \\yellow
>orange_y \\orange
>
>2) Then summarized like this, ordered with the highest occurances
>first:
>
>orange occurs 4
>banana occurs 3
>apple occurs 2
>
>Total occurances is 9


I am developing a Python version of IBM's CMS Pipelines, which is designed
for this kind of task. If you'd like to be an early recipient (read beta
tester) of this product, let me know.

You would invoke this program:
Pipe("""
< c:\input.txt
| split /_/
| nlocate -//-
| sort count
| spec 11-* 1 / occurs / 11 1-10 19
| > c:\output1.txt
| count
| spec /Total occurrences is / 1 1-* 21
| > c:\output2.txt""")

Explanation:
| == separates each stage of the pipe
< == read records from file
split == split each record into 2 records at first _
nlocate == select records that do not contain //
pad 10 == ensure each record has 10 characters (or whatever the longest
fruit name is)
sort count == sort; group by unique key and prepend count
spec ... == select cols 11-end of input, append literal, append cols
1-10
> == write records to file

spec ... == start with literal, append rest of record
> == write records to file


Or it can be run as a DOS Command:
C>python pipe.py spec.txt
where spec.txt contains the pipe specification

An enhancement to the IBM Pipeline specification for SPLIT will be to route
the 2nd part of each record to the secondary output, effectively discarding
it in this example, and eliminating the need for the NLOCATE stage.

This particular task can also be done fairly easily in Python. The appeal
of Pipe is that you focus on the specification rather than writing Python
code that is specific to the task. This shortens development time, and
enhances readability and maintainability.

The Python version:

input = file('c:\input.txt')
fruits = {} # a dictionary to hold each fruit and its count
lines = input.readlines()
for line in lines:
fruit = line.split('_', 1)[0]
if fruit in fruits:
fruits[fruit] += 1 # increment count
else:
fruits[fruit] = 1 # add to dictionary with count of 1
output1 = file('c:\output1.txt', 'w')
for key, value in fruits.items():
output1.write("%s occurs %s\n" % (key, value))
output1.close()
output2 = file('c:\output2.txt', 'w')
output2.write("Total occurrences is %s\n" % len(lines))
output2.close()

Bob Gailer
bgailer@alum.rpi.edu
303 442 2625


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.500 / Virus Database: 298 - Release Date: 7/10/2003


Andrew Dalke 07-13-2003 09:12 PM

Re: Newbie with sort text file question
 
Bob Gailer:
> [Pipeline]


Huh. Hadn't heard of that one before. Thanks for the pointer.
(And overall, nice post!)

> The Python version:


Some stylistic comments

> input = file('c:\input.txt')


Since 'input' is a builtin, I use 'infile'. That's only a preference of
mine.
For the OP, you'll need 'c:\\input.txt' because the '\' has special meaning
inside of a string so must be escaped.

> fruits = {} # a dictionary to hold each fruit and its count
> lines = input.readlines()
> for line in lines:


Since you are using Python 2.2 (later you use "if fruit in fruits",
and "__in__" support for dicts wasn't added until Python 2.2, I
think, and the 'file' usage is also new), this is best written as

for line in input:

> fruit = line.split('_', 1)[0]


> if fruit in fruits:
> fruits[fruit] += 1 # increment count
> else:
> fruits[fruit] = 1 # add to dictionary with count of 1


Here's a handy idiom for what you want

fruits[fruit] = fruits.get(fruit, 0) + 1

> output1 = file('c:\output1.txt', 'w')
> for key, value in fruits.items():
> output1.write("%s occurs %s\n" % (key, value))
> output1.close()
> output2 = file('c:\output2.txt', 'w')
> output2.write("Total occurrences is %s\n" % len(lines))
> output2.close()


That's missing some sorts, so I don't think it meets the OP's
requirements.

How about this?

infile = open("input.txt")
lines = []
counts = {}
for line in infile:
lines.append(line)
fruit = line.split("_", 1)[0]
counts[fruit] = counts.get(fruit) + 1

# Sort by name. Since "_" sorts after any letter, this means
# that "plum_" will be placed *after* "plumbago_", which
# is probably not what you want. Left as an exercise :)
lines.sort()
outfile = open("output1.txt")
for line in lines:
outfile.write(line)
outfile.close()

# Print counts from highest count to lowest
count_data = [(n, fruit) for (fruit, n) in counts.items()]
count_data.sort()
outfile = open("output2.txt")
total = 0
for n, fruit in count_data:
outfile.write("%s occurs %s\n" % (fruit, n))
total += n
outfile.write("\nTotal occurances: %s\n" % total)
outfile.close()

Andrew
dalke@dalkescientific.com



Bob Gailer 07-14-2003 01:10 PM

Re: Newbie with sort text file question
 
At 03:12 PM 7/13/2003 -0600, Andrew Dalke wrote:

>Bob Gailer:
> > [Pipeline]

>
>Huh. Hadn't heard of that one before. Thanks for the pointer.


Since I am developing the Python version of Pipeline I wonder if you have
any interest in it? Would like to be an early recipient?

>(And overall, nice post!)
>
> > The Python version:

>
>Some stylistic comments
>
> > input = file('c:\input.txt')

>
>Since 'input' is a builtin, I use 'infile'.


Agree. When I'm in a hurry I let details slip.

>For the OP, you'll need 'c:\\input.txt' because the '\' has special meaning
>inside of a string so must be escaped.


Agree. When I'm in a hurry I let details slip.

> > fruits = {} # a dictionary to hold each fruit and its count
> > lines = input.readlines()
> > for line in lines:

>
>Since you are using Python 2.2 (later you use "if fruit in fruits",
>and "__in__" support for dicts wasn't added until Python 2.2, I
>think, and the 'file' usage is also new), this is best written as
>
> for line in input:


Agree.

> > fruit = line.split('_', 1)[0]

>
> > if fruit in fruits:
> > fruits[fruit] += 1 # increment count
> > else:
> > fruits[fruit] = 1 # add to dictionary with count of 1

>
>Here's a handy idiom for what you want
>
> fruits[fruit] = fruits.get(fruit, 0) + 1


Don't you want setdefault() instead of get()?

> > output1 = file('c:\output1.txt', 'w')
> > for key, value in fruits.items():
> > output1.write("%s occurs %s\n" % (key, value))
> > output1.close()
> > output2 = file('c:\output2.txt', 'w')
> > output2.write("Total occurrences is %s\n" % len(lines))
> > output2.close()

>
>That's missing some sorts, so I don't think it meets the OP's requirements.


The only reason for sort that I could see was to group things for counting.
The output appears sorted descending, but that order was not specified, so
I assumed random output.

[snip]

Bob Gailer
bgailer@alum.rpi.edu
303 442 2625


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.500 / Virus Database: 298 - Release Date: 7/10/2003


Bengt Richter 07-14-2003 06:54 PM

Re: Newbie with sort text file question
 
On Mon, 14 Jul 2003 07:10:39 -0600, Bob Gailer <bgailer@alum.rpi.edu> wrote:
[...]
>>
>>Here's a handy idiom for what you want
>>
>> fruits[fruit] = fruits.get(fruit, 0) + 1

>
>Don't you want setdefault() instead of get()?
>

In this case that would set the original value twice, once on either side of the '='.
Setdefault is more useful when you are maintaining a mutable, such as a list of things
that you append to, as the key's associated value. You could use a length-1 list here,
(initialized by the default to hold the count starting value of 0), e.g.,

fruits.setdefault(fruit,[0])[0]+=1

and then later retrieve the actual count as

fruits[fruit][0]

Regards,
Bengt Richter


All times are GMT. The time now is 08:10 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.