Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: Newbie with sort text file question

Reply
Thread Tools

Re: Newbie with sort text file question

 
 
Bob Gailer
Guest
Posts: n/a
 
      07-12-2003
At 12:46 PM 7/12/2003 -0700, stuartc wrote:

>Hi:
>
>I'm not a total newbie, but I'm pretty green. I need to sort a text
>file and then get a total for the number of occurances for a part of
>the string. Hopefully, this will explain it better:
>
>Here's the text file:
>
>banana_c \\yellow
>apple_a \\green
>orange_b \\yellow
>banana_d \\green
>orange_a \\orange
>apple_w \\yellow
>banana_e \\green
>orange_x \\yellow
>orange_y \\orange
>
>I would like two output files:
>
>1) Sorted like this, by the fruit name (the name before the dash)
>
>apple_a \\green
>apple_w \\yellow
>banana_c \\yellow
>banana_d \\green
>banana_e \\green
>orange_a \\orange
>orange_b \\yellow
>orange_x \\yellow
>orange_y \\orange
>
>2) Then summarized like this, ordered with the highest occurances
>first:
>
>orange occurs 4
>banana occurs 3
>apple occurs 2
>
>Total occurances is 9


I am developing a Python version of IBM's CMS Pipelines, which is designed
for this kind of task. If you'd like to be an early recipient (read beta
tester) of this product, let me know.

You would invoke this program:
Pipe("""
< c:\input.txt
| split /_/
| nlocate -//-
| sort count
| spec 11-* 1 / occurs / 11 1-10 19
| > c:\output1.txt
| count
| spec /Total occurrences is / 1 1-* 21
| > c:\output2.txt""")

Explanation:
| == separates each stage of the pipe
< == read records from file
split == split each record into 2 records at first _
nlocate == select records that do not contain //
pad 10 == ensure each record has 10 characters (or whatever the longest
fruit name is)
sort count == sort; group by unique key and prepend count
spec ... == select cols 11-end of input, append literal, append cols
1-10
> == write records to file

spec ... == start with literal, append rest of record
> == write records to file


Or it can be run as a DOS Command:
C>python pipe.py spec.txt
where spec.txt contains the pipe specification

An enhancement to the IBM Pipeline specification for SPLIT will be to route
the 2nd part of each record to the secondary output, effectively discarding
it in this example, and eliminating the need for the NLOCATE stage.

This particular task can also be done fairly easily in Python. The appeal
of Pipe is that you focus on the specification rather than writing Python
code that is specific to the task. This shortens development time, and
enhances readability and maintainability.

The Python version:

input = file('c:\input.txt')
fruits = {} # a dictionary to hold each fruit and its count
lines = input.readlines()
for line in lines:
fruit = line.split('_', 1)[0]
if fruit in fruits:
fruits[fruit] += 1 # increment count
else:
fruits[fruit] = 1 # add to dictionary with count of 1
output1 = file('c:\output1.txt', 'w')
for key, value in fruits.items():
output1.write("%s occurs %s\n" % (key, value))
output1.close()
output2 = file('c:\output2.txt', 'w')
output2.write("Total occurrences is %s\n" % len(lines))
output2.close()

Bob Gailer

303 442 2625


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.500 / Virus Database: 298 - Release Date: 7/10/2003

 
Reply With Quote
 
 
 
 
Andrew Dalke
Guest
Posts: n/a
 
      07-13-2003
Bob Gailer:
> [Pipeline]


Huh. Hadn't heard of that one before. Thanks for the pointer.
(And overall, nice post!)

> The Python version:


Some stylistic comments

> input = file('c:\input.txt')


Since 'input' is a builtin, I use 'infile'. That's only a preference of
mine.
For the OP, you'll need 'c:\\input.txt' because the '\' has special meaning
inside of a string so must be escaped.

> fruits = {} # a dictionary to hold each fruit and its count
> lines = input.readlines()
> for line in lines:


Since you are using Python 2.2 (later you use "if fruit in fruits",
and "__in__" support for dicts wasn't added until Python 2.2, I
think, and the 'file' usage is also new), this is best written as

for line in input:

> fruit = line.split('_', 1)[0]


> if fruit in fruits:
> fruits[fruit] += 1 # increment count
> else:
> fruits[fruit] = 1 # add to dictionary with count of 1


Here's a handy idiom for what you want

fruits[fruit] = fruits.get(fruit, 0) + 1

> output1 = file('c:\output1.txt', 'w')
> for key, value in fruits.items():
> output1.write("%s occurs %s\n" % (key, value))
> output1.close()
> output2 = file('c:\output2.txt', 'w')
> output2.write("Total occurrences is %s\n" % len(lines))
> output2.close()


That's missing some sorts, so I don't think it meets the OP's
requirements.

How about this?

infile = open("input.txt")
lines = []
counts = {}
for line in infile:
lines.append(line)
fruit = line.split("_", 1)[0]
counts[fruit] = counts.get(fruit) + 1

# Sort by name. Since "_" sorts after any letter, this means
# that "plum_" will be placed *after* "plumbago_", which
# is probably not what you want. Left as an exercise
lines.sort()
outfile = open("output1.txt")
for line in lines:
outfile.write(line)
outfile.close()

# Print counts from highest count to lowest
count_data = [(n, fruit) for (fruit, n) in counts.items()]
count_data.sort()
outfile = open("output2.txt")
total = 0
for n, fruit in count_data:
outfile.write("%s occurs %s\n" % (fruit, n))
total += n
outfile.write("\nTotal occurances: %s\n" % total)
outfile.close()

Andrew



 
Reply With Quote
 
 
 
 
Bob Gailer
Guest
Posts: n/a
 
      07-14-2003
At 03:12 PM 7/13/2003 -0600, Andrew Dalke wrote:

>Bob Gailer:
> > [Pipeline]

>
>Huh. Hadn't heard of that one before. Thanks for the pointer.


Since I am developing the Python version of Pipeline I wonder if you have
any interest in it? Would like to be an early recipient?

>(And overall, nice post!)
>
> > The Python version:

>
>Some stylistic comments
>
> > input = file('c:\input.txt')

>
>Since 'input' is a builtin, I use 'infile'.


Agree. When I'm in a hurry I let details slip.

>For the OP, you'll need 'c:\\input.txt' because the '\' has special meaning
>inside of a string so must be escaped.


Agree. When I'm in a hurry I let details slip.

> > fruits = {} # a dictionary to hold each fruit and its count
> > lines = input.readlines()
> > for line in lines:

>
>Since you are using Python 2.2 (later you use "if fruit in fruits",
>and "__in__" support for dicts wasn't added until Python 2.2, I
>think, and the 'file' usage is also new), this is best written as
>
> for line in input:


Agree.

> > fruit = line.split('_', 1)[0]

>
> > if fruit in fruits:
> > fruits[fruit] += 1 # increment count
> > else:
> > fruits[fruit] = 1 # add to dictionary with count of 1

>
>Here's a handy idiom for what you want
>
> fruits[fruit] = fruits.get(fruit, 0) + 1


Don't you want setdefault() instead of get()?

> > output1 = file('c:\output1.txt', 'w')
> > for key, value in fruits.items():
> > output1.write("%s occurs %s\n" % (key, value))
> > output1.close()
> > output2 = file('c:\output2.txt', 'w')
> > output2.write("Total occurrences is %s\n" % len(lines))
> > output2.close()

>
>That's missing some sorts, so I don't think it meets the OP's requirements.


The only reason for sort that I could see was to group things for counting.
The output appears sorted descending, but that order was not specified, so
I assumed random output.

[snip]

Bob Gailer

303 442 2625


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.500 / Virus Database: 298 - Release Date: 7/10/2003

 
Reply With Quote
 
Bengt Richter
Guest
Posts: n/a
 
      07-14-2003
On Mon, 14 Jul 2003 07:10:39 -0600, Bob Gailer <> wrote:
[...]
>>
>>Here's a handy idiom for what you want
>>
>> fruits[fruit] = fruits.get(fruit, 0) + 1

>
>Don't you want setdefault() instead of get()?
>

In this case that would set the original value twice, once on either side of the '='.
Setdefault is more useful when you are maintaining a mutable, such as a list of things
that you append to, as the key's associated value. You could use a length-1 list here,
(initialized by the default to hold the count starting value of 0), e.g.,

fruits.setdefault(fruit,[0])[0]+=1

and then later retrieve the actual count as

fruits[fruit][0]

Regards,
Bengt Richter
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: When will Thunderbird support sort in place (in context sort)? Ron Natalie Firefox 0 02-02-2006 04:38 AM
The Colourised Bewitched -- sort of OK....... sort of! anthony DVD Video 26 06-28-2005 04:39 AM
xsl:sort lang="es" modern vs. tradidional Spanish sort order nobody XML 0 06-01-2004 06:25 AM
Ado sort error-Ado Sort -Relate, Compute By, or Sort operations cannot be done on column(s) whose key length is unknown or exceeds 10 KB. Navin ASP General 1 09-09-2003 07:16 AM
Newbie with sort text file question stuartc Python 3 07-13-2003 10:30 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57