Velocity Reviews > Simple Text Processing Help

# Simple Text Processing Help

patrick.waldo@gmail.com
Guest
Posts: n/a

 10-14-2007
Hi all,

I started Python just a little while ago and I am stuck on something
that is really simple, but I just can't figure out.

Essentially I need to take a text document with some chemical
information in Czech and organize it into another text file. The
information is always EINECS number, CAS, chemical name, and formula
in tables. I need to organize them into lines with | in between. So
it goes from:

200-763-1 71-73-8
nátrium-tiopentál C11H18N2O2S.Na to:

200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

but if I have a chemical like: kyselina močová

I get:
200-720-7|69-93-2|kyselina|močová
|C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

and then it is all off.

How can I get Python to realize that a chemical name may have a space
in it?

Thank you,
Patrick

So far I have:

#take tables in one text file and organize them into lines in another

import codecs

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

#read and enter into a list
chem_file = []

#split words and store them in a list
for word in chem_file:
words = word.split()

#starting values in list
e=0 #EINECS
c=1 #CAS
ch=2 #chemical name
f=3 #formula

n=0
loop=1
x=len(words) #counts how many words there are in the file

print '-'*100
while loop==1:
if n<x and f<=x:
print words[e], '|', words[c], '|', words[ch], '|', words[f],
'\n'
output.write(words[e])
output.write('|')
output.write(words[c])
output.write('|')
output.write(words[ch])
output.write('|')
output.write(words[f])
output.write('\r\n')
#increase variables by 4 to get next set
e = e + 4
c = c + 4
ch = ch + 4
f = f + 4
# increase by 1 to repeat
n=n+1
else:
loop=0

input.close()
output.close()

Marc 'BlackJack' Rintsch
Guest
Posts: n/a

 10-14-2007
On Sun, 14 Oct 2007 13:48:51 +0000, patrick.waldo wrote:

> Essentially I need to take a text document with some chemical
> information in Czech and organize it into another text file. The
> information is always EINECS number, CAS, chemical name, and formula
> in tables. I need to organize them into lines with | in between. So
> it goes from:
>
> 200-763-1 71-73-8
> nátrium-tiopentál C11H18N2O2S.Na to:

Is that in *one* line in the input file or two lines like shown here?

> 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
>
> but if I have a chemical like: kyselina močová
>
> I get:
> 200-720-7|69-93-2|kyselina|močová
> |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
>
> and then it is all off.
>
> How can I get Python to realize that a chemical name may have a space
> in it?

If the two elements before and the one element after the name can't
contain spaces it is easy: take the first two and the last as it is and
for the name take from the third to the next to last element = the name
and join them with a space.

In [202]: parts = '123 456 a name with spaces 789'.split()

In [203]: parts[0]
Out[203]: '123'

In [204]: parts[1]
Out[204]: '456'

In [205]: ' '.join(parts[2:-1])
Out[205]: 'a name with spaces'

In [206]: parts[-1]
Out[206]: '789'

This works too if the name doesn't have a space in it:

In [207]: parts = '123 456 name 789'.split()

In [208]: parts[0]
Out[208]: '123'

In [209]: parts[1]
Out[209]: '456'

In [210]: ' '.join(parts[2:-1])
Out[210]: 'name'

In [211]: parts[-1]
Out[211]: '789'

> #read and enter into a list
> chem_file = []

This reads the whole file and puts it into a list. This list will
*always* just contain *one* element. So why a list at all!?

> #split words and store them in a list
> for word in chem_file:
> words = word.split()

*If* the list would contain more than one element all would be processed
but only the last is bound to words. You could leave out chem_file and
the loop and simply do:

Same effect but less chatty.

The rest of the source seems to indicate that you don't really want to read
in the whole input file at once but process it line by line, i.e. chemical
element by chemical element.

Ciao,
Marc 'BlackJack' Rintsch

Paul Hankin
Guest
Posts: n/a

 10-14-2007
On Oct 14, 2:48 pm, patrick.wa...@gmail.com wrote:
> Hi all,
>
> I started Python just a little while ago and I am stuck on something
> that is really simple, but I just can't figure out.
>
> Essentially I need to take a text document with some chemical
> information in Czech and organize it into another text file. The
> information is always EINECS number, CAS, chemical name, and formula
> in tables. I need to organize them into lines with | in between. So
> it goes from:
>
> 200-763-1 71-73-8
> nátrium-tiopentál C11H18N2O2S.Na to:
>
> 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
>
> but if I have a chemical like: kyselina močová
>
> I get:
> 200-720-7|69-93-2|kyselina|močová
> |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
>
> and then it is all off.
>
> How can I get Python to realize that a chemical name may have a space
> in it?

In the original file, is every chemical on a line of its own? I assume
it is here.

You might use a regexp (look at the re module), or I think here you
can use the fact that only chemicals have spaces in them. Then, you
can split each line on whitespace (like you're doing), and join back
together all the words between the 3rd (ie index 2) and the last (ie
index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses
the somewhat unusual python syntax for replacing a section of a list
with another list.

The approach you took involves reading the whole file, and building a
list of all the chemicals which you don't seem to use: I've changed it
to a per-line version and removed the big lists.

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
tokens = line.strip().split()
tokens[2:-1] = [u' '.join(tokens[2:-1])]
chemical = u'|'.join(tokens)
print chemical + u'\n'
output.write(chemical + u'\r\n')

input.close()
output.close()

Obviously, this isn't tested because I don't have your chem_1_utf8.txt
file.

--
Paul Hankin

patrick.waldo@gmail.com
Guest
Posts: n/a

 10-14-2007
Thank you both for helping me out. I am still rather new to Python
and so I'm probably trying to reinvent the wheel here.

When I try to do Paul's response, I get
>>>tokens = line.strip().split()

[]

So I am not quite sure how to read line by line.

tokens = input.read().split() gets me all the information from the
file. tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like
in the example; however, how can I loop this for the entire document?
Also, when I try output.write(tokens), I get "TypeError: coercing to
Unicode: need string or buffer, list found".

Any ideas?

On Oct 14, 4:25 pm, Paul Hankin <paul.han...@gmail.com> wrote:
> On Oct 14, 2:48 pm, patrick.wa...@gmail.com wrote:
>
>
>
> > Hi all,

>
> > I started Python just a little while ago and I am stuck on something
> > that is really simple, but I just can't figure out.

>
> > Essentially I need to take a text document with some chemical
> > information in Czech and organize it into another text file. The
> > information is always EINECS number, CAS, chemical name, and formula
> > in tables. I need to organize them into lines with | in between. So
> > it goes from:

>
> > 200-763-1 71-73-8
> > nátrium-tiopentál C11H18N2O2S.Na to:

>
> > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

>
> > but if I have a chemical like: kyselina močová

>
> > I get:
> > 200-720-7|69-93-2|kyselina|močová
> > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

>
> > and then it is all off.

>
> > How can I get Python to realize that a chemical name may have a space
> > in it?

>
> In the original file, is every chemical on a line of its own? I assume
> it is here.
>
> You might use a regexp (look at the re module), or I think here you
> can use the fact that only chemicals have spaces in them. Then, you
> can split each line on whitespace (like you're doing), and join back
> together all the words between the 3rd (ie index 2) and the last (ie
> index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses
> the somewhat unusual python syntax for replacing a section of a list
> with another list.
>
> The approach you took involves reading the whole file, and building a
> list of all the chemicals which you don't seem to use: I've changed it
> to a per-line version and removed the big lists.
>
> path = "c:\\text_samples\\chem_1_utf8.txt"
> path2 = "c:\\text_samples\\chem_2.txt"
> input = codecs.open(path, 'r','utf8')
> output = codecs.open(path2, 'w', 'utf8')
>
> for line in input:
> tokens = line.strip().split()
> tokens[2:-1] = [u' '.join(tokens[2:-1])]
> chemical = u'|'.join(tokens)
> print chemical + u'\n'
> output.write(chemical + u'\r\n')
>
> input.close()
> output.close()
>
> Obviously, this isn't tested because I don't have your chem_1_utf8.txt
> file.
>
> --
> Paul Hankin

Marc 'BlackJack' Rintsch
Guest
Posts: n/a

 10-14-2007
On Sun, 14 Oct 2007 16:57:06 +0000, patrick.waldo wrote:

> Thank you both for helping me out. I am still rather new to Python
> and so I'm probably trying to reinvent the wheel here.
>
> When I try to do Paul's response, I get
>>>>tokens = line.strip().split()

> []

What is in line? Paul wrote this in the body of the for loop over
all the lines in the file.

> So I am not quite sure how to read line by line.

That's what the for loop over a file or file-like object is doing.
Maybe you should develop your script in smaller steps and do some printing
to see what you get at each step. For example after opening the input
file:

for line in input:
print line # prints the whole line.
tokens = line.split()
print tokens # prints a list with the split line.

> tokens = input.read().split() gets me all the information from the
> file.

Right it reads *all* of the file, not just one line.

> tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like
> in the example; however, how can I loop this for the entire document?

Don't read the whole file but line by line, just like Paul showed you.

> Also, when I try output.write(tokens), I get "TypeError: coercing to
> Unicode: need string or buffer, list found".

tokens is a list but you need to write a unicode string. So you have to
reassemble the parts with '|' characters in between. Also shown by Paul.

Ciao,
Marc 'BlackJack' Rintsch

John Machin
Guest
Posts: n/a

 10-14-2007
On Oct 14, 11:48 pm, patrick.wa...@gmail.com wrote:
> Hi all,
>
> I started Python just a little while ago and I am stuck on something
> that is really simple, but I just can't figure out.
>
> Essentially I need to take a text document with some chemical
> information in Czech and organize it into another text file. The
> information is always EINECS number, CAS, chemical name, and formula
> in tables. I need to organize them into lines with | in between. So
> it goes from:
>
> 200-763-1 71-73-8
> nátrium-tiopentál C11H18N2O2S.Na to:
>
> 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
>
> but if I have a chemical like: kyselina močová
>
> I get:
> 200-720-7|69-93-2|kyselina|močová
> |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
>
> and then it is all off.
>
> How can I get Python to realize that a chemical name may have a space
> in it?
>

Your input file could be in one of THREE formats:
(1) fields are separated by TAB characters (represented in Python by
the escape sequence '\t', and equivalent to '\x09')
(2) fields are fixed width and padded with spaces
(3) fields are separated by a random number of whitespace characters
(and can contain spaces).

What makes you sure that you have format 3? You might like to try
something like
print lines
print map(len, lines)
This will print a *precise* representation of what is in the first
four lines, plus their lengths. Please show us the output.

patrick.waldo@gmail.com
Guest
Posts: n/a

 10-15-2007
> print lines
> print map(len, lines)

gave me:
['\xef\xbb\xbf200-720-7 69-93-2\n', 'kyselina mo\xc4\x8dov
\xc3\xa1 C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n']
[28, 32, 1, 18]

I think it means that I'm still at option 3. I got the line by line
part. My code is a lot cleaner now:

import codecs

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
tokens = line.strip().split()
tokens[2:-1] = [u' '.join(tokens[2:-1])] #this doesn't seem to
combine the files correctly
file = u'|'.join(tokens) #this does put '|' in
between
print file + u'\n'
output.write(file + u'\r\n')

input.close()
output.close()

my sample input file looks like this( not organized,as you see it):
200-720-7 69-93-2
kyselina mocová C5H4N4O3

200-001-8 50-00-0
formaldehyd CH2O

200-002-3
50-01-1
guanid*nium-chlorid CH5N3.ClH

etc...

and after the program I get:

200-720-7|69-93-2|
kyselina|mocová||C5H4N4O3

200-001-8|50-00-0|
formaldehyd|CH2O|

200-002-3|
50-01-1|
guanid*nium-chlorid|CH5N3.ClH|

etc...
So, I am sort of back at the start again.

tokens = line.strip().split()
for token in tokens:
print token

I get all the single tokens, which I thought I could then put
together, except when I did:

for token in tokens:
s = u'|'.join(token)
print s

I got ?|2|0|0|-|7|2|0|-|7, etc...

How can I join these together into nice neat little lines? When I try
to store the tokens in a list, the tokens double and I don't know
why. I can work on getting the chemical names together after...baby
steps, or maybe I am just missing something obvious. The first two
numbers will always be the same three digits-three digits-one digit
and then two digits-two digits-one digit.

My intuition tells me that I need to add an if statement that says, if
the first two numbers follow the pattern, then continue, if they don't
(ie a chemical name was accidently split apart) then the third entry
needs to be put together. Something like
if tokens.startswith('pattern') == true

Again, thanks so much. I've gone to http://gnosis.cx/TPiP/ and I have
a couple O'Reilly books, but they don't seem to have a straightforward
example for this kind of text manipulation.

Patrick

On Oct 14, 11:17 pm, John Machin <sjmac...@lexicon.net> wrote:
> On Oct 14, 11:48 pm, patrick.wa...@gmail.com wrote:
>
>
>
> > Hi all,

>
> > I started Python just a little while ago and I am stuck on something
> > that is really simple, but I just can't figure out.

>
> > Essentially I need to take a text document with some chemical
> > information in Czech and organize it into another text file. The
> > information is always EINECS number, CAS, chemical name, and formula
> > in tables. I need to organize them into lines with | in between. So
> > it goes from:

>
> > 200-763-1 71-73-8
> > nátrium-tiopentál C11H18N2O2S.Na to:

>
> > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

>
> > but if I have a chemical like: kyselina močová

>
> > I get:
> > 200-720-7|69-93-2|kyselina|močová
> > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

>
> > and then it is all off.

>
> > How can I get Python to realize that a chemical name may have a space
> > in it?

>
> Your input file could be in one of THREE formats:
> (1) fields are separated by TAB characters (represented in Python by
> the escape sequence '\t', and equivalent to '\x09')
> (2) fields are fixed width and padded with spaces
> (3) fields are separated by a random number of whitespace characters
> (and can contain spaces).
>
> What makes you sure that you have format 3? You might like to try
> something like
> print lines
> print map(len, lines)
> This will print a *precise* representation of what is in the first
> four lines, plus their lengths. Please show us the output.

patrick.waldo@gmail.com
Guest
Posts: n/a

 10-15-2007
> print lines
> print map(len, lines)

gave me:
['\xef\xbb\xbf200-720-7 69-93-2\n', 'kyselina mo\xc4\x8dov
\xc3\xa1 C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n']
[28, 32, 1, 18]

I think it means that I'm still at option 3. I got the line by line
part. My code is a lot cleaner now:

import codecs

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
tokens = line.strip().split()
tokens[2:-1] = [u' '.join(tokens[2:-1])] #this doesn't seem to
combine the files correctly
file = u'|'.join(tokens) #this does put '|' in
between
print file + u'\n'
output.write(file + u'\r\n')

input.close()
output.close()

my sample input file looks like this( not organized,as you see it):
200-720-7 69-93-2
kyselina mocová C5H4N4O3

200-001-8 50-00-0
formaldehyd CH2O

200-002-3
50-01-1
guanid*nium-chlorid CH5N3.ClH

etc...

and after the program I get:

200-720-7|69-93-2|
kyselina|mocová||C5H4N4O3

200-001-8|50-00-0|
formaldehyd|CH2O|

200-002-3|
50-01-1|
guanid*nium-chlorid|CH5N3.ClH|

etc...
So, I am sort of back at the start again.

tokens = line.strip().split()
for token in tokens:
print token

I get all the single tokens, which I thought I could then put
together, except when I did:

for token in tokens:
s = u'|'.join(token)
print s

I got ?|2|0|0|-|7|2|0|-|7, etc...

How can I join these together into nice neat little lines? When I try
to store the tokens in a list, the tokens double and I don't know
why. I can work on getting the chemical names together after...baby
steps, or maybe I am just missing something obvious. The first two
numbers will always be the same three digits-three digits-one digit
and then two digits-two digits-one digit. This seems to be on the
only pattern.

My intuition tells me that I need to add an if statement that says, if
the first two numbers follow the pattern, then continue, if they don't
(ie a chemical name was accidently split apart) then the third entry
needs to be put together. Something like

if tokens[1] and tokens[2] startswith('pattern') == true
tokens[2] = join(tokens[2]:tokens[3])
token[3] = token[4]
del token[4]

but the code isn't right...any ideas?

Again, thanks so much. I've gone to http://gnosis.cx/TPiP/ and I have
a couple O'Reilly books, but they don't seem to have a straightforward
example for this kind of text manipulation.

Patrick

On Oct 14, 11:17 pm, John Machin <sjmac...@lexicon.net> wrote:
> On Oct 14, 11:48 pm, patrick.wa...@gmail.com wrote:
>
>
>
> > Hi all,

>
> > I started Python just a little while ago and I am stuck on something
> > that is really simple, but I just can't figure out.

>
> > Essentially I need to take a text document with some chemical
> > information in Czech and organize it into another text file. The
> > information is always EINECS number, CAS, chemical name, and formula
> > in tables. I need to organize them into lines with | in between. So
> > it goes from:

>
> > 200-763-1 71-73-8
> > nátrium-tiopentál C11H18N2O2S.Na to:

>
> > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

>
> > but if I have a chemical like: kyselina močová

>
> > I get:
> > 200-720-7|69-93-2|kyselina|močová
> > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

>
> > and then it is all off.

>
> > How can I get Python to realize that a chemical name may have a space
> > in it?

>
> Your input file could be in one of THREE formats:
> (1) fields are separated by TAB characters (represented in Python by
> the escape sequence '\t', and equivalent to '\x09')
> (2) fields are fixed width and padded with spaces
> (3) fields are separated by a random number of whitespace characters
> (and can contain spaces).
>
> What makes you sure that you have format 3? You might like to try
> something like
> print lines
> print map(len, lines)
> This will print a *precise* representation of what is in the first
> four lines, plus their lengths. Please show us the output.

Marc 'BlackJack' Rintsch
Guest
Posts: n/a

 10-15-2007
On Mon, 15 Oct 2007 10:47:16 +0000, patrick.waldo wrote:

> my sample input file looks like this( not organized,as you see it):
> 200-720-7 69-93-2
> kyselina mocová C5H4N4O3
>
> 200-001-8 50-00-0
> formaldehyd CH2O
>
> 200-002-3
> 50-01-1
> guanid*nium-chlorid CH5N3.ClH
>
> etc...

That's quite irregular so it is not that straightforward. One way is to
split everything into words, start a record by taking the first two
elements and then look for the start of the next record that looks like
three numbers concatenated by '-' characters. Quick and dirty hack:

import codecs
import re

NR_RE = re.compile(r'^\d+-\d+-\d+$') def iter_elements(tokens): tokens = iter(tokens) try: nr_a = tokens.next() while True: nr_b = tokens.next() items = list() for item in tokens: if NR_RE.match(item): yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1]) nr_a = item break else: items.append(item) except StopIteration: yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1]) def main(): in_file = codecs.open('test.txt', 'r', 'utf-8') tokens = in_file.read().split() in_file.close() for element in iter_elements(tokens): print '|'.join(element) Ciao, Marc 'BlackJack' Rintsch Paul Hankin Guest Posts: n/a  10-15-2007 On Oct 15, 12:20 pm, Marc 'BlackJack' Rintsch <bj_...@gmx.net> wrote: > On Mon, 15 Oct 2007 10:47:16 +0000, patrick.waldo wrote: > > my sample input file looks like this( not organized,as you see it): > > 200-720-7 69-93-2 > > kyselina mocov C5H4N4O3 > > > 200-001-8 50-00-0 > > formaldehyd CH2O > > > 200-002-3 > > 50-01-1 > > guanidnium-chlorid CH5N3.ClH > > > etc... > > That's quite irregular so it is not that straightforward. One way is to > split everything into words, start a record by taking the first two > elements and then look for the start of the next record that looks like > three numbers concatenated by '-' characters. Quick and dirty hack: > > import codecs > import re > > NR_RE = re.compile(r'^\d+-\d+-\d+$')
>
> def iter_elements(tokens):
> tokens = iter(tokens)
> try:
> nr_a = tokens.next()
> while True:
> nr_b = tokens.next()
> items = list()
> for item in tokens:
> if NR_RE.match(item):
> yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])
> nr_a = item
> break
> else:
> items.append(item)
> except StopIteration:
> yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])

Maybe this is a bit more readable?

def iter_elements(tokens):
chem = []
for tok in tokens:
if NR_RE.match(tok) and len(chem) >= 4:
chem[2:-1] = [' '.join(chem[2:-1])]
yield chem
chem = []
chem.append(tok)
yield chem

--
Paul Hankin

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is OffTrackbacks are On Pingbacks are On Refbacks are Off Forum Rules

 Similar Threads Thread Thread Starter Forum Replies Last Post AJAskey Python 2 09-12-2009 06:07 AM mathieu XML 3 01-06-2007 10:18 PM Zhenhuan Du Perl Misc 1 12-18-2006 09:11 PM Mike Henley Digital Photography 42 01-30-2005 08:26 AM Hubert Hung-Hsien Chang Python 2 09-17-2004 03:10 PM