Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > CSV module: incorrectly parsed file.

Reply
Thread Tools

CSV module: incorrectly parsed file.

 
 
Christopher Barrington-Leigh
Guest
Posts: n/a
 
      02-18-2008
Here is a file "test.csv"
number,name,description,value
1,"wer","tape 2"",5
1,vvv,"hoohaa",2

I want to convert it to tab-separated without those silly quotes. Note
in the second line that a field is 'tape 2"' , ie two inches: there is
a double quote in the string.

When I use csv module to read this:


import sys
outf=open(sys.argv[1]+'.tsv','wt')
import csv
reader=csv.reader(open(sys.argv[1], "rb"))
for row in reader:
outf.write('\t'.join([rr.strip() for rr in row]) +'\n')


it mangles it, messing up the double double-quote.
Can anyone help me? How do I use CSV to get it right?
Tjhanks!
c
 
Reply With Quote
 
 
 
 
Andrew McNamara
Guest
Posts: n/a
 
      02-18-2008
>Here is a file "test.csv"
>number,name,description,value
>1,"wer","tape 2"",5
>1,vvv,"hoohaa",2
>
>I want to convert it to tab-separated without those silly quotes. Note
>in the second line that a field is 'tape 2"' , ie two inches: there is
>a double quote in the string.


The input format is ambiguous - how is the parser to distinguish between
a double-quote in the field, and the double-quote that delimits the
field? Excel would have written that field as "tape 2""" (it doubles
double-quotes that appear within a field).

You can turn off the double-double-quote handling by passing
"doublequote=False" to the parser, but the results still might not be
what you want (because the format is ambiguous).


--
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/
 
Reply With Quote
 
 
 
 
Paul McGuire
Guest
Posts: n/a
 
      02-18-2008
On Feb 17, 8:09*pm, Christopher Barrington-Leigh
<(E-Mail Removed)> wrote:
> Here is a file "test.csv"
> number,name,description,value
> 1,"wer","tape 2"",5
> 1,vvv,"hoohaa",2
>
> I want to convert it to tab-separated without those silly quotes. Note
> in the second line that a field is 'tape 2"' , ie two inches: there is
> a double quote in the string.
>


What is needed to disambiguate this data is to only accept closing
quotes if they are followed by a comma or the end of the line. In
pyparsing, you can define your own quoted string format. Here is one
solution using pyparsing. At the end, you can extract the data by
field name, and print it out however you choose:

data = """\
number,name,description,value
1,"wer","tape 2"",5
1,vvv,"hoohaa",2"""


from pyparsing import *

# very special definition of a quoted string, that ends with a " only
if
# followed by a , or the end of line
quotedString = ('"' +
ZeroOrMore(CharsNotIn('"')|('"' + ~FollowedBy(','|lineEnd))) +
'"')
quotedString.setParseAction(keepOriginalText, removeQuotes)
integer = Word(nums).setParseAction(lambda toks:int(toks[0]))
value = integer | quotedString | Word(printables.replace(",",""))

# first pass, just parse the comma-separated values
for line in data.splitlines():
print delimitedList(value).parseString(line)
print

# now second pass, assign field names using names from first line
names = data.splitlines()[0].split(',')
def setValueNames(tokens):
for k,v in zip(names,tokens):
tokens[k] = v
lineDef = delimitedList(value).setParseAction(setValueNames)

# parse each line, and extract data by field name
for line in data.splitlines()[1:]:
results = lineDef.parseString(line)
print "Desc:", results.description
print results.dump()


Prints:
['number', 'name', 'description', 'value']
[1, 'wer', 'tape 2"', 5]
[1, 'vvv', 'hoohaa', 2]

Desc: tape 2"
[1, 'wer', 'tape 2"', 5]
- description: tape 2"
- name: wer
- number: 1
- value : 5
Desc: hoohaa
[1, 'vvv', 'hoohaa', 2]
- description: hoohaa
- name: vvv
- number: 1
- value : 2

-- Paul

 
Reply With Quote
 
7stud
Guest
Posts: n/a
 
      02-18-2008
On Feb 17, 7:09*pm, Christopher Barrington-Leigh
<(E-Mail Removed)> wrote:
> Here is a file "test.csv"
> number,name,description,value
> 1,"wer","tape 2"",5
> 1,vvv,"hoohaa",2
>
> I want to convert it to tab-separated without those silly quotes. Note
> in the second line that a field is 'tape 2"' , ie two inches: there is
> a double quote in the string.
>
> When I use csv module to read this:
>
> import sys
> outf=open(sys.argv[1]+'.tsv','wt')
> import csv
> reader=csv.reader(open(sys.argv[1], "rb"))
> for row in reader:
> * * outf.write('\t'.join([rr.strip() for rr in row]) +'\n')
>
> it mangles it, messing up the double double-quote.
> Can anyone help me? How do I use CSV to get it right?
> Tjhanks!
> c



Try this:

infile = open('data.txt')
outfile = open('outfile.txt', 'w')

for line in infile:
pieces = line.strip().split(',')

data = []
for piece in pieces:
if piece[0] == '"':
data.append(piece[1:-2])
else:
data.append(piece)

out_line = '%s\n' % '\t'.join(data)
outfile.write(out_line)
 
Reply With Quote
 
7stud
Guest
Posts: n/a
 
      02-18-2008
On Feb 17, 9:11*pm, 7stud <(E-Mail Removed)> wrote:
> On Feb 17, 7:09*pm, Christopher Barrington-Leigh
>
>
>
> <(E-Mail Removed)> wrote:
> > Here is a file "test.csv"
> > number,name,description,value
> > 1,"wer","tape 2"",5
> > 1,vvv,"hoohaa",2

>
> > I want to convert it to tab-separated without those silly quotes. Note
> > in the second line that a field is 'tape 2"' , ie two inches: there is
> > a double quote in the string.

>
> > When I use csv module to read this:

>
> > import sys
> > outf=open(sys.argv[1]+'.tsv','wt')
> > import csv
> > reader=csv.reader(open(sys.argv[1], "rb"))
> > for row in reader:
> > * * outf.write('\t'.join([rr.strip() for rr in row]) +'\n')

>
> > it mangles it, messing up the double double-quote.
> > Can anyone help me? How do I use CSV to get it right?
> > Tjhanks!
> > c

>
> Try this:
>
> infile = open('data.txt')
> outfile = open('outfile.txt', 'w')
>
> for line in infile:
> * * pieces = line.strip().split(',')
>
> * * data = []
> * * for piece in pieces:
> * * * * if piece[0] == '"':
> * * * * * * data.append(piece[1:-2])
> * * * * else:
> * * * * * * data.append(piece)
>
> * * out_line = '%s\n' % '\t'.join(data)
> * * outfile.write(out_line)


Whoops. The line:

data.append(piece[1:-2])

should be:

data.append(piece[1:-1])

 
Reply With Quote
 
Steve Holden
Guest
Posts: n/a
 
      02-18-2008
7stud wrote:
> On Feb 17, 9:11 pm, 7stud <(E-Mail Removed)> wrote:
>> On Feb 17, 7:09 pm, Christopher Barrington-Leigh
>>
>>
>>
>> <(E-Mail Removed)> wrote:
>>> Here is a file "test.csv"
>>> number,name,description,value
>>> 1,"wer","tape 2"",5
>>> 1,vvv,"hoohaa",2
>>> I want to convert it to tab-separated without those silly quotes. Note
>>> in the second line that a field is 'tape 2"' , ie two inches: there is
>>> a double quote in the string.
>>> When I use csv module to read this:
>>> import sys
>>> outf=open(sys.argv[1]+'.tsv','wt')
>>> import csv
>>> reader=csv.reader(open(sys.argv[1], "rb"))
>>> for row in reader:
>>> outf.write('\t'.join([rr.strip() for rr in row]) +'\n')
>>> it mangles it, messing up the double double-quote.
>>> Can anyone help me? How do I use CSV to get it right?
>>> Tjhanks!
>>> c

>> Try this:
>>
>> infile = open('data.txt')
>> outfile = open('outfile.txt', 'w')
>>
>> for line in infile:
>> pieces = line.strip().split(',')
>>
>> data = []
>> for piece in pieces:
>> if piece[0] == '"':
>> data.append(piece[1:-2])
>> else:
>> data.append(piece)
>>
>> out_line = '%s\n' % '\t'.join(data)
>> outfile.write(out_line)

>
> Whoops. The line:
>
> data.append(piece[1:-2])
>
> should be:
>
> data.append(piece[1:-1])
>

Even when you have done all this you will still have problems. As Andrew
pointed out the form is ambiguous, and you'd just better hope none of
your data items look like

Nails 2", soldiers for the use of

because then you will be completely screwed. So there's a need for a
certain amount of visual scrutiny of the data: I would definitely write
a validation program first that tries to read the data and catches any
exceptions like unmatched quotes or the wrong number of items in a line.
If there aren't too many (and there usually aren't) just edit them out
of your input data by hand.

If this is to be a regular task then you'll have to program to recognize
and correct the common error cases.

regards
Steve

--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Best way to parse a csv...... a csv that has CRLF in the fields sso Java 20 04-26-2009 11:04 AM
read and write csv file using csv module jliu66 Python 0 10-19-2007 03:12 PM
How to move data from a CSV file to a JTable, and from a JTable to a CSV file ? Tintin92 Java 1 02-14-2007 06:51 PM
Re: csv writerow creates double spaced excel csv files Skip Montanaro Python 0 02-13-2004 08:50 PM
csv writerow creates double spaced excel csv files Michal Mikolajczyk Python 0 02-13-2004 08:38 PM



Advertisments