Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > csv.reader has trouble with comma inside quotes inside brackets

Reply
Thread Tools

csv.reader has trouble with comma inside quotes inside brackets

 
 
Bret
Guest
Posts: n/a
 
      06-09-2009
i have a csv file like so:
row1,field1,[field2][text in field2 "quote, quote"],field3,field
row2,field1,[field2]text in field2 "quote, quote",field3,field

using csv.reader to read the file, the first row is broken into two
fields:
[field2][text in field2 "quote
and
quote"

while the second row is read correctly with:
[field2]text in field2 "quote, quote"
being one field.

any ideas how to make csv.reader work correctly for the first case?
the problem is the comma inside the quote inside the brackets, ie:
[","]
 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      06-09-2009
Bret <bretrouse <at> gmail.com> writes:

>
> i have a csv file like so:
> row1,field1,[field2][text in field2 "quote, quote"],field3,field
> row2,field1,[field2]text in field2 "quote, quote",field3,field
>
> using csv.reader to read the file, the first row is broken into two
> fields:
> [field2][text in field2 "quote
> and
> quote"
>
> while the second row is read correctly with:
> [field2]text in field2 "quote, quote"
> being one field.
>
> any ideas how to make csv.reader work correctly for the first case?
> the problem is the comma inside the quote inside the brackets, ie:
> [","]


I can't reproduce the behaviour that you describe based on these reasonable
assumptions: Python 2.6, delimiter=',', quotechar='"':

C:\junk\weird_csv>\python26\python
Python 2.6.2 (r262:71605, Apr 14 2009, 22:40:02) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
|>>> import pprint, csv
|>>> open('weird.csv', 'rb').read()
'row1,field1,[field2][text in field2 "quote, quote"],field3,field\r\nrow2,field1
,[field2]text in field2 "quote, quote",field3,field\r\n'
>>> pprint.pprint(list(csv.reader(open('weird.csv', 'rb'))))

[['row1',
'field1',
'[field2][text in field2 "quote',
' quote"]',
'field3',
'field'],
['row2',
'field1',
'[field2]text in field2 "quote',
' quote"',
'field3',
'field']]
|>>>

As you can see, it treats both lines in the same way. Opening the file with each
of Microsoft Excel 2003, OpenOffice.org Calc 3, and Gnumeric has exactly the
same effect.

The problem is that your data has not been prepared using the generally accepted
rules for quoting in CSV files:
[pseudocode]
if field contains any of quotechar, delimiter, newline, maybe others:
field = (quotechar
+ field.replace(quotechar, quotechar + quotechar)
+ quotechar)
which would change your first line from
row1,field1,[field2][text in field2 "quote, quote"],field3,field
to
row1,field1,"[field2][text in field2 ""quote, quote""]",field3,field

There's no option in the csv module to get around this, AFAICT. You'd have to
roll your own, something along these lines (assumes no embedded newlines etc):

8<--- parse_unquoted_csv.py
def parse_unquoted_csv(line, delimiter=',', quotechar='"'):
line = line.strip()
inside_quotes = False
fields = []
field = ''
for c in line:
if inside_quotes:
if c == quotechar:
inside_quotes = False
field += c
else:
if c == delimiter:
fields.append(field)
field = ''
else:
if c == quotechar:
inside_quotes = True
field += c
if inside_quotes:
print repr(line)
print fields
print repr(field)
raise Exception("Quotes not balanced")
fields.append(field)
return fields

if __name__ == "__main__":
tests = [
'row1,field1,[field2][text in field2 "quote, quote"],field3,field',
'row2,field1,[field2]text in field2 "quote, quote",field3,field',
'a,b,c',
'a,b,',
'',
'Look,here"s,a,typo',
]
for test in tests:
print repr(test)
print parse_unquoted_csv(test)
print
8<---

HTH,
John




 
Reply With Quote
 
 
 
 
Bret
Guest
Posts: n/a
 
      06-09-2009
Thanks John,

I didn't realize that the quotes were supposed to surround the entire
field. I ended up making a quick script to replace comma's outside
quotes with tabs. I was just trying to clean this crazy "csv" file to
import into msyql.

thanks again,

bret
 
Reply With Quote
 
Terry Reedy
Guest
Posts: n/a
 
      06-09-2009
Bret wrote:
> i have a csv file like so:
> row1,field1,[field2][text in field2 "quote, quote"],field3,field
> row2,field1,[field2]text in field2 "quote, quote",field3,field
>
> using csv.reader to read the file, the first row is broken into two
> fields:
> [field2][text in field2 "quote
> and
> quote"
>
> while the second row is read correctly with:
> [field2]text in field2 "quote, quote"
> being one field.
>
> any ideas how to make csv.reader work correctly for the first case?
> the problem is the comma inside the quote inside the brackets, ie:
> [","]


When posting, give version, minimum code that has problem, and actual
output. Cut and past latter two. Reports are less credible otherwise.

Using 3.1rc1

txt = [
'''row1,field1,[field2][text in field2 "quote, quote"],field3,field''',
'''row2,field1,[field2] text in field2 "quote, quote", field3,field''',
'''row2,field1, field2 text in field2 "quote, quote", field3,field''',
]
import csv
for row in csv.reader(txt): print(len(row),row)

produces

6 ['row1', 'field1', '[field2][text in field2 "quote', ' quote"]',
field3', 'field']
6 ['row2', 'field1', '[field2] text in field2 "quote', ' quote"', '
field3', 'field']
6 ['row2', 'field1', ' field2 text in field2 "quote', ' quote"', '
field3', 'field']

In 3.1 at least, the presence or absence of brackets is irrelevant, as I
expected it to be. For double quotes to protect the comma delimiter,
the *entire field* must be quoted, not just part of it.

If you want to escape the delimiter without quoting entire fields, use
an escape char and change the dialect. For example

txt = [
'''row1,field1,[field2][text in field2 "quote`, quote"],field3,field''',
'''row2,field1,[field2] text in field2 "quote`, quote", field3,field''',
'''row2,field1, field2 text in field2 "quote`, quote", field3,field''',
]
import csv
for row in csv.reader(txt, quoting=csv.QUOTE_NONE, escapechar = '`'):
print(len(row),row)

produces what you desire

5 ['row1', 'field1', '[field2][text in field2 "quote, quote"]',
'field3', 'field']
5 ['row2', 'field1', '[field2] text in field2 "quote, quote"', '
field3', 'field']
5 ['row2', 'field1', ' field2 text in field2 "quote, quote"', '
field3', 'field']


Terry Jan Reedy

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Searching regular expr: How to match the following pattern with quotes, brackets and semicolons? Peter Stacy Perl Misc 1 11-08-2009 09:27 AM
PHP double quotes inside double quotes MSB Computer Support 11 10-21-2006 01:09 PM
Asp.NET Javascript string, want to pass '(single quotes' within '(single quotes) Chris ASP .Net 1 03-24-2006 09:03 PM
Quotes/Double Quotes in Image Control Chris White ASP .Net 1 09-22-2004 06:22 AM
#include -- double quotes vs. angle brackets bartek C++ 1 05-25-2004 11:12 PM



Advertisments