![]() |
Some help in refining this regex for CSV files
Hi guys,
I've to deal with CSVs that look like following CSV (with one header and 3 legit rows where each legit row has 3 columns) ---- Some info Date: 12/6/2012 Author: Some guy Total records: 100 header1, header2, header3 one, two, three one, "Python is great, so are other languages, isn't ?", three one, two, 'some languages, are realyl beautiful\r\n, I really cannot deny \n this \t\t\t fact. \t\t\t\tthis fact alone is amazing' ---- So inside this CSV, there will always be bad lines like the top 4 (they could end up in the beginning, in the middle and even in the last). So above sample, csv has 3 legit lines and a header. I want to read those three linesand here is a regex that I came up with (which clearly isn't working) #print line pattern = r"([^\t]+\t|,+)" matches = re.match(pattern, line) Do you've any better ideas guys? I will really appreciate all help. |
Re: Some help in refining this regex for CSV files
On 06/12/2012 07:21, Oltmans wrote:
> Hi guys, > > I've to deal with CSVs that look like following > > CSV (with one header and 3 legit rows where each legit row has 3 columns) > ---- > Some info > Date: 12/6/2012 > Author: Some guy > Total records: 100 > > header1, header2, header3 > one, two, three > one, "Python is great, so are other languages, isn't ?", three > one, two, 'some languages, are realyl beautiful\r\n, I really cannot deny \n this \t\t\t fact. \t\t\t\tthis fact alone is amazing' > ---- > > So inside this CSV, there will always be bad lines like the top 4 (they could end up in the beginning, in the middle and even in the last). So above sample, csv has 3 legit lines and a header. I want to read those three lines and here is a regex that I came up with (which clearly isn't working) > > #print line > pattern = r"([^\t]+\t|,+)" > matches = re.match(pattern, line) > > Do you've any better ideas guys? I will really appreciate all help. > I'd simply use the csv module from the standard library to read your files, discarding anything that you regard as bad. I'd certainly not use a regex for this. -- Cheers. Mark Lawrence. |
Re: Some help in refining this regex for CSV files
On 12/06/12 01:21, Oltmans wrote:
> Hi guys, > > I've to deal with CSVs that look like following > > CSV (with one header and 3 legit rows where each legit row has 3 columns) > ---- > Some info > Date: 12/6/2012 > Author: Some guy > Total records: 100 > > header1, header2, header3 > one, two, three > one, "Python is great, so are other languages, isn't ?", three > one, two, 'some languages, are realyl beautiful\r\n, I really cannot deny \n this \t\t\t fact. \t\t\t\tthis fact alone is amazing' > ---- > > So inside this CSV, there will always be bad lines like the top 4 (they could end up in the beginning, in the middle and even in the last). So above sample, csv has 3 legit lines and a header. I want to read those three lines and here is a regex that I came up with (which clearly isn't working) > > #print line > pattern = r"([^\t]+\t|,+)" > matches = re.match(pattern, line) > > Do you've any better ideas guys? I will really appreciate all help. I agree with Mark that using the "csv" module will likely be your easiest way to go. Just consume the lines you don't want before passing it to the csv.reader(), or parse them and discard invalid items. The first could be done something like import csv f = file("data.csv", "rb") while True: line = f.next().rstrip("\r\n") if not line: break r = csv.reader(f) for row in r: print repr(row) The latter might be done something like f = file("data.csv", "rb") r = csv.reader(f) for row in r: if len(row) != 3: continue print repr(row) However, I also noticed that your example file doesn't seem to fit a true csv file definition, as you seem to switch quoting notations, sometimes using single, sometimes using double quotes. -tkc |
| All times are GMT. The time now is 06:23 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.