Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Split string but ignore quotes

Reply
Thread Tools

Split string but ignore quotes

 
 
Scooter
Guest
Posts: n/a
 
      09-29-2009
I'm attempting to reformat an apache log file that was written with a
custom output format. I'm attempting to get it to w3c format using a
python script. The problem I'm having is the field-to-field matching.
In my python code I'm using split with spaces as my delimiter. But it
fails when it reaches the user agent because that field itself
contains spaces. But that user agent is enclosed with double quotes.
So is there a way to split on a certain delimiter but not to split
within quoted words.

i.e. a line might look like

2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200
1923 1360 31715 -
 
Reply With Quote
 
 
 
 
Björn Lindqvist
Guest
Posts: n/a
 
      09-29-2009
2009/9/29 Scooter <(E-Mail Removed)>:
> I'm attempting to reformat an apache log file that was written with a
> custom output format. I'm attempting to get it to w3c format using a
> python script. The problem I'm having is the field-to-field matching.
> In my python code I'm using split with spaces as my delimiter. But it
> fails when it reaches the user agent because that field itself
> contains spaces. But that user agent is enclosed with double quotes.
> So is there a way to split on a certain delimiter but not to split
> within quoted words.
>
> i.e. a line might look like
>
> 2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
> Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
> 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200
> 1923 1360 31715 -


Try shlex:

>>> import shlex
>>> s = '2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200'
>>> shlex.split(s)

['2009-09-29', '12:00:00', '-', 'GET', '/', 'Mozilla/4.0 (compatible;
MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media
Center PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)',
'http://somehost.com', '200']





--
mvh Björn
 
Reply With Quote
 
 
 
 
MRAB
Guest
Posts: n/a
 
      09-29-2009
Björn Lindqvist wrote:
> 2009/9/29 Scooter <(E-Mail Removed)>:
>> I'm attempting to reformat an apache log file that was written with a
>> custom output format. I'm attempting to get it to w3c format using a
>> python script. The problem I'm having is the field-to-field matching.
>> In my python code I'm using split with spaces as my delimiter. But it
>> fails when it reaches the user agent because that field itself
>> contains spaces. But that user agent is enclosed with double quotes.
>> So is there a way to split on a certain delimiter but not to split
>> within quoted words.
>>
>> i.e. a line might look like
>>
>> 2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
>> Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
>> 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200
>> 1923 1360 31715 -

>
> Try shlex:
>
>>>> import shlex
>>>> s = '2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200'
>>>> shlex.split(s)

> ['2009-09-29', '12:00:00', '-', 'GET', '/', 'Mozilla/4.0 (compatible;
> MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media
> Center PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)',
> 'http://somehost.com', '200']
>

The regex solution is:

>>> import re
>>> s = '2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE

7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200'
>>> re.findall(r'".*?"|\S+', s)

['2009-09-29', '12:00:00', '-', 'GET', '/', '"Mozilla/4.0 (compatible;
MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center
PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)"',
'http://somehost.com', '200']
 
Reply With Quote
 
Simon Forman
Guest
Posts: n/a
 
      09-29-2009
On Tue, Sep 29, 2009 at 11:11 AM, Scooter <(E-Mail Removed)> wrote:
> I'm attempting to reformat an apache log file that was written with a
> custom output format. I'm attempting to get it to w3c format using a
> python script. The problem I'm having is the field-to-field matching.
> In my python code I'm using split with spaces as my delimiter. But it
> fails when it reaches the user agent because that field itself
> contains spaces. But that user agent is enclosed with double quotes.
> So is there a way to split on a certain delimiter but not to split
> within quoted words.
>
> i.e. a line might look like
>
> 2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
> Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
> 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200
> 1923 1360 31715 -
> --
> http://mail.python.org/mailman/listinfo/python-list
>


s = '''2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0;
..NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200 1923
1360 31715 -'''


initial, user_agent, trailing = s.split('"')

# Then depending on what you want to do with them...
foo = initial.split() + [user_agent] + trailing.split()
 
Reply With Quote
 
BJ Swope
Guest
Posts: n/a
 
      09-30-2009
Would the csv module be appropriate?

On 9/29/09, Scooter <(E-Mail Removed)> wrote:
> I'm attempting to reformat an apache log file that was written with a
> custom output format. I'm attempting to get it to w3c format using a
> python script. The problem I'm having is the field-to-field matching.
> In my python code I'm using split with spaces as my delimiter. But it
> fails when it reaches the user agent because that field itself
> contains spaces. But that user agent is enclosed with double quotes.
> So is there a way to split on a certain delimiter but not to split
> within quoted words.
>
> i.e. a line might look like
>
> 2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
> Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
> 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200
> 1923 1360 31715 -
> --
> http://mail.python.org/mailman/listinfo/python-list
>



--
To argue that honorable conduct is only required against an honorable
enemy degrades the Americans who must carry out the orders. -- Charles
Krulak, Former Commandant of the Marine Corps

We are all slave to our own paradigm. -- Joshua Williams

If the letters PhD appear after a person's name, that person will
remain outdoors even after it's started raining. -- Jeff Kay
 
Reply With Quote
 
Processor-Dev1l
Guest
Posts: n/a
 
      09-30-2009
On Sep 29, 5:11*pm, Scooter <(E-Mail Removed)> wrote:
> I'm attempting to reformat an apache log file that was written with a
> custom output format. I'm attempting to get it to w3c format using a
> python script. The problem I'm having is the field-to-field matching.
> In my python code I'm using split with spaces as my delimiter. But it
> fails when it reaches the user agent because that field itself
> contains spaces. But that user agent is enclosed with double quotes.
> So is there a way to split on a certain delimiter but not to split
> within quoted words.
>
> i.e. a line might look like
>
> 2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
> Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
> 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)"http://somehost.com200
> 1923 1360 31715 -


Best option for you is to use shlex module as Björn said.
This is quite a simple question and you would find it on your own for
sure if you search python docs a little bit
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
split string but ignore sep inside double quotes Jonno Python 0 04-13-2011 02:59 PM
split string at commas respecting quotes when string not in csv format R. David Murray Python 8 03-27-2009 02:19 PM
Re: split string at commas respecting quotes when string not in csvformat Terry Reedy Python 1 03-26-2009 10:20 PM
String#split(/\s+/) vs. String#split(/(\s+)/) Sam Kong Ruby 5 08-12-2006 07:59 PM
Asp.NET Javascript string, want to pass '(single quotes' within '(single quotes) Chris ASP .Net 1 03-24-2006 09:03 PM



Advertisments