Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Convert AWK regex to Python

Reply
Thread Tools

Convert AWK regex to Python

 
 
J
Guest
Posts: n/a
 
      05-16-2011
Good morning all,
Wondering if you could please help me with the following query:-
I have just started learning Python last weekend after a colleague of mine showed me how to dramatically cut the time a Bash script takes to execute by re-writing it in Python. I was amazed at how fast it ran. I would now like to do the same thing with another script I have.

This other script reads a log file and using AWK it filters certain fields from the log and writes them to a new file. See below the regex the scriptis executing. I would like to re-write this regex in Python as my script is currently taking about 1 hour to execute on a log file with about 100,000 lines. I would like to cut this time down as much as possible.

cat logs/pdu_log_fe.log | awk -F\- '{print $1,$NF}' | awk -F\. '{print $1,$NF}' | awk '{print $1,$4,$5}' | sort | uniq | while read service command status; do echo "Service: $service, Command: $command, Status: $status, Occurrences: `grep $service logs/pdu_log_fe.log | grep $command | grep $status |wc -l | awk '{ print $1 }'`" >> logs/pdu_log_fe_clean.log; done

This AWK command gets lines which look like this:-

2011-05-16 09:46:22,361 [Thread-4847133] PDU D <G_CC_SMS_SERVICE_51408_656.O_ CC_SMS_SERVICE_51408_656-ServerThread-VASPSessionThread-7ee35fb0-7e87-11e0-a2da-00238bce423b-TRX - 2011-05-16 09:46:22 - OUT - (submit_resp: (pdu: L: 53 ID: 80000004 Status: 0 SN: 25866) 98053090-7f90-11e0-a2da-00238bce423b (opt: ) ) >

And outputs lines like this:-

CC_SMS_SERVICE_51408 submit_resp: 0

I have tried writing the Python script myself but I am getting stuck writing the regex. So far I have the following:-

#!/usr/bin/python

# Import RegEx module
import re as regex
# Log file to work on
filetoread = open('/tmp/ pdu_log.log', "r")
# File to write output to
filetowrite = file('/tmp/ pdu_log_clean.log', "w")
# Perform filtering in the log file
linetoread = filetoread.readlines()
for line in linetoread:
filter0 = regex.sub(r"<G_","",line)
filter1 = regex.sub(r"\."," ",filter0)
# Write new log file
filetowrite.write(filter1)
filetowrite.close()
# Read new log and get required fields from it
filtered_log = open('/tmp/ pdu_log_clean.log', "r")
filtered_line = filtered_log.readlines()
for line in filtered_line:
token = line.split(" ")
print token[0], token[1], token[5], token[13], token[20]
print "Done"

Ugly I know but please bear in mind that I have just started learning Python two days ago.

I have been looking on this group and on the Internet for snippets of code that I could use but so far what I have found do not fit my needs or are too complicated (at least for me).

Any suggestion, advice you can give me on how to accomplish this task will be greatly appreciated.

On another note, can you also recommend a good no-nonsense book to learn Python? I have read the book “A Byte of Python” by Swaroop C H (great introductory book!) and I am now reading “Dive into Python” by Mark Pilgrim. I am looking for a book that explains things in simple terms and goes straight to the point (similar to how “A Byte of Python” was written)

Thanks in advance

Kind regards,

Junior
 
Reply With Quote
 
 
 
 
Chris Angelico
Guest
Posts: n/a
 
      05-16-2011
On Mon, May 16, 2011 at 6:19 PM, J <> wrote:
> cat logs/pdu_log_fe.log | awk -F\- '{print $1,$NF}' | awk -F\. '{print $1,$NF}' | awk '{print $1,$4,$5}' | sort | uniq | while read service command status; do echo "Service: $service, Command: $command, Status: $status, Occurrences: `grep $service logs/pdu_log_fe.log | grep $command | grep $status| wc -l | awk '{ print $1 }'`" >> logs/pdu_log_fe_clean.log; done


Small side point: Instead of "| sort | uniq |", you could use a Python
dictionary. That'll likely speed things up somewhat!

Chris Angelico
 
Reply With Quote
 
 
 
 
Chris Angelico
Guest
Posts: n/a
 
      05-16-2011
On Mon, May 16, 2011 at 6:43 PM, J <> wrote:
> Good morning Angelico,
> Do I understand correctly? Do you mean incorporating a Python dict inside the AWK command? How can I do this?


No, inside Python. What I mean is that you can achieve the same
uniqueness requirement by simply storing the intermediate data in a
dictionary and then retrieving it at the end.

Chris Angelico
 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      05-16-2011
J wrote:

> Good morning all,
> Wondering if you could please help me with the following query:-
> I have just started learning Python last weekend after a colleague of mine
> showed me how to dramatically cut the time a Bash script takes to execute
> by re-writing it in Python. I was amazed at how fast it ran. I would now
> like to do the same thing with another script I have.
>
> This other script reads a log file and using AWK it filters certain fields
> from the log and writes them to a new file. See below the regex the
> script is executing. I would like to re-write this regex in Python as my
> script is currently taking about 1 hour to execute on a log file with
> about 100,000 lines. I would like to cut this time down as much as
> possible.
>
> cat logs/pdu_log_fe.log | awk -F\- '{print $1,$NF}' | awk -F\. '{print
> $1,$NF}' | awk '{print $1,$4,$5}' | sort | uniq | while read service
> command status; do echo "Service: $service, Command: $command, Status:
> $status, Occurrences: `grep $service logs/pdu_log_fe.log | grep $command |
> grep $status | wc -l | awk '{ print $1 }'`" >> logs/pdu_log_fe_clean.log;
> done
>
> This AWK command gets lines which look like this:-
>
> 2011-05-16 09:46:22,361 [Thread-4847133] PDU D
> <G_CC_SMS_SERVICE_51408_656.O_
> CC_SMS_SERVICE_51408_656-ServerThread-

VASPSessionThread-7ee35fb0-7e87-11e0-a2da-00238bce423b-TRX
> - 2011-05-16 09:46:22 - OUT - (submit_resp: (pdu: L: 53 ID: 80000004
> Status: 0 SN: 25866) 98053090-7f90-11e0-a2da-00238bce423b (opt: ) ) >
>
> And outputs lines like this:-
>
> CC_SMS_SERVICE_51408 submit_resp: 0
>
> I have tried writing the Python script myself but I am getting stuck
> writing the regex. So far I have the following:-


For the moment forget about the implementation. The first thing you should
do is to describe the problem as clearly as possible, in plain English.


 
Reply With Quote
 
Giacomo Boffi
Guest
Posts: n/a
 
      05-16-2011
J <> writes:

> cat logs/pdu_log_fe.log | awk -F\- '{print $1,$NF}' | awk -F\. '{print $1,$NF}' | awk '{print $1,$4,$5}' | sort | uniq | while read service command status; do echo "Service: $service, Command: $command, Status: $status, Occurrences: `grep $service logs/pdu_log_fe.log | grep $command | grep $status | wc -l | awk '{ print $1 }'`" >> logs/pdu_log_fe_clean.log; done
>
> This AWK command gets lines which look like this:-
>
> 2011-05-16 09:46:22,361 [Thread-4847133] PDU D <G_CC_SMS_SERVICE_51408_656.O_ CC_SMS_SERVICE_51408_656-ServerThread-VASPSessionThread-7ee35fb0-7e87-11e0-a2da-00238bce423b-TRX - 2011-05-16 09:46:22 - OUT - (submit_resp: (pdu: L: 53 ID: 80000004 Status: 0 SN: 25866) 98053090-7f90-11e0-a2da-00238bce423b (opt: ) ) >
>
> And outputs lines like this:-
>
> CC_SMS_SERVICE_51408 submit_resp: 0
>


i see some discrepancies in the description of your problem

1. if i echo a properly quoted line "like this" above in the pipeline
formed by the first three awk commands i get

$ echo $likethis | awk -F\- '{print $1,$NF}' \
| awk -F\. '{print$1,$NF}' \
| awk '{print $1,$4,$5}'
2011 ) )
$
not a triple 'service command status'

2. with regard to the final product, you script outputs lines like in

echo "Service: $service, [...]"

and you say that it produces lines like

CC_SMS_SERVICE_51408 submit_resp:


WHATEVER, the abnormous run time is due to the fact that for every
output line you rescan again and again the whole log file

IF i had understood what you want, imho you should run your data
through sort and uniq -c

$ awk -F\- '{print $1,$NF}' < $file \
| awk -F\. '{print$1,$NF}' \
| awk '{print $1,$4,$5}' | sort | uniq -c | format_program

uniq -c drops repeated lines from a sorted input AND prepends to each
line the count of equal lines in the original stream

hth
g
 
Reply With Quote
 
Matt Berends
Guest
Posts: n/a
 
      05-16-2011
This doesn't directly bear upon the posted example, but I found the
following tutorial extremely helpful for learning how to parse log
files with idiomatic python. Maybe you'll might find it useful, too.

http://www.dabeaz.com/generators/

http://www.dabeaz.com/generators/Generators.pdf
 
Reply With Quote
 
MRAB
Guest
Posts: n/a
 
      05-16-2011
On 16/05/2011 09:19, J wrote:
[snip]
> #!/usr/bin/python
>
> # Import RegEx module
> import re as regex
> # Log file to work on
> filetoread = open('/tmp/ pdu_log.log', "r")
> # File to write output to
> filetowrite = file('/tmp/ pdu_log_clean.log', "w")
> # Perform filtering in the log file
> linetoread = filetoread.readlines()
> for line in linetoread:
> filter0 = regex.sub(r"<G_","",line)
> filter1 = regex.sub(r"\."," ",filter0)
> # Write new log file
> filetowrite.write(filter1)
> filetowrite.close()
> # Read new log and get required fields from it
> filtered_log = open('/tmp/ pdu_log_clean.log', "r")
> filtered_line = filtered_log.readlines()
> for line in filtered_line:
> token = line.split(" ")
> print token[0], token[1], token[5], token[13], token[20]
> print "Done"
>

[snip]

If you don't need the power of regex, it's faster to use string methods:

filter0 = line.replace("<G_", "")
filter1 = filter0.replace(".", " ")

Actually, seeing as how you're reading all the lines in one go anyway,
it's probably faster to do this instead:

text = filetoread.read()
text = text.replace("<G_", "")
text = text.replace(".", " ")
# Write new log file
filetowrite.write(text)
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Convert AWK regex to Python J Python 3 05-16-2011 02:01 PM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
text file parsing (awk -> python) Daniel Nogradi Python 3 11-22-2006 06:02 PM
python vs awk for simple sysamin tasks Matthew Thorley Python 20 06-05-2004 08:11 PM
where is the awk to python translator program Dan Jacobson Python 2 07-28-2003 05:09 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57