Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > groveling over a file for Q:: and A:: stmts

Reply
Thread Tools

groveling over a file for Q:: and A:: stmts

 
 
paul618
Guest
Posts: n/a
 
      07-24-2012
#!/usr/bin/env python
# grep_for_QA.py I am only looking to isolate uniq Q:: and A:: stmts from my daily files
#
# note: This algorithm will fail if there are any blank lines within the Q and A area of interest (a paragraph)

# D. Beazley is my fav documentation

import re, glob
import pprint as pp

sampledata = '''
A:: And Straight Street is playin on the Radio Free Tibet. What are the chances, DTMB?
Q:: About 1 in 518400, Professor.
A:: Correct! Err, I thought it was 1:410400, but <i>close enough for jazz!</i>


'''

pattern0 = re.compile("Q::")
pattern1 = re.compile("A::") # objects of interest can start with A:: ;; not alway Q::
END_OF_PARAGRAPH_pat = "\n\s*\n"

path = "/Users/paultaney/dailies2012/0722" # an example of real data set.

toggle = False
L = []
M = []

#file = open(path, "r")
try:
#for line in file.readlines():
for line in sampledata:
try:
# Later, I also need to treat Unicode -- and I am clueless.

# falsestarts::
#line.encode("utf8").decode('xxx', 'ignore')
#line.encode("utf8", 'ignore')
#line.decode('8859')
#line.decode('8859') # 8859, Latin-1 doesn't cover my CJK pastings AT ALL
#line.decode('GB18030') # 171006 -- ack
#encoded_line = line # xxx line.encode("utf8")

mo0 = re.search(pattern0, line)
mo1 = re.search(pattern1, line)
mo2 = re.search(END_OF_PARAGRAPH_pat, line)

if mo0:
if 1: print ("I see pattern 0")
toggle = True
if 1: print(line)
M.append(mo0.group())

if mo1:
if 1: print ("I see pattern 1")
toggle = True
M.append(mo1.group())

if mo2 and toggle:
if 1: print ("I see pattern 2 AND toggle is set")
# got one. save it for uniqifying, and empty the container
toggle = False
L.append(M)
M = []

except Exception as e:
print("--- " + e + " ---")

except UnicodeDecodeError:
#encoded_line = encoded_line.urlsafe_b64encode(re.replace("asdf", encoded_line))
#line = re.sub(".+", "--- asdf ---", line)
pass

L.sort
print (L)

# and what"s wrong with some of this, here!
#myHash = set(L) # uniqify
#pp.pprint(myHash) # july 23, 131001 hike!
 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      07-24-2012
On Tue, 24 Jul 2012 00:50:22 -0700, paul618 wrote:

> #!/usr/bin/env python
> # grep_for_QA.py I am only looking to isolate uniq Q:: and A:: stmts
> from my daily files #
> # note: This algorithm will fail if there are any blank lines within
> the Q and A area of interest (a paragraph)
>
> # D. Beazley is my fav documentation



If you are going to ask a question, please ask a question. Don't just
dump a whole pile of code in our laps and expect us to work out what your
question is.

It may help if you read this page:

http://sscce.org/

Some further comments below:

> import re, glob
> import pprint as pp
>
> sampledata = '''
> A:: And Straight Street is playin on the Radio Free Tibet. What are the
> chances, DTMB? Q:: About 1 in 518400, Professor.
> A:: Correct! Err, I thought it was 1:410400, but <i>close enough for
> jazz!</i>
>
>
> '''
>
> pattern0 = re.compile("Q::")


There is no point in using a regular expression for something as trivial
as that. That is like swinging a 20 kg sledge-hammer to crack a peanut.

Just use a string method:

if my_string.startswith("Q::"): ...


[...]
> # Later, I also need to treat Unicode -- and I am clueless.


If you have a question about Unicode, you should ask it.

If you have not already read this page, you should read it now:

http://www.joelonsoftware.com/printe...s/Unicode.html



> except Exception as e:
> print("--- " + e + " ---")


Please don't throw away useful debugging information.

You should learn to read exception tracebacks, not hide them. They
contain a lot of very useful information to help you debug your code.

> except UnicodeDecodeError:
> #encoded_line = encoded_line.urlsafe_b64encode(re.replace("asdf",
> encoded_line)) #line = re.sub(".+", "--- asdf ---", line) pass


This will never be caught because any UnicodeDecodeError will already be
caught by the "except Exception" line above.


> L.sort
> print (L)
>
> # and what"s wrong with some of this, here! #myHash = set(L) #
> uniqify
> #pp.pprint(myHash) # july 23, 131001 hike!


I don't know what's wrong with it. What do you expect it to do, and what
does it actually do instead?



--
Steven
 
Reply With Quote
 
 
 
 
paul618
Guest
Posts: n/a
 
      07-24-2012
Hi Steve:


Thank you for your quick response.

Ah, indeed I failed to ask my question:: Why doesnt this code print the sampledata? Instead it prints the empty list.

The answer is probably quite simple, as I really am an idiot.


Thanks again,
paul


 
Reply With Quote
 
MRAB
Guest
Posts: n/a
 
      07-24-2012
On 24/07/2012 08:50, paul618 wrote:
> #!/usr/bin/env python
> # grep_for_QA.py I am only looking to isolate uniq Q:: and A:: stmts from my daily files
> #
> # note: This algorithm will fail if there are any blank lines within the Q and A area of interest (a paragraph)
>
> # D. Beazley is my fav documentation
>
> import re, glob
> import pprint as pp
>
> sampledata = '''
> A:: And Straight Street is playin on the Radio Free Tibet. What are the chances, DTMB?
> Q:: About 1 in 518400, Professor.
> A:: Correct! Err, I thought it was 1:410400, but <i>close enough for jazz!</i>
>
>
> '''
>
> pattern0 = re.compile("Q::")
> pattern1 = re.compile("A::") # objects of interest can start with A:: ;; not alway Q::
> END_OF_PARAGRAPH_pat = "\n\s*\n"
>
> path = "/Users/paultaney/dailies2012/0722" # an example of real data set.
>
> toggle = False
> L = []
> M = []
>
> #file = open(path, "r")
> try:
> #for line in file.readlines():
> for line in sampledata:


sampledata is a string, therefore this is iterating over the string,
which yields characters, not lines. Try using sampledata.splitlines():

for line in sampledata.splitlines():

> try:
> # Later, I also need to treat Unicode -- and I am clueless.
>
> # falsestarts::
> #line.encode("utf8").decode('xxx', 'ignore')
> #line.encode("utf8", 'ignore')
> #line.decode('8859')
> #line.decode('8859') # 8859, Latin-1 doesn't cover my CJK pastings AT ALL
> #line.decode('GB18030') # 171006 -- ack
> #encoded_line = line # xxx line.encode("utf8")
>
> mo0 = re.search(pattern0, line)


This searches for pattern0 anywhere in the line. You really want to
check whether the line starts with pattern0, which is better done with:

line.startswith("Q::")

> mo1 = re.search(pattern1, line)
> mo2 = re.search(END_OF_PARAGRAPH_pat, line)
>
> if mo0:
> if 1: print ("I see pattern 0")
> toggle = True
> if 1: print(line)
> M.append(mo0.group())
>
> if mo1:
> if 1: print ("I see pattern 1")
> toggle = True
> M.append(mo1.group())
>
> if mo2 and toggle:
> if 1: print ("I see pattern 2 AND toggle is set")
> # got one. save it for uniqifying, and empty the container
> toggle = False
> L.append(M)
> M = []
>
> except Exception as e:
> print("--- " + e + " ---")
>
> except UnicodeDecodeError:
> #encoded_line = encoded_line.urlsafe_b64encode(re.replace("asdf", encoded_line))
> #line = re.sub(".+", "--- asdf ---", line)
> pass
>
> L.sort
> print (L)
>
> # and what"s wrong with some of this, here!
> #myHash = set(L) # uniqify
> #pp.pprint(myHash) # july 23, 131001 hike!
>


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
VOIP over VPN over TCP over WAP over 3G Theo Markettos UK VOIP 2 02-14-2008 03:27 PM
PSL stmts in VHDL: how to describe asynchronous dependencies? Eric DELAGE VHDL 2 04-06-2005 05:15 AM
PSL stmts embedded in VHDL: how to do functional coverage w/it? Eric DELAGE VHDL 1 04-05-2005 04:29 PM
using inline stmts vs Page_Load event Jack Frost ASP .Net 3 11-03-2003 01:50 PM
Newbie question about old include stmts Bob ASP .Net 1 07-22-2003 05:30 PM



Advertisments