Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Re: More Help with python .find fucntion (http://www.velocityreviews.com/forums/t741710-re-more-help-with-python-find-fucntion.html)

Steven D'Aprano 01-08-2011 05:35 AM

Re: More Help with python .find fucntion
 
On Fri, 07 Jan 2011 22:43:54 -0600, Keith Anthony wrote:

> My previous question asked how to read a file into a strcuture a line at
> a time. Figured it out. Now I'm trying to use .find to separate out
> the PDF objects. (See code) PROBLEM/QUESTION: My call to lines[i].find
> does NOT find all instances of endobj. Any help available? Any
> insights?
>
> #!/usr/bin/python
>
> inputfile = file('sample.pdf','rb') # This is PDF with which
> we will work
> lines = inputfile.readlines() # read file
> one line at a time


That's incorrect. readlines() reads the entire file in one go, and splits
it into individual lines.


> linestart = [] # Starting address for
> each line
> lineend = [] # Ending
> address for each line
> linetype = []


*raises eyebrow*

How is an empty list a starting or ending address?

The only thing worse than no comments where you need them is misleading
comments. A variable called "linestart" implies that it should be a
position, e.g. linestart = 0. Or possibly a flag.


> print len(lines) # print number of lines
>
> i = 0 # define an iterator, i


Again, 0 is not an iterator. 0 is a number.


> addr = 0 # and address pointer
>
> while i < len(lines): # Go through each line
> linestart = linestart + [addr]
> length = len(lines[i])
> lineend = lineend + [addr + (length-1)] addr = addr + length
> i = i + 1


Complicated and confusing and not the way to do it in Python. Something
like this is much simpler:


linetypes = [] # note plural
inputfile = open('sample.pdf','rb') # Don't use file, use open.

for line_number, line in enumerate(inputfile):
# Process one line at a time. No need for that nonsense with manually
# tracked line numbers, enumerate() does that for us.
# No need to initialise linetypes.
status = 'normal'
i = line.find(' obj')
if i >= 0:
print "Object found at offset %d in line %d" % (i, line_number)
status = 'object'
i = line.find('endobj')
if i >= 0:
print "endobj found at offset %d in line %d" % (i, line_number)
if status == 'normal': status = 'endobj'
else: status = 'object & endobj' # both found on the one line
linetypes.append(status)
# What if obj or endobj exist more than once in a line?



One last thing... if PDF files are a binary format, what makes you think
that they can be processed line-by-line? They may not have lines, except
by accident.


--
Steven


All times are GMT. The time now is 11:36 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.