Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   PDF Parser? (http://www.velocityreviews.com/forums/t319335-pdf-parser.html)

Miki Tebeka 07-07-2003 07:18 AM

PDF Parser?
 
Hello All,

I'm looking for a PDF parser.
Any pointers?

10x.
Miki

John Hunter 07-07-2003 04:12 PM

Re: PDF Parser?
 
>>>>> "Miki" == Miki Tebeka <tebeka@cs.bgu.ac.il> writes:

Miki> Hello All, I'm looking for a PDF parser. Any pointers?

A little more info would be helpful: do you need access to all the pdf
structures or just the text? AFAIK, there is no full pdf parser in
python. The subject has come up several times before, so check the
google.groups archives

http://groups.google.com/groups?q=pd...=Google+Search

Things people have suggested before:

1) use pdftotext and parse the text
2) wrap xpdf's parser.

For example, if you have pdftotext, the following will give you a
python file-like handle to the source:

def pdf2txt(fname):
return os.popen('pdftotext -raw -ascii7 %s -' % fname)

If you just want to search and index pdf, see
http://pdfsearch.sourceforge.net.

John Hunter


Adam Twardoch 07-15-2003 08:09 AM

Re: PDF Parser?
 
"John Hunter" <jdhunter@ace.bsd.uchicago.edu>

> A little more info would be helpful: do you need access to all the pdf
> structures or just the text? AFAIK, there is no full pdf parser in
> python.


If you need to access the graphical elements, you may use pstoedit to
convert the PDF into SVG (Structured Vector Graphics). Since SVG is XML, you
can then use any Python-based XML toolkit to parse the data.
http://www.pstoedit.net/pstoedit

Adam




All times are GMT. The time now is 09:45 AM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57