Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Extract Text from PDF

Reply
Thread Tools

Extract Text from PDF

 
 
Mark Dodwell
Guest
Posts: n/a
 
      04-13-2007
Hi,

Does anyone know a way to extract plain text from a PDF using Ruby?

Many Thanks,

~ Mark

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
Robert Klemme
Guest
Posts: n/a
 
      04-13-2007
On 13.04.2007 14:06, Mark Dodwell wrote:
> Does anyone know a way to extract plain text from a PDF using Ruby?


IIRC there is a project under way to extend PDFWriter with reading
capabilities. I don't know the current status of that. HTH

robert
 
Reply With Quote
 
 
 
 
Chris Lowis
Guest
Posts: n/a
 
      04-13-2007
Robert Klemme wrote:
> On 13.04.2007 14:06, Mark Dodwell wrote:
>> Does anyone know a way to extract plain text from a PDF using Ruby?

>
> IIRC there is a project under way to extend PDFWriter with reading
> capabilities. I don't know the current status of that. HTH


In the meantime, you could use the commandline tools pdf2ps and ps2ascii
(I think they use ghostscript as a backend), and read the resulting
ascii file with ruby in the usual way.

Regards,


Chris

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Kouhei Sutou
Guest
Posts: n/a
 
      04-13-2007
Hi,

2007/4/13, Mark Dodwell <>:

> Does anyone know a way to extract plain text from a PDF using Ruby?


You can use Ruby/Poppler:
http://ruby-gnome2.sourceforge.jp/hi...Ruby%2FPoppler

Here is an example to do that:
http://ruby-gnome2.cvs.sourceforge.n...AD&view=markup


Thanks,
--
kou

 
Reply With Quote
 
M. Edward (Ed) Borasky
Guest
Posts: n/a
 
      04-13-2007
Robert Klemme wrote:
> On 13.04.2007 14:06, Mark Dodwell wrote:
>> Does anyone know a way to extract plain text from a PDF using Ruby?

>
> IIRC there is a project under way to extend PDFWriter with reading
> capabilities. I don't know the current status of that. HTH
>
> robert

At least on Linux, there is "pdftotext", which is part of the "poppler"
package. So you can simply shell out to it if it's installed. If you're
more ambitious, you could write an extension to use the underlying
libraries in poppler.
>
>



--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.net/

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.


 
Reply With Quote
 
John Joyce
Guest
Posts: n/a
 
      04-13-2007
The trouble is, pdf is not always the same thing. Sometimes, there is
no text at all in a pdf. It can be all vector art outlines or even
all raster image graphics. There is never a guarantee that you will
get any or all text that may otherwise be human readable in a pdf.
Pdf has really become a kitchen sink format, so it is good to
anticipate trouble parsing pdf files.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
extract contents from pdf (pdf reader) P Rajmohan Banavi-A17190 Ruby 1 09-22-2008 01:49 PM
extract text from a PDF file with JAVA Sergio Java 9 08-03-2006 07:51 AM
Read and extract text from pdf Julien ARNOUX Python 0 04-24-2006 07:03 AM
Read and extract text from pdf Julien ARNOUX Python 3 04-21-2006 09:33 PM
how i can extract text from the PDF files,power point files,Ms word files? crazyprakash Java 4 10-30-2005 10:17 AM



Advertisments