Go Back   Velocity Reviews > Newsgroups > Java
User Name
Password
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

Reply

Java - extract text from a PDF file with JAVA

 
Thread Tools Search this Thread
Old 08-02-2006, 03:19 PM   #1
Default extract text from a PDF file with JAVA


Hi to all the newsgroup, this is my first post.
I'm approaching the text retrieving from PDF files with java.
I'm looking for some example code, tutorial, guide or similar.

I'm using, for the moment, PDFBox library but i notice a lot of errors
in its PDF parsing.
So i've tried with "Pjx" library and i've found a good example code in
this site:
http://www.jguru.com/faq/view.jsp?EID=1074237
....but i can't find a way to call "PdfParser.getContents()" method.

I will appreciate any advice.
Thanks in advance.

Sergio.



Sergio
  Reply With Quote
Old 08-02-2006, 05:31 PM   #2
Oliver Wong
 
Posts: n/a
Default Re: extract text from a PDF file with JAVA
"Sergio" <> wrote in message
news: ups.com...
>
> So i've tried with "Pjx" library and i've found a good example code in
> this site:
> http://www.jguru.com/faq/view.jsp?EID=1074237
> ...but i can't find a way to call "PdfParser.getContents()" method.


How can you "not find a way" to call a specific method? What did you
type and what error message was produced?

- Oliver



Oliver Wong
  Reply With Quote
Old 08-02-2006, 05:47 PM   #3
Lars Enderin
 
Posts: n/a
Default Re: extract text from a PDF file with JAVA
Oliver Wong skrev:
> "Sergio" <> wrote in message
> news: ups.com...
>>
>> So i've tried with "Pjx" library and i've found a good example code in
>> this site:
>> http://www.jguru.com/faq/view.jsp?EID=1074237
>> ...but i can't find a way to call "PdfParser.getContents()" method.

>
> How can you "not find a way" to call a specific method? What did you
> type and what error message was produced?
>


The method is declared private. It's not supposed to be called from
outside the class.


Lars Enderin
  Reply With Quote
Old 08-02-2006, 06:14 PM   #4
Sergio
 
Posts: n/a
Default Re: extract text from a PDF file with JAVA
Lars Enderin ha scritto:

> The method is declared private. It's not supposed to be called from
> outside the class.


first af all thanks for the answers.
i've made that method public before calling it.
My procedure's call is this (very simple):

File f = new File("sample.pdf");
String text = new String();
PdfParser p = new PdfParser();
Document doc = p.parse(f);
text = p.getContents();


These the errors displayed on the console:

Exception in thread "main" java.lang.ClassCastException:
java.lang.String
at com.etymon.pj.PdfParser.parse(PdfParser.java:427)
at com.etymon.pj.PdfParser.getNextXref(PdfParser.java :67)
at com.etymon.pj.PdfParser.getXref(PdfParser.java:57)
at com.etymon.pj.PdfParser.getObjects(PdfParser.java: 12)
at com.etymon.pj.Pdf.readFromFile(Pdf.java:1227)
at com.etymon.pj.Pdf.<init>(Pdf.java:32)
at PdfParser.getContents(PdfParser.java:82)
at PdfParser.parse(PdfParser.java:47)
at PdfParser.parse(PdfParser.java:29)
at Prova.main(Prova.java:31)

Thanks in advance for your interest.

Sergio.



Sergio
  Reply With Quote
Old 08-02-2006, 06:34 PM   #5
Oliver Wong
 
Posts: n/a
Default Re: extract text from a PDF file with JAVA

"Sergio" <> wrote in message
news: ps.com...
> Lars Enderin ha scritto:
>
>> The method is declared private. It's not supposed to be called from
>> outside the class.

>
> first af all thanks for the answers.
> i've made that method public before calling it.
> My procedure's call is this (very simple):
>
> File f = new File("sample.pdf");
> String text = new String();
> PdfParser p = new PdfParser();
> Document doc = p.parse(f);
> text = p.getContents();
>
>
> These the errors displayed on the console:
>
> Exception in thread "main" java.lang.ClassCastException:
> java.lang.String
> at com.etymon.pj.PdfParser.parse(PdfParser.java:427)
> at com.etymon.pj.PdfParser.getNextXref(PdfParser.java :67)
> at com.etymon.pj.PdfParser.getXref(PdfParser.java:57)
> at com.etymon.pj.PdfParser.getObjects(PdfParser.java: 12)
> at com.etymon.pj.Pdf.readFromFile(Pdf.java:1227)
> at com.etymon.pj.Pdf.<init>(Pdf.java:32)
> at PdfParser.getContents(PdfParser.java:82)
> at PdfParser.parse(PdfParser.java:47)
> at PdfParser.parse(PdfParser.java:29)
> at Prova.main(Prova.java:31)
>
> Thanks in advance for your interest.


Please show the parse method of the file com.etymon.pj.PdfParser. Be
sure to include line 427.

- Oliver



Oliver Wong
  Reply With Quote
Old 08-02-2006, 07:38 PM   #6
Sergio
 
Posts: n/a
Default Re: extract text from a PDF file with JAVA

> Please show the parse method of the file com.etymon.pj.PdfParser. Be
> sure to include line 427.
>
> - Oliver


As you've requested here is the parse method of the file
com.etymon.pj.PdfParser.
It's quite long...the line 427 is the return instruction at the end of
method.
Thanks again.

public static PjObject parse(Pdf pdf, RandomAccessFile raf, long[][]
xref, byte[] data, int start)
throws IOException, PjException {
PdfParserState state = new PdfParserState();
state._data = data;
state._pos = start;
state._stream = -1;
Stack stack = new Stack();
boolean endFlag = false;
while ( ( ! endFlag ) && (getToken(state)) ) {
if (state._stream != -1) {
stack.push(state._streamToken);
state._stream = -1;
}
else if (state._token.equals("startxref")) {
endFlag = true;
}
else if (state._token.equals("endobj")) {
endFlag = true;
}
else if (state._token.equals("%%EOF")) {
endFlag = true;
}
else if (state._token.equals("endstream")) {
byte[] stream = (byte[])(stack.pop());
PjStreamDictionary pjsd = new PjStreamDictionary(
((PjDictionary)(stack.pop())).getHashtable());
PjStream pjs = new PjStream(pjsd, stream);
stack.push(pjs);
}
else if (state._token.equals("stream")) {
// get length of stream
PjObject obj = ((PjObject)(
(((PjDictionary)(stack.peek())).
getHashtable().
get(new PjName("Length")))));
if (obj instanceof PjReference) {
obj = getObject(pdf, raf, xref,
((PjReference)(obj)).getObjNumber().getInt());
}
state._stream =
((PjNumber)(obj)).getInt();

// the following if() clause added to
// handle the case of "Length" being
// incorrect (larger than the actual
// stream length)
if ( state._stream >
(state._data.length - state._pos)
) {
state._stream =
state._data.length -
state._pos - 17;
}

if (state._pos < state._data.length) {
if ((char)(state._data[state._pos]) == '\r') {
state._pos++;
}
if ( (state._pos < state._data.length) &&
((char)(state._data[state._pos]) ==
'\n') ) {
state._pos++;
}
}
}
else if (state._token.equals("null")) {
stack.push(new PjNull());
}
else if (state._token.equals("true")) {
stack.push(new PjBoolean(true));
}
else if (state._token.equals("false")) {
stack.push(new PjBoolean(false));
}
else if (state._token.equals("R")) {
// we ignore the generation number
// because all objects get reset to
// generation 0 when we collapse the
// incremental updates
stack.pop(); // the generation number
PjNumber obj = (PjNumber)(stack.pop());
stack.push(new PjReference(obj, PjNumber.ZERO));
}
else if ( (state._token.charAt(0) == '<') &&
(state._token.startsWith("<<") == false) ) {
stack.push(new PjString(PjString.decodePdf(state._token)));
}
else if (
(Character.isDigit(state._token.charAt(0)))
|| (state._token.charAt(0) == '-')
|| (state._token.charAt(0) == '.') ) {
stack.push(new PjNumber(new Float(state._token).floatValue()));
}
else if (state._token.charAt(0) == '(') {
stack.push(new PjString(PjString.decodePdf(state._token)));
}
else if (state._token.charAt(0) == '/') {
stack.push(new PjName(state._token.substring(1)));
}
else if (state._token.equals(">>")) {
boolean done = false;
Object obj;
Hashtable h = new Hashtable();
while ( ! done ) {
obj = stack.pop();
if ( (obj instanceof String) &&
(((String)obj).equals("<<")) ) {
done = true;
} else {
h.put((PjName)(stack.pop()),
(PjObject)obj);
}
}
// figure out what kind of dictionary we have
PjDictionary dictionary = new PjDictionary(h);
if (PjPage.isLike(dictionary)) {
stack.push(new PjPage(h));
}
else if (PjPages.isLike(dictionary)) {
stack.push(new PjPages(h));
}
else if (PjFontType1.isLike(dictionary)) {
stack.push(new PjFontType1(h));
}
else if (PjFontDescriptor.isLike(dictionary)) {
stack.push(new PjFontDescriptor(h));
}
else if (PjResources.isLike(dictionary)) {
stack.push(new PjResources(h));
}
else if (PjCatalog.isLike(dictionary)) {
stack.push(new PjCatalog(h));
}
else if (PjInfo.isLike(dictionary)) {
stack.push(new PjInfo(h));
}
else if (PjEncoding.isLike(dictionary)) {
stack.push(new PjEncoding(h));
}
else {
stack.push(dictionary);
}
}
else if (state._token.equals("]")) {
boolean done = false;
Object obj;
Vector v = new Vector();
while ( ! done ) {
obj = stack.pop();
if ( (obj instanceof String) &&
(((String)obj).equals("[")) ) {
done = true;
} else {
v.insertElementAt((PjObject)obj, 0);
}
}
// figure out what kind of array we have
PjArray array = new PjArray(v);
if (PjRectangle.isLike(array)) {
stack.push(new PjRectangle(v));
}
else if (PjProcSet.isLike(array)) {
stack.push(new PjProcSet(v));
}
else {
stack.push(array);
}
}
else if (state._token.startsWith("%")) {
// do nothing
}
else {
stack.push(state._token);
}
}
/*line 427*/ return (PjObject)(stack.pop());
}



Sergio
  Reply With Quote
Old 08-02-2006, 07:51 PM   #7
Sergio
 
Posts: n/a
Default Re: extract text from a PDF file with JAVA
I've uploaded the pjx library to this site
http://rapidshare.de/files/27945483/pjx-1.4.0.jar.html
I think it could be useful.
Thanks for all your help.
Sergio.



Sergio
  Reply With Quote
Old 08-02-2006, 08:01 PM   #8
Oliver Wong
 
Posts: n/a
Default Re: extract text from a PDF file with JAVA

"Sergio" <> wrote in message
news: ups.com...

[OP has a CastClassException on line 427, actual class type is String]
>
>> Please show the parse method of the file com.etymon.pj.PdfParser. Be
>> sure to include line 427.
>>
>> - Oliver

>
> As you've requested here is the parse method of the file
> com.etymon.pj.PdfParser.
> It's quite long...the line 427 is the return instruction at the end of
> method.
> Thanks again.
>
> public static PjObject parse(Pdf pdf, RandomAccessFile raf, long[][]
> xref, byte[] data, int start)

[...]
> Stack stack = new Stack();

[...]
> stack.push(state._streamToken);

[...]
> byte[] stream = (byte[])(stack.pop());
> PjStreamDictionary pjsd = new PjStreamDictionary(
> ((PjDictionary)(stack.pop())).getHashtable());
> PjStream pjs = new PjStream(pjsd, stream);
> stack.push(pjs);

[...]
> /*line 427*/ return (PjObject)(stack.pop());


This code is extremely messy in that it pops all sorts of different type
objects into the stack object. I wouldn't be surprised if this were
generated code instead of hand written.

If this is your code, you've got a bug and you need to fix it. If it's
someone else's code, then you should write up an SSCCE demonstrating the bug
and submit it to then. See http://mindprod.com/jgloss/sscce.html

- Oliver



Oliver Wong
  Reply With Quote
Old 08-02-2006, 08:48 PM   #9
Sergio
 
Posts: n/a
Default Re: extract text from a PDF file with JAVA

Oliver Wong ha scritto:

> This code is extremely messy in that it pops all sorts of different type
> objects into the stack object. I wouldn't be surprised if this were
> generated code instead of hand written.
>
> If this is your code, you've got a bug and you need to fix it. If it's
> someone else's code, then you should write up an SSCCE demonstrating the bug
> and submit it to then. See http://mindprod.com/jgloss/sscce.html


the code of parse method is from pjx library...the only code i've wrote
is the calling method and i think the problem is in that procedure.
Thanks for your help.
Sergio.



Sergio
  Reply With Quote
Old 08-03-2006, 08:51 AM   #10
Chris Uppal
 
Posts: n/a
Default Re: extract text from a PDF file with JAVA
Sergio wrote:

> i've made that method public before calling it.


And you are surprised to find that it doesn't work ?

Presumably the author made that method private for a reason -- for instance it
may depend on certain kinds of initialisation being done first. Why not
explore the library for the /correct/ way to use it for what you want. If you
find there isn't a way, then you could drop a line to the author suggesting an
enhancement -- which would probably be more welcome if you can supply /working/
code too.

-- chris




Chris Uppal
  Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
SONY DVD RW DW-G120A SOMETIMES FAILS...... atlantic965 DVD Video 0 06-18-2006 10:36 PM
problems backing up dvds Lawrence Traub DVD Video 11 09-27-2005 07:34 PM
Re: Ripping DVDs. Please answer the attached question. - Question.txt Stan Brown DVD Video 19 02-09-2005 11:19 PM
Burn process failed - help! Log file posted for help troubleshooting Michael Mason DVD Video 1 08-16-2004 09:24 PM
Pioneer A05 Problems Bill Stock DVD Video 8 11-28-2003 05:03 AM




SEO by vBSEO 3.3.2 ©2009, Crawlability, Inc.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46