![]() |
|
|
|||||||
![]() |
Java - extract text from a PDF file with JAVA |
|
|
Thread Tools | Search this Thread |
|
|
#1 |
|
Hi to all the newsgroup, this is my first post.
I'm approaching the text retrieving from PDF files with java. I'm looking for some example code, tutorial, guide or similar. I'm using, for the moment, PDFBox library but i notice a lot of errors in its PDF parsing. So i've tried with "Pjx" library and i've found a good example code in this site: http://www.jguru.com/faq/view.jsp?EID=1074237 ....but i can't find a way to call "PdfParser.getContents()" method. I will appreciate any advice. Thanks in advance. Sergio. Sergio |
|
|
|
|
#2 |
|
Posts: n/a
|
"Sergio" <> wrote in message
news: ups.com... > > So i've tried with "Pjx" library and i've found a good example code in > this site: > http://www.jguru.com/faq/view.jsp?EID=1074237 > ...but i can't find a way to call "PdfParser.getContents()" method. How can you "not find a way" to call a specific method? What did you type and what error message was produced? - Oliver Oliver Wong |
|
|
|
#3 |
|
Posts: n/a
|
Oliver Wong skrev:
> "Sergio" <> wrote in message > news: ups.com... >> >> So i've tried with "Pjx" library and i've found a good example code in >> this site: >> http://www.jguru.com/faq/view.jsp?EID=1074237 >> ...but i can't find a way to call "PdfParser.getContents()" method. > > How can you "not find a way" to call a specific method? What did you > type and what error message was produced? > The method is declared private. It's not supposed to be called from outside the class. Lars Enderin |
|
|
|
#4 |
|
Posts: n/a
|
Lars Enderin ha scritto:
> The method is declared private. It's not supposed to be called from > outside the class. first af all thanks for the answers. i've made that method public before calling it. My procedure's call is this (very simple): File f = new File("sample.pdf"); String text = new String(); PdfParser p = new PdfParser(); Document doc = p.parse(f); text = p.getContents(); These the errors displayed on the console: Exception in thread "main" java.lang.ClassCastException: java.lang.String at com.etymon.pj.PdfParser.parse(PdfParser.java:427) at com.etymon.pj.PdfParser.getNextXref(PdfParser.java :67) at com.etymon.pj.PdfParser.getXref(PdfParser.java:57) at com.etymon.pj.PdfParser.getObjects(PdfParser.java: 12) at com.etymon.pj.Pdf.readFromFile(Pdf.java:1227) at com.etymon.pj.Pdf.<init>(Pdf.java:32) at PdfParser.getContents(PdfParser.java:82) at PdfParser.parse(PdfParser.java:47) at PdfParser.parse(PdfParser.java:29) at Prova.main(Prova.java:31) Thanks in advance for your interest. Sergio. Sergio |
|
|
|
#5 |
|
Posts: n/a
|
"Sergio" <> wrote in message news: ps.com... > Lars Enderin ha scritto: > >> The method is declared private. It's not supposed to be called from >> outside the class. > > first af all thanks for the answers. > i've made that method public before calling it. > My procedure's call is this (very simple): > > File f = new File("sample.pdf"); > String text = new String(); > PdfParser p = new PdfParser(); > Document doc = p.parse(f); > text = p.getContents(); > > > These the errors displayed on the console: > > Exception in thread "main" java.lang.ClassCastException: > java.lang.String > at com.etymon.pj.PdfParser.parse(PdfParser.java:427) > at com.etymon.pj.PdfParser.getNextXref(PdfParser.java :67) > at com.etymon.pj.PdfParser.getXref(PdfParser.java:57) > at com.etymon.pj.PdfParser.getObjects(PdfParser.java: 12) > at com.etymon.pj.Pdf.readFromFile(Pdf.java:1227) > at com.etymon.pj.Pdf.<init>(Pdf.java:32) > at PdfParser.getContents(PdfParser.java:82) > at PdfParser.parse(PdfParser.java:47) > at PdfParser.parse(PdfParser.java:29) > at Prova.main(Prova.java:31) > > Thanks in advance for your interest. Please show the parse method of the file com.etymon.pj.PdfParser. Be sure to include line 427. - Oliver Oliver Wong |
|
|
|
#6 |
|
Posts: n/a
|
> Please show the parse method of the file com.etymon.pj.PdfParser. Be > sure to include line 427. > > - Oliver As you've requested here is the parse method of the file com.etymon.pj.PdfParser. It's quite long...the line 427 is the return instruction at the end of method. Thanks again. public static PjObject parse(Pdf pdf, RandomAccessFile raf, long[][] xref, byte[] data, int start) throws IOException, PjException { PdfParserState state = new PdfParserState(); state._data = data; state._pos = start; state._stream = -1; Stack stack = new Stack(); boolean endFlag = false; while ( ( ! endFlag ) && (getToken(state)) ) { if (state._stream != -1) { stack.push(state._streamToken); state._stream = -1; } else if (state._token.equals("startxref")) { endFlag = true; } else if (state._token.equals("endobj")) { endFlag = true; } else if (state._token.equals("%%EOF")) { endFlag = true; } else if (state._token.equals("endstream")) { byte[] stream = (byte[])(stack.pop()); PjStreamDictionary pjsd = new PjStreamDictionary( ((PjDictionary)(stack.pop())).getHashtable()); PjStream pjs = new PjStream(pjsd, stream); stack.push(pjs); } else if (state._token.equals("stream")) { // get length of stream PjObject obj = ((PjObject)( (((PjDictionary)(stack.peek())). getHashtable(). get(new PjName("Length"))))); if (obj instanceof PjReference) { obj = getObject(pdf, raf, xref, ((PjReference)(obj)).getObjNumber().getInt()); } state._stream = ((PjNumber)(obj)).getInt(); // the following if() clause added to // handle the case of "Length" being // incorrect (larger than the actual // stream length) if ( state._stream > (state._data.length - state._pos) ) { state._stream = state._data.length - state._pos - 17; } if (state._pos < state._data.length) { if ((char)(state._data[state._pos]) == '\r') { state._pos++; } if ( (state._pos < state._data.length) && ((char)(state._data[state._pos]) == '\n') ) { state._pos++; } } } else if (state._token.equals("null")) { stack.push(new PjNull()); } else if (state._token.equals("true")) { stack.push(new PjBoolean(true)); } else if (state._token.equals("false")) { stack.push(new PjBoolean(false)); } else if (state._token.equals("R")) { // we ignore the generation number // because all objects get reset to // generation 0 when we collapse the // incremental updates stack.pop(); // the generation number PjNumber obj = (PjNumber)(stack.pop()); stack.push(new PjReference(obj, PjNumber.ZERO)); } else if ( (state._token.charAt(0) == '<') && (state._token.startsWith("<<") == false) ) { stack.push(new PjString(PjString.decodePdf(state._token))); } else if ( (Character.isDigit(state._token.charAt(0))) || (state._token.charAt(0) == '-') || (state._token.charAt(0) == '.') ) { stack.push(new PjNumber(new Float(state._token).floatValue())); } else if (state._token.charAt(0) == '(') { stack.push(new PjString(PjString.decodePdf(state._token))); } else if (state._token.charAt(0) == '/') { stack.push(new PjName(state._token.substring(1))); } else if (state._token.equals(">>")) { boolean done = false; Object obj; Hashtable h = new Hashtable(); while ( ! done ) { obj = stack.pop(); if ( (obj instanceof String) && (((String)obj).equals("<<")) ) { done = true; } else { h.put((PjName)(stack.pop()), (PjObject)obj); } } // figure out what kind of dictionary we have PjDictionary dictionary = new PjDictionary(h); if (PjPage.isLike(dictionary)) { stack.push(new PjPage(h)); } else if (PjPages.isLike(dictionary)) { stack.push(new PjPages(h)); } else if (PjFontType1.isLike(dictionary)) { stack.push(new PjFontType1(h)); } else if (PjFontDescriptor.isLike(dictionary)) { stack.push(new PjFontDescriptor(h)); } else if (PjResources.isLike(dictionary)) { stack.push(new PjResources(h)); } else if (PjCatalog.isLike(dictionary)) { stack.push(new PjCatalog(h)); } else if (PjInfo.isLike(dictionary)) { stack.push(new PjInfo(h)); } else if (PjEncoding.isLike(dictionary)) { stack.push(new PjEncoding(h)); } else { stack.push(dictionary); } } else if (state._token.equals("]")) { boolean done = false; Object obj; Vector v = new Vector(); while ( ! done ) { obj = stack.pop(); if ( (obj instanceof String) && (((String)obj).equals("[")) ) { done = true; } else { v.insertElementAt((PjObject)obj, 0); } } // figure out what kind of array we have PjArray array = new PjArray(v); if (PjRectangle.isLike(array)) { stack.push(new PjRectangle(v)); } else if (PjProcSet.isLike(array)) { stack.push(new PjProcSet(v)); } else { stack.push(array); } } else if (state._token.startsWith("%")) { // do nothing } else { stack.push(state._token); } } /*line 427*/ return (PjObject)(stack.pop()); } Sergio |
|
|
|
#7 |
|
Posts: n/a
|
I've uploaded the pjx library to this site
http://rapidshare.de/files/27945483/pjx-1.4.0.jar.html I think it could be useful. Thanks for all your help. Sergio. Sergio |
|
|
|
#8 |
|
Posts: n/a
|
"Sergio" <> wrote in message news: ups.com... [OP has a CastClassException on line 427, actual class type is String] > >> Please show the parse method of the file com.etymon.pj.PdfParser. Be >> sure to include line 427. >> >> - Oliver > > As you've requested here is the parse method of the file > com.etymon.pj.PdfParser. > It's quite long...the line 427 is the return instruction at the end of > method. > Thanks again. > > public static PjObject parse(Pdf pdf, RandomAccessFile raf, long[][] > xref, byte[] data, int start) [...] > Stack stack = new Stack(); [...] > stack.push(state._streamToken); [...] > byte[] stream = (byte[])(stack.pop()); > PjStreamDictionary pjsd = new PjStreamDictionary( > ((PjDictionary)(stack.pop())).getHashtable()); > PjStream pjs = new PjStream(pjsd, stream); > stack.push(pjs); [...] > /*line 427*/ return (PjObject)(stack.pop()); This code is extremely messy in that it pops all sorts of different type objects into the stack object. I wouldn't be surprised if this were generated code instead of hand written. If this is your code, you've got a bug and you need to fix it. If it's someone else's code, then you should write up an SSCCE demonstrating the bug and submit it to then. See http://mindprod.com/jgloss/sscce.html - Oliver Oliver Wong |
|
|
|
#9 |
|
Posts: n/a
|
Oliver Wong ha scritto: > This code is extremely messy in that it pops all sorts of different type > objects into the stack object. I wouldn't be surprised if this were > generated code instead of hand written. > > If this is your code, you've got a bug and you need to fix it. If it's > someone else's code, then you should write up an SSCCE demonstrating the bug > and submit it to then. See http://mindprod.com/jgloss/sscce.html the code of parse method is from pjx library...the only code i've wrote is the calling method and i think the problem is in that procedure. Thanks for your help. Sergio. Sergio |
|
|
|
#10 |
|
Posts: n/a
|
Sergio wrote:
> i've made that method public before calling it. And you are surprised to find that it doesn't work ? Presumably the author made that method private for a reason -- for instance it may depend on certain kinds of initialisation being done first. Why not explore the library for the /correct/ way to use it for what you want. If you find there isn't a way, then you could drop a line to the author suggesting an enhancement -- which would probably be more welcome if you can supply /working/ code too. -- chris Chris Uppal |
|
![]() |
| Thread Tools | Search this Thread |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| SONY DVD RW DW-G120A SOMETIMES FAILS...... | atlantic965 | DVD Video | 0 | 06-18-2006 10:36 PM |
| problems backing up dvds | Lawrence Traub | DVD Video | 11 | 09-27-2005 07:34 PM |
| Re: Ripping DVDs. Please answer the attached question. - Question.txt | Stan Brown | DVD Video | 19 | 02-09-2005 11:19 PM |
| Burn process failed - help! Log file posted for help troubleshooting | Michael Mason | DVD Video | 1 | 08-16-2004 09:24 PM |
| Pioneer A05 Problems | Bill Stock | DVD Video | 8 | 11-28-2003 05:03 AM |