Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Extracting text data from MS Word document

Reply
Thread Tools

Extracting text data from MS Word document

 
 
Max
Guest
Posts: n/a
 
      09-15-2004
Hello,
I need to extract textual information (as ASCII chars' stream for
instance, or a text file) from MS Word document using Java. As it is
going to be non-MS environment, e.g. UNIX I cannot rely on
Windows-specific APIs.

Does anybody has such an experience?
I'd be appreciated for any references/hints related to subject


MSWord related part of Jakarta POI project isn't looking ready for use
right now? Is it?

Thank you,
Max
 
Reply With Quote
 
 
 
 
Ike
Guest
Posts: n/a
 
      09-16-2004
you may want to look at openoffice.org, I believe they are on sourveforge.
They read/write MS word files in Java.

Secondly, check out http://jakarta.apache.org/poi/

Thirdly try http://www.wotsit.org/ and search on 'word' and you;ll find they
have the docs on all ms word formats.

-Ike

"Max" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) om...
> Hello,
> I need to extract textual information (as ASCII chars' stream for
> instance, or a text file) from MS Word document using Java. As it is
> going to be non-MS environment, e.g. UNIX I cannot rely on
> Windows-specific APIs.
>
> Does anybody has such an experience?
> I'd be appreciated for any references/hints related to subject
>
>
> MSWord related part of Jakarta POI project isn't looking ready for use
> right now? Is it?
>
> Thank you,
> Max



 
Reply With Quote
 
 
 
 
Ann
Guest
Posts: n/a
 
      09-16-2004

"Ike" <(E-Mail Removed)> wrote in message
news:ac52d.2944$(E-Mail Removed) ink.net...
> you may want to look at openoffice.org, I believe they are on sourveforge.
> They read/write MS word files in Java.
>
> Secondly, check out http://jakarta.apache.org/poi/
>
> Thirdly try http://www.wotsit.org/ and search on 'word' and you;ll find

they
> have the docs on all ms word formats.
>
> -Ike
>
> "Max" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed) om...
> > Hello,
> > I need to extract textual information (as ASCII chars' stream for
> > instance, or a text file) from MS Word document using Java. As it is
> > going to be non-MS environment, e.g. UNIX I cannot rely on
> > Windows-specific APIs.
> >
> > Does anybody has such an experience?
> > I'd be appreciated for any references/hints related to subject
> >
> >
> > MSWord related part of Jakarta POI project isn't looking ready for use
> > right now? Is it?
> >
> > Thank you,
> > Max

>


If you have control, save the word document in RTF format
which is mostly text. Then just read it as any other text file.


 
Reply With Quote
 
Paul Lutus
Guest
Posts: n/a
 
      09-16-2004
Ann wrote:

/ ...

> If you have control, save the word document in RTF format
> which is mostly text.


Not really. Look at one sometime in a plain-text editor.

> Then just read it as any other text file.


If the intent is to read it "as any other text file", why not save it as any
othre text file? Word does that too. If instead it is saved as RTF, it
should be read as RTF, which Java can do with some limited success.

--
Paul Lutus
http://www.arachnoid.com

 
Reply With Quote
 
Max
Guest
Posts: n/a
 
      09-16-2004
unfortunately this approach doesn't fit ...

I cannot make users to save documents in some specific formats ...
The Big Idea is that
1. user works with preferred document format (MS Word)
2. sends it in a System
3. System extracts text data from it and process it as necessary ...

any other ideas?



Paul Lutus <(E-Mail Removed)> wrote in message news:<(E-Mail Removed)>...
> Ann wrote:
>
> / ...
>
> > If you have control, save the word document in RTF format
> > which is mostly text.

>
> Not really. Look at one sometime in a plain-text editor.
>
> > Then just read it as any other text file.

>
> If the intent is to read it "as any other text file", why not save it as any
> othre text file? Word does that too. If instead it is saved as RTF, it
> should be read as RTF, which Java can do with some limited success.

 
Reply With Quote
 
Paul Lutus
Guest
Posts: n/a
 
      09-16-2004
Max wrote:

> unfortunately this approach doesn't fit ...
>
> I cannot make users to save documents in some specific formats ...


Then you cannot get RTF either, Ann's suggestion. Too bad.

I guess you will have to see what MS Word converters are available on the
receiving end.

> The Big Idea is that
> 1. user works with preferred document format (MS Word)
> 2. sends it in a System
> 3. System extracts text data from it and process it as necessary ...


Yes, and for that, you will need an MS Word converter. Since the target
platform is described as "UNIX", I can't go farther without finding out
which unix. If it were Linux, I would know exactly what to tell you (a
Linux installation can, and usually does, host several MS Word converter
methods).

--
Paul Lutus
http://www.arachnoid.com

 
Reply With Quote
 
Malcolm Dew-Jones
Guest
Posts: n/a
 
      09-16-2004
Max ((E-Mail Removed)) wrote:
: Hello,
: I need to extract textual information (as ASCII chars' stream for
: instance, or a text file) from MS Word document using Java.


antiword is not written in java, but it does what you want. Run it with
the java equivalent of the well known system command.

I.e. in perl

system("antiword ms-word-file.doc > temporary-file.txt");

temporary-file.txt then contains what you want.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Extracting data from a document GTN170777 ASP General 5 06-28-2008 10:20 PM
generating word document with header and footerwithout using word object. srk ASP .Net 0 04-24-2007 01:59 PM
Problem extracting image from word document amit ASP .Net 0 11-16-2006 10:19 AM
Extracting text from a Word document via StreamReader - track chan =?Utf-8?B?S2V2aW4gSw==?= ASP .Net 2 04-05-2006 11:07 PM
Extracting data from XML document Ken XML 8 11-30-2003 01:51 AM



Advertisments