> HI! I'm trying to use Ruby and win32ole to parse a Word document. So
> far, I'm able to extract the style and text of each paragraph. That
> works great to convert it into individual divs (in the HTML CSS sense).
>
> Now, inside the paragraphs, there are certain words that have special
> formatting (for e.g. the name of a command which is in monospace) - I'm
> trying to find how to extract those special cases. Does anyone know how
> to achieve that?
>
Dear Mohit,
you could save the Word file as an html and then extract the relevant information...
I did that using OpenOffice and got a file containing the font information in the following form.
<BODY LANG="en-US" DIR="LTR">
<P STYLE="margin-bottom: 0in">A command in <FONT FACE="Linux Libertine">Linux
Libertine</FONT></P>
<P STYLE="margin-bottom: 0in">A text in <FONT FACE="Bitstream Charter, serif">Bitstream
Charter</FONT></P>
</BODY>
If you read in the text of that file as a String, you can then find the relevant bits using regexps.
Best regards,
Axel
--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN:
http://www.gmx.net/de/go/smartsurfer