Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: Html or Pdf to Rtf (Linux) with Python

Reply
Thread Tools

Re: Html or Pdf to Rtf (Linux) with Python

 
 
Axel Straschil
Guest
Posts: n/a
 
      12-16-2004
Hallo!

> However, our company's product, PDFTextStream does do a phenomenal job of
> extracting text and metadata out of PDF documents. It's crazy-fast, has a
> clean API, and in general gets the job done very nicely. It presents two
> points of compromise from your idea situation:
> 1. It only produces text, so you would have to take the text it provides and
> write it out as an RTF yourself (there are tons of packages and tools that do
> this). Since the RTF format has pretty weak formatting capabilities compared


I've got the Input Source in HTML, the Problem ist converting from any to
RTF. Please give me a hint where the tons of packages are.

Thanks,
AXEL.
--
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
interpreted as described in RFC 2119 [http://ietf.org/rfc/rfc2119.txt]
 
Reply With Quote
 
 
 
 
Mike Meyer
Guest
Posts: n/a
 
      12-16-2004
Axel Straschil <(E-Mail Removed)> writes:

> Hallo!
>
>> However, our company's product, PDFTextStream does do a phenomenal
>> job of extracting text and metadata out of PDF documents. It's
>> crazy-fast, has a clean API, and in general gets the job done very
>> nicely. It presents two points of compromise from your idea
>> situation:
>> 1. It only produces text, so you would have to take the text it
>> provides and write it out as an RTF yourself (there are tons of
>> packages and tools that do this). Since the RTF format has pretty
>> weak formatting capabilities compared

>
> I've got the Input Source in HTML, the Problem ist converting from any
> to RTF. Please give me a hint where the tons of packages are.


That's easy. Load the HTML in MS Word, and save it as RTF. Script it
via COM using the python win32all (I think that's what it's now
called) package.

<mike
--
Mike Meyer <(E-Mail Removed)> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
 
Reply With Quote
 
 
 
 
Axel Straschil
Guest
Posts: n/a
 
      12-16-2004
Hello!

> That's easy. Load the HTML in MS Word, and save it as RTF. Script it
> via COM using the python win32all (I think that's what it's now
> called) package.


As I wrote in my posting and the subject: linux
I could try to do this with open office, by I'm afraid this will not
be a performant solution ;-(
I realy was spending hour's on that, the only thing I found was a
spezifikation for reach text, maybe a good point to start a project ...

Lg
AXEL.

 
Reply With Quote
 
Mike Meyer
Guest
Posts: n/a
 
      12-17-2004
Axel Straschil <(E-Mail Removed)> writes:

> Hello!
>
>> That's easy. Load the HTML in MS Word, and save it as RTF. Script it
>> via COM using the python win32all (I think that's what it's now
>> called) package.

> As I wrote in my posting and the subject: linux
> I could try to do this with open office, by I'm afraid this will not
> be a performant solution ;-(
> I realy was spending hour's on that, the only thing I found was a
> spezifikation for reach text, maybe a good point to start a project ...


Sorry. I forgot the original subject.

You might take a look at PyRTF in PyPI. It's still in beta,
though. But it might be enough that coupled with the HTMLParser.py to
get you where you need to go.

<mike
--
Mike Meyer <(E-Mail Removed)> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
 
Reply With Quote
 
Stephen Thorne
Guest
Posts: n/a
 
      12-17-2004
On Thu, 16 Dec 2004 19:30:37 +0000 (UTC), Axel Straschil
<(E-Mail Removed)> wrote:
> > That's easy. Load the HTML in MS Word, and save it as RTF. Script it
> > via COM using the python win32all (I think that's what it's now
> > called) package.

>
> As I wrote in my posting and the subject: linux
> I could try to do this with open office, by I'm afraid this will not
> be a performant solution ;-(
> I realy was spending hour's on that, the only thing I found was a
> spezifikation for reach text, maybe a good point to start a project ...


I've been able to successfully get konqueror to generate a pdf from a
html file via dcop. It's something along the lines of:
% dcop konqueror-25827 html-widget1 print 1
You can launch konq in a xvfb (X Virtual Framebuffer) then communicate
via dcop to send commands to the browser (load this url, print this
page, etc).

I've been investigating doing the same feat using JS/XUL/etc in
mozilla. It probably is possible. There's lots of documentation about
the XPCOM api available from http://xulplanet.com/

As for converting to RTF, someone has already pointed out PyRTF.

Regards,
Stephen Thorne
 
Reply With Quote
 
Axel Straschil
Guest
Posts: n/a
 
      12-17-2004
Hello!

> I've been able to successfully get konqueror to generate a pdf from a
> html file via dcop. It's something along the lines of:


For that stuff, I'm using htmloc (http://www.htmldoc.org/).

Lg,
AXEL.
--
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
interpreted as described in RFC 2119 [http://ietf.org/rfc/rfc2119.txt]

 
Reply With Quote
 
Axel Straschil
Guest
Posts: n/a
 
      12-17-2004
Hello!

> You might take a look at PyRTF in PyPI. It's still in beta,


I think PyRTF would be the right choice, thanks. Yust had a short look
at it.

Lg,
AXEL.
--
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
interpreted as described in RFC 2119 [http://ietf.org/rfc/rfc2119.txt]

 
Reply With Quote
 
Stephen Thorne
Guest
Posts: n/a
 
      12-18-2004
On Fri, 17 Dec 2004 07:55:10 +0000 (UTC), Axel Straschil
<(E-Mail Removed)> wrote:
> Hello!
>
> > I've been able to successfully get konqueror to generate a pdf from a
> > html file via dcop. It's something along the lines of:

>
> For that stuff, I'm using htmloc (http://www.htmldoc.org/).


I found htmldoc and every other open source purpose built html->pdf
converter to be deficient enough to discourage us from using them. For
our requirements only web-browsers had the quality of rendering
required.

Stephen.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: what do you think on my summary on C, for 13-15 years studentin .rtf format [to cut and paste in a .rtf file] Kleuskes & Moos C Programming 8 10-13-2011 08:51 AM
Html or Pdf to Rtf (Linux) with Python Alexander Straschil Python 4 11-05-2010 08:51 AM
A python library to convert RTF into PDF ? leonel.gayard@gmail.com Python 6 01-05-2007 10:40 PM
full-text indexing of pdf, rtf, txt, html Dizzy Haze Perl Misc 3 11-17-2005 02:37 AM
Doing a 'mail merge' with RTF files (aka RTF templates) Tony Perl Misc 2 08-27-2003 08:12 AM



Advertisments