Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > ASP .Net > Multiple PDF, PPT, DOC to html or text conversion

Reply
Thread Tools

Multiple PDF, PPT, DOC to html or text conversion

 
 
osiceanu
Guest
Posts: n/a
 
      02-21-2008
Hello,

I have a asp.net application storing pdf files and word documents into
db. The problem appears when trying to show a preview of a document on
the aspx page. That is converting the document to html or text. Is
there a method for doing it? Keeping the images in the document or the
format of the document is not necessary.
If it is not possible, maybe an image preview of the document (i.e.
the first page of it) is more suitable and easier.

Thanks in advance!
 
Reply With Quote
 
 
 
 
Lasse Vågsæther Karlsen
Guest
Posts: n/a
 
      02-21-2008
osiceanu wrote:
> Hello,
>
> I have a asp.net application storing pdf files and word documents into
> db. The problem appears when trying to show a preview of a document on
> the aspx page. That is converting the document to html or text. Is
> there a method for doing it? Keeping the images in the document or the
> format of the document is not necessary.
> If it is not possible, maybe an image preview of the document (i.e.
> the first page of it) is more suitable and easier.
>
> Thanks in advance!


Perhaps not an "optimal" solution in terms of resource usage on the
server, but could you use the Office 2007 COM objects for this?

A PDF document you can easily embed into a page.

A Word document you could, on the server, load into the Word
application, save as a temporary pdf file, and then embed that into the
page.

If resource usage on the server will take a hit, you could tag new
documents in the database "must be rendered to pdf", and then run a job
at intervals that does the same, ie. loads up the word document into
Word, save as pdf, and then uploads the pdf to the database as an
alternate representation of the word document.

You mention that you want to convert it to html or text. Is this a
must-have criteria? Because if you need that you need to either have a
server-component that can output html from pdf and word (Word 2007 can
do this from the word file), or you need to do a similar interval-based
rendering of the files to html.

3rd party class libraries exists that does either, and while I don't
know the current state of pdf libraries that would fit, I do know that
the only way to support all the features of the word application is by
using word itself.

As for only showing the text, you can then probably use such 3rd party
libraries, TX Text Control can be used to grab the text from a word
file, and there are probably similar things for pdf, but do know that
pdf is a format suitable for printing, I've seen badly formed pdf files
that consists of words on a page, but the words are not actually put on
a page on a per line per sentence basis, more like just thrown onto the
page in the right spots, grabbing the text from such a document would
most likely not look good.

--
Lasse Vågsæther Karlsen
(E-Mail Removed)
http://presentationmode.blogspot.com/
PGP KeyID: 0xBCDEA2E3
 
Reply With Quote
 
 
 
 
osiceanu
Guest
Posts: n/a
 
      02-21-2008
Thank you for your response!

I was trying to do something like Google's "View as HTML" for
documents like pdf, doc, ppt, xls. This component would be used also
for searching the site and returning as results the documents
containing the search text.
Another approach would be to have the documents on the hard disk, and
storing into the db only references to those documents. But for
searching I also have to store the text from documents.
 
Reply With Quote
 
Nicholas Paldino [.NET/C# MVP]
Guest
Posts: n/a
 
      02-21-2008
This solution is a very bad one for a server environment (automating
word in a server environment). See the following thread as to why:

http://groups.google.com/group/micro...7b1b19ebfa0e34

Additionally, this link is referenced in the thread as to why MS says it
is a bad idea to Automate word in a server environment:

http://support.microsoft.com/default...b;EN-US;257757

--
- Nicholas Paldino [.NET/C# MVP]
- http://www.velocityreviews.com/forums/(E-Mail Removed)


"Lasse Vågsæther Karlsen" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> osiceanu wrote:
>> Hello,
>>
>> I have a asp.net application storing pdf files and word documents into
>> db. The problem appears when trying to show a preview of a document on
>> the aspx page. That is converting the document to html or text. Is
>> there a method for doing it? Keeping the images in the document or the
>> format of the document is not necessary.
>> If it is not possible, maybe an image preview of the document (i.e.
>> the first page of it) is more suitable and easier.
>>
>> Thanks in advance!

>
> Perhaps not an "optimal" solution in terms of resource usage on the
> server, but could you use the Office 2007 COM objects for this?
>
> A PDF document you can easily embed into a page.
>
> A Word document you could, on the server, load into the Word application,
> save as a temporary pdf file, and then embed that into the page.
>
> If resource usage on the server will take a hit, you could tag new
> documents in the database "must be rendered to pdf", and then run a job at
> intervals that does the same, ie. loads up the word document into Word,
> save as pdf, and then uploads the pdf to the database as an alternate
> representation of the word document.
>
> You mention that you want to convert it to html or text. Is this a
> must-have criteria? Because if you need that you need to either have a
> server-component that can output html from pdf and word (Word 2007 can do
> this from the word file), or you need to do a similar interval-based
> rendering of the files to html.
>
> 3rd party class libraries exists that does either, and while I don't know
> the current state of pdf libraries that would fit, I do know that the only
> way to support all the features of the word application is by using word
> itself.
>
> As for only showing the text, you can then probably use such 3rd party
> libraries, TX Text Control can be used to grab the text from a word file,
> and there are probably similar things for pdf, but do know that pdf is a
> format suitable for printing, I've seen badly formed pdf files that
> consists of words on a page, but the words are not actually put on a page
> on a per line per sentence basis, more like just thrown onto the page in
> the right spots, grabbing the text from such a document would most likely
> not look good.
>
> --
> Lasse Vågsæther Karlsen
> (E-Mail Removed)
> http://presentationmode.blogspot.com/
> PGP KeyID: 0xBCDEA2E3



 
Reply With Quote
 
Mark Rae [MVP]
Guest
Posts: n/a
 
      02-21-2008
"Lasse Vågsæther Karlsen" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...

>> I have a asp.net application storing pdf files and word documents into
>> db. The problem appears when trying to show a preview of a document on
>> the aspx page. That is converting the document to html or text. Is
>> there a method for doing it? Keeping the images in the document or the
>> format of the document is not necessary.
>> If it is not possible, maybe an image preview of the document (i.e.
>> the first page of it) is more suitable and easier.

>
> Perhaps not an "optimal" solution in terms of resource usage on the
> server, but could you use the Office 2007 COM objects for this?


Under no circumstances should server-side Office automation be attempted:
http://support.microsoft.com/default...US;q257757#kb2
http://support.microsoft.com/default.aspx/kb/288367
http://www.aspose.com/Products/Aspos...utomation.html

Use this instead:
http://www.aspose.com/Products/Aspose.Words/


--
Mark Rae
ASP.NET MVP
http://www.markrae.net

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: .doc to html and pdf conversion with python Alexander Klingenstein Python 0 10-15-2006 05:34 AM
.doc to html and pdf conversion with python Alexander Klingenstein Python 2 10-15-2006 04:22 AM
String[] files = {"a.doc, b.doc"}; VERSUS String[] files = new String[] {"a.doc, b.doc"}; Matt Java 3 09-17-2004 10:28 PM
Converting a org.jdom DOC to org.w3c DOC Praveen Chhangani XML 2 08-07-2003 08:22 AM
Parsing MS Word client doc into server-side doc... John Wallace ASP .Net 0 07-22-2003 06:49 PM



Advertisments