Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > MS Word to XHTML

Reply
Thread Tools

MS Word to XHTML

 
 
Joris Gillis
Guest
Posts: n/a
 
      09-11-2005
Hi,

Tempore 12:19:53, die Sunday 11 September 2005 AD, hinc in foribus {microsoft.public.word.vba.general,microsoft.publi c.word.docmanagement,alt.html,comp.text.xml} scripsit Alan J. Flavell <(E-Mail Removed)>:

>> Word XP and upwards stores its documents in XML format doesn't it?

>
> So what? XML is only a format for defining markup. If the markup
> doesn't do anything meaningful (specifically - if it only creates a
> visual result on a printed page, without having any significant
> structure) then it's not going to turn into effective HTML: it'd just
> be the usual garbage in / garbage out that we're accustomed to with
> Word conversions to soi-disant "web" format.
>
>> You could probably write your own XSLT to turn in into HTML fairly
>> easily.

>
> There seems to be some kind of conceptual disconnect here. Most Word
> documents (in my experience) simply don't contain the necessary
> structure for useful conversion to HTML: they've been created as a
> purely visual construction for printing onto paper. It's irrelevant
> what underlying technology you use (RTF, XML, SGML, whatever) - the
> problem is that the source material simply does not represent the
> needed structures, *because the document authors do not put it there*.
>
> You might as well try to convert cheese into fresh cream: both are
> fine milk products, it's true, but instead of trying to convert the
> one into the other, you'd do better to produce them both starting from
> fresh milk. And the kind of "fresh milk" that's needed here is
> logically structured text markup. Not visual formatting. Until the
> authors of Word documents can grasp that, the prospects for conversion
> of Word to web formats are poor, IMHO.


I warmheartedly applaud your brilliant analysis. You stated your point very clearly.

It's depressing to see what a tiny percentage of people realize (or bother with) the importance of structural markup.

The future does not look bright. I have seen so called 'IT-classes' where they make innocent people believe they are IT-experts when they can change the background color of characters typed in Word...

regards,
--
Joris Gillis (http://users.telenet.be/root-jg/me.html)
Spread the wiki (http://www.wikipedia.org)
 
Reply With Quote
 
 
 
 
SpaceGirl
Guest
Posts: n/a
 
      09-11-2005
Roy Schestowitz wrote:
> __/ [Alan J. Flavell] on Sunday 11 September 2005 11:19 \__
>
>
>>On Sun, 11 Sep 2005, SpaceGirl wrote:
>>
>>
>>>Alan J. Flavell wrote:

>>
>>[comprehensive quote of my posting, without apparently having anything
>>relevant to say about it.]
>>
>>
>>>Word XP and upwards stores its documents in XML format doesn't it?

>>
>>So what? XML is only a format for defining markup. If the markup
>>doesn't do anything meaningful (specifically - if it only creates a
>>visual result on a printed page, without having any significant
>>structure) then it's not going to turn into effective HTML: it'd just
>>be the usual garbage in / garbage out that we're accustomed to with
>>Word conversions to soi-disant "web" format.


Word documents, being style based, are easy to convert. Use XSLT to
strip out all the crap so that all you end up with is basic HTML - <p>'s
and <h>'s. I wasn't suggested that anything more complicated that that
should be attempted - but I HAVE seen it done pretty successfully with
Word 2003 files. In the case of that client (although I wasn't part of
the team who wrote those tools), their customers would submit Word
documents and the XSLT would convert them into both HTML and PDFs, and
the reproduction was almost perfect (styling and colours anyway).

>>>You could probably write your own XSLT to turn in into HTML fairly
>>>easily.

>>
>>There seems to be some kind of conceptual disconnect here. Most Word
>>documents (in my experience) simply don't contain the necessary
>>structure for useful conversion to HTML: they've been created as a
>>purely visual construction for printing onto paper. It's irrelevant
>>what underlying technology you use (RTF, XML, SGML, whatever) - the
>>problem is that the source material simply does not represent the
>>needed structures, *because the document authors do not put it there*.


That wasn't what I saw, but like I said I wasn't on that team. As far as
I could tell they wrote a simple parser.

>>You might as well try to convert cheese into fresh cream: both are
>>fine milk products, it's true, but instead of trying to convert the
>>one into the other, you'd do better to produce them both starting from
>>fresh milk. And the kind of "fresh milk" that's needed here is
>>logically structured text markup. Not visual formatting. Until the
>>authors of Word documents can grasp that, the prospects for conversion
>>of Word to web formats are poor, IMHO.


Strange, as I've never had a problem. Generally I have to do it in a
sort of round-robin of programs; First save your Word documents as PDF,
then save the PDF as a web page. It works just fine.

<snip stuff I cant be bothered to read, seeing as everyone else is being
so ****ing rude>


--


x theSpaceGirl (miranda)

# lead designer @ http://www.dhnewmedia.com #
# remove NO SPAM to email, or use form on website #
# this post (c) Miranda Thomas 2005
# explicitly no permission given to Forum4Designers
# to duplicate this post.
 
Reply With Quote
 
 
 
 
Roy Schestowitz
Guest
Posts: n/a
 
      09-11-2005
__/ [SpaceGirl] on Sunday 11 September 2005 20:46 \__

> Roy Schestowitz wrote:
>> __/ [Alan J. Flavell] on Sunday 11 September 2005 11:19 \__
>>
>>
>>>On Sun, 11 Sep 2005, SpaceGirl wrote:
>>>
>>>
>>>>Alan J. Flavell wrote:
>>>
>>>[comprehensive quote of my posting, without apparently having anything
>>>relevant to say about it.]
>>>
>>>
>>>>Word XP and upwards stores its documents in XML format doesn't it?
>>>
>>>So what? XML is only a format for defining markup. If the markup
>>>doesn't do anything meaningful (specifically - if it only creates a
>>>visual result on a printed page, without having any significant
>>>structure) then it's not going to turn into effective HTML: it'd just
>>>be the usual garbage in / garbage out that we're accustomed to with
>>>Word conversions to soi-disant "web" format.

>
> Word documents, being style based, are easy to convert. Use XSLT to
> strip out all the crap so that all you end up with is basic HTML - <p>'s
> and <h>'s. I wasn't suggested that anything more complicated that that
> should be attempted - but I HAVE seen it done pretty successfully with
> Word 2003 files. In the case of that client (although I wasn't part of
> the team who wrote those tools), their customers would submit Word
> documents and the XSLT would convert them into both HTML and PDFs, and
> the reproduction was almost perfect (styling and colours anyway).
>
>>>>You could probably write your own XSLT to turn in into HTML fairly
>>>>easily.
>>>
>>>There seems to be some kind of conceptual disconnect here. Most Word
>>>documents (in my experience) simply don't contain the necessary
>>>structure for useful conversion to HTML: they've been created as a
>>>purely visual construction for printing onto paper. It's irrelevant
>>>what underlying technology you use (RTF, XML, SGML, whatever) - the
>>>problem is that the source material simply does not represent the
>>>needed structures, *because the document authors do not put it there*.

>
> That wasn't what I saw, but like I said I wasn't on that team. As far as
> I could tell they wrote a simple parser.



I believe that's possible, but it depends on the standard that the author
sticks to. Word does not /force/ the author to add structural information.
Hence, hacks are allowed which leave bits hanging aloof.


>>>You might as well try to convert cheese into fresh cream: both are
>>>fine milk products, it's true, but instead of trying to convert the
>>>one into the other, you'd do better to produce them both starting from
>>>fresh milk. And the kind of "fresh milk" that's needed here is
>>>logically structured text markup. Not visual formatting. Until the
>>>authors of Word documents can grasp that, the prospects for conversion
>>>of Word to web formats are poor, IMHO.

>
> Strange, as I've never had a problem. Generally I have to do it in a
> sort of round-robin of programs; First save your Word documents as PDF,
> then save the PDF as a web page. It works just fine.



I have had bad experiences converting PDF's to HTML. I even wrote about this
very <http://schestowitz.com/Weblog/archives/2005/05/24/pdf-to-html/>
particular conversion because I found it frustrating. PDF involves
embedment of objects to fit the media, e.g. A4 paper, so it is bound to
lose what is necessary for a good conversion.


> <snip stuff I cant be bothered to read, seeing as everyone else is being
> so ****ing rude>



Are you referring to me? Did I say anything rude? Please clarify if
possible.

Roy
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      09-12-2005
On Sun, 11 Sep 2005, Roy Schestowitz wrote:

> To suggest ways forward, I suggest that
> the OP, who clearly wants to publish material on the Web, learns LaTeX.


Well, this drifts somewhat off the topic of some of the crossposted
groups, but our physicists are accustomed to writing their
publications in some form of latex, and I can say that when I was
handling the web-ifying of their publications, several years back, I
was (for the most part) getting good results from a program called
latex2html, and most problems were attributable to identifiable
causes, none of which were usually a major hindrance. (Back then we
had to make do with the deplorable HMTL version called HTML/3.2, but,
aside from that, the principles seemed right).

> Shall the idea of editing raw text become daunting, I suggest LyX
> < lyx.org > [LyX: Front-end to LaTeX]. 5 minutes with LyX would help
> anyone realise the difference and convey the idea, e.g. varying
> outputs, styles, imposition of structure, etc.
>
> Only a few days ago, somebody in the LyX mailing lists mentioned his
> upcoming presentation on "Word: What you See Is What a Mess".


googled!

It's really the principles which count here: but in practical terms,
I'm sure you're right in aiming at a format which promotes >doing the
right thing< by default - as opposed to one which has prominent
direct-formatting buttons on its user interface, and logical markup as
an apparently advanced topic which, I'm afraid, too many of authors
seem to disdain learning.

all the best
 
Reply With Quote
 
Roy Schestowitz
Guest
Posts: n/a
 
      09-12-2005
[Groups distribution reduced]

__/ [Alan J. Flavell] on Monday 12 September 2005 17:33 \__

> On Sun, 11 Sep 2005, Roy Schestowitz wrote:
>
>> To suggest ways forward, I suggest that
>> the OP, who clearly wants to publish material on the Web, learns LaTeX.

>
> Well, this drifts somewhat off the topic of some of the crossposted
> groups, but our physicists are accustomed to writing their
> publications in some form of latex, and I can say that when I was
> handling the web-ifying of their publications, several years back, I
> was (for the most part) getting good results from a program called
> latex2html, and most problems were attributable to identifiable
> causes, none of which were usually a major hindrance. (Back then we
> had to make do with the deplorable HMTL version called HTML/3.2, but,
> aside from that, the principles seemed right).



I use latex2html almost religiously. I estimate that about 1000 pages in my
site are in one form or another a product of latex2html, which has always
produced better output than lyx2html, for example. I discussed latex2html
in depth a couple of days ago and I continue to promote it.


>> Shall the idea of editing raw text become daunting, I suggest LyX
>> < lyx.org > [LyX: Front-end to LaTeX]. 5 minutes with LyX would help
>> anyone realise the difference and convey the idea, e.g. varying
>> outputs, styles, imposition of structure, etc.
>>
>> Only a few days ago, somebody in the LyX mailing lists mentioned his
>> upcoming presentation on "Word: What you See Is What a Mess".

>
> googled!
>
> It's really the principles which count here: but in practical terms,
> I'm sure you're right in aiming at a format which promotes >doing the
> right thing< by default - as opposed to one which has prominent
> direct-formatting buttons on its user interface, and logical markup as
> an apparently advanced topic which, I'm afraid, too many of authors
> seem to disdain learning.
>
> all the best



Only last night I was in a similar position involving my supervisor who
heads the Computer Science Department [I believe it is sensible to make
this public given the nature of the discussion]. For a Windows-centric
person like himself, who uses Office almost exclusively, it was difficult
to satisfy a Linux-dominated department. Conversion of a Word document to
HTML, also to be embedded in E-mail (I must bite my tongue) was never a
good idea. The final outcome is a PDF attachment with hyperlinks. My
arguments about standards, structure-based composition and the like seem to
have led to this result, which I suspect many will be satisfied with.

Best Wishes,

Roy

--
Roy S. Schestowitz | "Avoid missing ball for higher score"
http://Schestowitz.com | SuSE Linux | PGP-Key: 74572E8E
6:10pm up 18 days 13:16, 3 users, load average: 0.66, 0.29, 0.29
 
Reply With Quote
 
Peter Flynn
Guest
Posts: n/a
 
      09-13-2005
Toby Inkster wrote:

> Alan J. Flavell wrote:
>
>> You might as well try to convert cheese into fresh cream: both are
>> fine milk products, it's true, but instead of trying to convert the
>> one into the other, you'd do better to produce them both starting from
>> fresh milk.

>
> That is a very nice analogy -- I must try to remember it.


The others in common use are

Turning hamburgers back into cows
Turning scrambled eggs back into chickens

///Peter

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
PSD to XHTML Conversion, PSD to HTML, Joomla, Drupal, WordpressConversion, PSD to XHTML CSS xhtml champs XML 0 08-02-2011 05:40 AM
PSD to XHTML Conversion, PSD to HTML, Joomla, Drupal, WordpressConversion, PSD to XHTML CSS xhtml champs C Programming 0 08-01-2011 06:29 AM
convert xhtml to another xhtml using xslt Usha2009 XML 0 12-20-2009 01:13 PM
Should I Convert Site To XHTML or XHTML mobile? chronos3d HTML 9 12-05-2006 04:46 PM
parse URL (href) from xhtml, xhtml -> text, for data hawat.thufir@gmail.com XML 7 02-08-2006 07:39 PM



Advertisments