Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Flat HTML headers to nested XML sections

Reply
Thread Tools

Flat HTML headers to nested XML sections

 
 
CrazyAtlantaGuy
Guest
Posts: n/a
 
      05-16-2007
I am working on creating an XSLT that transforms Html into an XML
format that can be imported into Framemaker. The challenge, it turns
out, is correctly transforming the flat html header tags (<H1>, <H2>,
etc)
into nested sections inside the xml. I have made significant
progress, but have run into a roadblock.

Here is an example of my input HTML:

<html><body>
<p>abc abc</p>
<h1 class='header'>A</h1>
<p>A abc abc</p>
<h2 class='header'>B</h2>
<p>B abc abc</p>
<h3 class='header'>C</h3>
<p>Cabc abc</p>
<h2 class='header'>D</h2> <!-- this is missing in the output --
>

<p>D abc abc</p> <!-- this is missing in the output -->
<h1 class='header'>E</h1>
<p>E abc abc</p>
</body></html>

Here is an example of the output, you'll notice that the <H2>D</h2>
is missing.

<?xml version="1.0" encoding="UTF-8"?>
<article>
<title/>
<para>abc abc</para>
<section depth="1" id="A">
<title>A</title>
<para>A abc abc</para>
<section depth="2" id="B">
<title>B</title>
<para>B abc abc</para>
<section depth="3" id="C">
<title>C</title>
<para>C abc abc</para>
</section>
</section>
</section>
<section depth="1" id="E">
<title>E</title>
<para>E abc abc</para>
</section>

The problem is that my code is currently applying templates to all
nodes following a header who's nearest preceding header is that same
header. For this reason when content follows a header which isn't
it's header (like an <h2> following an <h3>) it doesn't get shown.
What I don't understand is how to fix it. Any help would much
appreciated. I'm not really an xsl guru, so I'm doing the best I can
to get through this.

Here is the relevant code from my xsl:

<xsl:template match="body">
<article>
<title>
<xsl:value-of select="$docTitle" />
</title>

<xsl:for-each select='child::*[not(preceding-
sibling::*[@class="header"])][not(@class="header")]'>
<xsl:apply-templates select="."/>
</xsl:for-each>

<xsl:variable name='depth'
select='substring(name(child::*[@class="header"][1]),2)'/>
<xsl:for-each select='child::*[@class="header"]
[substring(name(),
2)&lt;=$depth]'>
<xsl:apply-templates select="."/>
</xsl:for-each>

</article>
</xsl:template>

<xsl:template match="h1 | h2 | h3 | h4 | h5">
<xsl:call-template name="header">
<xsl:with-param name="depth" select="substring(name(),2)"/>
</xsl:call-template>
</xsl:template>

<xsl:template name="header">
<xslaram name="depth"/>
<section>
<xsl:attribute name="depth">
<xsl:value-of select="$depth"/>
</xsl:attribute>

<xsl:attribute name="id">
<xsl:value-of select="translate(.,' ','')" />
</xsl:attribute>
<title><xsl:value-of select="."/></title>

<xsl:variable name='thisHeader' select='generate-id(.)'/>
<xsl:for-each select='following-sibling::*[$thisHeader=generate-
id(preceding-sibling::*[@class="header"][last()])]
[not(@class="header") or (@class="header" and substring(name(),2)>=
$depth)]'>
<xsl:apply-templates select="."/>
</xsl:for-each>

</section>

</xsl:template>

 
Reply With Quote
 
 
 
 
Peter Flynn
Guest
Posts: n/a
 
      05-16-2007
CrazyAtlantaGuy wrote:
> I am working on creating an XSLT that transforms Html into an XML
> format that can be imported into Framemaker. The challenge, it turns
> out, is correctly transforming the flat html header tags (<H1>, <H2>,
> etc) into nested sections inside the xml.


This is called encapsulation, and there's a much neater way than writing
XSLT to try and reach-forward-down-the-tree-up-to-but-not-including the
next H1/H2/H3/etc.

1. Run Tidy to make the HTML into well-formed XHTML (tidy -nc -asxml)

2. Write a short script to turn the XHTML back into valid SGML
(remove NETs, namespaces)

3. Apply a DocType Declaration for the ISO 15445 HTML DTD, which
includes a DIV1/DIV2 containment structure, in "preparation" mode
(declare % Preparation as INCLUDE in the internal subset and use
pre-html as the declared root element type)

4. Run osgmlnorm to normalize the document: this adds the missing
markup, switches single quotes to double where possible, etc

<!doctype pre-html
public "ISO/IEC 15445:2000//DTD HyperText Markup Language//EN" [
<!entity % Preparation "include" >
]>
<PRE-HTML>
<HEAD>
<META CONTENT="HTML Tidy for Linux/x86 (vers 1 September 2005), see
www.w3.org" NAME="GENERATOR">
<TITLE></TITLE>
</HEAD>
<BODY>
<P>abc abc</P>
<H1 CLASS="header">A</H1>
<DIV1>
<P>A abc abc</P>
<H2 CLASS="header">B</H2>
<DIV2>
<P>B abc abc</P>
<H3 CLASS="header">C</H3>
<DIV3>
<P>Cabc abc</P>
</DIV3>
</DIV2>
<H2 CLASS="header">D</H2>
<DIV2>
<P>D abc abc</P>
</DIV2>
</DIV1>
<H1 CLASS="header">E</H1>
<DIV1>
<P>E abc abc</P>
</DIV1>
</BODY>
</PRE-HTML>

You can easily mess with the Preparation structure in the DTD if you
don't like the way they did it (I don't).

///Peter
 
Reply With Quote
 
 
 
 
Joe Kesselman
Guest
Posts: n/a
 
      05-17-2007
You could try adapting something from the XSLT FAQ. Likely candidates
would be
http://www.dpawson.co.uk/xsl/sect2/N4486.html#d5891e424
or
http://www.dpawson.co.uk/xsl/sect2/N...tml#d5891e1051

Some of the other examples on that page may also be adaptable to this
question.

(It's always worth checking Dave's page; he has done an excellent job of
collecting useful answers from XSL-List, which is unofficial but has
been in existence since before XSL was a Recommendation and has had
participation by a lot of XSL's architects and implementers. I still try
to keep half an eye on that list, though I must admit I don't watch it
as closely as I should.)

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
 
Reply With Quote
 
CrazyAtlantaGuy
Guest
Posts: n/a
 
      05-22-2007
On May 17, 12:37 am, Joe Kesselman <(E-Mail Removed)> wrote:
> You could try adapting something from the XSLT FAQ. Likely candidates
> would behttp://www.dpawson.co.uk/xsl/sect2/N4486.html#d5891e424
> orhttp://www.dpawson.co.uk/xsl/sect2/N4486.html#d5891e1051
>
> Some of the other examples on that page may also be adaptable to this
> question.
>
> (It's always worth checking Dave's page; he has done an excellent job of
> collecting useful answers from XSL-List, which is unofficial but has
> been in existence since before XSL was a Recommendation and has had
> participation by a lot of XSL's architects and implementers. I still try
> to keep half an eye on that list, though I must admit I don't watch it
> as closely as I should.)
>
> --
> () ASCII Ribbon Campaign | Joe Kesselman
> /\ Stamp out HTML e-mail! | System architexture and kinetic poetry


Thanks for the help!

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Why 'Flat is better than nested' Terry Reedy Python 0 07-31-2012 09:04 PM
Why "flat is better than nested"? kj Python 53 11-10-2010 05:08 AM
function-sections and data-sections option in gcc Raman C Programming 6 08-03-2007 10:40 AM
dealing with nested xml within nested xml within...... Ultrus Python 3 07-09-2007 09:00 PM
Do anyone already have code to copy nested files to a flat directory? Podi Python 4 03-21-2006 06:51 AM



Advertisments