Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > HTML Parsing Question

Reply
Thread Tools

HTML Parsing Question

 
 
Stefan Kleineikenscheidt
Guest
Posts: n/a
 
      12-31-2006
Hi all,

i'm trying to convert an HTML page to a hierachical structure, but I am
stuck. Consider a page like that:

<h1>First Heading1</h1>
<p>some text</p>
<p>more text</p>

<h2>First Heading2</h2>
<p>more text</p>

<h2>Second Heading2</h2>
...
<h1>Second Heading1</h1>
...
<h2>Third Heading2</h2>
...


Now I would like to convert this into a hierarchical structure like
this (think of Docbook):

<article>
|
+ <sect1>
| |
| + <sect2>
| + <sect2>
|
+ <sect1>
|
+ <sect2>

This is my 'h1' template, where i try to process all elements between
two 'h1' elements:

<xsl:template match="//h:h1">
<section>
<title><xsl:value-of select="text()" /></title>
<xsl:variable name="nexth1" select="position(parent::*/*[(name()
= 'h1')])" />
<xsl:apply-templates select="following-sibling::*[position()
&lt;= $nexth1]" />
</section>
</xsl:template>

$nexth1 should be the position of the next 'h1' element. However,
position() does not take any arguments, and i don't have a clue how to
get the position. (I need to change the context node, but i don't know
how...)

Can you give me any directions on this?

Thanks in advance,
-Stefan

 
Reply With Quote
 
 
 
 
Peter Flynn
Guest
Posts: n/a
 
      01-02-2007
Stefan Kleineikenscheidt wrote:
> Hi all,
>
> i'm trying to convert an HTML page to a hierachical structure, but I am
> stuck. Consider a page like that:
>
> <h1>First Heading1</h1>
> <p>some text</p>
> <p>more text</p>
>
> <h2>First Heading2</h2>
> <p>more text</p>
>
> <h2>Second Heading2</h2>
> ...
> <h1>Second Heading1</h1>
> ...
> <h2>Third Heading2</h2>
> ...


First of all you would need to make it well-formed XHTML (use W3C Tidy
for that). This ensures that any subsequent XSLT process won't gag.

> This is my 'h1' template, where i try to process all elements between
> two 'h1' elements:
>
> <xsl:template match="//h:h1">
> <section>
> <title><xsl:value-of select="text()" /></title>
> <xsl:variable name="nexth1" select="position(parent::*/*[(name()
> = 'h1')])" />
> <xsl:apply-templates select="following-sibling::*[position()
> &lt;= $nexth1]" />
> </section>
> </xsl:template>


<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet xmlnssl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xslutput method="xml" indent="yes"/>

<xsl:template match="h1|h2|h3|h4">
<xsl:variable name="id" select="generate-id(.)"/>
<xsl:variable name="level">
<xsl:value-of select="number(translate(name(),'h',''))"/>
</xsl:variable>
<xsl:variable name="gi" select="name()"/>
<xsl:element name="{concat('sect',$level)}">
<xsl:attribute name="id" select="$id"/>
<title>
<xsl:apply-templates/>
</title>
<xsl:apply-templates select="following-sibling::*
[generate-id(preceding-sibling::*[name()=$gi][1])=$id]
[not(substring(name(),1,1)='h' and name()!='hr' and
number(translate(substring(name(),1,1),'h',''))&lt ;$level)]

[not(number(translate(name(preceding-sibling::*[substring(name(),1,1)='h'
and name()!='hr'][1]),'h',''))&lt;$level)]"/>
</xsl:element>
</xsl:template>

<xsl:template match="p">
<para>
<xsl:apply-templates/>
</para>
</xsl:template>

</xsl:stylesheet>

This needs some more work: it's not subsetting out the higher-level H*
element types, but I've run out of time here.

///Peter
--
XML FAQ: http://xml.silmaril.ie/
 
Reply With Quote
 
 
 
 
Johannes Koch
Guest
Posts: n/a
 
      01-02-2007
Peter Flynn schrieb:
> First of all you would need to make it well-formed XHTML

[...]
> <?xml version="1.0" encoding="iso-8859-1"?>
> <xsl:stylesheet xmlnssl="http://www.w3.org/1999/XSL/Transform"
> version="1.0">
>
> <xslutput method="xml" indent="yes"/>
>
> <xsl:template match="h1|h2|h3|h4">


If the source is "well-formed XHTML" you will have to deal with
namespaces as the OP already did.
--
Johannes Koch
In te domine speravi; non confundar in aeternum.
(Te Deum, 4th cent.)
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Parsing HTML with HTML::Tree Ninja Li Perl Misc 1 03-01-2010 03:37 PM
Parsing HTML with HTML::TableExtract Ninja Li Perl Misc 2 11-28-2009 12:43 AM
Parsing HTML - using HTML::TreeBuilder olson_ord@yahoo.it Perl Misc 7 10-06-2006 06:33 PM
SAX Parsing - Weird results when parsing content between tags. Naren XML 0 05-11-2004 07:25 PM
Perl expression for parsing CSV (ignoring parsing commas when in double quotes) GIMME Perl 2 02-11-2004 05:40 PM



Advertisments