Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Loading a simple XHTML transitional document into aorg.w3c.dom.Document

Reply
Thread Tools

Loading a simple XHTML transitional document into aorg.w3c.dom.Document

 
 
Ion Freeman
Guest
Posts: n/a
 
      07-09-2009
Hi!
I'm just trying to do the simplest thing in the world. Where input
is a java.io.File that contains an transitional XHTML 1.0 file, I do

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance
();
dbf.setNamespaceAware(false);
db = dbf.newDocumentBuilder();
Document doc = db.parse(input);

Unfortunately, this tries to pull the DTD from the W3C, and they
didn't like that. So, they give me a 503 error. I tried the
EntityResolver from http://forums.sun.com/thread.jspa?threadID=5244492,
but that just gives me a MalformedURLException. Either way, my parse
fails.

I'm sure that at least tens of thousands of people have written code
to do this, but I can't find a (working) reference online. I think
most of my XML parsing happened when the W3C would just give the DTDs
out -- I understand that they found that unworkable, but I still need
to parse my document.

How should I be doing this?

Thanks!

Ion
 
Reply With Quote
 
 
 
 
markspace
Guest
Posts: n/a
 
      07-09-2009
Ion Freeman wrote:
> Hi!
> I'm just trying to do the simplest thing in the world. Where input
> is a java.io.File that contains an transitional XHTML 1.0 file, [snip ....]


> Unfortunately, this tries to pull the DTD from the W3C, and they
> didn't like that. So, they give me a 503 error.



There might be some clues here:

http://www.javalobby.org/java/forums/t105916.html
 
Reply With Quote
 
 
 
 
Ion Freeman
Guest
Posts: n/a
 
      07-09-2009
Thanks, markspace. I did try Axiom, but it looks like I have to figure
out how to do everything all over again -- like find an element by id
and replace it, all I really want to accomplish. I'd really just like
to get the Xerces parser to load my dtds locally, as opposed to
erroring out on the W3C site.

On Jul 9, 3:30*pm, markspace <nos...@nowhere.com> wrote:
> Ion Freeman wrote:
> > Hi!
> > * *I'm just trying to do the simplest thing in the world. Where input
> > is a java.io.File that contains an transitional XHTML 1.0 file, [snip .....]
> > Unfortunately, this tries to pull the DTD from the W3C, and they
> > didn't like that. So, they give me a 503 error.

>
> There might be some clues here:
>
> http://www.javalobby.org/java/forums/t105916.html


 
Reply With Quote
 
markspace
Guest
Posts: n/a
 
      07-09-2009
Ion Freeman wrote:
> Thanks, markspace. I did try Axiom, but it looks like I have to figure
> out how to do everything all over again -- like find an element by id
> and replace it, all I really want to accomplish. I'd really just like
> to get the Xerces parser to load my dtds locally, as opposed to
> erroring out on the W3C site.
>
> On Jul 9, 3:30 pm, markspace <nos...@nowhere.com> wrote:
>> Ion Freeman wrote:
>>> Hi!
>>> I'm just trying to do the simplest thing in the world. Where input
>>> is a java.io.File that contains an transitional XHTML 1.0 file, [snip .....]
>>> Unfortunately, this tries to pull the DTD from the W3C, and they
>>> didn't like that. So, they give me a 503 error.

>> There might be some clues here:
>>
>> http://www.javalobby.org/java/forums/t105916.html

>



I tried a quick little program of my own, which had a different problem
than yours did, although mine still threw a fatal error. My take away
from that error was that the Xerces parser just isn't going to pares the
looser syntax of a transitional HTML document. You'll have to use a
special one. The parses built into Java all seem to be XML and nothing
else, they don't allow for HTML's funky syntax. I'm guessing, but in
the small amount of work I did that seemed to be the case.
 
Reply With Quote
 
Mike Schilling
Guest
Posts: n/a
 
      07-09-2009
Ion Freeman wrote:
> Hi!
> I'm just trying to do the simplest thing in the world. Where input
> is a java.io.File that contains an transitional XHTML 1.0 file, I do
>
> DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance
> ();
> dbf.setNamespaceAware(false);
> db = dbf.newDocumentBuilder();
> Document doc = db.parse(input);
>
> Unfortunately, this tries to pull the DTD from the W3C, and they
> didn't like that. So, they give me a 503 error. I tried the
> EntityResolver from
> http://forums.sun.com/thread.jspa?threadID=5244492, but that just
> gives me a MalformedURLException. Either way, my parse fails.
>
> I'm sure that at least tens of thousands of people have written code
> to do this, but I can't find a (working) reference online. I think
> most of my XML parsing happened when the W3C would just give the DTDs
> out -- I understand that they found that unworkable, but I still need
> to parse my document.
>
> How should I be doing this?


You should be able to solve this with an entity resolver that returns an
input source containing the right DTD text. They're not that difficut to
construct; just recognize the URL and return a StringReader or
ByteArrayInputStream. Return null for any URL you don't recognize.

If you know for a fact that the parser is Xerces (it's the default in Java
1.5 and later), you could try setting the Xerces-specific feature to ignore
DTDs. http://xml.org/sax/features/external-parameter-entities suggests that
you set http://xml.org/sax/features/external-parameter-entities to
"false", though we set
"http://apache.org/xml/features/nonvalidating/load-dtd-grammar" and
"http://apache.org/xml/features/nonvalidating/load-external-dtd" to false.
Be sure to call setValidating(false) too, though I'm pretty sure that's the
default anyway.


 
Reply With Quote
 
Mike Schilling
Guest
Posts: n/a
 
      07-09-2009
markspace wrote:

>
>
> I tried a quick little program of my own, which had a different
> problem than yours did, although mine still threw a fatal error. My
> take away from that error was that the Xerces parser just isn't going
> to pares the looser syntax of a transitional HTML document. You'll
> have to use a special one. The parses built into Java all seem to be
> XML and nothing else, they don't allow for HTML's funky syntax. I'm
> guessing, but in the small amount of work I did that seemed to be the
> case.


The original poster did say he's parsing xhtml, which is an XML-compatible
version of html. And DTDs (which is what's causing his problems) are a
standard and supported XML feature.


 
Reply With Quote
 
markspace
Guest
Posts: n/a
 
      07-09-2009
Mike Schilling wrote:
> markspace wrote:
>
>>
>> I tried a quick little program of my own, which had a different
>> problem than yours did, although mine still threw a fatal error. My
>> take away from that error was that the Xerces parser just isn't going
>> to pares the looser syntax of a transitional HTML document. You'll
>> have to use a special one. The parses built into Java all seem to be
>> XML and nothing else, they don't allow for HTML's funky syntax. I'm
>> guessing, but in the small amount of work I did that seemed to be the
>> case.

>
> The original poster did say he's parsing xhtml, which is an XML-compatible
> version of html. And DTDs (which is what's causing his problems) are a
> standard and supported XML feature.



Theoretically, yes, but he said he was parsing a transitional document,
and I assume that means "web page." For my test, I used the home page
of http://cnn.com. It has 42 errors, according the the validator at
w3c.org. And Xerces barfed on stuff that the W3C validator passed.

My take away: transitional documents aren't. The OP will need a parser
specially built to deal with common errors that appear on web pages.

 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      07-10-2009
Ion Freeman wrote:
> I'm just trying to do the simplest thing in the world. Where input
> is a java.io.File that contains an transitional XHTML 1.0 file, I do
>
> DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance
> ();
> dbf.setNamespaceAware(false);
> db = dbf.newDocumentBuilder();
> Document doc = db.parse(input);
>
> Unfortunately, this tries to pull the DTD from the W3C, and they
> didn't like that. So, they give me a 503 error. I tried the
> EntityResolver from http://forums.sun.com/thread.jspa?threadID=5244492,
> but that just gives me a MalformedURLException. Either way, my parse
> fails.
>
> I'm sure that at least tens of thousands of people have written code
> to do this, but I can't find a (working) reference online. I think
> most of my XML parsing happened when the W3C would just give the DTDs
> out -- I understand that they found that unworkable, but I still need
> to parse my document.
>
> How should I be doing this?


Download the DTD and the 3 ENT files to your harddrive and tell
the parse to use those.

See code below.

Arne

================================================== =====

import java.io.IOException;
import java.io.StringReader;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;
import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class XhtmlParse {
public static void main(String[] args) throws Exception{
String xml = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0
Transitional//EN\"
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\r\n<html>\r\n<head>\r\n<title> simple
document</title>\r\n</head>\r\n<body>\r\n<p>a simple
paragraph</p>\r\n</body>\r\n</html>";
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(true);
DocumentBuilder db = dbf.newDocumentBuilder();
db.setEntityResolver(new DTDHandler());
Document doc = db.parse(new InputSource(new StringReader(xml)));
}
}

class DTDHandler implements EntityResolver {
@Override
public InputSource resolveEntity(String publicId, String systemId)
throws SAXException, IOException {

if(systemId.equals("http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"))
{
return new InputSource("C:\\xhtml1-transitional.dtd");
} else
if(systemId.equals("http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent")) {
return new InputSource("C:\\xhtml-lat1.ent");
} else
if(systemId.equals("http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent")) {
return new InputSource("C:\\xhtml-symbol.ent");
} else
if(systemId.equals("http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent")) {
return new InputSource("C:\\xhtml-special.ent");
} else {
return null;
}
}
}
 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      07-10-2009
markspace wrote:
> Mike Schilling wrote:
>> markspace wrote:
>>> I tried a quick little program of my own, which had a different
>>> problem than yours did, although mine still threw a fatal error. My
>>> take away from that error was that the Xerces parser just isn't going
>>> to pares the looser syntax of a transitional HTML document. You'll
>>> have to use a special one. The parses built into Java all seem to be
>>> XML and nothing else, they don't allow for HTML's funky syntax. I'm
>>> guessing, but in the small amount of work I did that seemed to be the
>>> case.

>>
>> The original poster did say he's parsing xhtml, which is an
>> XML-compatible version of html. And DTDs (which is what's causing his
>> problems) are a standard and supported XML feature.

>
> Theoretically, yes, but he said he was parsing a transitional document,
> and I assume that means "web page." For my test, I used the home page
> of http://cnn.com. It has 42 errors, according the the validator at
> w3c.org. And Xerces barfed on stuff that the W3C validator passed.
>
> My take away: transitional documents aren't. The OP will need a parser
> specially built to deal with common errors that appear on web pages.


CNN does not claim to be XHTML 1.0 Transitional.

CNN claims to be HTML 4.01 Transitional.

Difference.

There are web pages and there are web pages.

If something is valid XHTML, then it can be parsed
by an XML parser.

If something claims to be XHTML but are actually not
valid XHTML, then it may not be parseable by an
XML parser.

Arne
 
Reply With Quote
 
markspace
Guest
Posts: n/a
 
      07-10-2009
Arne Vajhøj wrote:

>
> CNN does not claim to be XHTML 1.0 Transitional.
>
> CNN claims to be HTML 4.01 Transitional.
>
> Difference.



Hmm, Wikipedia said they were the same. Care to elaborate?

"XHTML 1.0 Transitional is the equivalent of HTML 4.01 Transitional, and
includes the presentational elements (such as center, font and strike)
excluded from the strict version."

http://en.wikipedia.org/wiki/XHTML
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Having problems with Netscape 6+ and XHTML 1.0 Transitional vivek ASP .Net 1 03-24-2006 07:25 AM
W3C xhtml 1.0 transitional problem Spartaco ASP .Net 2 02-28-2006 10:55 PM
translate a HTML document into a XHTML document mike Java 3 01-24-2005 09:42 AM
A HTML document into a XHTML document(wap2.0 mobile profile) mike XML 6 10-14-2004 09:02 PM
XHTML bug? Transitional/Strict Rendering Issues Dave Winter HTML 16 04-27-2004 07:21 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57