Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > lxml precaching DTD for document verification.

Reply
Thread Tools

lxml precaching DTD for document verification.

 
 
Gelonida N
Guest
Posts: n/a
 
      11-27-2011
Hi,

I'd like to verify some (x)html / / html5 / xml documents from a server.

These documents have a very limited number of different doc types / DTDs.

So what I would like to do is to build a small DTD cache and some code,
that would avoid searching the DTDs over and over from the net.

What would be the best way to do this?
I guess, that
the fields od en ElementTre, that I have to look at are
docinfo.public_id
docinfo.system_uri

There's also mentioning af a catalogue, but I don't know how to
use a catalog and how to know what is inside my catalogue
and what isn't.


Below a non working skeleto (first shot):
---------------------------------------------
Would this be the right way??

### ufnctions with '???' are not implemented / are the ones
### where I don't know whether they exist alreday.

import os
import urllib

from lxml import etree

cache_dir = os.path.join(os.environ['HOME'], ''.my_dtd_cache')

def get_from_cache(docinfo):
""" the function which I'd like to implement most efficiently """
fpi = docinfo.public_id
uri = docinfo.system_uri
dtd = ???get_from_dtd_cache(fpi, uri)
if dtd is not None:
return dtd
# how can I check what is in my 'catalogue'
if ???dtd_in_catalogue(??):
return ???get_dtd_from_catalogue???
dtd_rdr = urllib.urlopen(uri)
dtd_filename = ???create_cache_filename(docinfo)
(fname, _headers) = urllib.urlretrieve(uri, dtd_filename)
return etree.DTD(fname)


def check_doc_cached(filename):
""" function, which should report errors
if a doc doesn't validate.
"""
doc = etree.parse(filename)
dtd = get_from_cache(doc.docinfo)
rslt = dtd.validate(doc)
if not rlst:
print "validate error:"
print(dtd.error_log.filter_from_errors()[0])





 
Reply With Quote
 
 
 
 
Roy Smith
Guest
Posts: n/a
 
      11-27-2011
In article <(E-Mail Removed)>,
Gelonida N <(E-Mail Removed)> wrote:

> I'd like to verify some (x)html / / html5 / xml documents from a server.


I'm sure you could roll your own validator with lxml and some DTDs, but
you would probably save yourself a huge amount of effort by just using
the validator the W3C provides (http://validator.w3.org/).
 
Reply With Quote
 
 
 
 
John Gordon
Guest
Posts: n/a
 
      11-27-2011
In <(E-Mail Removed)> Roy Smith <(E-Mail Removed)> writes:

> In article <(E-Mail Removed)>,
> Gelonida N <(E-Mail Removed)> wrote:
>
> > I'd like to verify some (x)html / / html5 / xml documents from a server.


> I'm sure you could roll your own validator with lxml and some DTDs, but
> you would probably save yourself a huge amount of effort by just using
> the validator the W3C provides (http://validator.w3.org/).


With regards to XML, he may mean that he wants to validate that the
document conforms to a specific format, not just that it is generally
valid XML. I don't think the w3 validator will do that.

--
John Gordon A is for Amy, who fell down the stairs
http://www.velocityreviews.com/forums/(E-Mail Removed) B is for Basil, assaulted by bears
-- Edward Gorey, "The Gashlycrumb Tinies"

 
Reply With Quote
 
Gelonida N
Guest
Posts: n/a
 
      11-28-2011
On 11/27/2011 10:33 PM, John Gordon wrote:
> In <(E-Mail Removed)> Roy Smith <(E-Mail Removed)> writes:
>
>> In article <(E-Mail Removed)>,
>> Gelonida N <(E-Mail Removed)> wrote:
>>
>>> I'd like to verify some (x)html / / html5 / xml documents from a server.

>
>> I'm sure you could roll your own validator with lxml and some DTDs, but
>> you would probably save yourself a huge amount of effort by just using
>> the validator the W3C provides (http://validator.w3.org/).


This validator requires that I post the code to some host.
The contents that I'd like to verify is intranet contents, which I am
not allowed to post to an external site.
>
> With regards to XML, he may mean that he wants to validate that the
> document conforms to a specific format, not just that it is generally
> valid XML. I don't think the w3 validator will do that.
>



Basically I want to integrate this into a django unit test.

I noticed, that some of of the templates generate documents with
mismatching DTD headers / contents.
All of the HTML code is parsable as xml (if it isn't it's a bug)

There are also some custom XML files, which have their specific DTDs

So I thought about validating some of the generated html with lxml.

the django test environment allows to run test clients, which are
supposedly much faster than a real http client.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Applet image precaching and interrupt Luc The Perverse Java 0 02-16-2006 02:11 AM
How to specify DTD to DTD.getDTD for DocumentParser? Ronald Fischer Java 4 03-17-2005 09:37 AM
Removing the dtd name when using print(...) on the dtd generated class Joseph Tilian Java 0 12-21-2004 02:58 PM
Including a dtd into another dtd... possible? Asfand Yar Qazi XML 1 09-19-2003 12:10 PM
image precaching viza Javascript 4 07-31-2003 03:04 PM



Advertisments