Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Need more info about problem resolving entity reference

Reply
Thread Tools

Need more info about problem resolving entity reference

 
 
David Karr
Guest
Posts: n/a
 
      05-23-2013
I have a Cygwin Perl script makes numerous REST api calls to a local service, parses the results from those, and makes other calls with that data. It also runs some of these calls in multiple threads, using LWP::UserAgent.

It mostly works, but I sometimes get errors like this:

-----------------------
caught error:
500 Can't connect to www.w3.org:80 (Operation now in progress) http://www.w3.org/TR/html4/strict.dtd
Handler couldn't resolve external entity at line 1, column 90, byte 92
error in processing external entity reference at line 1, column 90, byte 92:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
================================================== =======================================^
<html>
<head>
at /usr/lib/perl5/vendor_perl/5.14/i686-cygwin-threads-64int/XML/Parser.pm line 187 thread 2
----------------------

That's the entire error message. I have no idea where in the script this gets called from, and I'm not really sure what this error is telling me.
 
Reply With Quote
 
 
 
 
David Karr
Guest
Posts: n/a
 
      05-24-2013
On Thursday, May 23, 2013 7:34:04 PM UTC-7, Ben Morrow wrote:
> Quoth David Karr <(E-Mail Removed)>:
>
> > I have a Cygwin Perl script makes numerous REST api calls to a local

>
> > service, parses the results from those, and makes other calls with that

>
> > data. It also runs some of these calls in multiple threads, using

>
> > LWP::UserAgent.

>
> >

>
> > It mostly works, but I sometimes get errors like this:

>
> >

>
> > -----------------------

>
> > caught error:

>
> > 500 Can't connect to www.w3.org:80 (Operation now in progress)

>
> > http://www.w3.org/TR/html4/strict.dtd

>
> > Handler couldn't resolve external entity at line 1, column 90, byte 92

>
> > error in processing external entity reference at line 1, column 90, byte 92:

>
> > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"

>
> > "http://www.w3.org/TR/html4/strict.dtd">

>
> > ================================================== ====================

>
> > ===================^

>
> > <html>

>
> > <head>

>
> > at

>
> > /usr/lib/perl5/vendor_perl/5.14/i686-cygwin-threads-64int/XML/Parser.pm

>
> > line 187 thread 2

>
> > ----------------------

>
> >

>
> > That's the entire error message. I have no idea where in the script

>
> > this gets called from, and I'm not really sure what this error is

>
> > telling me.

>
>
>
> This error comes from XML:arser. I assume you are invoking that
>
> directly, to parse the REST response? What's happening is that
>
> XML:arser sees a DOCTYPE declaration like
>
>
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
>
> "http://www.w3.org/TR/html4/strict.dtd">
>
>
>
> and, like a good little SGML-derived XML parser, tries to fetch the DTD
>
> (using LWP) so it can validate the rest of the file. For some reason,
>
> when it tries to connect to www.w3.org to download the DTD file, the
>
> connection is failing with EINPROGRESS. Since LWP isn't expecting that
>
> error code, it throws an error.
>
>
>
> So, what's the real problem? Well, first, that's an HTML doctype. You
>
> can't, in general, parse HTML with an XML parser, so are you sure you're
>
> getting the responses you expect? REST services are usually pretty good
>
> about getting their Content-types right, so you ought to be able to
>
> check for an XML Content-type before passing the data to XML:arser.


I'm completely certain that in these anomalous cases, I'm definitely not getting the response I expect. The problem with this error message is that it gives me absolutely no clue where in the script this is happening. I'm guessing that our back-end server gets confused in some cases, but it's hardto diagnose when I don't know what URL was being attempted, or where in the script it was done.

> Second, you really don't want to keep fetching the DTDs like that. Does
>
> the XML you're actually trying to parse use external DTDs? If not, then
>
> you want to pass the NoLWP option to XML:arser, so that it doesn't
>
> even try to fetch DTDs from the network. In the case of a public DTD
>
> like HTML the attempt to load it as a local file will fail, of course,
>
> but the parsing wasn't going to succeed anyway, because it wasn't XML.


That "NoLWP" option sounds useful, but it's somewhat moot here.

> However, I'm slightly confused here, because the XML:arser
>
> documentation seems to say it doesn't parse external DTDs by default.
>
> It's possible I'm misunderstanding; I don't think I've used XML:arser
>
> myself. Are you passing ParseParamEnt, and if so, why?


I don't know what "ParseParamEnt" is, so I imagine I'm not.

> Third, you probably don't want to be using XML:arser at all. As you
>
> can see, it's old and rather cronky, and while it's extremely solid code
>
> it also takes a rather SGMLish approach to parsing XML. Most of the
>
> time, with modern XML use, DTDs are not used, and instead the XML just
>
> needs to be well-formed and properly namespaced. For this sort of thing
>
> (small documents) I would use XML::LibXML (which, incidentally, also
>
> includes a reasonable HTML parser); if a streaming model is more
>
> appropriate, either because your documents may be ridiculously large or
>
> simply because your program is structured that way, I would use one of
>
> the SAX modules.


The funny thing about searching in CPAN is that there are no packages (I'm guessing) that say "do not use this, use something better". I'll take a look at XML::LibXML to see what it does for me.

> Finally, fourth, I have no idea where that EINPROGRESS is coming from.
>
> That error is supposed to be returned if a socket is connected while in
>
> non-blocking mode, and the connection cannot be completed without
>
> blocking; it's basically the equivalent of EAGAIN for connect(). This
>
> means it shouldn't be possible to get that error without having asked
>
> for it by setting nonblocking mode on the socket, which LWP does not
>
> (normally) do.
>
>
>
> Are you doing something peculiar which might cause this to happen?
>
> Alternatively, it's possible this is some sort of Cygwin peculiarity,
>
> which unfortunately may be difficult to track down; if you can isolate
>
> the conditions where the error occurs it would be useful. (For instance,
>
> does it tend to occur when the network goes down? When the network is
>
> overloaded? When the DNS doesn't respond promptly?)


The script runs for perhaps 30-40 minutes, basically walking the entire data model of a REST api. It sends hundreds of requests to the (load-balanced) service, some from multiple threads. This kind of error happens several times during the run of the script, which means that the vast majority workwell enough. I ended up putting a hack into my "sendGet" sub that just checks for "DOCTYPE HTML" in the output and simply tries again, with a reasonable limit of retries. Almost all of the calls that detect this once or twice eventually get good data.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
I need your advices about C prg. Dogukan Bayraktar C Programming 76 06-16-2013 08:54 AM
Entity, problem with entity key ThatsIT.net.au ASP .Net 1 09-07-2009 02:20 AM
How to relate a SQL based entity with an Object based entity in Entity Framework markla ASP .Net 1 10-06-2008 09:42 AM
Entity Name or Entity Number? Samuel van Laere HTML 4 02-24-2007 10:11 PM
resolving an entity Dean A. Hoover XML 5 12-08-2003 06:31 PM



Advertisments