Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > this code: &#x3, an invalid XML character error.

Reply
Thread Tools

this code: &#x3, an invalid XML character error.

 
 
Kaidi
Guest
Posts: n/a
 
      09-27-2004
Hello guys,
I get the "an invalid XML character" error when using xerces to parse
a XML file. I know that XML will correspond the &, <, >, " to special
strings like "&gt;&lt;". However, how about if the XML file really
needs to contain some text like: "&#x3;&#x4;&#x14;&#x8;&#x8;"? (as
content of a tag)

The story is:
I am writing a program to parse some XML files from another program.
In that program, it graps webpages, and saves the pages' URLs and
content into a XML file, something like (for each webpage):

<pageurl>http://www.cs.waikato.ac.nz/~ml/weka/agridatasets.jar</pageurl>
<pagecontent> the_page_HTML_content </pagecontent>

This works fine since that program will replace &, <, > etc with &lt;
etc.

However, some web urls point to files: .zip, .pdf file, etc. The
program just "prints" the .pdf content as text and puts it in the XML
file. In this case, the content of <pagecontent> will look like:

PK&#x3;&#x4;&#x14;&#x0;&#x8;&#x0;&#x8;&#x0;ÈR&lt; +&#x0;&#x0;&#x0;&#
......
(Just think what you will see if you open a .pdf file in notepad!)

In this way, when I use a XML parser (xerces) to parse it, it will get
errors like:

FATAL: line 5079: Character reference "&#x3" is an invalid XML
character.
org.xml.sax.SAXParseException: Character reference "&#x3" is an
invalid XML character.
at org.apache.xerces.util.ErrorHandlerWrapper.createS AXParseException(Unknown
Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalEr ror(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportErro r(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportErro r(Unknown
Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError (Unknown Source)
at org.apache.xerces.impl.XMLScanner.scanCharReferenc eValue(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl.scanCharReference(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl$FragmentContentDispatcher.dispatch(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl.scanDocument(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse( Unknown Source)

So, any idea how I can make it work?
How can I tell the xerces parser to ignore the "&xx;" pairs (except
those for <,>,", etc) and parse them just as plain text?

Thanks a lot.
 
Reply With Quote
 
 
 
 
Patrick TJ McPhee
Guest
Posts: n/a
 
      09-27-2004
In article <(E-Mail Removed) >,
Kaidi <(E-Mail Removed)> wrote:

% I get the "an invalid XML character" error when using xerces to parse
% a XML file. I know that XML will correspond the &, <, >, " to special
% strings like "&gt;&lt;". However, how about if the XML file really
% needs to contain some text like: "&#x3;&#x4;&#x14;&#x8;&#x8;"? (as
% content of a tag)

The only valid characters in an XML file are the non-control code points
from Unicode, tab, carriage-return, and line-feed. Even if you enter
them as numeric entity references, other control characters (such as
&#x3 are not allowed. I suggest encoding binary data using one of
the schemes recognised in mime, such as quoted-printable (for text with
the odd control character) or base64.

% However, some web urls point to files: .zip, .pdf file, etc. The
% program just "prints" the .pdf content as text and puts it in the XML
% file. In this case, the content of <pagecontent> will look like:

For these, use base64.

--

Patrick TJ McPhee
East York Canada
http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
 
 
 
Johannes Koch
Guest
Posts: n/a
 
      09-27-2004
Kaidi wrote:
> The
> program just "prints" the .pdf content as text and puts it in the XML
> file. In this case, the content of <pagecontent> will look like:
>
> PK&#x3;&#x4;&#x14;&#x0;&#x8;&#x0;&#x8;&#x0;ÈR&lt; +&#x0;&#x0;&#x0;&#
> ......
> (Just think what you will see if you open a .pdf file in notepad!)
>
> In this way, when I use a XML parser (xerces) to parse it,


Why do you want to parse PDF with an XML parser? When downloading the
resources, you may store the content-type and make XML pasring dependent
on the content-type.
--
Johannes Koch
In te domine speravi; non confundar in aeternum.
(Te Deum, 4th cent.)
 
Reply With Quote
 
Kaidi
Guest
Posts: n/a
 
      09-27-2004
Johannes Koch <(E-Mail Removed)> wrote in message news:<(E-Mail Removed)>...
> Kaidi wrote:
> > The
> > program just "prints" the .pdf content as text and puts it in the XML
> > file. In this case, the content of <pagecontent> will look like:
> >
> > PK&#x3;&#x4;&#x14;&#x0;&#x8;&#x0;&#x8;&#x0;?R&lt; +&#x0;&#x0;&#x0;&#
> > ......
> > (Just think what you will see if you open a .pdf file in notepad!)
> >
> > In this way, when I use a XML parser (xerces) to parse it,

>
> Why do you want to parse PDF with an XML parser? When downloading the
> resources, you may store the content-type and make XML pasring dependent
> on the content-type.


yes, if let me write the whole program, I will do that way. The
problem is: the existing program (which I can not change) is doing
that way: it just put .jar/pdf, etc. into one XML file. I need to
process this XML file.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
invalid character in base-64 string and invalid postback or callba kevin ASP .Net 0 01-16-2008 09:39 PM
How to detect the Invalid XML Character ? sachinik19@gmail.com XML 2 06-13-2006 12:19 AM
invalid XML character Marco Montel XML 6 12-08-2004 10:41 AM
Invalid XML character Mark XML 5 08-18-2004 01:57 PM
Character reference "&#c" is an invalid XML character cgbusch XML 6 09-02-2003 07:04 PM



Advertisments