Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Facing exception: Invalid byte 2 of 4-byte UTF-8 sequence.

Reply
Thread Tools

Facing exception: Invalid byte 2 of 4-byte UTF-8 sequence.

 
 
dk
Guest
Posts: n/a
 
      01-21-2010
Hi All,

While I'm trying to use some UTF-8 characters in my xml while parsing
the xml using JDOM parser I'm getting this below exception:

Malformed XML, Caused by: 'Invalid byte 2 of 4-byte UTF-8 sequence.'
at com.clarify.boss.utility.xml.SimpleXmlParser.build
(SimpleXmlParser.java:236)
at
com.clarify.boss.msf.handler.RespHeaderInitiateHan dler.getStandardHeader
(RespHeaderInitiateHandler.java:366)
at com.clarify.boss.msf.handler.RespHeaderInitiateHan dler.execute
(RespHeaderInitiateHandler.java:289)
at
com.clarify.boss.utility.appcontroller.support.Abs tractHandler.execute
(AbstractHandler.java:42)
at
com.clarify.boss.utility.appcontroller.support.App licationControllerImpl.handleRequest
(ApplicationControllerImpl.java:174)
at
com.clarify.boss.utility.appcontroller.support.App licationControllerImpl.execute
(ApplicationControllerImpl.java:311)
at com.clarify.boss.msf.support.ServiceFaultPublisher AB.executeImpl
(ServiceFaultPublisherAB.java:87)
at com.clarify.boss.common.base.BossActionBeanBase.ex ecute
(BossActionBeanBase.java:125)
at com.clarify.boss.sa.msf.xbean.InvokeResponseXB.exe cuteImpl
(InvokeResponseXB.java:19
at com.clarify.cbo.XBeanImpl.baselineExecuteImpl_(XBe anImpl.java:275)
at com.amdocs.oss.sm.core.common.XBeanBase.baselineEx ecuteImpl_
(XBeanBase.java:75)
at com.clarify.cbo.XBeanImpl.execute(XBeanImpl.java:1 97)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Nativ e Method)
at sun.reflect.NativeMethodAccessorImpl.invoke
(NativeMethodAccessorImpl.java:64)
at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:615)
at com.clarify.sam.JavaDispatch.invokeMethodImp(JavaD ispatch.java:
396)
at com.clarify.sam.JavaDispatch.invokeMethod(JavaDisp atch.java:34
at com.clarify.sam.ActionBeanService.invokeBeanMethod
(ActionBeanService.java:509)
at com.clarify.sam.ActionBeanService.invokeAifOperati on
(ActionBeanService.java:12
at com.clarify.sam.AppFrameworkBindingHandler.execute Operation
(AppFrameworkBindingHandler.java:69)
at com.amdocs.aif.consumer.ServiceContext.executeWith Retries
(ServiceContext.java:900)
at com.amdocs.aif.consumer.ServiceContext.executeOper ationImpl
(ServiceContext.java:756)
at com.amdocs.aif.consumer.ServiceContext.executeOper ation
(ServiceContext.java:676)
at com.amdocs.aif.consumer.ServiceContext.executeOper ation
(ServiceContext.java:323)
at
com.clarify.boss.errorhandler.resolver.ResolverLau ncherSynchXB.executeImpl
(ResolverLauncherSynchXB.java:157)
... 35 more
Caused by: org.jdom.input.JDOMParseException: Error on line 72:
Invalid byte 2 of 4-byte UTF-8 sequence.
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:46
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:77 0)
at com.clarify.boss.utility.xml.SimpleXmlParser.build
(SimpleXmlParser.java:231)
... 60 more
Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte
UTF-8 sequence.
at org.apache.xerces.util.ErrorHandlerWrapper.createS AXParseException
(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalEr ror(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportErro r(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportErro r(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl
$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl.scanDocument
(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse( Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser .parse(Unknown
Source)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:45 3)
... 62 more

I have declared the encoding to be used while parsing, in my xml as
UTF-8:
<?xml version="1.0" encoding="UTF-8"?>

Initially I doubted that the xml backup had some problem because on
the same application server while I was trying to use the same xml as
input it worked but from one of my friends machine it didn't. So is
this could be the cause?

But now I have even something more interesting out of all this. I
tried changing the encoding to ISO-8859-1 i.e. : <?xml version="1.0"
encoding="ISO-8859-1"?> & to surprise it worked.

Now this has led to a confusion. I thought ISO-8859-1 is a charset
which is subset of UTF-8. Then why didn't UTF-8 work whereas
ISO-8859-1 worked?

And lastly I can't change this encoding in my xml as in turn I would
have to do all the regression once again on my application. So please
let me know where I have gone wrong.

The Java code that I'm using is:

/*
* (non-Javadoc)
/ *
* @see com.clarify.boss.utility.xml.XmlParser#build
(org.springframework.core.io.Resource)
*/
public Document build(Resource source) {
try {
return (getSystemId() == null ? getSaxBuilder().build
(source.getInputStream()) : getSaxBuilder().build(
source.getInputStream(), getSystemId()));
} catch (Exception e) {
e.printStackTrace();
BossErrorCode bossErrorCode = new BossErrorCode
(ErrorCode.BOSS_MALFORMED_XML);
throw new BossException(bossErrorCode, new String[] {e.getCause
().getMessage()},e);
}
}

the sax builder method is:

/**
* Getter method for the <b>saxBuilder </b> property
*
* @return Returns the saxBuilder.
*/
private PropertyAwareSAXBuilder getSaxBuilder() {
if (saxBuilder == null) {

PropertyAwareSAXBuilder myParser = new PropertyAwareSAXBuilder(
isValidate());

myParser.setFeature("http://apache.org/xml/features/validation/
schema", isValidate());
myParser.setFeature("http://xml.org/sax/features/namespaces",
true);

//CatalogResolver myResolver = new CatalogResolver();

CatalogResolver myResolver = getCatalogResolver();

myParser.setEntityResolver(myResolver);
setSaxBuilder(myParser);

Iterator it = getProperties().keySet().iterator();
while (it.hasNext()) {
String name = (String) it.next();
saxBuilder.setProperty(name, getProperties().get(name));
}
}
return saxBuilder;
}

Regards,
Dhirendra
 
Reply With Quote
 
 
 
 
Roedy Green
Guest
Posts: n/a
 
      01-21-2010
On Thu, 21 Jan 2010 02:13:27 -0800 (PST), dk <(E-Mail Removed)>
wrote, quoted or indirectly quoted someone who said :

>
>While I'm trying to use some UTF-8 characters in my xml while parsing
>the xml using JDOM parser I'm getting this below exception:


Partition your problem. Is it that the file is malformed or is the
problem getting the XML parser to understand the file is in UTF-8
encoding?

You can examine your file in a hex viewer if you are familiar with
UTF-8 encoding, or you could feed it to the Sun utility native2ascii
to see if it likes it.

See http://mindprod.com/jgloss/utf.html
http://mindprod.com/jgloss/encoding.html

You could also give up and use entities (NCRs).
see http://mindprod.com/jgloss/xml.html#AWKWARD
--
Roedy Green Canadian Mind Products
http://mindprod.com
Responsible Development is the style of development I aspire to now. It can be summarized by answering the question, How would I develop if it were my money? Im amazed how many theoretical arguments evaporate when faced with this question.
~ Kent Beck (born: 1961 age: 49) , evangelist for extreme programming .
 
Reply With Quote
 
 
 
 
dk
Guest
Posts: n/a
 
      01-21-2010
On Jan 21, 6:26*pm, Roedy Green <(E-Mail Removed)>
wrote:
> On Thu, 21 Jan 2010 02:13:27 -0800 (PST), dk <(E-Mail Removed)>
> wrote, quoted or indirectly quoted someone who said :
>
>
>
> >While I'm trying to use some UTF-8 characters in my xml while parsing
> >the xml using JDOM parser I'm getting this below exception:

>
> Partition your problem. *Is it that the file is malformed or is the
> problem getting the XML parser to understand the file is in UTF-8
> encoding?
>
> You can examine your file in a hex viewer if you are familiar with
> UTF-8 encoding, or you could feed it to the Sun utility native2ascii
> to see if it likes it.
>
> Seehttp://mindprod.com/jgloss/utf.htmlhttp://mindprod.com/jgloss/encoding..html
>
> You could also give up and use entities (NCRs).
> seehttp://mindprod.com/jgloss/xml.html#AWKWARD
> --
> Roedy Green Canadian Mind Productshttp://mindprod.com
> Responsible Development is the style of development I aspire to now. It can be summarized by answering the question, How would I develop if it were my money? I m amazed how many theoretical arguments evaporate when faced with this question.
> ~ Kent Beck (born: 1961 age: 49) , evangelist for extreme programming .



@BugBear: yeah the xml is a well formed and properly validated xml.

@Roedy: write now I'm using ultraEdit and inserting the characters
from the ASCII table that it has. I have even tried seeing it in hex
mode and I got the same value from both the places.

Meanwhile I have found something more interesting while reading the
input stream from my xml if I exclusively define it to be formatted to
UTF-8 in getByteStream it is working fine. Now here is this a Java bug
(1.5.0.12)? or something else?
 
Reply With Quote
 
Mike Schilling
Guest
Posts: n/a
 
      01-21-2010
It may be a clue that 4-byte UTE-8 sequences only occur with
surrogates, which there are two reasonable ways to encode:

1. Encode the code point as 4 bytes
2. Encode each 16-bit "char" as 3 bytes

Only 1 is correct, but I'm sure there's lots of non-surrogate-aware
code that does 2.


 
Reply With Quote
 
Lew
Guest
Posts: n/a
 
      01-21-2010
dk wrote:
> @BugBear: yeah the xml [sic] is a well formed and properly validated xml [sic].
>


That didn't answer his question. Answer his question.
"Have you checked that your data IS valid UTF-8 ?"

Clearly there is an improperly-encoded character in your XML file.
Find that and fix it.

> @Roedy: write now I'm using ultraEdit and inserting the characters
> from the ASCII table that it has. I have even tried seeing it in hex
> mode and I got the same value from both the places.
>


ASCII != UTF-8.

That hex value for the bad character, does it match the UTF-8 code
point for that character? It's four bytes long? What character is
it, and what is the hex value you observe? (Note: that's four
questions, so there ought to be four answers.)

> Meanwhile I have found something more interesting while reading the
> input stream from my xml [sic] if I exclusively define it to be formatted to
> UTF-8 in getByteStream it is working fine. Now here is this a Java bug
> (1.5.0.12)? or something else?
>


It's not a Java bug.

> Now this has led to a confusion. I thought ISO-8859-1 is a charset


Did you mean "encoding"?

> which is subset of UTF-8. Then why didn't UTF-8 work whereas
> ISO-8859-1 worked?
>


Because you were wrong. The two encodings differ.

If you have an assumption, let's call it an hypothesis, and the
evidence contradicts the hypothesis, then the hypothesis is wrong.
Simple.

--
Lew
 
Reply With Quote
 
Arne Vajhj
Guest
Posts: n/a
 
      01-22-2010
On 21-01-2010 10:03, dk wrote:
> Meanwhile I have found something more interesting while reading the
> input stream from my xml if I exclusively define it to be formatted to
> UTF-8 in getByteStream it is working fine. Now here is this a Java bug
> (1.5.0.12)? or something else?


If you post the XML input and the Java code, then we can
tell you.

Arne
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      01-22-2010
On Thu, 21 Jan 2010 07:03:23 -0800 (PST), dk <(E-Mail Removed)>
wrote, quoted or indirectly quoted someone who said :

>@Roedy: write now I'm using ultraEdit and inserting the characters
>from the ASCII table that it has. I have even tried seeing it in hex
>mode and I got the same value from both the places.


You need to know what the hex SHOULD look like.
See http://mindprod.com/jgloss/utf8.html

You need a tool to see what it DOES look like.
See http://www.sweetscape.com/010editor/
http://funduc.com/otsoft.htm#hexview

And a tool to validate the encoding:
http://mindprod.com/jgloss/native2asciiexe.html
http://mindprod.com/applet/ecodingrecogniser.html


--
Roedy Green Canadian Mind Products
http://mindprod.com
Responsible Development is the style of development I aspire to now. It can be summarized by answering the question, How would I develop if it were my money? Im amazed how many theoretical arguments evaporate when faced with this question.
~ Kent Beck (born: 1961 age: 49) , evangelist for extreme programming .
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Invalid byte 2 of 3-byte UTF-8 sequence - inconsistent behavior KN XML 6 11-15-2007 09:46 PM
InputStream - invalid byte 1 of 1-byte UTF-8 sequence Arun Java 2 12-27-2004 01:56 PM
connection 1: transfer chunk 1 (byte 0 to byte 1024) Jean-Daniel Gamache Java 0 07-14-2004 03:57 AM
Single byte addressable, multiple byte readout. Andreas VHDL 1 05-04-2004 01:49 PM
Appending byte[] to another byte[] array Bharat Bhushan Java 15 08-05-2003 07:52 PM



Advertisments