Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Sanitizing HTML strings.

Reply
Thread Tools

Sanitizing HTML strings.

 
 
jason.cipriani@gmail.com
Guest
Posts: n/a
 
      06-19-2008
Is there anything in the Java API that will sanitize strings for
display in HTML (e.g. replace HTML tokens with escape sequences), or
is it normal to roll your own? I'm playing around with Java, and Java
servlets; I'm primarily a C++ programmer and not too familiar with
Java, so sorry if this is a silly question.

Thanks,
Jason
 
Reply With Quote
 
 
 
 
Stefan Ram
Guest
Posts: n/a
 
      06-19-2008
"(E-Mail Removed)" <(E-Mail Removed)> writes:
>Is there anything in the Java API that will sanitize strings for
>display in HTML (e.g. replace HTML tokens with escape sequences), or
>is it normal to roll your own?


Some people suggest:

http://commons.apache.org/lang/api/o...va.lang.String)

But I have found defects in some years ago:

import org.apache.commons.lang.StringEscapeUtils;

public final class Test
{
public static void main( final String[] args )
{
System.out.println( StringEscapeUtils.escapeXml( "a" ) );
System.out.println( StringEscapeUtils.escapeXml( "" ));
System.out.println( StringEscapeUtils.escapeXml( "&" ) );
final String text = "\ud800\udc00";
System.out.println( text.codePointCount( 0, text.length()) );
System.out.println( StringEscapeUtils.escapeXml( text )); }}

IIRC, this should show that not only letters with special
meaning in HTML are replaced (what might be wanted or not
wanted) and that code points represented using surrogate pairs
were not handled correctly.

The last time I looked, the JDK had some methods for this, but
most of them were private, protected or not intended for use
by applications:

javax.swing.text.html.HTMLWriter#output(char[] chars, int start, int length)
java.util.logging,XMLFormatter#escape(StringBuffer sb, String text)
java.beans.XMLEncoder#String quoteCharacters(String s)
com.sun.org.apache.xml.internal.serialize.XMLSeria lizer#printEscaped(String source)
com.sun.org.apache.xml.internal.serialize.XML11Ser ializer#printEscaped(String source)
com.sun.org.apache.xerces.internal.impl.xs.travers ers.XSDAbstractTraverser#processAttValue(String original)

"package"

com.sun.org.apache.xerces.internal.impl.xs.opti.Sc hemaDOM#processAttValue(String original)

"public":

com.sun.org.apache.xalan.internal.client.XSLTProce ssorApplet#escapeString(String s)

So, you might try an approach like in the following class I wrote.

public final class Text
{
public static java.lang.String
sourceCharacter
( final char s )
{ return
( s < 63 && s >= 34 )?
( s < 40 ?
( s == '"' ? """ :
s == '&' ? "&" :
s == '\'' ? "'" : null ) :
s >= 60 ?
( s == '<' ? "<" :
s == '>' ? ">" : null ): null ): null; }

public static java.lang.String
sourceText
( final java.lang.String text )
{ java.lang.StringBuilder buffer = null;
int growth = 0;
final int length = text.length();
for( int i = 0; i < length; ++i )
{ final java.lang.String sourceChar =
sourceCharacter( text.charAt( i ) );
if( sourceChar != null )
{ if( buffer == null )buffer =
new java.lang.StringBuilder( text );
final int position = i + growth;
buffer.replace( position, position + 1, sourceChar );
growth += 4; }}
return buffer == null ? text : buffer.toString(); }

/* untested */
public static void main( final String[] args )
{ java.lang.System.out.println
( sourceText
( "<alpha beta=\"gamma\" delta='epsilon' />" )); }}

This only is intended to encode characters with special
meanings. Depending on the encoding used for the HTML
document, other characters might have to be represented using
character references, too.

Another idea might be to use a
javax.swing.text.html.HTMLEditorKit to write the text into a
document and then serialize it to HTML, but I have not tried
this.

 
Reply With Quote
 
 
 
 
Mark Space
Guest
Posts: n/a
 
      06-19-2008
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> Is there anything in the Java API that will sanitize strings for
> display in HTML (e.g. replace HTML tokens with escape sequences), or
> is it normal to roll your own? I'm playing around with Java, and Java
> servlets; I'm primarily a C++ programmer and not too familiar with
> Java, so sorry if this is a silly question.


I think you can get by with just replacing "&" with "&amp:" and "<" with
"&lt;". This is really a HTML/XML type question, not Jave per se. The
String class has a replaceAll() method you can use.

If you're talking about data validation in general, that's a whole
'nother ball of wax. Be careful about assuming that all you need to do
is "sanatize HTML." I'm not an expert, but it would be wise to become
one before designing a strategy to validate input, especially input
taken from a web form.

This is too complicated really for the Java API. You might try
libraries designed to interact with users on the web, there may be
sub-libraries designed with various validators in mind. Struts is
probably the oldest, and the Struts home page has links to other UI web
oriented frameworks that might be useful, under Similar Projects.

http://struts.apache.org/
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      06-20-2008
On Thu, 19 Jun 2008 11:20:20 -0700 (PDT), "(E-Mail Removed)"
<(E-Mail Removed)> wrote, quoted or indirectly quoted someone
who said :

>Is there anything in the Java API that will sanitize strings for
>display in HTML (e.g. replace HTML tokens with escape sequences), or
>is it normal to roll your own? I'm playing around with Java, and Java
>servlets; I'm primarily a C++ programmer and not too familiar with
>Java, so sorry if this is a silly question.


I have written some stuff that might prove useful.

http://mindprod.com/products1.html#ENTITIES

which interconverts between Unicode and &xxx; entities
There is also a method to strip out html tags leaving you just the raw
text.

http://mindprod.com/products1.html#AMPER
that convert & to &amp; where appropriate in a malformed HTML
document.

If your HTML is well-formed, you can render it inside Java JLabels and
JTextAreas. see http://mindprod.com/jgloss/htmlrendering.html

There is also a utility http://mindprod.com/applet/quoter.html
that will transform data is many different ways, including converting
HTML to Java string literals.
--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
sanitizing html tags (content) Aljaz Fajmut Ruby 2 10-22-2009 11:26 PM
Re: xml input sanitizing method in standard lib? Petr Muller Python 0 03-09-2009 05:30 PM
xml input sanitizing method in standard lib? Petr Muller Python 0 03-09-2009 03:32 PM
Which Software is best for securely deleting files and sanitizing hard drives? david jones Computer Security 16 08-16-2006 07:08 AM
Sanitizing untrusted code for eval() Jim Washington Python 9 08-23-2005 02:54 PM



Advertisments