Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Regex and Unicode

Reply
Thread Tools

Regex and Unicode

 
 
michael.biden@gmail.com
Guest
Posts: n/a
 
      03-19-2007
I have a situation in which I am receiving a String from a non-java
system. The system that generates the String attempts to encode some
characters such a slash to unicode. However it encodes characters
using the percent sign rathern than the backslash.

Thus the String test-victorf becomes test%u002dvictorf. I'd love to
be able to simply replace the percent with a backslash, but it seems
that there is no way to dynamically insert the backslash like a
literal. For example:
public static void main (String args[]){
String user = "test%u002dvictof";
user = user.replace('%', '\\');
System.out.println(user);
}

Does not work. The output is test\002dvictorf.

So I tried to use a regular expression with a capturing parantheses:
public static void main (String args[]){
String user = "test%u002dvictof";
user = user.replaceAll("%u([a-f | A-F | 0-9][a-f | A-F | 0-9][a-f |
A-F | 0-9][a-f | A-F | 0-9])",
Character.toString((char)Integer.valueOf("$1", 16).intValue()) );
System.out.println(user);
}
Which generates a java.lang.NumberFormatException becuase the compiler
does not like the $1 at runtime. It seems that the $1 is being
interpretted literally. The real value of $1 at run time is '002d'

Any help is appreciated.

Thanks.

 
Reply With Quote
 
 
 
 
Oliver Wong
Guest
Posts: n/a
 
      03-19-2007

<(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
>I have a situation in which I am receiving a String from a non-java
> system. The system that generates the String attempts to encode some
> characters such a slash to unicode. However it encodes characters
> using the percent sign rathern than the backslash.
>
> Thus the String test-victorf becomes test%u002dvictorf. I'd love to
> be able to simply replace the percent with a backslash, but it seems
> that there is no way to dynamically insert the backslash like a
> literal. For example:
> public static void main (String args[]){
> String user = "test%u002dvictof";
> user = user.replace('%', '\\');
> System.out.println(user);
> }
>
> Does not work. The output is test\002dvictorf.
>
> So I tried to use a regular expression with a capturing parantheses:
> public static void main (String args[]){
> String user = "test%u002dvictof";
> user = user.replaceAll("%u([a-f | A-F | 0-9][a-f | A-F | 0-9][a-f |
> A-F | 0-9][a-f | A-F | 0-9])",
> Character.toString((char)Integer.valueOf("$1", 16).intValue()) );
> System.out.println(user);
> }
> Which generates a java.lang.NumberFormatException becuase the compiler
> does not like the $1 at runtime. It seems that the $1 is being
> interpretted literally. The real value of $1 at run time is '002d'


"$1" is interpreted literally, because "$1" is a literal. It has the
same value at runtime as it does a compile time, namely the two-character
string consisting of the character '$' followed by the character '1'.

Do the replace in three smaller steps instead of one big step: In the
first step, extract the "specially-encoded" char, "%u002d", and in the
second step, convert this 6-character string into a 1-character string
"-". In the third step, put your 1-character string where it should be in
the original string you were parsing.

- Oliver


 
Reply With Quote
 
 
 
 
Robert Klemme
Guest
Posts: n/a
 
      03-19-2007
On 19.03.2007 17:25, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> I have a situation in which I am receiving a String from a non-java
> system. The system that generates the String attempts to encode some
> characters such a slash to unicode. However it encodes characters
> using the percent sign rathern than the backslash.
>
> Thus the String test-victorf becomes test%u002dvictorf. I'd love to
> be able to simply replace the percent with a backslash, but it seems
> that there is no way to dynamically insert the backslash like a
> literal. For example:
> public static void main (String args[]){
> String user = "test%u002dvictof";
> user = user.replace('%', '\\');
> System.out.println(user);
> }
>
> Does not work. The output is test\002dvictorf.


Well, there is no Unicode escape sequence in the string so there is
actually a "%" in the string which gets replaced. To make the unicode
replacement work, the string has to read "test\u002dvictof" in the
*source code* because the compiler will do the replacement.

> So I tried to use a regular expression with a capturing parantheses:
> public static void main (String args[]){
> String user = "test%u002dvictof";
> user = user.replaceAll("%u([a-f | A-F | 0-9][a-f | A-F | 0-9][a-f |
> A-F | 0-9][a-f | A-F | 0-9])",
> Character.toString((char)Integer.valueOf("$1", 16).intValue()) );
> System.out.println(user);
> }
> Which generates a java.lang.NumberFormatException becuase the compiler
> does not like the $1 at runtime. It seems that the $1 is being
> interpretted literally. The real value of $1 at run time is '002d'


You need to set a replacement string for every replacement *while
replacing* because the calculation of the replacement value has to take
place for every individual match. See

http://java.sun.com/j2se/1.4.2/docs/...va.lang.String)


> Any help is appreciated.


I think a more proper solution would be to create a custom
InputStreamReader that does the conversion to char when reading binary.
Maybe even one of the default encodings does this already. IIRC
java.util.Property.load() does it already when reading from files. But
this is an ugly hack so I'd rather either look for something or create
your own solution.

Kind regards

robert
 
Reply With Quote
 
Hendrik Maryns
Guest
Posts: n/a
 
      03-20-2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

(E-Mail Removed) schreef:
> I have a situation in which I am receiving a String from a non-java
> system. The system that generates the String attempts to encode some
> characters such a slash to unicode. However it encodes characters
> using the percent sign rathern than the backslash.
>
> Thus the String test-victorf becomes test%u002dvictorf. I'd love to
> be able to simply replace the percent with a backslash, but it seems
> that there is no way to dynamically insert the backslash like a
> literal. For example:
> public static void main (String args[]){
> String user = "test%u002dvictof";
> user = user.replace('%', '\\');
> System.out.println(user);
> }
>
> Does not work. The output is test\002dvictorf.


Actually the output is test\u002dvictof, which is what I thought you
wanted from your description. If you really want to replace the percent
encoding with the character it represents, read the other replies.

H.
- --
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFF//+Pe+7xMGD3itQRAtOTAJ417tuJ0pSNyqMM270ZVf7Dy3/VXACeM2+V
QXuLhbwle9rK+od7WEPPF30=
=m5/A
-----END PGP SIGNATURE-----
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Python unicode utf-8 characters and MySQL unicode utf-8 characters Grzegorz Śliwiński Python 2 01-19-2011 07:31 AM
Regex testing and UTF8 awarenes or Regex and numeric pattern matching sln@netherlands.com Perl Misc 2 03-10-2009 03:51 AM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
help needed with regex and unicode Pradnyesh Sawant Python 2 03-04-2008 07:43 AM
Help for Unicode char and Unicode char based string in Ruby Chirag Mistry Ruby 6 02-08-2008 12:45 PM



Advertisments