Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > ascii char 26

Reply
Thread Tools

ascii char 26

 
 
bob
Guest
Posts: n/a
 
      09-11-2011
Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

I had to write this function to deal with this:

public static String convertToAscii(String html) {
html = html.replaceAll("\u2019", "'");
html = html.replaceAll("\u201D", "\"");
html = html.replaceAll("\u201C", "\"");

byte[] b = null;
try {
b = html.getBytes("US-ASCII");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}

// hyphen replace
for (int ctr = 0; ctr < b.length; ctr++)
if (b[ctr] == 26)
b[ctr] = 45;

html = new String(b);
return html;
}
 
Reply With Quote
 
 
 
 
Arne Vajhj
Guest
Posts: n/a
 
      09-11-2011
On 9/11/2011 5:33 PM, bob wrote:
> Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?
>
> I had to write this function to deal with this:
>
> public static String convertToAscii(String html) {
> html = html.replaceAll("\u2019", "'");
> html = html.replaceAll("\u201D", "\"");
> html = html.replaceAll("\u201C", "\"");
>
> byte[] b = null;
> try {
> b = html.getBytes("US-ASCII");
> } catch (UnsupportedEncodingException e) {
> e.printStackTrace();
> }
>
> // hyphen replace
> for (int ctr = 0; ctr< b.length; ctr++)
> if (b[ctr] == 26)
> b[ctr] = 45;
>
> html = new String(b);
> return html;
> }


ASCII code 26 is not in general replaced with hyphen.

If you are asking why some code may do it, then in
some contexts (usually on Windows platform) ASCII code
26 indicates EOF.

Arne


 
Reply With Quote
 
 
 
 
Joshua Cranmer
Guest
Posts: n/a
 
      09-11-2011
On 9/11/2011 4:33 PM, bob wrote:
> Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?


The US-ASCII encoder only properly encodes characters in the range of
0-127, i.e., the characters that are present in ASCII. Any other
character is replaced with some sort of substitution character; in this
case, it looks like the charset has chosen to use ^Z as the "I don't
know what this character is" character (I would have guessed '?'
instead, but I suppose they decided to go with the less-commonly used
variant).

My guess is your input is using one of the characters like the minus
sign, em dash, or perhaps an en dash instead (there may be others),
which are visually close in appearance to a hyphen but do not share the
same Unicode codepoint.

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      09-11-2011
On Sun, 11 Sep 2011 14:33:05 -0700 (PDT), bob <(E-Mail Removed)>
wrote, quoted or indirectly quoted someone who said :
>Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?
>html = html.replaceAll("\u201C", "\"");


\u0026 is replaced by an ampersand at compile time, as if you had
typed one into the source code.

I presume you are talking about

26 0x1a ^Z SUB, substitute

\u001a is not useful. It gets replaced by a ^z character, as if you
had typed it into the source text, possibly creating a syntax error.
If you want this char you probably want (char)0x001a

This is true for ascii, UTF and UTF-8. If you see a -, it might just
be some font's attempt to render a SUB char.

You can use &#x241a; in HTML or \u241a in Java to render a tiny SUB
glyph to represent the char.

see
http://mindprod.com/jgloss/ascii.html
http://mindprod.com/jgloss/unicode.html
http://mindprod.com/jgloss/utf.html
http://mindprod.com/jgloss/literal.html
--
Roedy Green Canadian Mind Products
http://mindprod.com
The modern conservative is engaged in one of man's oldest exercises in moral philosophy; that is,
the search for a superior moral justification for selfishness.
~ John Kenneth Galbraith (born: 1908-10-15 died: 2006-04-29 at age: 97)
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      09-11-2011
On 9/11/2011 5:52 PM, Joshua Cranmer wrote:
> On 9/11/2011 4:33 PM, bob wrote:
>> Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

>
> The US-ASCII encoder only properly encodes characters in the range of
> 0-127, i.e., the characters that are present in ASCII. Any other
> character is replaced with some sort of substitution character; in this
> case, it looks like the charset has chosen to use ^Z as the "I don't
> know what this character is" character (I would have guessed '?'
> instead, but I suppose they decided to go with the less-commonly used
> variant).


It makes more sense when you think of 26 not as ^Z, but as SUB.

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)d
 
Reply With Quote
 
Bent C Dalager
Guest
Posts: n/a
 
      09-11-2011
On 2011-09-11, bob <(E-Mail Removed)> wrote:
> Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?


Unicode has multiple different hyphens and hyphen-like characters.

The traditional ASCII hyphen is the Unicode "hyphen-minus" which
encodes to 0x2d in utf-8.

http://www.fileformat.info/info/unic...r/2d/index.htm suggests the
following additional hyphen-like characters that you may actually be
working with in your string, and that will probably be mapped to 26 in
your case:

hyphen U+2010
non-breaking hyphen U+2011
figure dash U+2012
en dash U+2013
minus sign U+2212
roman uncia sign U+10191

If hyphens are of particular interest to you it may be a better
approach to replace non-ASCII-supported hyphens from the above list
with "hyphen-minus", before you transcode to ASCII.

One would tend to think there ought to be a library function somewhere
to convert a unicode string to ASCII-supported variants of its various
characters where possible, that you should be using instead. I don't
know if such a function is easily available.

Cheers,
Bent D
--
Bent Dalager - (E-Mail Removed) - http://www.pvv.org/~bcd
powered by emacs
 
Reply With Quote
 
Joshua Cranmer
Guest
Posts: n/a
 
      09-11-2011
On 9/11/2011 6:18 PM, Bent C Dalager wrote:
> One would tend to think there ought to be a library function somewhere
> to convert a unicode string to ASCII-supported variants of its various
> characters where possible, that you should be using instead. I don't
> know if such a function is easily available.


This generally falls under the umbrella of Unicode normalization, which
can resolve, e.g., Å the Angstrom symbol and Å the Swedish letter to the
same representation (may require compatibility normalization). You can
do this in Java using the java.text.Normalizer class.

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth
 
Reply With Quote
 
Retahiv Oopsiscame
Guest
Posts: n/a
 
      09-11-2011
On Sep 11, 7:18*pm, Bent C Dalager <(E-Mail Removed)> wrote:
> On 2011-09-11, bob <(E-Mail Removed)> wrote:
>
> > Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

>
> Unicode has multiple different hyphens and hyphen-like characters.
>
> The traditional ASCII hyphen is the Unicode "hyphen-minus" which
> encodes to 0x2d in utf-8.
>
> http://www.fileformat.info/info/unic...ex.htmsuggests the
> following additional hyphen-like characters that you may actually be
> working with in your string, and that will probably be mapped to 26 in
> your case:
>
> hyphen U+2010
> non-breaking hyphen U+2011
> figure dash U+2012
> en dash U+2013
> minus sign U+2212
> roman uncia sign U+10191


Wow, what a mess!

> One would tend to think there ought to be a library function somewhere
> to convert a unicode string to ASCII-supported variants of its various
> characters where possible,


Indeed.
 
Reply With Quote
 
bob
Guest
Posts: n/a
 
      09-12-2011
You're right. I messed up, and it was the em dash. It turned into 26
after going thru 'b = html.getBytes("US-ASCII");'

Here's the new code:

public static String convertToAscii(String html) {
html = html.replaceAll("\u2019", "'");
html = html.replaceAll("\u201D", "\"");
html = html.replaceAll("\u201C", "\"");

// mdash
html = html.replaceAll("\u2014", "-");


byte[] b = null;
try {
b = html.getBytes("US-ASCII");

} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return html;
}

Also, I'm on Android 2.1, so import java.text.Normalizer; doesn't
work.



On Sep 11, 4:52*pm, Joshua Cranmer <(E-Mail Removed)> wrote:
> On 9/11/2011 4:33 PM, bob wrote:
>
> > Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

>
> The US-ASCII encoder only properly encodes characters in the range of
> 0-127, i.e., the characters that are present in ASCII. Any other
> character is replaced with some sort of substitution character; in this
> case, it looks like the charset has chosen to use ^Z as the "I don't
> know what this character is" character (I would have guessed '?'
> instead, but I suppose they decided to go with the less-commonly used
> variant).
>
> My guess is your input is using one of the characters like the minus
> sign, em dash, or perhaps an en dash instead (there may be others),
> which are visually close in appearance to a hyphen but do not share the
> same Unicode codepoint.
>
> --
> Beware of bugs in the above code; I have only proved it correct, not
> tried it. -- Donald E. Knuth


 
Reply With Quote
 
Joshua Cranmer
Guest
Posts: n/a
 
      09-12-2011
On 9/11/2011 9:12 PM, bob wrote:
> You're right. I messed up, and it was the em dash. It turned into 26
> after going thru 'b = html.getBytes("US-ASCII");'
>
> Here's the new code:


Hardcoding a list of tables is generally not a good thing; in
particular, I don't think it's going to solve your problems. I have seen
sites that use the Unicode ff and fi ligatures instead of relying on
fonts to automatically pick up on that as well.

If I may ask, why do you need to convert the string to US-ASCII as
opposed to UTF-8? That is going to cause major issues for the ~90% of
the world that doesn't speak English as their main language.

> Also, I'm on Android 2.1, so import java.text.Normalizer; doesn't
> work.


It shouldn't be that hard to find other Java Unicode normalization
libraries out there.

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
(const char *cp) and (char *p) are consistent type, (const char **cpp) and (char **pp) are not consistent lovecreatesbeauty C Programming 1 05-09-2006 08:01 AM
/usr/bin/ld: ../../dist/lib/libjsdombase_s.a(BlockGrouper.o)(.text+0x98): unresolvable relocation against symbol `std::basic_ostream<char, std::char_traits<char> >& std::endl<char, std::char_traits<char> >(std::basic_ostre silverburgh.meryl@gmail.com C++ 3 03-09-2006 12:14 AM
char *fred; char * fred; char *fred; any difference? Ben Pfaff C Programming 5 01-17-2004 07:37 PM
The difference between char a[6] and char *p=new char[6] ? wwj C Programming 24 11-07-2003 05:27 PM
the difference between char a[6] and char *p=new char[6] . wwj C++ 7 11-05-2003 12:59 AM



Advertisments