Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > tern an hebrew string into unicode

Reply
Thread Tools

tern an hebrew string into unicode

 
 
dana livni
Guest
Posts: n/a
 
      06-29-2004
hello,
i hope you can help me.
i have an hebrow string (it can be in more languges) and i need to
tern it to somting i can send in a "get" but it have to be in unicode
for exmple:
the string דנה
need to be tern into %D7%93%D7%A0%D7%94

i tried alot of metouds but it does not work, please help me.
 
Reply With Quote
 
 
 
 
Jürgen Exner
Guest
Posts: n/a
 
      06-29-2004
dana livni wrote:
> i hope you can help me.
> i have an hebrow string (it can be in more languges) and i need to
> tern it to somting i can send in a "get" but it have to be in unicode
> for exmple:
> the string דנה


This appears to be _one_ textual representation of those characters in
Unicode, presumably their numerical value in UTF-16?

> need to be tern into %D7%93%D7%A0%D7%94


This on the other hand doesn't look like Unicode at all but rather like
maybe URL encoding?

"Unicode" can be encoded in many different ways. Not only UTF-8 versus
UTF-16 versus UTF-32 but the resulting values can then be encoded in code
points (maybe what you got first), or as Base-64, or as URL-encode, or or
or.

Without knowing from where to where you really want to go it is very
difficult to offer any advise.

jue


 
Reply With Quote
 
 
 
 
dana livni
Guest
Posts: n/a
 
      06-30-2004
i gess you right, i need to convert the text in order to send it in a
get request - in the format of the www.vivvisimo.com site.
i think that all the %d7 meen that this is hebrow and that the second
pare symbol the specific letter.
i'm not sure witch encoding is it.
i meant to send the real string (my name - dana) but google site
encoded it.

if there any function that get a string and the encode for use and
retearnd a string of two pares :
1. symbol the languge
2. symbol the specific letter.
like in my example, i will find the encoding i'm looking for.

thanks
 
Reply With Quote
 
Ian Wilson
Guest
Posts: n/a
 
      07-02-2004
dana livni wrote:
> i gess you right, i need to convert the text in order to send it in a
> get request - in the format of the www.vivvisimo.com site.


Thats a parked domain, maybe you mean www.vivisimo.com?

> i think that all the %d7 meen that this is hebrow and that the second
> pare symbol the specific letter.


In Unicode, Hebrew glyphs are in the range 0590-05FF

You originally said
>> the string דנה


Decimal 1491 is Hex 05D3 which is the Unicode code-point for Hebrew
letter DALET. Presumably this is the first letter of "Dana" in Hebrew.
http://www.unicode.org/charts/

> i'm not sure witch encoding is it.
> i meant to send the real string (my name - dana) but google site
> encoded it.


Does vivisimo accept Unicode? I thought most sites expected ISO-8859-1
(Latin 1). Which does not include Hebrew characters AFAIK.

> if there any function that get a string and the encode for use and
> retearnd a string of two pares :
> 1. symbol the languge
> 2. symbol the specific letter.
> like in my example, i will find the encoding i'm looking for.


Does such an encoding exist? A pair of 8-bit bytes would allow 256
languages of 256 glyphs. There must be more than 256 languages in
Unicode and most of them have more than 256 glyphs. So such an encoding
could not represent more than a small subset of Unicode.

http://www.marsengineering.com/charCodeConverter.html
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      07-02-2004
On Fri, 2 Jul 2004, Ian Wilson wrote:

> dana livni wrote:
> > i gess you right, i need to convert the text in order to send it in a
> > get request - in the format of the www.vivvisimo.com site.

>
> Thats a parked domain, maybe you mean www.vivisimo.com?
>
> > i think that all the %d7 meen that this is hebrow and that the second
> > pare symbol the specific letter.


I worried about the fact that I didn't understand exactly what the
questioner was trying to achieve, so I was reluctant to try to answer
the question, even if I might have some of the relevant expertise.

> In Unicode, Hebrew glyphs are in the range 0590-05FF
>
> You originally said
> >> the string דנה

>
> Decimal 1491 is Hex 05D3 which is the Unicode code-point for Hebrew
> letter DALET. Presumably this is the first letter of "Dana" in Hebrew.
> http://www.unicode.org/charts/


Looking good so far.

> Does vivisimo accept Unicode? I thought most sites expected ISO-8859-1
> (Latin 1). Which does not include Hebrew characters AFAIK.


This is the point at which your reply lost credibility for me, I'm
afraid. If you don't know that for sure, I'm puzzled that you thought
it helpful to try to offer an answer.

> > if there any function that get a string and the encode for use and
> > retearnd a string of two pares :
> > 1. symbol the languge
> > 2. symbol the specific letter.
> > like in my example, i will find the encoding i'm looking for.

>
> Does such an encoding exist? A pair of 8-bit bytes would allow 256
> languages of 256 glyphs.


I'm not sure where you're heading here. Seems to be devising a
problem for which there have long since been solutions.

Current Perl versions have a natural way of representing Unicode
internally; and natural ways of turning it into other useful
representations (could be iso-8859-8; could be HTML &#number;
representations which the questioner evidently already knows about;
etc.) if utf-8 coding is somehow not appropriate.

But I still don't feel confident that I know what the original poster
wanted to achieve, so I couldn't offer a practical answer to their
questions yet, not with any degree of confidence.

> There must be more than 256 languages in Unicode


Unicode doesn't really "do" languages, except in the context of
disambiguating unified CJK characters. Greek (language) is still
Greek (language) even when transcribed into Latin characters; English
(language) is still Engrish (language) when transcribed into Japanese
writing. Unicode represents *writing systems*, not languages.

have fun
 
Reply With Quote
 
dana livni
Guest
Posts: n/a
 
      07-04-2004
i'm not sure i understood your answer.

what i want to do?
i want to create a uri.
it sould look like the uri of google or vivisimo.
when you enters one of those sites and search for a word in hebrow or
any other languge both sites tern every charecter in it to an
exprision in this pattern:
%xx%xx. the first pear seam to mark the languge (in hebrow d7) and the
second the spasific letter.

i want to find a way to do the same.
i hope now it is clear enougth.
 
Reply With Quote
 
Ian
Guest
Posts: n/a
 
      07-15-2004
"Alan J. Flavell" <(E-Mail Removed)> wrote in message news:<(E-Mail Removed) .gla.ac.uk>...
> On Fri, 2 Jul 2004, Ian Wilson wrote:
>
> > dana livni wrote:
> > > i gess you right, i need to convert the text in order to send it in a
> > > get request - in the format of the www.vivvisimo.com site.

> >
> > Thats a parked domain, maybe you mean www.vivisimo.com?
> >
> > > i think that all the %d7 meen that this is hebrow and that the second
> > > pare symbol the specific letter.

>
> I worried about the fact that I didn't understand exactly what the
> questioner was trying to achieve, so I was reluctant to try to answer
> the question, even if I might have some of the relevant expertise.
>
> > In Unicode, Hebrew glyphs are in the range 0590-05FF
> >
> > You originally said
> > >> the string דנה

> >
> > Decimal 1491 is Hex 05D3 which is the Unicode code-point for Hebrew
> > letter DALET. Presumably this is the first letter of "Dana" in Hebrew.
> > http://www.unicode.org/charts/

>
> Looking good so far.


Uh oh.


> > Does vivisimo accept Unicode? I thought most sites expected ISO-8859-1
> > (Latin 1). Which does not include Hebrew characters AFAIK.

>
> This is the point at which your reply lost credibility for me, I'm
> afraid. If you don't know that for sure, I'm puzzled that you thought
> it helpful to try to offer an answer.


Translation: "Please dont post replies to CLPM unless you are certain
the information you give is correct".

Further reading revealed ...

http://www.unicode.org/faq/unicode_web.html#11
says "If you have a single CGI and a single HTML form, then the
browsers will return the data in the encoding of the original form".

Both Google and Vivisimo search forms refer to UTF-8

For example, google has
<meta http-equiv="content-type" content="text/html; charset=UTF-8">



> > > if there any function that get a string and the encode for use and
> > > retearnd a string of two pares :
> > > 1. symbol the languge
> > > 2. symbol the specific letter.
> > > like in my example, i will find the encoding i'm looking for.

> >
> > Does such an encoding exist? A pair of 8-bit bytes would allow 256
> > languages of 256 glyphs.

>
> I'm not sure where you're heading here. Seems to be devising a
> problem for which there have long since been solutions.


I was attempting reductio ad absurdam. Some further playing with a
calculator shows that the %D7 which the OP refers to is simply a hex
representation of the first byte of the UTF-8 encoding of the Hebrew
DALET character. This byte does not designate a specific language
(i.e. script) as the OP appears to mistakenly assume.

> Current Perl versions have a natural way of representing Unicode
> internally; and natural ways of turning it into other useful
> representations (could be iso-8859-8; could be HTML &#number;
> representations which the questioner evidently already knows about;
> etc.) if utf-8 coding is somehow not appropriate.


> But I still don't feel confident that I know what the original poster
> wanted to achieve, so I couldn't offer a practical answer to their
> questions yet, not with any degree of confidence.


Translation: Fools rush in where angels fear to tread.
I'm pretty sure the OP wanted to search Google for a name in Hebrew.
Point taken however.

> > There must be more than 256 languages in Unicode

>
> Unicode doesn't really "do" languages, except in the context of
> disambiguating unified CJK characters. Greek (language) is still
> Greek (language) even when transcribed into Latin characters; English
> (language) is still Engrish (language) when transcribed into Japanese
> writing. Unicode represents *writing systems*, not languages.


This is of course true, my mistake. In fact, there is a web page at
unicode.org which does refer to the number of languages which can be
written using the various writing systems covered by Unicode. This
number is less than 256

> have fun


I did.
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      07-15-2004
On Thu, 15 Jul 2004, Ian wrote:

> > > I thought most sites expected ISO-8859-1
> > > (Latin 1). Which does not include Hebrew characters AFAIK.

> >
> > This is the point at which your reply lost credibility for me, I'm
> > afraid.

>
> Translation: "Please dont post replies to CLPM unless you are certain
> the information you give is correct".


Oh no, I wouldn't go *that* far; but trying to answer a question about
Hebrew, when you say you're not sure whether iso-8859-1 has Hebrew
characters in it, *did* seem to be rather adventurous, in the
circumstances. IMHO and RTL and YMMV.

> Further reading revealed ...
>
> http://www.unicode.org/faq/unicode_web.html#11
> says "If you have a single CGI and a single HTML form, then the
> browsers will return the data in the encoding of the original form".


Kind-of odd wording they are using, but yes, that's right: by default,
browsers submit their forms input using the same character encoding as
the HTML page which contains the form. And this is basically the only
option which works widely enough to be used (putting accept-charset on
the <form...> element is technically valid, but not widely supported).

However, Netscape 4.* versions get this massively wrong when the HTML
page is in utf-8. (Not that I really use NN4.* any more, but I keep
a copy for test purposes).

There's more about this topic (for anyone who's interested at
http://ppewww.ph.gla.ac.uk/~flavell/...form-i18n.html

> For example, google has
> <meta http-equiv="content-type" content="text/html; charset=UTF-8">


If Google thinks the browser is capable of it, indeed it does.
(Try it from NN4.* and you'll find a different result).

> Some further playing with a calculator shows that the %D7 which the
> OP refers to is simply a hex representation of the first byte of the
> UTF-8 encoding of the Hebrew DALET character.


Confirmed: based on the input mentioned in the original posting -

דנה

<!-- 1491 5d3 d793 -->
<!-- 1504 5e0 d7a0 -->
<!-- 1492 5d4 d794 -->

Those are decimal and hexadecimal code points, followed by the utf-8
representation. DALET NUN HE (reading off the unicode page U+05xx,
since I can't actually read Hebrew, sorry).

> This byte does not designate a specific language
> (i.e. script) as the OP appears to mistakenly assume.


That's technically accurate; although it just so happens that the
Hebrew alphabet (not counting the combining marks) in their utf-8
representations all have "d7" as their first octet (byte), so, in a
way, it -is- indicative of the Hebrew script.

Well, the questioner referred to a "get" (which to me indicates
"form-URL-encoded" format), and said at the outset:

| the string דנה
| need to be tern into %D7%93%D7%A0%D7%94

Juergen Exner's reply seemed to be headed in the direction of
understanding the result as a url-encoded utf-8 representation, which
indeed hits the nail on the head, right.

But the original poster then added in a followup:

| i meant to send the real string (my name - dana) but google site
| encoded it.

by which I understood that Google had turned the original encoding
(whatever it might have been) into &#number; notations. Unfortunately
the actual posting

http://groups.google.com/groups?selm...&output=gplain

claims to be in:

Content-Type: text/plain; charset=ISO-8859-1

which throws no light at all on what the actual posting details would
have been.

So we really don't know for sure from this whether the questioner is
working in iso-8859-8, utf-8 or what, in their practical application.

> I'm pretty sure the OP wanted to search Google for a name in Hebrew.


To submit a search request to something, indeed.

Once the Hebrew text has been entered into Perl's natural Unicode
format, it will be represented internally as utf-8 octets.

Witness, step by step:

my $string = chr(1491) . chr(1504) . chr(1492);

my $result = unpack("H*",$string);

print $result, "\n";

Gives the result:

d793d7a0d794

Quod Erat Demonstrandum. It remains to insert the "%" characters
at appropriate points (no doubt a Perl golfer will be along any moment
to boil this down to a one-liner).

But of course I cheated: I created the input by using the chr()
function. I say again, we need to know how our questioner is creating
this input before we can replace that initial step with something
useful.

> > have fun

>
> I did.


Great stuff.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[ANN] tern - The SQL Fan's Migrator Jack Christensen Ruby 0 02-08-2011 04:46 AM
[unicode] inconvenient unicode conversion of non-string arguments Holger Joukl Python 5 12-13-2006 10:10 PM
Unicode digit to unicode string Gabriele *darkbard* Farina Python 2 05-16-2006 01:15 PM
unicode (hebrew) regexp search for new line headaches mitchell_laks Perl Misc 10 12-11-2005 10:23 PM
convert from unicode to ascii (hebrew) Jon Skeet [C# MVP] ASP .Net 2 06-28-2004 12:30 PM



Advertisments