Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > How to identify double bytes language?

Reply
Thread Tools

How to identify double bytes language?

 
 
sqlcamel
Guest
Posts: n/a
 
      11-13-2009
Hello,

I have a text file, there are some double-bytes words in it, like
Chinese, Japanese.
Is there a way to identify them separately with Perl? Thanks.
 
Reply With Quote
 
 
 
 
Dr.Ruud
Guest
Posts: n/a
 
      11-13-2009
sqlcamel wrote:

> I have a text file, there are some double-bytes words in it, like
> Chinese, Japanese.
> Is there a way to identify them separately with Perl? Thanks.


See
`perldoc perlopentut`,
`perldoc -f open`,
`perldoc open`,
`perldoc PerlIO`
and look for "layer".

--
Ruud
 
Reply With Quote
 
 
 
 
Dr.Ruud
Guest
Posts: n/a
 
      11-13-2009
Ben Morrow wrote:
> Dr.Ruud:


>>> I have a text file, there are some double-bytes words in it, like
>>> Chinese, Japanese.
>>> Is there a way to identify them separately with Perl? Thanks.

>> See
>> `perldoc perlopentut`,
>> `perldoc -f open`,
>> `perldoc open`,
>> `perldoc PerlIO`
>> and look for "layer".

>
> IMHO you should start with perldoc perlunitut and perldoc perlunicode.


I don't understand. Maybe you thought that UTF-16 was meant?

The data in the "double-byte" encoded files (probably Shift-JIS, GB2312
or Big5) will just become normal Perl strings if the right IO-layer is used.

After that, some basic Unicode knowledge will of course help.

--
Ruud
 
Reply With Quote
 
Ilya Zakharevich
Guest
Posts: n/a
 
      11-13-2009
On 2009-11-13, sqlcamel <(E-Mail Removed)> wrote:
> Hello,
>
> I have a text file, there are some double-bytes words in it, like
> Chinese, Japanese.
> Is there a way to identify them separately with Perl? Thanks.


As you can see, the posters may be confused about the meaning of your
question.

Myself, I think your question is about "how to guess which encoding it
is?". But please be more specific...

Ilya
 
Reply With Quote
 
sqlcamel
Guest
Posts: n/a
 
      11-14-2009
Thanks for all the suggestions.
What I wanted is, for example, given the text piece below:

There is a 中国人 in the park.

So how to scratch the gb2312 word of 中国人 from the text?

Thanks again.


On 11月14日, 上午5时58分, Ilya Zakharevich <(E-Mail Removed)> wrote:
> On 2009-11-13, sqlcamel <(E-Mail Removed)> wrote:
>
> > Hello,

>
> > I have a text file, there are some double-bytes words in it, like
> > Chinese, Japanese.
> > Is there a way to identify them separately with Perl? Thanks.

>
> As you can see, the posters may be confused about the meaning of your
> question.
>
> Myself, I think your question is about "how to guess which encoding it
> is?". But please be more specific...
>
> Ilya


 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      11-14-2009
On 2009-11-14 03:31, sqlcamel <(E-Mail Removed)> wrote:
> Thanks for all the suggestions.


Please don't top-post. Quote the relevant parts of the posting you are
replying to and write your answers below each part.

> What I wanted is, for example, given the text piece below:
>
> There is a 涓*鍥戒汉 in the park.
>
> So how to scratch the gb2312 word of 涓*鍥戒汉 from the text?


There isn't a "gb2312 word" in the text. The whole text is gb2312.

You want to distinguish the Chinese characters from the Latin
characters.

I think in GB2312 this is easy: Just search for pairs of bytes with the
high bit set.

But in general I would convert the whole text to Unicode and check the
character properties. This works for *all* encodings, no matter how
complicated they are:

#!/usr/bin/perl
use warnings;
use strict;

binmode STDIN, ":encoding(GB2312)"; # input is GB2312
binmode STDOUT, ":encoding(UTF-"; # my terminal is UTF-8

while (read(STDIN, my $char, 1)) {
my $classes = "";
for my $class (qw(Han Latin)) {
if ($char =~ /\p{$class}/) {
$classes .= " $class";
}
}
print "$char - $classes\n";
}
__END__

Prints for a file containing "There is a 涓*鍥戒汉 in the park." in GB2312:


T - Latin
h - Latin
e - Latin
r - Latin
e - Latin
-
i - Latin
s - Latin
-
a - Latin
-
涓* - Han
鍥 - Han
浜 - Han
-
i - Latin
n - Latin
-
t - Latin
h - Latin
e - Latin
-
p - Latin
a - Latin
r - Latin
k - Latin
.. -

-


hp
 
Reply With Quote
 
J黵gen Exner
Guest
Posts: n/a
 
      11-14-2009
[Please no TOFU, trying to repair]
sqlcamel <(E-Mail Removed)> wrote:
>> On 2009-11-13, sqlcamel <(E-Mail Removed)> wrote:
>> > I have a text file, there are some double-bytes words in it, like
>> > Chinese, Japanese.
>> > Is there a way to identify them separately with Perl? Thanks.

>
>What I wanted is, for example, given the text piece below:
>
>There is a ?????? in the park.
>
>So how to scratch the gb2312 word of ?????? from the text?


gb2312 is a character set, it includes at least Chinese as well as Latin
characters. Therefore all of your text is gb2313, not just that word.

Now, having said that your real task seems to be to distinguish between
Latin/ASCII/.... and non-Latin/ASCII/... characters.
There are several POSIX classes in the regular expressions that will
help you with that, please check 'perldoc perlre' for what is most
suitable for you.

jue
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      11-14-2009
Ben Morrow wrote:
> Dr.Ruud:


>> The data in the "double-byte" encoded files (probably Shift-JIS, GB2312
>> or Big5) will just become normal Perl strings if the right IO-layer is used.

>
> No, they will become SvUTF8 strings, which (shouldn't, but do) behave
> differently from byte strings under some circumstances.


Please Ben, stop messing things up. I said Perl strings, not byte
strings. The unit of Perl strings is characters, not bytes.

--
Ruud
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      11-14-2009
On 2009-11-14 10:03, Peter J. Holzer <(E-Mail Removed)> wrote:
> But in general I would convert the whole text to Unicode and check the
> character properties. This works for *all* encodings, no matter how
> complicated they are:

[...]
> for my $class (qw(Han Latin)) {
> if ($char =~ /\p{$class}/) {


Forgot to add: The full list of properties can be found in
perldoc perlunicode.

hp
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Ratio of Bytes Delayed to Bytes Sent netproj Cisco 0 12-21-2005 08:08 PM
cannot convert parameter from 'double (double)' to 'double (__cdecl *)(double)' error Sydex C++ 12 02-17-2005 06:30 PM
Private Bytes vs. # Bytes in all Heaps in Perfmon Jason Collins ASP .Net 3 02-18-2004 03:59 PM
Re: receiving Bytes and sending Bytes The Old Sourdough Computer Support 0 07-23-2003 01:23 PM



Advertisments