Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Newbe Unicode question

Reply
Thread Tools

Newbe Unicode question

 
 
Scottie
Guest
Posts: n/a
 
      02-18-2004
How do I make a Unicode Perl script that uses:
perl zapotec.pl zapotecUnicode.txt > asdf.txt
where "zapotecUnicode.txt" is UTF-8 file?

In the zapotec.pl I have:
binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");
use encoding "latin2";
at the very top.

Any help would be appreciated.

Scott
 
Reply With Quote
 
 
 
 
Ben Morrow
Guest
Posts: n/a
 
      02-18-2004

http://www.velocityreviews.com/forums/(E-Mail Removed) (Scottie) wrote:
> How do I make a Unicode Perl script that uses:
> perl zapotec.pl zapotecUnicode.txt > asdf.txt
> where "zapotecUnicode.txt" is UTF-8 file?
>
> In the zapotec.pl I have:
> binmode(STDOUT, ":utf8");
> binmode(STDIN, ":utf8");
> use encoding "latin2";


Why? Is your source in latin2?

> at the very top.
>
> Any help would be appreciated.


Err... what does yor script do, and in what ways is in not working?

Ben

--
don't get my sympathy hanging out the 15th floor. you've changed the locks 3
times, he still comes reeling though the door, and soon he'll get to you, teach
you how to get to purest hell. you do it to yourself and that's what really
hurts is you do it to yourself just you, you and noone else * (E-Mail Removed)
 
Reply With Quote
 
 
 
 
Scottie
Guest
Posts: n/a
 
      02-19-2004
Ben,

> > In the zapotec.pl I have:
> > binmode(STDOUT, ":utf8");
> > binmode(STDIN, ":utf8");
> > use encoding "latin2";

>
> Why? Is your source in latin2?


I'm sorry. The 3rd line is:
use encoding "latin1";

> Err... what does yor script do, and in what ways is in not working?


I started with GAWK and used a2p to change it to Perl. I think I know
that the @Fld line isn't allowing it to be Unicode. I have hunted
through the Perl docs concerning my problem and I haven't come up with
an answer. What do you think?

# Perl - a2p - Combines many changes to the Zapotec-Spanish
dictionary.
# Scott Starker

binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");
use encoding "latin1";

# ${^WIDE_SYSTEM_CALLS} = 1;
$[ = 1; # set array base to 1
$, = " "; # set output field separator
$\ = "\n"; # set output record separator

$AlreadyGN = 0;
$notes = 0;
$gnsgnFirstLine = 0;
$anyline = 0;
$position = 0;
$lxline = '';
$mldef = '';
$seline = '';
$line = '';
$beg = '';
$end = '';

# This program takes out the "lx"'s that are alone on the line ("\k").
while (<>) {
chomp; # strip record separator
@Fld = split("\x{0020}", $_, 9999); # " "
print "\x{002a}";
# if ($Fld[1] eq " \\ l x") {
# if ($Fld[1] eq "\x{005c}\x{006c}\x{0078}") { # "\\lx"
if ($Fld[1] eq "\x{005c}\x{005c}\x{006c}\x{0078}") { # "\\lx"
print "\x{002a}\x{002a}";
$s = "\x{002d}", s/$s/\^\x{007e}/g; # "-"
# Make "tone" un-bolded
$Fld[2] = "\x{007c}\x{0062}" . $Fld[2]; # "\x{007c}\x{0062}"
s/\x{005b}/\x{007c}\x{0072}\x{005b}/g; # If "[" or "," exist
s/\x{005d}/\x{005d}\x{007c}\x{0062}/g;
s/\x{005d}\x{007c}\x{0062}\x{00b8}\x{0020}/\x{005d}\x{00b8}\x{0020}\x{007c}\x{0062}/g;
$Fld[$#Fld] = $Fld[$#Fld] . "\x{007c}\x{0072}";
$position = index($Fld[$#Fld], "\x{005d}");
$lxline = $_;
..
..
..

Scott
 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      02-19-2004

(E-Mail Removed) (Scottie) wrote:
> Ben,
>
> > > In the zapotec.pl I have:
> > > binmode(STDOUT, ":utf8");
> > > binmode(STDIN, ":utf8");
> > > use encoding "latin2";

> >
> > Why? Is your source in latin2?

>
> I'm sorry. The 3rd line is:
> use encoding "latin1";
>
> > Err... what does yor script do, and in what ways is in not working?

>
> I started with GAWK and used a2p to change it to Perl. I think I know
> that the @Fld line isn't allowing it to be Unicode. I have hunted
> through the Perl docs concerning my problem and I haven't come up with
> an answer. What do you think?
>
> # Perl - a2p - Combines many changes to the Zapotec-Spanish
> dictionary.
> # Scott Starker
>
> binmode(STDOUT, ":utf8");
> binmode(STDIN, ":utf8");
> use encoding "latin1";


This is unnecessary because al latin1 is the default anyway and b. your
source is all ascii.

> # ${^WIDE_SYSTEM_CALLS} = 1;
> $[ = 1; # set array base to 1


Aaarg... run away... $[ is highly deprecated and double-plus-ungood.
Yes, I know it's not your code .

> $, = " "; # set output field separator
> $\ = "\n"; # set output record separator
>
> $AlreadyGN = 0;
> $notes = 0;
> $gnsgnFirstLine = 0;
> $anyline = 0;
> $position = 0;
> $lxline = '';
> $mldef = '';
> $seline = '';
> $line = '';
> $beg = '';
> $end = '';
>
> # This program takes out the "lx"'s that are alone on the line ("\k").
> while (<>) {
> chomp; # strip record separator
> @Fld = split("\x{0020}", $_, 9999); # " "
> print "\x{002a}";
> # if ($Fld[1] eq " \\ l x") {
> # if ($Fld[1] eq "\x{005c}\x{006c}\x{0078}") { # "\\lx"
> if ($Fld[1] eq "\x{005c}\x{005c}\x{006c}\x{0078}") { # "\\lx"
> print "\x{002a}\x{002a}";
> $s = "\x{002d}", s/$s/\^\x{007e}/g; # "-"
> # Make "tone" un-bolded
> $Fld[2] = "\x{007c}\x{0062}" . $Fld[2]; # "\x{007c}\x{0062}"
> s/\x{005b}/\x{007c}\x{0072}\x{005b}/g; # If "[" or "," exist
> s/\x{005d}/\x{005d}\x{007c}\x{0062}/g;
> s/\x{005d}\x{007c}\x{0062}\x{00b8}\x{0020}/\x{005d}\x{00b8}\x{0020}\x{007c}\x{0062}/g;
> $Fld[$#Fld] = $Fld[$#Fld] . "\x{007c}\x{0072}";
> $position = index($Fld[$#Fld], "\x{005d}");
> $lxline = $_;


Right, let's attempt to translate that into Perl... (untested)

#!/usr/bin/perl

use strict;
use warnings;

$, = " ";
$\ = "\n";

binmode STDIN, ':encoding(utf';
binmode STDOUT, ':encoding(utf';
# this is better as you get fallback if the input is invalid

my $ced = "\xb8";

while (<>) {
chomp;
my ($a, $b, $c) = split " ";
if ($a eq '\\\lx') { # this comes out as two \
print '**';
s/-/^~/g;
$b = "|b$b";
s/\[/|r[/g;
s/]/]|b/g;
s/]\|b$ced ]/]$ced |b/g;

....etc. (Bog, that code's making my eyes hurt!) You can carry on, and
finish it (what you posted wasn't complete, right?).

Now, I can't really see what this is supposed to do, so what do you want
it to do, and what is it in fact doing?

Ben

--
$.=1;*g=sub{print@_};sub r($$\$){my($w,$x,$y)=@_;for(keys%$x){/main/&&next;*p=$
$x{$_};/(\w)::$/&&(r($w.$1,$x.$_,$y),next);$y eq\$p&&&g("$w$_")}};sub t{for(@_)
{$f&&($_||&g(" "));$f=1;r"","::",$_;$_&&&g(chr(0012))}};t # (E-Mail Removed)
$J::u::t, $a::n:::t::h::e::r, $P::e::r::l, $h::a::c::k::e::r, $.
 
Reply With Quote
 
Scottie
Guest
Posts: n/a
 
      02-20-2004
Ben,

> ... (what you posted wasn't complete, right?).


It wasn't nearly all of it!

> Now, I can't really see what this is supposed to do, so what do you want
> it to do, and what is it in fact doing?


Well, the zapotecUnicode.txt is a file the contains a "dictionary" of
Zapotec word (spoken in Mexico) and it's Spanish words as it's
definitions. It's almost a database type-of-thing. The program is
called Shoebox. There are different lines for each record. They all
start with "\lx" (lexicon). Then the definition(s) (\gn) follows.
There might at least one subentry (\se) along with it's definition(s)
(\sgn). There's more than these fields. (The Perl line "print "**";
was for testing purposes.) Thus, I therefor I need a @Fld = split(" ",
$_, 9999); that takes an array like this. Can you help me out? I need
to know how to get the line into @Fld.

Scott
 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      02-21-2004

(E-Mail Removed) (Scottie) wrote:
>
> Well, the zapotecUnicode.txt is a file the contains a "dictionary" of
> Zapotec word (spoken in Mexico) and it's Spanish words as it's
> definitions. It's almost a database type-of-thing. The program is
> called Shoebox. There are different lines for each record. They all
> start with "\lx" (lexicon). Then the definition(s) (\gn) follows.
> There might at least one subentry (\se) along with it's definition(s)
> (\sgn). There's more than these fields. (The Perl line "print "**";
> was for testing purposes.) Thus, I therefor I need a @Fld = split(" ",
> $_, 9999); that takes an array like this. Can you help me out? I need
> to know how to get the line into @Fld.


Well, that's easy:

my @F = split ' ';

if the records on each line are space-separated. Alternatively,

my @F = split /\\/;

may work better, as it will split the line on the backslashes. There are
two 'unfortunately's here: firstly, you'll get an initial empty field,
before the first backslash; secondly, the actual backslashes themselves
will be removed, so you'll have to remember to put them back in.

It's probably easiest if you then iterate over the fields, and do
whatever you need to based on the field type:

#!/usr/bin/perl -lanF\\
# see perldoc perlrun for the above: it automagically iterates over all
# lines and splits them into @F

BEGIN {
$\ = '';
binmode STDIN, ':encoding(utf';
binmode STDOUT, ':encoding(utf';
}

for (@F) {
/^lx/ and next;

/^gn/ and do {
s/\xB8/|b/; # or whatever it is you want to do
next;
};
}
continue {
# this makes sure each entry gets printed, with its backslash,
# when you're done with it.

print '\\' . (join $,, @_) . $\;
}

Ben

--
For the last month, a large number of PSNs in the Arpa[Inter-]net have been
reporting symptoms of congestion ... These reports have been accompanied by an
increasing number of user complaints ... As of June,... the Arpanet contained
47 nodes and 63 links. [ftp://rtfm.mit.edu/pub/arpaprob.txt] * (E-Mail Removed)
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Newbe question ---- response.writefile =?Utf-8?B?b3o=?= ASP .Net 1 01-09-2005 05:15 AM
Easy SQL Question - Newbe =?Utf-8?B?UnlhbiBTbWl0aA==?= ASP .Net 1 01-05-2005 10:04 PM
Mozilla newbe mail settings question dryd(takethisout) Firefox 16 09-05-2004 02:17 PM
newbe question on configuration Michael Huffaker Cisco 1 07-16-2004 10:24 PM
Newbe question Cisco 1710 router! Arben Qarkaxhija Cisco 3 07-18-2003 04:20 PM



Advertisments