Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Three questions: UTF-8, DBM, hash of lists, ...

Reply
Thread Tools

Three questions: UTF-8, DBM, hash of lists, ...

 
 
Wes Groleau
Guest
Posts: n/a
 
      01-12-2005
I've been rooting around in perlutf8, perlencoding, perlunicode,
and other such things. I think I follow most of it, but there
are some contradictions. Or I thought there were.

1. At the moment, my source is pure ASCII, but I want to
treat it as UTF-8 because the text I work with is UTF-8
and my editor is configured accordingly. (And data
can easily become literals in source). I put -CSD on
my bang-line, which one man page said covers everything
(except -CL which I did not want for some reason). But
another man page seemed to say that "use utf8;" covered
something that -CSD did not, so I put that in, too. Is
either one interfering with the other in any way?

2. One of my applications is reading in a large file, finding
certain patterns, and using them as keys to store everything
else in a DBM hash (use DBM_File; dbmopen %hash, etc.)
The input is 99.5% ASCII--only a few French diacritics, one
copyright symbol, and two Polish characters. Yet adding
the utf-8 constructs to the script and regenerating the DBM
made a HUGE difference in the size of the file. Why is
that?

3. Say an input file contains key and value pairs, BUT
there is more than one possible value for a key.

For example, occupations.

Key Value
----------- ---------
firefighter Fred
chef Charlotte
firefighter Felicia

Can I store a list at the key, or do I have to append
to a string and split on output?

If I can store a list, what is the syntax? The following
is not allowed:


push (@the_hash{$the_job}, $the_name);


If the hash is tied with

use DBM_File;
dbmopen %the_hash .......

does that change the answer?


OK, more than three.

--
Wes Groleau

In any formula, constants (especially those obtained
from handbooks) are to be treated as variables.
 
Reply With Quote
 
 
 
 
Jim Keenan
Guest
Posts: n/a
 
      01-12-2005
Wes Groleau wrote:

>
> 3. Say an input file contains key and value pairs, BUT
> there is more than one possible value for a key.
>
> For example, occupations.
>
> Key Value
> ----------- ---------
> firefighter Fred
> chef Charlotte
> firefighter Felicia
>
> Can I store a list at the key, or do I have to append
> to a string and split on output?
>
> If I can store a list, what is the syntax? The following
> is not allowed:
>
>
> push (@the_hash{$the_job}, $the_name);
>
>

But wouldn't this be appropriate?

push @{$the_hash{$the_job}}, $the_name;



> If the hash is tied with
>
> use DBM_File;
> dbmopen %the_hash .......


Shouldn't that be ...?

use DB_file;

Jim Keenan
 
Reply With Quote
 
 
 
 
Alan J. Flavell
Guest
Posts: n/a
 
      01-12-2005
On Tue, 11 Jan 2005, Wes Groleau wrote:

> Three questions


There are no special awards for folding several questions into one
posting. All that it achieves is: several unrelated subthreads
hanging-off the original posting. Confusion all round.

The key to effective problem-solving is to break up a complex problem
into manageable parts, and deal with each separately, until one
understands it well enough to use it at a component of the whole. In
that sense, I'd commend to you the strategy of asking detailed
questions one at a time (with enough context for the group to
understand the detailed question). If, on the other hand, you can't
decide how to partition a complex problem, then ask about the problem
itself, at a higher level, without pre-judging the lower-level
implementation detail. IMHO and YMMV, anyway.

> I've been rooting around in perlutf8, perlencoding, perlunicode,
> and other such things. I think I follow most of it, but there
> are some contradictions. Or I thought there were.
>
> 1. At the moment, my source is pure ASCII, but I want to
> treat it as UTF-8 because the text I work with is UTF-8
> and my editor is configured accordingly.


Please distinguish carefully between your program source and your
data.

As a matter of fact, us-ascii -is- a subset of utf-8 - utf-8 was
deliberately designed that way - but you *don't* have to use utf-8
encoding in your program source in order to process unicode data.

In any case, Perl's unicode implementation is supposed to be
transparent, i.e you shouldn't normally need to know that its internal
representation happens to be utf-8. What you /do/ need to know is
what encoding is used in your /external data/, and to tell Perl about
it at the appropriate time (e.g by an encoding layer on an I/O
statement).

> (And data can easily become literals in source).


In many situations, you might be better advised to write unicode
characters into the source by means of their \x{..} representation.
Which is not to deny that there can also be situations where you'd
want to write unicode characters directly - but then you have to be a
lot more careful with how you edit and transfer your source code.
See
http://www.perldoc.com/perl5.8.4/pod...cter-Semantics
for more details.

> I put -CSD on
> my bang-line, which one man page said covers everything
> (except -CL which I did not want for some reason).


Could we have a cite on that?

-C is a request to use wide system calls. It doesn't influence Perl's
interpretation of your program source or data "as such".

> But
> another man page seemed to say that "use utf8;" covered
> something that -CSD did not, so I put that in, too.


The perlunicode pod, for the version of Perl that you're using, should
be your "bible". Don't go tossing-in arbitrary bits and pieces that
you may have acquired from elsewhere - treat them as possibly
misleading clues, but check with the authoritative documentation to
make sure that they really do what you want.

See what
http://www.perldoc.com/perl5.8.4/pod...ortant-Caveats
says about "use utf8;".

> Is either one interfering with the other in any way?


I don't know of any reason why they should.

good luck
 
Reply With Quote
 
Wes Groleau
Guest
Posts: n/a
 
      01-15-2005
Alan J. Flavell wrote:
> There are no special awards for folding several questions into one


No rewards expected or requested.

> hanging-off the original posting. Confusion all round.


Welcome to Usenet.

>>1. At the moment, my source is pure ASCII, but I want to
>> treat it as UTF-8 because the text I work with is UTF-8
>> and my editor is configured accordingly.

>
> Please distinguish carefully between your program source and your
> data.


I did. When I said "source," I meant "source" and when
I said "text" I meant what you apparently call "data."

> As a matter of fact, us-ascii -is- a subset of utf-8 - utf-8 was
> deliberately designed that way - but you *don't* have to use utf-8
> encoding in your program source in order to process unicode data.


I know that. However, I prefer that everything on my system
be interpreted as UTF-8, as I work with French, Spanish, Polish,
and Japanese. The script is all ASCII _now_ but I could add
literals for searching or whatever at any time.

> In any case, Perl's unicode implementation is supposed to be
> transparent, i.e you shouldn't normally need to know that its internal
> representation happens to be utf-8. What you /do/ need to know is


I don't want to know what it does internally, as long as everything
comes out UTF-8 and is decoded as such going in.

> what encoding is used in your /external data/, and to tell Perl about
> it at the appropriate time (e.g by an encoding layer on an I/O
> statement).


Since I want _everything_ UTF-8, the appropriate time
is (if possible) at the beginning of the script.

> In many situations, you might be better advised to write unicode
> characters into the source by means of their \x{..} representation.


My terminal renders the glyphs correctly when I 'cat' UTF-8.
Why should I have to look up the codes every time instead?
And although I can compose characters in hex, why should
I do that instead of cut-and-paste from the editor?

> Which is not to deny that there can also be situations where you'd
> want to write unicode characters directly - but then you have to be a
> lot more careful with how you edit and transfer your source code.
> See
> http://www.perldoc.com/perl5.8.4/pod...cter-Semantics
> for more details.


Yes, I read that. I'm trying to minimize the need for "being careful"
about all those ten zillion details by specifying "everything is UTF-8."

> -C is a request to use wide system calls. It doesn't influence Perl's
> interpretation of your program source or data "as such".


You're right:

man perlrun
.....

As of 5.8.1, the "-C" can be followed either by a number or a list
of option letters. The letters, their numeric values, and effects
are as follows; listing the letters is equal to summing the numbers.

I 1 STDIN is assumed to be in UTF-8
O 2 STDOUT will be in UTF-8
E 4 STDERR will be in UTF-8
S 7 I + O + E
i 8 UTF-8 is the default PerlIO layer for input streams
o 16 UTF-8 is the default PerlIO layer for output streams
D 24 i + o

Seems to say -CSDA should handle all my IO (I left off the A because
I still have a little bit of resistance to overcome from the shell)
except for the script itself. A detail I missed. Not an issue yet,
but I'd like to fix it before it becomes one.

>> But
>> another man page seemed to say that "use utf8;" covered
>> something that -CSD did not, so I put that in, too.

>
> The perlunicode pod, for the version of Perl that you're using, should
> be your "bible". Don't go tossing-in arbitrary bits and pieces that


I have 5.8.1 but no pod, so my 'elsewhere' is the man pages
derived from the pod.

> See what
> http://www.perldoc.com/perl5.8.4/pod...ortant-Caveats
> says about "use utf8;".


It says the same as my man page: that the pragma is needed
to "enable UTF-8" in scripts. It doesn't say whether
"enable" means the script itself or the IO or both.
However, 'man perlrun' says the -CSD handles the IO,
and perlunicode says for script encoding, see encoding
which says that UTF-8 already works in scripts.

So, things are a little unclear. I put in both, and
was able to read UTF-8 text, put it in a DBM hash, and
get it back out. That's good enough for now.

--
Wes Groleau
"Beware the barrenness of a busy life."
-- George Verwer

 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      01-15-2005
On Sat, 15 Jan 2005, Wes Groleau wrote:

> Welcome to Usenet.


Indeed. It seems from your response, and the rarity of responses from
other contributors, that you're in the position to offer us all a
valuable tutorial on the topic.

> I don't want to know what it does internally, as long as everything
> comes out UTF-8 and is decoded as such going in.


Fine, then we're pretty much up to speed already, and I'm sorry that I
misinterpreted your original posting.

> > Which is not to deny that there can also be situations where you'd
> > want to write unicode characters directly - but then you have to
> > be a lot more careful with how you edit and transfer your source
> > code. See
> > http://www.perldoc.com/perl5.8.4/pod...cter-Semantics
> > for more details.

>
> Yes, I read that. I'm trying to minimize the need for "being
> careful" about all those ten zillion details by specifying
> "everything is UTF-8."


Point made. If you're really in control of all that data then you're
in a much happier position than I've ever been

> I 1 STDIN is assumed to be in UTF-8
> O 2 STDOUT will be in UTF-8
> E 4 STDERR will be in UTF-8
> S 7 I + O + E
> i 8 UTF-8 is the default PerlIO layer for input streams
> o 16 UTF-8 is the default PerlIO layer for output streams
> D 24 i + o
>
> Seems to say -CSDA should handle all my IO


It does, doesn't it? Did I miss the specific problem you were having,
and your test case that demonstrated it?

> > > But
> > > another man page seemed to say that "use utf8;" covered
> > > something that -CSD did not, so I put that in, too.

> >
> > The perlunicode pod, for the version of Perl that you're using,
> > should be your "bible". Don't go tossing-in arbitrary bits and
> > pieces that

>
> I have 5.8.1 but no pod, so my 'elsewhere' is the man pages
> derived from the pod.


No disagreement there. More than one way to...read the documentation.

> > See what
> > http://www.perldoc.com/perl5.8.4/pod...ortant-Caveats
> > says about "use utf8;".

>
> It says the same as my man page: that the pragma is needed
> to "enable UTF-8" in scripts.


Hmmm? At 5.8.4 (and I don't remember it being different in recent
versions before that) it says [this'll need monospace display, and go
sadly wrong with these newfangled usenet-ish interfaces, sorry]:

As a compatibility measure, the use utf8 pragma must be explicitly
included to enable recognition of UTF-8 in the Perl scripts
^^^^^^^^^^^^^^^^^^^
themselves (in string or regular expression literals, or in
^^^^^^^^^^
identifier names) on ASCII-based machines or to recognize UTF-EBCDIC
on EBCDIC-based machines. These are the only times when an explicit
^^^^^^^^^^
use utf8 is needed.

> However, 'man perlrun' says the -CSD handles the IO,


Indeed, and (fwiw) I don't see anything there about encoding of the
script's source code itself.

> and perlunicode says for script encoding, see encoding
> which says that UTF-8 already works in scripts.


It "works", yes, but (as I understand it, anyway) I think you have to
ask for it. It could just be that if you call for locale-awareness
with -CL, and you have utf-8 in your locale, it will come out in the
wash; but I don't see any harm in asking for it directly, if you're so
certain that you'll never not want it (sorry for the double-negative).

> So, things are a little unclear. I put in both,


Looks as if you're (a) right and (b) unlikely to cause any harm.

> was able to read UTF-8 text, put it in a DBM hash, and
> get it back out. That's good enough for now.


Good luck
 
Reply With Quote
 
Wes Groleau
Guest
Posts: n/a
 
      01-16-2005
Alan J. Flavell wrote:
[re UTF-8 in perl scripts]

> It "works", yes, but (as I understand it, anyway) I think you have to
> ask for it. It could just be that if you call for locale-awareness
> with -CL, and you have utf-8 in your locale, it will come out in the
> wash; but I don't see any harm in asking for it directly, if you're so
> certain that you'll never not want it (sorry for the double-negative).


I also left the L off of -C because I don't think I have that completely
coerced to UTF-8

>>So, things are a little unclear. I put in both,

>
> Looks as if you're (a) right and (b) unlikely to cause any harm.


Sigh, now it starts getting weird. Kind of long, summary at the bottom.

The script with -CSD and use utf8 created a database,
and a test script pulled the records out of the database
and printed them. The non-ASCII characters rendered
correctly BUT that doesn't mean anything, since the test
script had the same -CSD and use utf8. (Right?)

So I figured I needed to eyeball inside the DB file
and see if I could find some nonASCII and see how it was encoded.

But a series of unfortunate events resulted in my having
to re-create the script, and then it crashed (bus error
or segmentation fault). Figured out which record it
was crashing on, put it in its own file, and ....
well to skip over the long tedious details, I eventually
had a version of the script that would crash and one that
would not crash on the same input file.

'diff' showed only one difference:

wgroleau$ diff ~/bin/GEDCOM_DB ./tempGCDB
1c1
< #!/usr/bin/perl -w -CSD
---
> #!/usr/bin/perl -w -CSD


od -xc revealed that the extra space is indeed a (hex 20)
regular space and not a UTF-8 construct.

More study showed that the space made a difference on the only
two systems I currently have access to:

wgroleau$ uname -a
Darwin Groleau.local 7.7.0 Darwin Kernel Version 7.7.0: Sun Nov 7
16:06:51 PST 2004; rootnu/xnu-517.9.5.obj~1/RELEASE_PPC Power
Macintosh powerpc
wgroleau$ perl -v

This is perl, v5.8.1-RC3 built for darwin-thread-multi-2level
(with 1 registered patch, see perl -V for more detail)

Copyright 1987-2003, Larry Wall

AND

[0:ag/g/groleau> uname -a
NetBSD otaku 1.6.2_STABLE NetBSD 1.6.2_STABLE (sdf) #0: Sun Jul 25
04:17:09 UTC 2004 root@ol:/var/src/src/sys/arch/alpha/compile/sdf alpha

[0:ag/g/groleau> perl -v

This is perl, v5.8.0 built for alpha-netbsd

Copyright 1987-2002, Larry Wall


On Darwin/PPC, the extra space prevents bus error/segmentation fault.
On Net-BSD/Alpha, it prevents the following:

[0:ag/g/groleau> rm wgroleau.DB; ./tempGCDB < bad.record.GED
Recompile perl with -DDEBUGGING to use -D switch
Can't emulate -S on #! line at ./tempGCDB line 1.
[255:ag/g/groleau> head -1 ./tempGCDB
#!/usr/pkg/bin/perl -w -CSD


Summary: On two diferent platforms, in

#!/usr/bin/perl -w -CSD

the extra space is required.

If anyone wants to try it on a different system, I can provide
the script and the input file.

--
Wes Groleau
-----------

"Thinking I'm dumb gives people something to
feel smug about. Why should I disillusion them?"
-- Charles Wallace
(in _A_Wrinkle_In_Time_)
 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      01-16-2005
Wes Groleau <(E-Mail Removed)> wrote:
> Alan J. Flavell wrote:
>> There are no special awards for folding several questions into one

>
> No rewards expected or requested.
>
>> hanging-off the original posting. Confusion all round.

>
> Welcome to Usenet.



So long then.


--
Tad McClellan SGML consulting
http://www.velocityreviews.com/forums/(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
hash of hash of hash of hash in c++ rp C++ 1 11-10-2011 04:45 PM
Re: Three Mobile --> Skype on three (Non-three [Symbian - Nokia] handsets) Harry Stottle UK VOIP 0 01-05-2010 08:59 AM
Hash#select returns an array but Hash#reject returns a hash... Srijayanth Sridhar Ruby 19 07-02-2008 12:49 PM
Re-inventing the wheel, same hash, three scripts Justin C Perl Misc 6 04-24-2007 10:53 PM
In 'HashMap.put', "if (e.hash == hash && eq(k, e.key))" ? Red Orchid Java 3 01-30-2006 07:04 PM



Advertisments