Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Replacing hundreds of hash keys with their values in a text document

Reply
Thread Tools

Replacing hundreds of hash keys with their values in a text document

 
 
Arvin Portlock
Guest
Posts: n/a
 
      02-12-2004
I'm writing a script that replaces the direct form of a
special character with its SDATA equivalent. For example
it would replace all occurences of with é. I've
compiled an enormous hash with the "direct" form as the
key and the SDATA version as its value. I can think of two
ways to accomplish this. The first is two loop through all
keys and do a global replace with the correct value:

foreach my $key (keys %characters) {
$fulltext =~ s/$key/$characters{$key}/g;
}

The second is to process the document character by character
and if the character is in the hash then replace it:

local $/ = undef;
open (FILE, $file);
my $fulltext = <FILE>;
close (FILE);
my @chars = split (//, $fulltext);
foreach my $char (@chars) {
if ($characters{$char}) {
print $characters{$char};
} else {
print $char;
}
}

The second seems the faster option, but neither one of them
is exactly and elegant solution. Is there something obvious
I'm missing?

Arvin

 
Reply With Quote
 
 
 
 
Ben Morrow
Guest
Posts: n/a
 
      02-12-2004
Arvin Portlock <(E-Mail Removed)> wrote:
> I'm writing a script that replaces the direct form of a
> special character with its SDATA equivalent. For example
> it would replace all occurences of with &eacute;. I've
> compiled an enormous hash with the "direct" form as the
> key and the SDATA version as its value. I can think of two
> ways to accomplish this. The first is two loop through all
> keys and do a global replace with the correct value:
>
> foreach my $key (keys %characters) {
> $fulltext =~ s/$key/$characters{$key}/g;
> }
>
> The second is to process the document character by character
> and if the character is in the hash then replace it:
>
> local $/ = undef;
> open (FILE, $file);
> my $fulltext = <FILE>;
> close (FILE);
> my @chars = split (//, $fulltext);
> foreach my $char (@chars) {
> if ($characters{$char}) {
> print $characters{$char};
> } else {
> print $char;
> }
> }
>
> The second seems the faster option, but neither one of them
> is exactly and elegant solution. Is there something obvious
> I'm missing?


If you're using 5.8, and don't mind having &#nnnn; instead of named
entities, you can do

use Encode qw/:fallbacks/;

$PerlIO::encoding::fallback = FB_HTMLCREF;
binmode STDOUT, ':encoding(ascii)';

open my $FILE, '<:encoding(latin1)', $file or die...;
# or whatever encoding is appropriate
print while <$FILE>;

Otherwise, I'd do

open my $FILE, $file or die...;
while (<$FILE>) {
s/([^[:ascii:]])/$characters{$1}/g;
print;
}

If your %characters doesn't include all the non-ascii in the file, you
could use

my $to_encode = '[' . (join '', keys %characters) . ']';
while (<$FILE>) {
s/($to_encode)/$characters{$1}/g;
print;
}

Ben

--
Like all men in Babylon I have been a proconsul; like all, a slave ... During
one lunar year, I have been declared invisible; I shrieked and was not heard,
I stole my bread and was not decapitated.
~ http://www.velocityreviews.com/forums/(E-Mail Removed) ~ Jorge Luis Borges, 'The Babylon Lottery'
 
Reply With Quote
 
 
 
 
Arvin Portlock
Guest
Posts: n/a
 
      02-13-2004
> If your %characters doesn't include all the non-ascii in the file, you
> could use
>
> my $to_encode = '[' . (join '', keys %characters) . ']';
> while (<$FILE>) {
> s/($to_encode)/$characters{$1}/g;
> print;
> }
>
> Ben



Boy, do I feel like an idiot. That makes MUCH more sense and is just
what I'll do. I have no idea what I was thinking.

> If you're using 5.8, and don't mind having &#nnnn; instead of named
> entities, you can do
>
> use Encode qw/:fallbacks/;
>
> $PerlIO::encoding::fallback = FB_HTMLCREF;
> binmode STDOUT, ':encoding(ascii)';
>
> open my $FILE, '<:encoding(latin1)', $file or die...;
> # or whatever encoding is appropriate
> print while <$FILE>;


Nah, I have to use SDATA entities. I'm not dealing with HTML.
But this is a good trick for another project: converting unicode
characters to numeric decimal entities in HTML files so older
browsers can view them.

Thanks!

Arvin

 
Reply With Quote
 
Brian McCauley
Guest
Posts: n/a
 
      02-13-2004
Arvin Portlock <(E-Mail Removed)> writes:

> > If your %characters doesn't include all the non-ascii in the file, you
> > could use
> >
> > my $to_encode = '[' . (join '', keys %characters) . ']';
> > while (<$FILE>) {
> > s/($to_encode)/$characters{$1}/g;
> > print;
> > }
> >
> > Ben

>
>
> That makes MUCH more sense and is just what I'll do.


I've not benchmarked it but I suspect it would be more efficient to
take the appending of the () outsie the loop. I'd also explicitly
precompile the regex - although I think Perl will actually manage to
avoid unnecessary recompilation anyhow.

my $to_encode = join '', keys %characters;
$to_encode = qr/([$to_encode])/;
while (<$FILE>) {
s/$to_encode/$characters{$1}/g;
print;
}

Note: Some people would use @{[]} interpolation here but although I'm
a proponent of @{[]} in here-docs I think it looks messy in qr//.

my $to_encode = qr/([@{[ join '', keys %characters ]}])/;

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
hash of hash of hash of hash in c++ rp C++ 1 11-10-2011 04:45 PM
Hash#keys, Hash#values order question Ronald Fischer Ruby 0 08-23-2007 09:34 AM
hash.keys and hash.values Mage Ruby 14 08-15-2006 08:44 PM
Hash#values and Hash#keys order Alex Fenton Ruby 1 04-15-2006 05:45 AM
keyboard keys replacing mouse keys? larry Computer Support 8 09-14-2003 07:32 PM



Advertisments