Arvin Portlock <> wrote:
> I'm writing a script that replaces the direct form of a
> special character with its SDATA equivalent. For example
> it would replace all occurences of é with é. I've
> compiled an enormous hash with the "direct" form as the
> key and the SDATA version as its value. I can think of two
> ways to accomplish this. The first is two loop through all
> keys and do a global replace with the correct value:
>
> foreach my $key (keys %characters) {
> $fulltext =~ s/$key/$characters{$key}/g;
> }
>
> The second is to process the document character by character
> and if the character is in the hash then replace it:
>
> local $/ = undef;
> open (FILE, $file);
> my $fulltext = <FILE>;
> close (FILE);
> my @chars = split (//, $fulltext);
> foreach my $char (@chars) {
> if ($characters{$char}) {
> print $characters{$char};
> } else {
> print $char;
> }
> }
>
> The second seems the faster option, but neither one of them
> is exactly and elegant solution. Is there something obvious
> I'm missing?
If you're using 5.8, and don't mind having &#nnnn; instead of named
entities, you can do
use Encode qw/:fallbacks/;
$PerlIO::encoding::fallback = FB_HTMLCREF;
binmode STDOUT, ':encoding(ascii)';
open my $FILE, '<:encoding(latin1)', $file or die...;
# or whatever encoding is appropriate
print while <$FILE>;
Otherwise, I'd do
open my $FILE, $file or die...;
while (<$FILE>) {
s/([^[:ascii:]])/$characters{$1}/g;
print;
}
If your %characters doesn't include all the non-ascii in the file, you
could use
my $to_encode = '[' . (join '', keys %characters) . ']';
while (<$FILE>) {
s/($to_encode)/$characters{$1}/g;
print;
}
Ben
--
Like all men in Babylon I have been a proconsul; like all, a slave ... During
one lunar year, I have been declared invisible; I shrieked and was not heard,
I stole my bread and was not decapitated.
~
~ Jorge Luis Borges, 'The Babylon Lottery'