Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > search/replace

Reply
Thread Tools

search/replace

 
 
molsted
Guest
Posts: n/a
 
      03-30-2009
Hi all
I'm trying to relpace some strings in a textfile like this:
&00Antiques^M&00Antiquit<0x00E4>ten^M&00Antiquit<0 x00E9>s^M&00Antig<0x00FC>edades^M&00Antikviteter^M

I've tried the following:
s/&00(.+?)\r\n&00(.+?)\r\n&00(.+?)\r\n&00(.+?)\r\n&0 0(.+?)\r\n/
<Style:GB>$1\n<StyleE>$2\n<Style:FR>$3\n<Style:E S>$4\n<StyleK>$5/
g;

with no luck.

--
Rene
 
Reply With Quote
 
 
 
 
Tad J McClellan
Guest
Posts: n/a
 
      03-30-2009
molsted <(E-Mail Removed)> wrote:

> I'm trying to relpace some strings in a textfile like this:
> &00Antiques^M&00Antiquit<0x00E4>ten^M&00Antiquit<0 x00E9>s^M&00Antig<0x00FC>edades^M&00Antikviteter^M




Does your data really have caret-M in it or does it instead have
carriage return-linefeed in it?

You should write the data in Real Perl Code so that there is no ambiguity.

Have you seen the Posting Guidelines that are posted here frequently?


--------------------------
#!/usr/bin/perl
use warnings;
use strict;

my @lang = qw/ <Style:GB> <StyleE> <Style:FR> <Style:ES> <StyleK> /;

$_ = "&00Antiques\r\n&00Antiquit<0x00E4>ten\r\n&00Antiq uit<0x00E9>s\r\n"
. "&00Antig<0x00FC>edades\r\n&00Antikviteter\r\n ";
print;
print "\n";

my $capture_num = 0;
s/&00([^\r]+)\r\n/$lang[$capture_num++]$1\n/g;
print;
--------------------------


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
 
Reply With Quote
 
 
 
 
Alex
Guest
Posts: n/a
 
      03-31-2009
molsted wrote:
> Hi all
> I'm trying to relpace some strings in a textfile like this:
> &00Antiques^M&00Antiquit<0x00E4>ten^M&00Antiquit<0 x00E9>s^M&00Antig<0x00FC>edades^M&00Antikviteter^M
>
> I've tried the following:
> s/&00(.+?)\r\n&00(.+?)\r\n&00(.+?)\r\n&00(.+?)\r\n&0 0(.+?)\r\n/
> <Style:GB>$1\n<StyleE>$2\n<Style:FR>$3\n<Style:E S>$4\n<StyleK>$5/
> g;


What does the '^M' mean to you? My editor shows carriage returns as
'^M', but I see you're search for a carriage return followed by a
newline. Since not all line breaks are the same (it depends on your
system), you want to match "at the end of the line". You want to use $
to match "just before the end of a line" and .*? to chomp off the the
line breaking characters. This requires using the flags s and m. The
x-flag permits whitespace in your expression, which improves readability.

Here's a version that works on my system:

s/


^ & 00 ([^\r\n]+?)$ .*?



^ & 00 ([^\r\n]+?)$ .*?



^ & 00 ([^\r\n]+?)$ .*?



^ & 00 ([^\r\n]+?)$ .*?



^ & 00 ([^\r\n]+?)$ .*?



/<Style:GB>$1\n<StyleE>$2\n<Style:FR>$3\n<Style:E S>$4\n<StyleK>$5/msx;


Note: The white space in "^ & 00 ([^\r\n]+?)$ .*?" is ignored, so it
really means "^&00([^\r\n]+?)$.*?", which means "At the start of a line,
match an ampersand, followed by two zeros, followed by any number of
characters which are not carriage returns or line feeds, just before the
end of the line".

HTH!

--
Alex
domain: iki dot fi
localpart: alext
email: localpart at domain
 
Reply With Quote
 
Alex
Guest
Posts: n/a
 
      03-31-2009
Alex meant to write:
> s/
>
> ^ & 00 ([^\r\n]+?)$ .*?
> ^ & 00 ([^\r\n]+?)$ .*?
> ^ & 00 ([^\r\n]+?)$ .*?
> ^ & 00 ([^\r\n]+?)$ .*?
> ^ & 00 ([^\r\n]+?)$ .*?
> /<Style:GB>$1\n<StyleE>$2\n<Style:FR>$3\n<Style:E S>$4\n<StyleK>$5/msx;


And sorry for all the extra lines, which my ng-reader added for me.


--
Alex
domain: iki dot fi
localpart: alext
email: localpart at domain
 
Reply With Quote
 
molsted
Guest
Posts: n/a
 
      04-01-2009
On 30 Mar., 14:41, Tad J McClellan <(E-Mail Removed)> wrote:
> Does your data really have caret-M in it or does it instead have
> carriage return-linefeed in it?
>
> You should write the data in Real Perl Code so that there is no ambiguity..
>
> Have you seen the Posting Guidelines that are posted here frequently?
>
> --------------------------
> #!/usr/bin/perl
> use warnings;
> use strict;
>
> my @lang = qw/ <Style:GB> <StyleE> <Style:FR> <Style:ES> <StyleK> /;
>
> $_ = "&00Antiques\r\n&00Antiquit<0x00E4>ten\r\n&00Antiq uit<0x00E9>s\r\n"
> * *. "&00Antig<0x00FC>edades\r\n&00Antikviteter\r\n ";
> print;
> print "\n";
>
> my $capture_num = 0;
> s/&00([^\r]+)\r\n/$lang[$capture_num++]$1\n/g;
> print;
> --------------------------


Hi Tad,
I haven't seen Posting Guidelines, this my first post to the group,
can i read them some where?
I'm going with your suggestion but it only matches the first line.
However if I put more sequences in the @lang-array it will work.
How would I overcome that?

----------------------------

#!/usr/bin/perl

use strict;

my $fileName=$ARGV[0];

open(FILE,"$fileName") || die("Cannot Open File");

my(@fcont) = <FILE>;

close FILE;

open(FOUT,">$fileName.txt") || die("Cannot Open File");

foreach my $line (@fcont) {

$line =~ s/\r/\r\n/g;

#### METHOD #1 BEGIN ####

my @lang = qw/ <Style:GB> <StyleE> <Style:FR> <Style:ES>
<StyleK> /;
my $capture_num = 0;
$line =~ s/&00([^\r]+)\r\n/$lang[$capture_num++]$1\n/g;

#### METHOD #1 END ####

print FOUT $line;
}
close FOUT;

exit 0

----------------------------


--
Rene
 
Reply With Quote
 
Tad J McClellan
Guest
Posts: n/a
 
      04-01-2009
molsted <(E-Mail Removed)> wrote:

> I haven't seen Posting Guidelines, this my first post to the group,
> can i read them some where?



http://tinyurl.com/dg27de


> I'm going with your suggestion but it only matches the first line.



To analyse the behavior of a pattern match, we need two things:

1) the pattern that is to be matched
2) the string that the pattern is to be matched against

Since we only have access to one of them, we cannot analyse why it
fails to match.


> #!/usr/bin/perl
>
> use strict;



use warnings;


> my $fileName=$ARGV[0];
>
> open(FILE,"$fileName") || die("Cannot Open File");



You should not quote lone variables:

perldoc -q vars

What's wrong with always quoting "$vars"?

You should use the 3-argument form of open() and a lexical filehandle.

You should include the name of the file in the diag message.

You should put delimiters around the filename in your diag message.

You should include the $! variable in the diag messages.


open my $FILE, '<', $file_name or die "could not open '$file_name' $!";


> my(@fcont) = <FILE>;



my @fcont = <$FILE>;

(but see below)


> foreach my $line (@fcont) {



You should not read the entire file into memory if you only need
one line of the file at a time.

while ( my $line = <$FILE> ) {


> $line =~ s/\r/\r\n/g;



Why are you doing this?

Is the file a MAC-OS (not OS X) text file?

It is too late to fix line endings after you have used <> to read "lines".

You need to fix them *before* applying the <> operator.

Perhaps by setting the $/ variable to an appropriate value.


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
 
Reply With Quote
 
molsted
Guest
Posts: n/a
 
      04-02-2009
On 1 Apr., 14:32, Tad J McClellan <(E-Mail Removed)> wrote:
> To analyse the behavior of a pattern match, we need two things:
>
> 1) the pattern that is to be matched


Sample pattern:
&00Antiques^M
&00Antiquit<0x00E4>ten^M
&00Antiquit<0x00E9>s^M
&00Antig<0x00FC>edades^M
&00Antikviteter^M

Sample output:
<Style:GB>Antiques
<StyleE>Antiquit<0x00E4>ten
<Style:FR>Antiquit<0x00E9>s
<Style:ES>Antig<0x00FC>edades
<StyleK>Antikviteter

All on seperate lines. The file is generated on a Windows PC (\r\n),
my file needs to end up as a UNIX-file on Mac OS X

The first file had accidently been opened on a Mac, hence the \r end
of line.

I hope this clears things a bit up.

The file is being converted from 1252 to Macroman prior being run
through script (/usr/bin/iconv -f WINDOWS-1252 -t MACROMAN). However I
am considdering using 'Text::Iconv' instead.

--
Rene
 
Reply With Quote
 
Tad J McClellan
Guest
Posts: n/a
 
      04-02-2009
molsted <(E-Mail Removed)> wrote:
> On 1 Apr., 14:32, Tad J McClellan <(E-Mail Removed)> wrote:
>> To analyse the behavior of a pattern match, we need two things:
>>
>> 1) the pattern that is to be matched

>
> Sample pattern:
> &00Antiques^M
> &00Antiquit<0x00E4>ten^M
> &00Antiquit<0x00E9>s^M
> &00Antig<0x00FC>edades^M
> &00Antikviteter^M



That is NOT the pattern to be matched!

The pattern to be matched is:

&00([^\r]+)\r\n

Those are (meant to be) the strings that the pattern is to be matched against.

The reason that none of those strings match the pattern is because
none of those strings contain a carriage return, and the pattern requires
a carriage return.

A hex dump, such as from xxd, shows that there are no carriage returns
in that data. Each lines ends with a caret (ASCII 0x5e), an upper
case "M" (ASCII 0x4d) and a linefeed (ASCII 0x0a):

0000000: 2630 3041 6e74 6971 7565 735e 4d0a 2630 &00Antiques^M.&0
^^ ^^^^
0000010: 3041 6e74 6971 7569 743c 3078 3030 4534 0Antiquit<0x00E4
0000020: 3e74 656e 5e4d 0a26 3030 416e 7469 7175 >ten^M.&00Antiqu
^^^^ ^^
0000030: 6974 3c30 7830 3045 393e 735e 4d0a 2630 it<0x00E9>s^M.&0
^^ ^^^^

If you cannot figure out how to post data with the line endings that
are actually in your data, then write the data in Real Perl Code.

(that sounds familiar...)

instead of

while ( <FILE> ) {

put the data into an array and loop over the array:

my @lines = ( "&00Antiques\r\n", "&00Antiquit<0x00E4>ten\r\n", ...
foreach ( @lines ) {


> The file is generated on a Windows PC (\r\n),
> my file needs to end up as a UNIX-file on Mac OS X



Then all you need to do is delete all of the carriage returns before
matching:

tr/\r//d;

and change the pattern to not require carriage returns.


> The first file had accidently been opened on a Mac, hence the \r end
> of line.



That explains it then.

On Linux/OS X the input operator, <>, reads until it finds a newline.

Since there were no newlines, a single read gets the entire file in one go.


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off




Advertisments