Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > File handling and regex

Reply
Thread Tools

File handling and regex

 
 
Luca Villa
Guest
Posts: n/a
 
      11-05-2007
Hi all!

I need help with Perl under Windows command-line to solve the
following task:

I have many disordered txt files and subdirectories under the root
directory "c:\dir", like this:
c:\dir\foobar.txt
c:\dir\popo.txt
c:\dir\sub1\agsds.txt
c:\dir\sub1\popo.txt
c:\dir\sub2\hghghg.txt
c:\dir\sub2\subbb\abc.txt

These txt files are of three types:
type1: those that contain a string definable by the regular expression
"abc[0-9]+def"
type2: those that contain a string definable by the regular expression
"lmn[0-9]+opq"
type3: those that contain a string definable by the regular expression
"rst[0-9]+uvw"

I would to copy with a Perl Windows command-line script all these txt
files into a single directory "c:\output" with the filename composed
by the number found in the regex match (the "[0-9]+" part of the
regex) and a "-type1.txt" or "-type2.txt" or "-type3.txt" suffix
depending of what of the three above regexes are found in the file,
obtaining a result looking like this:
c:\output\15-type2.txt
c:\output\102-type1.txt
c:\output\33-type1.txt
c:\output\49-type3.txt
c:\output\4-type1.txt
c:\output\335-type2.txt
c:\output\32-type3.txt

How can I do it?

 
Reply With Quote
 
 
 
 
John W. Krahn
Guest
Posts: n/a
 
      11-05-2007
Luca Villa wrote:
>
> I need help with Perl under Windows command-line to solve the
> following task:
>
> I have many disordered txt files and subdirectories under the root
> directory "c:\dir", like this:
> c:\dir\foobar.txt
> c:\dir\popo.txt
> c:\dir\sub1\agsds.txt
> c:\dir\sub1\popo.txt
> c:\dir\sub2\hghghg.txt
> c:\dir\sub2\subbb\abc.txt
>
> These txt files are of three types:
> type1: those that contain a string definable by the regular expression
> "abc[0-9]+def"
> type2: those that contain a string definable by the regular expression
> "lmn[0-9]+opq"
> type3: those that contain a string definable by the regular expression
> "rst[0-9]+uvw"
>
> I would to copy with a Perl Windows command-line script all these txt
> files into a single directory "c:\output" with the filename composed
> by the number found in the regex match (the "[0-9]+" part of the
> regex) and a "-type1.txt" or "-type2.txt" or "-type3.txt" suffix
> depending of what of the three above regexes are found in the file,
> obtaining a result looking like this:
> c:\output\15-type2.txt
> c:\output\102-type1.txt
> c:\output\33-type1.txt
> c:\output\49-type3.txt
> c:\output\4-type1.txt
> c:\output\335-type2.txt
> c:\output\32-type3.txt
>
> How can I do it?


*UNTESTED* YMMV

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
use File::Copy;

my $from = 'c:/dir';
my $to = 'c:/output';

my %trans = qw(
abc(\d+)def type1
lmn(\d+)opq type2
rst(\d+)uvw type3
);

find sub {
return unless open my $fh, '<', $_;
return unless -f $fh;
read $fh, my $data, -s _;
close $fh;
for my $pat ( keys %trans ) {
next unless $data =~ $pat;
copy $File::Find::name, "$to/$1-$trans{$pat}.txt";
last;
}
}, $from;

__END__



John
--
use Perl;
program
fulfillment
 
Reply With Quote
 
 
 
 
jordilin
Guest
Posts: n/a
 
      11-06-2007
On Nov 5, 5:52 pm, "John W. Krahn" <(E-Mail Removed)> wrote:
> Luca Villa wrote:
>
> > I need help with Perl under Windows command-line to solve the
> > following task:

>
> > I have many disordered txt files and subdirectories under the root
> > directory "c:\dir", like this:
> > c:\dir\foobar.txt
> > c:\dir\popo.txt
> > c:\dir\sub1\agsds.txt
> > c:\dir\sub1\popo.txt
> > c:\dir\sub2\hghghg.txt
> > c:\dir\sub2\subbb\abc.txt

>
> > These txt files are of three types:
> > type1: those that contain a string definable by the regular expression
> > "abc[0-9]+def"
> > type2: those that contain a string definable by the regular expression
> > "lmn[0-9]+opq"
> > type3: those that contain a string definable by the regular expression
> > "rst[0-9]+uvw"

>
> > I would to copy with a Perl Windows command-line script all these txt
> > files into a single directory "c:\output" with the filename composed
> > by the number found in the regex match (the "[0-9]+" part of the
> > regex) and a "-type1.txt" or "-type2.txt" or "-type3.txt" suffix
> > depending of what of the three above regexes are found in the file,
> > obtaining a result looking like this:
> > c:\output\15-type2.txt
> > c:\output\102-type1.txt
> > c:\output\33-type1.txt
> > c:\output\49-type3.txt
> > c:\output\4-type1.txt
> > c:\output\335-type2.txt
> > c:\output\32-type3.txt

>
> > How can I do it?

>
> *UNTESTED* YMMV
>
> #!/usr/bin/perl
> use warnings;
> use strict;
> use File::Find;
> use File::Copy;
>
> my $from = 'c:/dir';
> my $to = 'c:/output';
>
> my %trans = qw(
> abc(\d+)def type1
> lmn(\d+)opq type2
> rst(\d+)uvw type3
> );
>
> find sub {
> return unless open my $fh, '<', $_;
> return unless -f $fh;
> read $fh, my $data, -s _;
> close $fh;
> for my $pat ( keys %trans ) {
> next unless $data =~ $pat;
> copy $File::Find::name, "$to/$1-$trans{$pat}.txt";
> last;
> }
> }, $from;
>
> __END__
>
> John
> --
> use Perl;
> program
> fulfillment


One doubt,
when you write
read $fh, my $data, -s _;
should not be
read $fh, my $data, -s $_;

I have searched along the web without success. I don't know if _
equals $_ in this particular case
best regards,
jordi

 
Reply With Quote
 
Josef Moellers
Guest
Posts: n/a
 
      11-06-2007
jordilin wrote:
> On Nov 5, 5:52 pm, "John W. Krahn" <(E-Mail Removed)> wrote:
>
>>Luca Villa wrote:
>>
>>
>>>I need help with Perl under Windows command-line to solve the
>>>following task:

>>
>>>I have many disordered txt files and subdirectories under the root
>>>directory "c:\dir", like this:
>>>c:\dir\foobar.txt
>>>c:\dir\popo.txt
>>>c:\dir\sub1\agsds.txt
>>>c:\dir\sub1\popo.txt
>>>c:\dir\sub2\hghghg.txt
>>>c:\dir\sub2\subbb\abc.txt

>>
>>>These txt files are of three types:
>>>type1: those that contain a string definable by the regular expression
>>>"abc[0-9]+def"
>>>type2: those that contain a string definable by the regular expression
>>>"lmn[0-9]+opq"
>>>type3: those that contain a string definable by the regular expression
>>>"rst[0-9]+uvw"

>>
>>>I would to copy with a Perl Windows command-line script all these txt
>>>files into a single directory "c:\output" with the filename composed
>>>by the number found in the regex match (the "[0-9]+" part of the
>>>regex) and a "-type1.txt" or "-type2.txt" or "-type3.txt" suffix
>>>depending of what of the three above regexes are found in the file,
>>>obtaining a result looking like this:
>>>c:\output\15-type2.txt
>>>c:\output\102-type1.txt
>>>c:\output\33-type1.txt
>>>c:\output\49-type3.txt
>>>c:\output\4-type1.txt
>>>c:\output\335-type2.txt
>>>c:\output\32-type3.txt

>>
>>>How can I do it?

>>
>>*UNTESTED* YMMV
>>
>>#!/usr/bin/perl
>>use warnings;
>>use strict;
>>use File::Find;
>>use File::Copy;
>>
>>my $from = 'c:/dir';
>>my $to = 'c:/output';
>>
>>my %trans = qw(
>> abc(\d+)def type1
>> lmn(\d+)opq type2
>> rst(\d+)uvw type3
>> );
>>
>>find sub {
>> return unless open my $fh, '<', $_;
>> return unless -f $fh;
>> read $fh, my $data, -s _;
>> close $fh;
>> for my $pat ( keys %trans ) {
>> next unless $data =~ $pat;
>> copy $File::Find::name, "$to/$1-$trans{$pat}.txt";
>> last;
>> }
>> }, $from;
>>
>>__END__
>>
>>John
>>--
>>use Perl;
>>program
>>fulfillment

>
>
> One doubt,
> when you write
> read $fh, my $data, -s _;
> should not be
> read $fh, my $data, -s $_;
>
> I have searched along the web without success. I don't know if _
> equals $_ in this particular case


No, it doesn't, at least not "literally" or conceptually.
"_" is the special filehandle which refers to the filehandle used in the
most recently used stat operation:

"If any of the file tests (or either the "stat" or "lstat" operators)
are given the special filehandle consisting of a solitary underline,
then the stat structure of the previous file test (or stat operator) is
used, saving a system call."
(perldoc -f -s)


--
These are my personal views and not those of Fujitsu Siemens Computers!
Josef Möllers (Pinguinpfleger bei FSC)
If failure had no penalty success would not be a prize (T. Pratchett)
Company Details: http://www.fujitsu-siemens.com/imprint.html

 
Reply With Quote
 
Luca Villa
Guest
Posts: n/a
 
      11-09-2007
Thanks to all and to John in particular.

John's solution perhaps worked but I had difficulty in adapting it for
my needs so I ended using this alternative solution:


use File::Find;

find(\&found, 'c:/dir');


sub found {
unless(open(IN,"<$File::Find::name")) {
warn "Could not open $File::Find::name: $! (SKIPPING)\n";
return;
}
local $/;
my $data=<IN>;
close(IN);

my($type, $number);
if($data =~ /abc([0-9]+)def/) {
$number=$1;
$type=1;
}
elsif($data =~ /lmn([0-9]+)opq/) {
$number=$1;
$type=2;
}
elsif($data =~ /rst([0-9]+)uvw/) {
$number=$1;
$type=3;
}
else {
warn "File $File::Find::name is unknown type\n";
return;
}

my $outfn="c:/output/$number-type$type.txt";
if(-e $outfn) {
warn "File $outfn already exists.\n";
return;
}
unless(open(OUT,">$outfn")) {
warn "Could not open $outfn: $!\n";
return;
}
print OUT $data;
close(OUT);
}

 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      11-10-2007
Luca Villa <(E-Mail Removed)> wrote:


> unless(open(IN,"<$File::Find::name")) {
> warn "Could not open $File::Find::name: $! (SKIPPING)\n";
> return;
> }
> local $/;
> my $data=<IN>;
> close(IN);



If you are going to mess with the special variables anyway,
then you could replace all of that with:

local @ARGV = $_;
local $/;
my $data = <>;


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
 
Reply With Quote
 
Luca Villa
Guest
Posts: n/a
 
      11-10-2007
> If you are going to mess with the special variables anyway,
> then you could replace all of that with:
>
> local @ARGV = $_;
> local $/;
> my $data = <>;


I received this error:
"Can't do inplace edit: . is not a regular file at c:\script.src line
12."

inplace edit? What does it want to do?

 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      11-10-2007
Luca Villa <(E-Mail Removed)> wrote:
>> If you are going to mess with the special variables anyway,
>> then you could replace all of that with:
>>
>> local @ARGV = $_;
>> local $/;
>> my $data = <>;

>
> I received this error:
> "Can't do inplace edit: . is not a regular file at c:\script.src line
> 12."



The error message has nothing to do with the code you quoted above.


> inplace edit? What does it want to do?



It wants to edit the file "inplace", that is, with the same name.

You have turned on inplace editing either with the -i command line
switch, or by setting the $^I variable somewhere...

Also, what it is trying to edit is not a file, it is a directory. You
may want to test what find() is operating on with the -d or -f filetest.


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
 
Reply With Quote
 
Luca Villa
Guest
Posts: n/a
 
      11-10-2007
Hi Tad,

I'm not using any argument a part of the "source.src" that contains
the script.

I started to get the error since I used your suggested substitutive
block.

This is the source.src exact content, that gives the mentioned error:

use File::Find;

find(\&found, 'c:/tempebay/1');

sub found {
local @ARGV = $_;
local $/;
my $data = <>;


my($type, $number);
if($data =~ /<td align="right" nowrap>\s+Item number:\s+(\d+)<\/
td>/) {
$number=$1;
$type="item_description_html";
}
elsif($data =~ /Item number:\s*<img src="http:\/\/pics
\.ebaystatic\.com\/aw\/pics\/s\.gif" width="\d+">(\d+)<\/div>/) {
$number=$1;
$type="buyers_history_html";
}
else {
warn "File $File::Find::name is of not interesting type,
for example an eBay page of item\n";
return;
}

my $outfn="c:/tempebay/2/$number-$type.htm";
if(-e $outfn) {
warn "File $outfn already exists.\n";
return;
}
unless(open(OUT,">$outfn")) {
warn "Could not open $outfn: $!\n";
return;
}
print OUT $data;
close(OUT);
}


___

I launch: perl script.src
and despite that initial error message it actually works!

Can you understand why does it want to do that inplace edit?

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex testing and UTF8 awarenes or Regex and numeric pattern matching sln@netherlands.com Perl Misc 2 03-10-2009 03:51 AM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
String Pattern Matching: regex and Python regex documentation Xah Lee Python 8 09-26-2006 03:24 PM
String Pattern Matching: regex and Python regex documentation Xah Lee Perl Misc 2 09-25-2006 03:15 AM
String Pattern Matching: regex and Python regex documentation Xah Lee Java 1 09-22-2006 07:11 PM



Advertisments