Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > [NEWBIE] Multiline match on two files ignoring newlines, tabs & blank chars

Reply
Thread Tools

[NEWBIE] Multiline match on two files ignoring newlines, tabs & blank chars

 
 
Ga
Guest
Posts: n/a
 
      12-15-2003
Hi all,

I have two files (some thousends of files in pairs, indeed...), file1 and
file2. File2 looks similar to file1, but:
- it contains more data than file1 (and such info is what I need to get)
- it is formatted differently

Example:

File1:

Divitias alius fulvo sibi congerat auro Et teneat culti iugera multa soli,
Me mea paupertas vita traducat inerti, Dum meus adsiduo luceat igne focus.

File2:

Divitias alius fulvo sibi congerat auro
Et teneat culti iugera multa soli,
Quem labor adsiduus vicino terreat hoste,
Martia cui somnos classica pulsa fugent.
Me mea paupertas vita traducat inerti,
Dum meus adsiduo luceat igne focus.

In the example file2 contains the same *data* as file1 (rows 1,2,5,6), but
formatted differently. What I would like to do is a script to get rows 3
and 4 of file2 without losing it's format (tabs, spaces and newlines).

I played a little with regexp and nested cycles on both files but it's
really becoming too complicated and I think there should be some "easy-
way" I'm missing.

Any hint, anybody?

Thanx alot.

G.
 
Reply With Quote
 
 
 
 
Anno Siegel
Guest
Posts: n/a
 
      12-15-2003
Ga <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> Hi all,
>
> I have two files (some thousends of files in pairs, indeed...), file1 and
> file2. File2 looks similar to file1, but:
> - it contains more data than file1 (and such info is what I need to get)
> - it is formatted differently
>
> Example:
>
> File1:
>
> Divitias alius fulvo sibi congerat auro Et teneat culti iugera multa soli,
> Me mea paupertas vita traducat inerti, Dum meus adsiduo luceat igne focus.
>
> File2:
>
> Divitias alius fulvo sibi congerat auro
> Et teneat culti iugera multa soli,
> Quem labor adsiduus vicino terreat hoste,
> Martia cui somnos classica pulsa fugent.
> Me mea paupertas vita traducat inerti,
> Dum meus adsiduo luceat igne focus.
>
> In the example file2 contains the same *data* as file1 (rows 1,2,5,6), but
> formatted differently. What I would like to do is a script to get rows 3
> and 4 of file2 without losing it's format (tabs, spaces and newlines).
>
> I played a little with regexp and nested cycles on both files but it's
> really becoming too complicated and I think there should be some "easy-
> way" I'm missing.


I don't think there is. Identifying text differences looks simple,
intuitively, but it isn't.

Let's for the moment forget about format differences and assume you have
two equally formatted strings. Also, the differences are only insertions,
no deletions or other changes happen to the first text. Still the problem
of determining an insertion isn't unique.

Suppose one string is "the right way", and the other is "the right to go
the right way". Has "right to go the" been inserted after "the", or has
"to go the right" been inserted after "right"? Somehow your algorithm
will have to decide. And that is only a single insertion, with multiple
ones the problems become more formidable.

The diff program (Unix) tackles these problems on a line-by-basis.
A possible approach would be to split your files into one-word lines,
run them through diff and interpret the output. There are also modules
on CPAN that incorporate the diff algorithm without an external program.

Ambiguities like the one above will be resolved in one way or another.
If it matters, a manual check can't be avoided. You will also have the
problem of putting the insertions back into their original format, but
that should be solvable.

Anno
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to truncate char string fromt beginning and replace chars instring by other chars in C or C++? Hongyu C++ 9 08-08-2008 12:18 PM
Ignoring spaces, tabs and line in XML DOM parsing Sandeep Java 1 01-15-2006 06:56 PM
List text files showing LFs and expanded tabs (was: Colorize expanded tabs) qwweeeit Python 2 12-14-2005 10:07 AM
Floats to chars and chars to floats Kosio C Programming 44 09-23-2005 09:49 AM
how to define a variable to hold a multiline text input in perl from html multiline textbox dale zhang Perl Misc 8 11-30-2004 06:53 AM



Advertisments