Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Regexp issue . . .

Reply
Thread Tools

Regexp issue . . .

 
 
MichaelC
Guest
Posts: n/a
 
      11-25-2003
Hi all. I am having a particularly difficult time with a perl script that I
am writing. The problem area is a place where I need to strip some newlines
out of a file.

My source data is text which is in paragraph form, but has line breaks
within the paragraphs. I need to do as much processing as possible in order
to minimise the amount of manual changes that I have to make.

Sample text is as follows:

"This document is intended to give you an
overview of DG as well as highlight some of
the features. This is a brought to your handheld using DG."
With DG you can view and edit word processing and spreadsheet files on
your handheld. Simple push-button synchronization of
the handheld with the desktop will maintain the most up-to-date
version of a file on both the desktop and handheld.

I want these to be parsed as follows:

"This document is intended to give you an overview of DG as well as
highlight some of the features. This is a brought to your handheld using
DG." With DG you can view and edit word processing and spreadsheet files on
your handheld. Simple push-button synchronization of the handheld with the
desktop will maintain the most up-to-date version of a file on both the
desktop and handheld.

--

One way that I thought might work is to catch all lines that begin upper
case, prepend them with a line break, strip the trailing break, then trap
all lines that start lower case and dump them as-is. Repeat this until no
matches are made on the lower case test, then clean up all those extra line
breaks.

I came up with this . . . but all it seems to do is strip all newlines out.

while( <infl> ) {

my $x = $_;
if ( $x =~ ?^[^a-z]? ) { $x =~ s!(.*)\n!\n\1 ! }
else { $x =~ s!(.*)\n!\1 ! }
print outfl $x;
}

Any help would be greately appreciated.

Michael



 
Reply With Quote
 
 
 
 
Eric J. Roode
Guest
Posts: n/a
 
      11-25-2003
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"MichaelC" <(E-Mail Removed)> wrote in
news:d9Dwb.492453$9l5.241927@pd7tw2no:

> Hi all. I am having a particularly difficult time with a perl script
> that I am writing. The problem area is a place where I need to strip
> some newlines out of a file.
>
> My source data is text which is in paragraph form, but has line breaks
> within the paragraphs. I need to do as much processing as possible in
> order to minimise the amount of manual changes that I have to make.


You don't say what you mean by "paragraph form". If you're using that
term in the usual sense, then you mean that the paragraphs have double
newlines between them. Is that so? If so, Perl can read paragraph-at-a-
time for you:

$/ = '';
$paragraph = <>;

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP8NO2mPeouIeTNHoEQKl7wCgwhaYGGLKl2VuQu4P7cXtQv 9C8ZQAn0K0
9YlaoVGjDaBonogRTFfOnn5h
=h9Av
-----END PGP SIGNATURE-----
 
Reply With Quote
 
 
 
 
MichaelC
Guest
Posts: n/a
 
      11-26-2003
"Eric J. Roode" <(E-Mail Removed)> wrote in message
news:Xns943E4EE1E1E8Dsdn.comcast@216.196.97.136...
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> "MichaelC" <(E-Mail Removed)> wrote in
> news:d9Dwb.492453$9l5.241927@pd7tw2no:
>
> > Hi all. I am having a particularly difficult time with a perl script
> > that I am writing. The problem area is a place where I need to strip
> > some newlines out of a file.
> >
> > My source data is text which is in paragraph form, but has line breaks
> > within the paragraphs. I need to do as much processing as possible in
> > order to minimise the amount of manual changes that I have to make.

>
> You don't say what you mean by "paragraph form". If you're using that
> term in the usual sense, then you mean that the paragraphs have double
> newlines between them. Is that so? If so, Perl can read paragraph-at-a-
> time for you:
>
> $/ = '';
> $paragraph = <>;
>


Sorry, I thought that I had defined my problem in
enough detail. My problem is that the text that I am
processing does NOT have double line breaks
between paragraphs, and the text has been presented
wrapped to 72 character width. I do not have access
to the original, as it was lost. That is the reason for
my current problem.
That said, statistically, in the text that I am processing,
the vast majority of lines that start with the set [A-Z"]
will start a new paragraph. The converse is als true,
in that lines that start [a-z,.!?] are definitely part of a
logical paragraph. In that sense, I am not using the
term "paragraph" in the way that you normally assume.

As an object example, the explanation above is a reasonable simulation of
the problem that I am facing. Logistically, the manually broken text is two
paragraphs with no extra line breaks between them. I neither require nor do
I desire double line breaks between paragraphs, what I ro need, though, is
each paragraph on a single line with a single line break at the end, and
ONLY there.

For example, I need to strip all but two line breaks out of the example that
I have provided, so that the text is contiguous from "Sorry, I" to "current
problem." and from "That said, " to "normally assume." After some thought,
I found a solution:

#!/usr/bin/perl

open(infl, "<in.txt" );
open(outfl, ">out.txt");

while( <infl> ) {

my $x = $_;
if ( $x =~ m!^[A-Z"]! ) { print outfl "\n"; }
$x =~ s!(^.+)\n!\1 !m;

print outfl $x;
}

close(infl);
close(outfl);

Thanks,

Michael


 
Reply With Quote
 
Eric J. Roode
Guest
Posts: n/a
 
      11-26-2003
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"MichaelC" <(E-Mail Removed)> wrote in
news:s5Vwb.496786$pl3.155625@pd7tw3no:

> Sorry, I thought that I had defined my problem in
> enough detail.


I would say not.

> My problem is that the text that I am
> processing does NOT have double line breaks
> between paragraphs, and the text has been presented
> wrapped to 72 character width. I do not have access
> to the original, as it was lost. That is the reason for
> my current problem.
> That said, statistically, in the text that I am processing,
> the vast majority of lines that start with the set [A-Z"]
> will start a new paragraph. The converse is als true,
> in that lines that start [a-z,.!?] are definitely part of a
> logical paragraph. In that sense, I am not using the
> term "paragraph" in the way that you normally assume.


It sounds like you want to remove all newlines, except where the newline
is followed by an uppercase character. Is that correct?

If so, I'd suggest reading the entire file into memory, and doing a
simple substitution on it:

$/ = undef;
$content = <FILE>;
$content =~ s/\n(?![[:upper:]])//g;

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP8SeSmPeouIeTNHoEQKoVQCfdSokT7bnrjmUOkqt4NVFOn p9A48An3t1
xj9Z1HMNOPOnq8PJ6NJF1KvR
=1T1p
-----END PGP SIGNATURE-----
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
new RegExp().test() or just RegExp().test() Matěj Cepl Javascript 3 11-24-2009 02:41 PM
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
Ruby 1.9 - ArgumentError: incompatible encoding regexp match(US-ASCII regexp with ISO-2022-JP string) Mikel Lindsaar Ruby 0 03-31-2008 10:27 AM
Programmatically turning a Regexp into an anchored Regexp Greg Hurrell Ruby 4 02-14-2007 06:56 PM
RegExp.exec() returns null when there is a match - a JavaScript RegExp bug? Uldis Bojars Javascript 2 12-17-2006 09:59 PM



Advertisments