Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   Pattern Matching on Case (http://www.velocityreviews.com/forums/t896677-pattern-matching-on-case.html)

DANIEL BURCH 02-19-2006 06:59 PM

Pattern Matching on Case
 
I have a file that apparently had html tags stripped out of it, or
something, but no space characters added to replace the tags so it ended up
with a lot of words run together like "ExplosionThis". In almost all cases
there is a lower case letter followed by an upper case letter. I am trying
to figure out a substitution statement that would separate them, but I'm not
sure what would work. Maybe something like

s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;

but I don't have a clue if that is even close to working or if it will give
me an "a" at the end and beginning of the words. Any help would be greatly
appreciated.



A. Sinan Unur 02-19-2006 07:11 PM

Re: Pattern Matching on Case
 
"DANIEL BURCH" <dburchm1@verizon.net> wrote in
news:Ge3Kf.3027$GQ.2625@trnddc03:

> I have a file that apparently had html tags stripped out of it, or
> something, but no space characters added to replace the tags so it
> ended up with a lot of words run together like "ExplosionThis". In
> almost all cases there is a lower case letter followed by an upper
> case letter. I am trying to figure out a substitution statement that
> would separate them, but I'm not sure what would work. Maybe
> something like
>
> s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;


I am curious: What do you think this does?

Here is a quick and dirty attempt based on your vague specification, and
nothing else. You might want to post some real code along with data
after reading the posting guidelines for this group.

#!/usr/bin/perl

use strict;
use warnings;

my $text;
{
local $/;
$text = <DATA>;
}

$text =~ s{\.\s+}{}g;

$text =~ s{([[:lower:]])([[:upper:]])}{$1\. $2}g;

print "$text\n";

__DATA__
I have a file that apparently had html tags stripped out of it,
or something, but no space characters added to replace the tags
so it ended up with a lot of words run together like "ExplosionThis."
In almost all cases there is a lower case letter followed by an
upper case letter. I am trying to figure out a substitution
statement that would separate them, but I'm not sure what would
work. Maybe something like

Notice the mess this makes of "ExplosionThis".

Sinan


--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/cl...uidelines.html


it_says_BALLS_on_your_forehead 02-19-2006 07:15 PM

Re: Pattern Matching on Case
 

DANIEL BURCH wrote:
> I have a file that apparently had html tags stripped out of it, or
> something, but no space characters added to replace the tags so it ended up
> with a lot of words run together like "ExplosionThis". In almost all cases
> there is a lower case letter followed by an upper case letter. I am trying
> to figure out a substitution statement that would separate them, but I'm not
> sure what would work. Maybe something like
>
> s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;
>
> but I don't have a clue if that is even close to working or if it will give
> me an "a" at the end and beginning of the words. Any help would be greatly
> appreciated.


use strict; use warnings;

my $string = 'Hello theRe danielBurch howAreYou?';
$string =~ s/([a-z])([A-Z])/$1\ $2/g; # i escape the space b/c in Perl
6 /x will be default
print $string, "\n";


Ala Qumsieh 02-19-2006 10:18 PM

Re: Pattern Matching on Case
 
it_says_BALLS_on_your_forehead wrote:
> use strict; use warnings;
>
> my $string = 'Hello theRe danielBurch howAreYou?';
> $string =~ s/([a-z])([A-Z])/$1\ $2/g; # i escape the space b/c in Perl
> 6 /x will be default


Still, you don't need to escape it. /x only affects the regexp part, and
not the replacement part.

--Ala


it_says_BALLS_on_your_forehead 02-19-2006 10:33 PM

Re: Pattern Matching on Case
 

Ala Qumsieh wrote:
> it_says_BALLS_on_your_forehead wrote:
> > use strict; use warnings;
> >
> > my $string = 'Hello theRe danielBurch howAreYou?';
> > $string =~ s/([a-z])([A-Z])/$1\ $2/g; # i escape the space b/c in Perl
> > 6 /x will be default

>
> Still, you don't need to escape it. /x only affects the regexp part, and
> not the replacement part.


ahh, right you are! i always forget that.


John W. Krahn 02-20-2006 12:32 AM

Re: Pattern Matching on Case
 
DANIEL BURCH wrote:
> I have a file that apparently had html tags stripped out of it, or
> something, but no space characters added to replace the tags so it ended up
> with a lot of words run together like "ExplosionThis". In almost all cases
> there is a lower case letter followed by an upper case letter. I am trying
> to figure out a substitution statement that would separate them, but I'm not
> sure what would work. Maybe something like
>
> s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;
>
> but I don't have a clue if that is even close to working or if it will give
> me an "a" at the end and beginning of the words. Any help would be greatly
> appreciated.


$ perl -le'$_ = "ThisIsATest"; print; s/(?<=.)(?=[[:upper:]])/ /g; print'
ThisIsATest
This Is A Test


John
--
use Perl;
program
fulfillment

it_says_BALLS_on_your_forehead 02-20-2006 02:20 AM

Re: Pattern Matching on Case
 

John W. Krahn wrote:
> DANIEL BURCH wrote:
> > I have a file that apparently had html tags stripped out of it, or
> > something, but no space characters added to replace the tags so it ended up
> > with a lot of words run together like "ExplosionThis". In almost all cases
> > there is a lower case letter followed by an upper case letter.


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^
> > I am trying
> > to figure out a substitution statement that would separate them, but I'm not
> > sure what would work. Maybe something like
> >
> > s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;
> >
> > but I don't have a clue if that is even close to working or if it will give
> > me an "a" at the end and beginning of the words. Any help would be greatly
> > appreciated.

>
> $ perl -le'$_ = "ThisIsATest"; print; s/(?<=.)(?=[[:upper:]])/ /g; print'
> ThisIsATest
> This Is A Test


the above is pretty slick, but doesn't address what the OP asked for.
what about cases where the data consists of a word in all caps?


Matt Garrish 02-20-2006 03:12 AM

Re: Pattern Matching on Case
 

"it_says_BALLS_on_your_forehead" <simon.chao@gmail.com> wrote in message
news:1140402015.655308.107700@g43g2000cwa.googlegr oups.com...
>
> John W. Krahn wrote:
>> DANIEL BURCH wrote:
>> > I have a file that apparently had html tags stripped out of it, or
>> > something, but no space characters added to replace the tags so it
>> > ended up
>> > with a lot of words run together like "ExplosionThis". In almost all
>> > cases
>> > there is a lower case letter followed by an upper case letter.

>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^
>> > I am trying
>> > to figure out a substitution statement that would separate them, but
>> > I'm not
>> > sure what would work. Maybe something like
>> >
>> > s/*[a-z][A-Z]*/*[a-z] [A-Z]*/g;
>> >
>> > but I don't have a clue if that is even close to working or if it will
>> > give
>> > me an "a" at the end and beginning of the words. Any help would be
>> > greatly
>> > appreciated.

>>
>> $ perl -le'$_ = "ThisIsATest"; print; s/(?<=.)(?=[[:upper:]])/ /g; print'
>> ThisIsATest
>> This Is A Test

>
> the above is pretty slick, but doesn't address what the OP asked for.
> what about cases where the data consists of a word in all caps?
>


That's why the OP will probably learn the hard way that regexes are more
trouble than they're worth in this kind of situation, and that it's easier
to go back to the source and start over. A spellchecker might prove more
useful if that's not possible...

Matt



Samwyse 02-20-2006 05:17 AM

Re: Pattern Matching on Case
 
DANIEL BURCH wrote:
> I have a file that apparently had html tags stripped out of it, or
> something, but no space characters added to replace the tags so it ended up
> with a lot of words run together like "ExplosionThis".


This is a bit off-topic, and definitely not related to Perl, but your
file didn't have HTML tags stripped from it. When stripping HTML tags,
you aren't supposed to replace them with whitespace. For example,
consider the following HTML, which italicizes some of the alphabet:

a<i>bcd</i>e<i>fgh</i>i<i>jklmn</i>o<i>pqrst</i>u<i>vwx</i>y<i>z</i>

Introducing spaces for the tags would mess everything up.

Matt Garrish 02-20-2006 12:18 PM

Re: Pattern Matching on Case
 

"Samwyse" <samwyse@gmail.com> wrote in message
news:phcKf.34787$F_3.23009@newssvr29.news.prodigy. net...
> DANIEL BURCH wrote:
>> I have a file that apparently had html tags stripped out of it, or
>> something, but no space characters added to replace the tags so it ended
>> up
>> with a lot of words run together like "ExplosionThis".

>
> This is a bit off-topic, and definitely not related to Perl, but your file
> didn't have HTML tags stripped from it. When stripping HTML tags, you
> aren't supposed to replace them with whitespace. For example, consider
> the following HTML, which italicizes some of the alphabet:
>
> a<i>bcd</i>e<i>fgh</i>i<i>jklmn</i>o<i>pqrst</i>u<i>vwx</i>y<i>z</i>
>
> Introducing spaces for the tags would mess everything up.


But then consider:

<td>I like to</td><td>Format everything</td><td>Inside cells</td><td>On one
line</td>

Never underestimate a bad html parsing job... : )

Matt




All times are GMT. The time now is 10:23 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.