Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   Regex question, limit repeats UNLESS within specified tags (http://www.velocityreviews.com/forums/t954123-regex-question-limit-repeats-unless-within-specified-tags.html)

Jason C 11-02-2012 12:31 AM

Regex question, limit repeats UNLESS within specified tags
 
I'm currently limiting repeated characters like so:

$text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

I'm wanting to modify it to only limit repeated characters if they're not within <img...> or <a href=...></a> tags.

I'm guessing that this would be done with negative lookahead, like this:

# Note, these aren't tested, just here for the explanation
$text =~ s#(?<!<img)(.)\1{6,}#$1$1$1$1$1$1#gsi;
$text =~ s#(?<!<a href)(.)\1{6,}#$1$1$1$1$1$1#gsi;

Neither of these are going to be perfect, though, because:

1. in the first one, I need to test for both an opening <img and an ending >; otherwise, I think it would not catch something like "<img src='aaa.jpg'> bbbbbbbbbb" (since the repeated "b" comes after "<img").

2. in the second one, I also need to test for the ending >, but also for the closing </a>. Even if I fixed the ending >, I could still end up with a confusing "<a href='http://www.aaaaaaaaaa.com'>http://www.aaaaaa.com</a>"


Any suggestions on how to do either of these better? TIA,

Jason

Justin C 11-02-2012 09:40 AM

Re: Regex question, limit repeats UNLESS within specified tags
 
On 2012-11-02, Jason C <jwcarlton@gmail.com> wrote:
> I'm currently limiting repeated characters like so:
>
> $text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
>
> I'm wanting to modify it to only limit repeated characters if they're not within <img...> or <a href=...></a> tags.
>
> I'm guessing that this would be done with negative lookahead, like this:
>
> # Note, these aren't tested, just here for the explanation
> $text =~ s#(?<!<img)(.)\1{6,}#$1$1$1$1$1$1#gsi;
> $text =~ s#(?<!<a href)(.)\1{6,}#$1$1$1$1$1$1#gsi;



Found in /usr/share/perl/5.10/pod/perlfaq6.pod
How do I match XML, HTML, or other nasty, ugly things with a regex?
(contributed by brian d foy)

If you just want to get work done, use a module and forget about the
regular expressions. The "XML::Parser" and "HTML::Parser" modules are
good starts, although each namespace has other parsing modules
specialized for certain tasks and different ways of doing it. Start at
CPAN Search ( http://search.cpan.org ) and wonder at all the work
people have done for you already! :)

Use the modules and use your regex on what's left, don't don't try to
write REs for HTML, life is too short.


Justin.

--
Justin C, by the sea.

Jason C 11-02-2012 08:37 PM

Re: Regex question, limit repeats UNLESS within specified tags
 
On Friday, November 2, 2012 6:08:03 AM UTC-4, Justin C wrote:
> On 2012-11-02, Jason C <jwcarlton@gmail.com> wrote:
>
> > I'm currently limiting repeated characters like so:

>
> >

>
> > $text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

>
> >

>
> > I'm wanting to modify it to only limit repeated characters if they're not within <img...> or <a href=...></a> tags.

>
> >

>
> > I'm guessing that this would be done with negative lookahead, like this:

>
> >

>
> > # Note, these aren't tested, just here for the explanation

>
> > $text =~ s#(?<!<img)(.)\1{6,}#$1$1$1$1$1$1#gsi;

>
> > $text =~ s#(?<!<a href)(.)\1{6,}#$1$1$1$1$1$1#gsi;

>
>
>
>
>
> Found in /usr/share/perl/5.10/pod/perlfaq6.pod
>
> How do I match XML, HTML, or other nasty, ugly things with a regex?
>
> (contributed by brian d foy)
>
>
>
> If you just want to get work done, use a module and forget about the
>
> regular expressions. The "XML::Parser" and "HTML::Parser" modules are
>
> good starts, although each namespace has other parsing modules
>
> specialized for certain tasks and different ways of doing it. Start at
>
> CPAN Search ( http://search.cpan.org ) and wonder at all the work
>
> people have done for you already! :)
>
>
>
> Use the modules and use your regex on what's left, don't don't try to
>
> write REs for HTML, life is too short.
>
>
>
>
>
> Justin.
>
>
>
> --
>
> Justin C, by the sea.


I've used HTML::Parser at length, but I don't think that it offers anything like what I'm needing. I looked through CPAN, and didn't find anything like this.

I might have made the OP seem too complicated. What I really need to figure out is how to run a regex where both the look-behind AND look-ahead match.

Something like this, I guess:

# Not tested
while (($text !~ /<img[^>]*?>/gi) &&
($text !~ /<a href[^>]*?>/gi)) {
$text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
}

Or maybe two separate loops, like this:

while ($text !~ /<img[^>]*?>/gi) {
$text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
}

while ($text !~ /<a href([^>]*?)>(.*?)<\/a>/gi) {
$pattern = $repl = $1;

$pattern = quotemeta($pattern);
$repl =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

$text =~ s#$pattern#$repl#gsi;
}

Thoughts?

Peter J. Holzer 11-03-2012 12:31 PM

Re: Regex question, limit repeats UNLESS within specified tags
 
On 2012-11-02 21:11, Eli the Bearded <*@eli.users.panix.com> wrote:
> In comp.lang.perl.misc, Jason C <jwcarlton@gmail.com> wrote:
>> On Friday, November 2, 2012 6:08:03 AM UTC-4, Justin C wrote:
>>> Found in /usr/share/perl/5.10/pod/perlfaq6.pod
>>> How do I match XML, HTML, or other nasty, ugly things with a regex?
>>> (contributed by brian d foy)
>>> If you just want to get work done, use a module and forget about the
>>> regular expressions. The "XML::Parser" and "HTML::Parser" modules

>> I've used HTML::Parser at length, but I don't think that it offers anything
>> like what I'm needing. I looked through CPAN, and didn't find anything like
>> this.

>
> Your use case is exotic. You will not find exactly what you need off the
> shelf. You will find ways to break a document up into <IMG>, <A>, and
> neither of thsoe when you use a parsing module. Thus broken up, you can
> then do your substring regexp.


Agreed.

>
>> I might have made the OP seem too complicated. What I really need to figure
>> out is how to run a regex where both the look-behind AND look-ahead match.

>
> No, I don't think you made it seem "too complicated", it *is* too
> complicated.


I don't know whether it is complicated but I do know that I don't
understand it. My best guess is that he wants to limit duplicate
characters in the text of document, but wants to avoid mangling URLs.

So if someone writes:

<p>John is stupid!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!</p>

he wants to change this to

<p>John is stupid!!!!!!</p>

But something like

<img src="/images/img0000000123.jpg" title="Little Johnny and his dog">

should not be changed to

<img src="/images/img000000123.jpg" title="Little Johnny and his dog">

because that would invalidate the link.

But this is just a guess.

Assuming I am right, I would use HTML::Parser to parse the file and then
do those substitutions only in text nodes. This is probably most easily
done with a handler.

hp



--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | hjp@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel


All times are GMT. The time now is 02:43 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.