Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Regex question, limit repeats UNLESS within specified tags

Reply
Thread Tools

Regex question, limit repeats UNLESS within specified tags

 
 
Jason C
Guest
Posts: n/a
 
      11-02-2012
I'm currently limiting repeated characters like so:

$text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

I'm wanting to modify it to only limit repeated characters if they're not within <img...> or <a href=...></a> tags.

I'm guessing that this would be done with negative lookahead, like this:

# Note, these aren't tested, just here for the explanation
$text =~ s#(?<!<img)(.)\1{6,}#$1$1$1$1$1$1#gsi;
$text =~ s#(?<!<a href)(.)\1{6,}#$1$1$1$1$1$1#gsi;

Neither of these are going to be perfect, though, because:

1. in the first one, I need to test for both an opening <img and an ending >; otherwise, I think it would not catch something like "<img src='aaa.jpg'> bbbbbbbbbb" (since the repeated "b" comes after "<img").

2. in the second one, I also need to test for the ending >, but also for the closing </a>. Even if I fixed the ending >, I could still end up with a confusing "<a href='http://www.aaaaaaaaaa.com'>http://www.aaaaaa.com</a>"


Any suggestions on how to do either of these better? TIA,

Jason
 
Reply With Quote
 
 
 
 
Justin C
Guest
Posts: n/a
 
      11-02-2012
On 2012-11-02, Jason C <(E-Mail Removed)> wrote:
> I'm currently limiting repeated characters like so:
>
> $text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
>
> I'm wanting to modify it to only limit repeated characters if they're not within <img...> or <a href=...></a> tags.
>
> I'm guessing that this would be done with negative lookahead, like this:
>
> # Note, these aren't tested, just here for the explanation
> $text =~ s#(?<!<img)(.)\1{6,}#$1$1$1$1$1$1#gsi;
> $text =~ s#(?<!<a href)(.)\1{6,}#$1$1$1$1$1$1#gsi;



Found in /usr/share/perl/5.10/pod/perlfaq6.pod
How do I match XML, HTML, or other nasty, ugly things with a regex?
(contributed by brian d foy)

If you just want to get work done, use a module and forget about the
regular expressions. The "XML:arser" and "HTML:arser" modules are
good starts, although each namespace has other parsing modules
specialized for certain tasks and different ways of doing it. Start at
CPAN Search ( http://search.cpan.org ) and wonder at all the work
people have done for you already!

Use the modules and use your regex on what's left, don't don't try to
write REs for HTML, life is too short.


Justin.

--
Justin C, by the sea.
 
Reply With Quote
 
 
 
 
Jason C
Guest
Posts: n/a
 
      11-02-2012
On Friday, November 2, 2012 6:08:03 AM UTC-4, Justin C wrote:
> On 2012-11-02, Jason C <(E-Mail Removed)> wrote:
>
> > I'm currently limiting repeated characters like so:

>
> >

>
> > $text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

>
> >

>
> > I'm wanting to modify it to only limit repeated characters if they're not within <img...> or <a href=...></a> tags.

>
> >

>
> > I'm guessing that this would be done with negative lookahead, like this:

>
> >

>
> > # Note, these aren't tested, just here for the explanation

>
> > $text =~ s#(?<!<img)(.)\1{6,}#$1$1$1$1$1$1#gsi;

>
> > $text =~ s#(?<!<a href)(.)\1{6,}#$1$1$1$1$1$1#gsi;

>
>
>
>
>
> Found in /usr/share/perl/5.10/pod/perlfaq6.pod
>
> How do I match XML, HTML, or other nasty, ugly things with a regex?
>
> (contributed by brian d foy)
>
>
>
> If you just want to get work done, use a module and forget about the
>
> regular expressions. The "XML:arser" and "HTML:arser" modules are
>
> good starts, although each namespace has other parsing modules
>
> specialized for certain tasks and different ways of doing it. Start at
>
> CPAN Search ( http://search.cpan.org ) and wonder at all the work
>
> people have done for you already!
>
>
>
> Use the modules and use your regex on what's left, don't don't try to
>
> write REs for HTML, life is too short.
>
>
>
>
>
> Justin.
>
>
>
> --
>
> Justin C, by the sea.


I've used HTML:arser at length, but I don't think that it offers anything like what I'm needing. I looked through CPAN, and didn't find anything like this.

I might have made the OP seem too complicated. What I really need to figure out is how to run a regex where both the look-behind AND look-ahead match.

Something like this, I guess:

# Not tested
while (($text !~ /<img[^>]*?>/gi) &&
($text !~ /<a href[^>]*?>/gi)) {
$text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
}

Or maybe two separate loops, like this:

while ($text !~ /<img[^>]*?>/gi) {
$text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
}

while ($text !~ /<a href([^>]*?)>(.*?)<\/a>/gi) {
$pattern = $repl = $1;

$pattern = quotemeta($pattern);
$repl =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

$text =~ s#$pattern#$repl#gsi;
}

Thoughts?
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      11-03-2012
On 2012-11-02 21:11, Eli the Bearded <*@eli.users.panix.com> wrote:
> In comp.lang.perl.misc, Jason C <(E-Mail Removed)> wrote:
>> On Friday, November 2, 2012 6:08:03 AM UTC-4, Justin C wrote:
>>> Found in /usr/share/perl/5.10/pod/perlfaq6.pod
>>> How do I match XML, HTML, or other nasty, ugly things with a regex?
>>> (contributed by brian d foy)
>>> If you just want to get work done, use a module and forget about the
>>> regular expressions. The "XML:arser" and "HTML:arser" modules

>> I've used HTML:arser at length, but I don't think that it offers anything
>> like what I'm needing. I looked through CPAN, and didn't find anything like
>> this.

>
> Your use case is exotic. You will not find exactly what you need off the
> shelf. You will find ways to break a document up into <IMG>, <A>, and
> neither of thsoe when you use a parsing module. Thus broken up, you can
> then do your substring regexp.


Agreed.

>
>> I might have made the OP seem too complicated. What I really need to figure
>> out is how to run a regex where both the look-behind AND look-ahead match.

>
> No, I don't think you made it seem "too complicated", it *is* too
> complicated.


I don't know whether it is complicated but I do know that I don't
understand it. My best guess is that he wants to limit duplicate
characters in the text of document, but wants to avoid mangling URLs.

So if someone writes:

<p>John is stupid!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!</p>

he wants to change this to

<p>John is stupid!!!!!!</p>

But something like

<img src="/images/img0000000123.jpg" title="Little Johnny and his dog">

should not be changed to

<img src="/images/img000000123.jpg" title="Little Johnny and his dog">

because that would invalidate the link.

But this is just a guess.

Assuming I am right, I would use HTML:arser to parse the file and then
do those substitutions only in text nodes. This is probably most easily
done with a handler.

hp



--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | http://www.velocityreviews.com/forums/(E-Mail Removed) | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
HTTP_ACCEPT_LANGUAGE; repeats and repeats value; ie recent update? cate Javascript 1 06-14-2010 03:30 PM
VWD gives bogus validation errors unless MasterPageFile is specified in the @Page directive Alan Silver ASP .Net 1 02-23-2006 05:38 PM
Unless unless Gábor SEBESTYÉN Ruby 3 06-17-2005 08:54 AM
RegEx to find CFML tags nested in HTML tags Dean H. Saxe Perl 0 01-03-2004 06:11 PM
Custom Tags within Custom Tags. Ranganath Java 2 10-21-2003 06:14 AM



Advertisments