Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > RegEx Help Needed

Reply
Thread Tools

RegEx Help Needed

 
 
DeepDiver
Guest
Posts: n/a
 
      12-04-2004
I'm trying to parse a string of HTML that contains a mix of tags and text.
My goal is to match and replace double quote marks in the text (but not
within the tags) and replace them with the equivalent html character entity
(i.e., &quot.

For example, this string:
The "slow" red fox.<div class="test">The "quick" brown fox.</div>

would become this:
The &quot;slow&quot; red fox.<div class="test">The &quot;quick&quot;
brown fox.</div>

TIA!!!


 
Reply With Quote
 
 
 
 
Sherm Pendley
Guest
Posts: n/a
 
      12-04-2004
DeepDiver wrote:

> I'm trying to parse a string of HTML


Have a look at HTML:arser on CPAN.

sherm--

--
Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
 
Reply With Quote
 
 
 
 
DeepDiver
Guest
Posts: n/a
 
      12-04-2004
"Sherm Pendley" <> wrote in message
news:SOydnYD65MRH0izcRVn-...
>
> Have a look at HTML:arser on CPAN.
>


Thanks, but I'm in need of a pure RegEx solution.


 
Reply With Quote
 
Lars Eighner
Guest
Posts: n/a
 
      12-04-2004
In our last episode, <JZbsd.9270$_>, the lovely
and talented DeepDiver broadcast on comp.lang.perl.misc:

> I'm trying to parse a string of HTML that contains a mix of tags and text.
> My goal is to match and replace double quote marks in the text (but not
> within the tags) and replace them with the equivalent html character entity
> (i.e., &quot.


> For example, this string:
> The "slow" red fox.<div class="test">The "quick" brown fox.</div>


> would become this:
> The &quot;slow&quot; red fox.<div class="test">The &quot;quick&quot;
> brown fox.</div>


> TIA!!!


I can't do it in one, but --

WARNING! Those offended by brute force ugliness should look away now!
WARNING!

goodwill~/test$perl -wpi -e '$/=undef;while( s/\"([^<>]*<)/&quot\;$1/g ){}
;' test.html

This won't work if you have unbalanced <s and/or > anywhere in the
document such as a script with something like document.write("<")
or simply unclosed tags. If you actually run this as a one-liner,
beware of what your shell may do with $1 if you double quote the
executable.


--
Lars Eighner -finger for geek code- http://www.io.com/~eighner/
War on Terrorism: Camp Follower
"I am ... a total sucker for the guys ... with all the ribbons on and stuff,
and they say it's true and I'm ready to believe it. -Cokie Roberts,_ABC_
 
Reply With Quote
 
David H. Adler
Guest
Posts: n/a
 
      12-04-2004
On 2004-12-04, DeepDiver <no-> wrote:
> "Sherm Pendley" <> wrote in message
> news:SOydnYD65MRH0izcRVn-...
>>
>> Have a look at HTML:arser on CPAN.
>>

>
> Thanks, but I'm in need of a pure RegEx solution.


This of course raises the question: Why?

We can probably help you better if we have some idea of why you reject
the generally accepted solution...

dha

--
David H. Adler - <> - http://www.panix.com/~dha/
[Insert Angus Prune Tune here]
 
Reply With Quote
 
DeepDiver
Guest
Posts: n/a
 
      12-04-2004
"David H. Adler" <> wrote in message
news:...
> On 2004-12-04, DeepDiver <no-> wrote:
> > "Sherm Pendley" <> wrote in message
> > news:SOydnYD65MRH0izcRVn-...
> >>
> >> Have a look at HTML:arser on CPAN.
> >>

> >
> > Thanks, but I'm in need of a pure RegEx solution.

>
> This of course raises the question: Why?



A few reasons:

1. I'm not programming in Perl. In fact, my experience with Perl was a long
time ago (and not very extensive even then). I came here because I believe
that Perl programmers are generally the most proficient with regular
expressions.

2. I'm writing the current routine in C#. But I would still prefer a "pure"
RegEx solution so that I have something that is concise and (higher-level)
language independent.

3. I'm trying to improve my RegEx skills, so the more I can learn how to do
things like this in RegEx (without "massaging" in a higher-level language)
the better.

I hope this addresses your concerns.

Thanks,
Michael


 
Reply With Quote
 
Sherm Pendley
Guest
Posts: n/a
 
      12-04-2004
DeepDiver wrote:

> 1. I'm not programming in Perl.
>
> 2. I'm writing the current routine in C#.


This is a Perl group. The C# group is down the hall to the left. Don't
let the door hit you on the way out.

sherm--

--
Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
 
Reply With Quote
 
Joe Smith
Guest
Posts: n/a
 
      12-04-2004
DeepDiver wrote:

> 1. I came here because I believe
> that Perl programmers are generally the most proficient with regular
> expressions.


Regular expressions as implemented in other languages are not the same.

Using just a regular expression won't cut it; correct parsing usually
requires program logic as well.
-Joe
 
Reply With Quote
 
Tassilo v. Parseval
Guest
Posts: n/a
 
      12-04-2004
Also sprach DeepDiver:

> "David H. Adler" <> wrote in message
> news:...
>> On 2004-12-04, DeepDiver <no-> wrote:
>> > "Sherm Pendley" <> wrote in message
>> > news:SOydnYD65MRH0izcRVn-...
>> >>
>> >> Have a look at HTML:arser on CPAN.
>> >>
>> >
>> > Thanks, but I'm in need of a pure RegEx solution.

>>
>> This of course raises the question: Why?

>
>
> A few reasons:
>
> 1. I'm not programming in Perl. In fact, my experience with Perl was a long
> time ago (and not very extensive even then). I came here because I believe
> that Perl programmers are generally the most proficient with regular
> expressions.


This nonetheless makes your posting rather off-topic in this group. Perl
did not invent regular expressions. Also, Perl regular expressions are
likely to be more powerful than regular expressions found in other
languages. This means you probably couldn't use a regex solution
from this group in your program.

> 2. I'm writing the current routine in C#. But I would still prefer a "pure"
> RegEx solution so that I have something that is concise and (higher-level)
> language independent.


I have my doubts as to the conciseness of a pure regex solution.
Classical reguar expressions aren't even remotely powerful enough to
parse HTML (and there's not much to argue about: It can be proven with
the famous Pumping lemma). Perl's regular expressions might be powerful
enough as they have some non-regular extensions (they allow
back-references, they can be recursive etc.). Still, a regex solution
could hardly be robust. Let alone the fact that .NET regular expressions
lack many of the Perl features.

Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus}) !JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexi ixesixeseg;y~\n~~dddd;eval
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      12-04-2004
On Sat, 4 Dec 2004, Tassilo v. Parseval wrote:

> Perl regular expressions are likely to be more powerful than regular
> expressions found in other languages.


Would this be a moment to mention PCRE, http://www.pcre.org/ ?

"Perl Compatible Regular Expressions" library.

I often use its diagnostic command, "pcretest", to explore the
behaviour of some complex regex that I'm working with, when fed with
various data. Whether the regex is meant for Perl or, indeed, when
writing ACLs for the same author's excellent MTA, exim.

(Of course, that has nothing to do with attempting to use regexes for
parsing arbitrary HTML - which is ultimately hopeless.)
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
help needed with regex and unicode Pradnyesh Sawant Python 2 03-04-2008 07:43 AM
Help needed on this 857W config. Repost to be clearer what the problemsare and the help needed sparticle Cisco 3 08-30-2007 07:47 PM
Regex help needed rh0dium Python 8 01-11-2006 01:03 AM
Regex help needed Alvin Bruney - ASP.NET MVP ASP .Net 2 09-16-2005 06:29 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57