Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Re: Regular expression to find <tr> tags in 2nd level HTML tables

Reply
Thread Tools

Re: Regular expression to find <tr> tags in 2nd level HTML tables

 
 
Shannon Jacobs
Guest
Posts: n/a
 
      01-09-2004
Brian Genisio <> wrote in message news:<>...
> Shannon Jacobs wrote:
>

<snip>
> Take a look at the TidyLib. It is a C library that will parse HTML for
> you, in DOM-Like nodes, which you can traverse like a tree. It was
> originally developed via the W3C, but it is available via SourceForge

<snip>
> Using a RegExp will break as soon as the HTML format changes, but a
> smart tree traversal will likely be more robust.
>
> If you go the TidyLib method, you can manipulate the data quickly, and
> easily develop your palm database via C routines.


From your description, this doesn't really sound like an approach I
want to take. It's not a matter of simple access, but pruning
manipulation. If I really wanted to follow this approach, the most
bankable-for-use-in-the-real-office approach would be the Excel macro
programming approach I mentioned. However, anytime anyone mentions
Microsoft or Visual <anything> I feel like I want to hold up a silver
cross and scream "Return to Hades, you evil demons!"

However, due to your hint and another source, I thought to explore the
DOM tree to get a better understanding of the problem. Mozilla has a
DOM explorer that was quite good for this, and I can clarify the
problem now. Here is a reduction of the situation:

<table>
<tr>
<tr>
<table>
<tr>
<tr>
....
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
...
<tr>

In the outermost table, there is some useful data worth saving in the
first <tr> row. In the 2nd level table, there is some useful data,
mostly numbers, in each of those <tr> rows. Returning to the outer
table, the 7th <tr> row also contains some information that would be
worth saving. That's the legend I mentioned in the earlier post, but
which I still feel would be too difficult to parse in a robust way.

The rest of it is basically dross, and my current regexes toss it away
quite nicely. The main problem is that the line breaks associated with
those second level <tr> tags are useful and significant, and I want to
keep them.

There seem to be two regex-based approaches that are possible. One is
to use one regex to mark them in a way that prevents them from being
tossed, and then restore them as at the end after the other line
breaks have been removed, basically with the reverse regex. I'm
already doing that with some other information that needs to be
preserved.

The other approach would be to just save the immediately preceding
line breaks while tossing all the others. I think I favor this
approach because it strikes me as most elegant and in keeping with the
spirit of the great regex of the heading of 137 degrees. A related
approach to this one would be to toss all the line breaks at the
beginning, and then insert the correct ones before throwing the other
dross.

I actually found a rather similar recent thread in the comp.lang.perl
newsgroup, so I've cross-posted to that newsgroup, too. That involved
using

s/<[^>]*>//g;

to remove all of the HTML tags, but I need to be more selective.

I also wanted to include a response to the other reply, snide though
it was.

His first snide question was "Why?", in response to my preference for
a regex-based solution. I've already mostly answered that question,
but I'll add that I think regex-based solutions can be quite elegant,
and apparently I sometimes like having my head bent through the regex
dimension.

He then recommended using a HTML parsing module and suggested asking
in a JavaScript newsgroup. In the original post I had already
explained why I wanted this direct approach, and I had already asked
in the JavaScript newsgroup with the original cross-post. I suspect
him of being a wannabe Perler, since real Perl people tend to be very
observant of all details. The regex experts even more so. However, I
just wanted to note that his attitude is one of the main reasons I
quit working in Perl. IMNSHO, it's rather too common among Perl users,
and I'd hate to wind up like that.
 
Reply With Quote
 
 
 
 
Shannon Jacobs
Guest
Posts: n/a
 
      01-11-2004
"Alan J. Flavell" <> wrote in message news:<. gla.ac.uk>...
> On Fri, 9 Jan 2004, Shannon Jacobs wrote:
>
> > By the way, I've relinked the Perl group which is accessible from this
> > particular server. In spite of the attitude thing, I still think the best
> > regex people are Perl-centric.

>
> And they will presumably tell you, as I've seen them doing many times
> before, that regexes are not the way to parse HTML. Then what? Will
> you be griping about "attitude" again, or deferring to their
> expertise?


Yeah, I think I will be griping. You certainly haven't exhibited any
"expertise" to defer to. This time your "attitude" reminds me of the
religious zealots. I still seek truth and beauty and all that jazz,
but when I was much younger I thought the zealots might know something
about them--after all, they were SO certain of their "expertise".

I certainly have managed to understand that you say that a regex
replacement of the <tr> tags in the second level <table> is not a
perfect solution. I also believe:

1. It will work well enough for my narrow purpose,
2. A regex may be elegant, and
3. I will also learn something from studying it.

I think an actual expert could craft the kernel regex in the same time
required to write your four-line negativistic reply--and that expert
would actually understand its limitations, too. If the expert was
feeling really helpful (though I have no reason to expect such
helpfulness except for fading memories of when usenet was a much more
friendly and helpful place), he or she would provide a regex solution
and share additional wisdom, such as the comparable solution written
with a better approach, or a concrete example of the most obvious
problem with the regex.

Time for a hats trick:

Putting on my mathematician's hat, I like elegance and love learning
about new ways to solve problems. And I still miss working in APL.

Putting on my engineer's hat, Excel is a practical and available tool
and regular expressions are just a waste of time. Don't waste time on
elegance. Mea culpa.

Putting on my technical historian's hat, regular expressions and Perl
are elitist technologies and are fading into insignificance. Just an
observation.


--
Did you know that is sometimes a black hole?
That's right, resident Dubya does NOT care what you emailing peasants
think.
 
Reply With Quote
 
 
 
 
Jürgen Exner
Guest
Posts: n/a
 
      01-11-2004
Shannon Jacobs wrote:
> [...] or a concrete example of the most obvious
> problem with the regex.


Which parts of the negative examples in FAQ "How do I remove HTML from a
string?" do you have problem with when trying to adapt them to your concrete
"<tr>" problem?

jue


 
Reply With Quote
 
Shannon Jacobs
Guest
Posts: n/a
 
      01-11-2004
"Jürgen Exner" <> wrote in message news:<mW2Mb.4024$>...
> Shannon Jacobs wrote:
> > [...] or a concrete example of the most obvious
> > problem with the regex.

>
> Which parts of the negative examples in FAQ "How do I remove HTML from a
> string?" do you have problem with when trying to adapt them to your concrete
> "<tr>" problem?
>
> jue


Thank you for the reference to
http://www.perldoc.com/perl5.6/pod/perlfaq9.html. Unfortunately, the
category of structural problem that I encountered is not covered
there, and my source HTML does not include any of the problems covered
in the "tricky cases". If the FAQ included any examples of the use of
HTML::FormatText, or a more concrete reference, it might have been
more helpful.

As it stands, I've decided to return to Excel. Ugly and inelegant (and
typical of Microsoft), but useful and adequate.

With regards to the other recent comments in this thread, I will note:

1. Just because a particular NNTP server does not carry a particular
newsgroup, that does not mean that the newsgroup in question does not
exist.

2. With regards to the unhelpful advice to stop using Perl, I already
have (except for infrequent maintenance work on a few CGI/Perl systems
I wrote some years ago). As noted several times earlier, I am
currently working from a JavaScript perspective, but sought out Perl
people because of the compatibility of the regex implementations and
because of old memories of their expertise (though not found this time
around).

3. I used the term "elitist" in the sense of high technical expertise.
Perhaps I should have tried the XSL community. Recently all I have
seen around Perl are the laziness, impatience, and hubris, but without
the justification of results.
 
Reply With Quote
 
Jürgen Exner
Guest
Posts: n/a
 
      01-11-2004
Shannon Jacobs wrote:
> "Jürgen Exner" <> wrote in message
> news:<mW2Mb.4024$>...
>> Shannon Jacobs wrote:
>>> [...] or a concrete example of the most obvious
>>> problem with the regex.

>>
>> Which parts of the negative examples in FAQ "How do I remove HTML
>> from a string?" do you have problem with when trying to adapt them
>> to your concrete "<tr>" problem?
>>
>> jue

>
> Thank you for the reference to
> http://www.perldoc.com/perl5.6/pod/perlfaq9.html. Unfortunately, the
> category of structural problem that I encountered is not covered
> there, and my source HTML does not include any of the problems covered
> in the "tricky cases".


Well, ok. Your call. But please keep in mind that first of all these are
just a few examples for illustration. There are more ways to break RE-based
parser code.
And second unless you own and control the source HTML code (which may or may
not be the case, I don't know) this source code can change at any moment
without notice.

> If the FAQ included any examples of the use of
> HTML::FormatText, or a more concrete reference, it might have been
> more helpful.


That would be a poor use of the FAQ, because instructions and examples are
included in the standard documentation for each module already.

jue


 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      01-11-2004
Shannon Jacobs <> wrote:


> 1. Just because a particular NNTP server does not carry a particular
> newsgroup, that does not mean that the newsgroup in question does not
> exist.



Just because a particular newsgroup _is_ listed on a
server does not mean that the newsgroup actually exists.
That server may be wrong.

comp.lang.perl was rmgroup'd many years ago, servers that still
list it as a valid newsgroup look like they've been neglected
for many years.


> 2. With regards to the unhelpful advice to stop using Perl, I already
> have



Thank you.

We will miss your valuable contributions to the community.


--
Tad McClellan SGML consulting
Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Shannon Jacobs
Guest
Posts: n/a
 
      01-23-2004
I'm so sorry to hear that the Google Groups system has been "neglected
for many years", as you put it so thoughtfully. It really is
unfortunate that so many people regard Google as a useful information
resource, isn't it?

Incidentally, when I finally had a bit of free time this morning, I
rethought the technical problem and did come up with a trivial
regex-based solution. It did exactly what I required on the first
attempt, confirming that the technical problem was pretty much as
trivial as I had thought it was. I guess it's just too bad that none
of you "experts" and "community contributors" were able to help.

However, this does lead to a new question:

Why did the newsgroups fail to produce the technically trivial answer?

While I can be abrasive or even rude when provoked, there is nothing
like that in my original query. I asked a simple technical question,
and wound up being dragged into a religious war about proper ways to
handle HTML. Not very useful.

If the religious issue of HTML was the problem, my advice to other
people seeking similar help is to avoid mentioning HTML. Try
describing your problem as structured database output, and maybe
you'll have better "luck" than I had.

I still regard regular expressions as useful and worthy of further
study. I cannot say the same thing about most of the people who
responded so religiously to my trivial question.

Oh yeah, I suppose I should give a hint about the solution, even
though it's a bit embarrassing. (I don't mind much as long as I can
feel I learned something along the way.) Returning to the problem
fresh and without the "box" around my thoughts, I looked at the data
files again and asked myself whether there was some other unique
string associated with the data that was associated with the second
level <tr> tags. I picked one of the likely candidates, and sure
enough, it worked. I still think there is a more clever way to do it
considering the logical structure of the HTML tags and the powerful
features of regular expressions, and I'd have been quite glad to learn
something new about those features. That would have been more
instructional than just solving the original rather trivial problem.

(By the way, the Excel-based solution was just TOO ugly to bear.)

(Tad McClellan) wrote in message news:<>.. .
> Shannon Jacobs <> wrote:
>
>
> > 1. Just because a particular NNTP server does not carry a particular
> > newsgroup, that does not mean that the newsgroup in question does not
> > exist.

>
>
> Just because a particular newsgroup _is_ listed on a
> server does not mean that the newsgroup actually exists.
> That server may be wrong.
>
> comp.lang.perl was rmgroup'd many years ago, servers that still
> list it as a valid newsgroup look like they've been neglected
> for many years.
>
>
> > 2. With regards to the unhelpful advice to stop using Perl, I already
> > have

>
>
> Thank you.
>
> We will miss your valuable contributions to the community.

 
Reply With Quote
 
Matt Garrish
Guest
Posts: n/a
 
      01-23-2004

"Shannon Jacobs" <> wrote in message
news: om...
> I'm so sorry to hear that the Google Groups system has been "neglected
> for many years", as you put it so thoughtfully. It really is
> unfortunate that so many people regard Google as a useful information
> resource, isn't it?
>


Well, if Google still archives the messages then it must be a group. Someone
should re-revise this horribly outdated faq:

http://www.perldoc.com/perl5.8.0/pod...ost-questions-

>
> Why did the newsgroups fail to produce the technically trivial answer?
>


Because the point of this newsgroup is NOT to produce technically trivial
answers, because technically trivial answers are useless. So what if you
found some way you think might work for you? What good would posting some
bad advice that's bound to fail but that might do the job for you do for
someone searching on the same topic? Parsing html questions come up every
few days. Do you think people here want to sit and answer them with
technically trivial answers over and over again? Do you think they want to
be responding to questions along the lines of "Duh, how come this trivial
answer didn't work for me?"?

Get a life. You got flamed for asking a stupid question. If you had any
knowledge of markup languages you wouldn't have even asked it. And if you
don't like being told you're dumb, don't post to usenet.

Matt


 
Reply With Quote
 
John W. Kennedy
Guest
Posts: n/a
 
      01-24-2004
Shannon Jacobs wrote:
> I'm so sorry to hear that the Google Groups system has been "neglected
> for many years", as you put it so thoughtfully. It really is
> unfortunate that so many people regard Google as a useful information
> resource, isn't it?


Google Groups is an archive, and, as such, obviously does not delete
obsolete groups.

> Incidentally, when I finally had a bit of free time this morning, I
> rethought the technical problem and did come up with a trivial
> regex-based solution.


No you didn't, because it's impossible. Either you misstated your
requirement, your "solution" does not work, or it is not "regex-based".

> Oh yeah, I suppose I should give a hint about the solution, even
> though it's a bit embarrassing. (I don't mind much as long as I can
> feel I learned something along the way.) Returning to the problem
> fresh and without the "box" around my thoughts, I looked at the data
> files again and asked myself whether there was some other unique
> string associated with the data that was associated with the second
> level <tr> tags. I picked one of the likely candidates, and sure
> enough, it worked.


In other words, you came up with an ad-hoc solution that does not
involve the use of regex's for parsing (which regex's cannot do), and
which no-one here could possibly have thought of, since it involves
facts that you never mentioned.

That's a cute job of drawing your target around the bullet holes, but
you can't really expect adults to be impressed by that, can you?

--
John W. Kennedy
"But now is a new thing which is very old--
that the rich make themselves richer and not poorer,
which is the true Gospel, for the poor's sake."
-- Charles Williams. "Judgement at Chelmsford"
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular Expression help for parsing html tables steve551979@hotmail.com Python 3 10-29-2006 04:46 PM
Regular Expression for HTML Tags and Special Characters Marc Bogaard Perl Misc 12 10-21-2004 07:11 PM
Regular expression to find <tr> tags in 2nd level HTML tables Shannon Jacobs Javascript 19 01-24-2004 05:26 AM
Regular expression to find <tr> tags in 2nd level HTML tables Shannon Jacobs Perl Misc 18 01-23-2004 02:03 AM
Dynamically changing the regular expression of Regular Expression validator VSK ASP .Net 2 08-24-2003 02:47 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57