Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Ruby (http://www.velocityreviews.com/forums/f66-ruby.html)
-   -   Hpricot - best way to parse based on comments (http://www.velocityreviews.com/forums/t835684-hpricot-best-way-to-parse-based-on-comments.html)

Jerome --- 11-20-2006 10:52 PM

Hpricot - best way to parse based on comments
 
I am trying to parse some files that contain comments like this:

<html>
<body>

<!-- BEGIN ad_content -->

images, text, etc...

<!-- END ad_content -->

Interesting text of site here.

</body>
</html>


I am wondering how to go about extracting the data within the comments
block using Hpricot. I am not aware of a way to refer to commented HTML
through CSS or XPath selectors.

Thanks for any ideas!

- Jerome

--
Posted via http://www.ruby-forum.com/.


Keith Fahlgren 11-20-2006 11:50 PM

Re: Hpricot - best way to parse based on comments
 
On 11/20/06, Jerome --- <jerome@tut0r.com> wrote:
> I am trying to parse some files that contain comments like this:
> ...
> I am not aware of a way to refer to commented HTML
> through CSS or XPath selectors.


The XPath comment() selector will select all comments:

For example (xpath after -m flag):
keith@devel ~ $ xml sel -t -m '//comment()' -v '.' -n simple.xml
one comment
two comment

keith@devel ~ $ cat simple.xml
<simple>
<!-- one comment -->
<foo/>
<!-- two comment -->
<bar/>
</simple>


HTH,
Keith


Ken Bloom 11-21-2006 03:20 PM

Re: Hpricot - best way to parse based on comments
 
On Tue, 21 Nov 2006 07:52:12 +0900, Jerome --- wrote:

> I am trying to parse some files that contain comments like this:
>
> <html>
> <body>
>
> <!-- BEGIN ad_content -->
>
> images, text, etc...
>
> <!-- END ad_content -->
>
> Interesting text of site here.
>
> </body>
> </html>
>
>
> I am wondering how to go about extracting the data within the comments
> block using Hpricot. I am not aware of a way to refer to commented HTML
> through CSS or XPath selectors.
>
> Thanks for any ideas!
>
> - Jerome
>


Why not gsub out the unwanted sections before parsing with hpricot, or
if the data you want is nested between comments, use a regexp to narrow
down the document to only the text between the comments before parsing
with hpricot?

--Ken Bloom

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/


All times are GMT. The time now is 07:58 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.