Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Hpricot elem index/position

Reply
Thread Tools

Hpricot elem index/position

 
 
henryturnerlists@googlemail.com
Guest
Posts: n/a
 
      10-06-2008
Hey,

Trying to find the String index of an Hpricot::Elem within its doc.
For example..

doc = Hpricot("<a>bob</a><a>james</a><a>dan</a>")
elem = doc.search("a")[1]
elem.start #=> 10 ( the first '<' of the second a tag.)

and eventually the following would be good..

elem.length #=> 12
elem.end #=> 21

Any thoughts appreciated!
Henners

 
Reply With Quote
 
 
 
 
Mark Thomas
Guest
Posts: n/a
 
      10-06-2008
On Oct 6, 10:19*am, "(E-Mail Removed)"
<(E-Mail Removed)> wrote:
> Hey,
>
> Trying to find the String index of an Hpricot::Elem within its doc.
> For example..
>
> doc = Hpricot("<a>bob</a><a>james</a><a>dan</a>")
> elem = doc.search("a")[1]
> elem.start #=> 10 ( the first '<' of the second a tag.)
>
> and eventually the following would be good..
>
> elem.length #=> 12
> elem.end #=> 21
>
> Any thoughts appreciated!
> Henners


My first thought is: Why do you want that information? Character
position is meaningless in an XML and HTML DOM. Whitespace can change
character positions without affecting the DOM at all.

-- Mark.
 
Reply With Quote
 
 
 
 
henryturnerlists@googlemail.com
Guest
Posts: n/a
 
      10-07-2008
Hi Mark,

I'm writing a broken link reporting type tool. When I find a dodgy tag
I'd like to be able to relay the character position and or line number
to the user. Useful for debugging.

Thanks -h

On Oct 6, 9:13=A0pm, Mark Thomas <(E-Mail Removed)> wrote:
> On Oct 6, 10:19=A0am, "(E-Mail Removed)"
>
>
>
> <(E-Mail Removed)> wrote:
> > Hey,

>
> > Trying to find the String index of an Hpricot::Elem within its doc.
> > For example..

>
> > doc =3D Hpricot("<a>bob</a><a>james</a><a>dan</a>")
> > elem =3D doc.search("a")[1]
> > elem.start #=3D> 10 ( the first '<' of the second a tag.)

>
> > and eventually the following would be good..

>
> > elem.length #=3D> 12
> > elem.end #=3D> 21

>
> > Any thoughts appreciated!
> > Henners

>
> My first thought is: Why do you want that information? Character
> position is meaningless in an XML and HTML DOM. Whitespace can change
> character positions without affecting the DOM at all.
>
> -- Mark.


 
Reply With Quote
 
Mark Thomas
Guest
Posts: n/a
 
      10-07-2008
On Oct 7, 3:58*am, "(E-Mail Removed)"
<(E-Mail Removed)> wrote:
> Hi Mark,
>
> I'm writing a broken link reporting type tool. When I find a dodgy tag
> I'd like to be able to relay the character position and or line number
> to the user. Useful for debugging.


So, are you really interested in broken *links* (as in a GET does not
return a 200 result code) or broken HTML? I have done the former via
AJAX (jQuery sends links to a backend rails action, and if it is
broken changes the class of the link to display a red background). The
latter may be able to be done with libxml, which reports the character
position of broken input.

-- Mark.
 
Reply With Quote
 
henryturnerlists@googlemail.com
Guest
Posts: n/a
 
      10-07-2008
Well, I suppose there are incorrectly formatted links too... I was
talking about correctly formatted links that point to a 400+ status
code resource. Something libxml would not pick up since I guess you're
talking about its syntax checking bit.

Since the entire document is accessible from the Hpricot::Elem it
seems plausible to count the characters up to and after the element. A
15min look at the source didn't reveal anything obvious.. Have a nasty
feeling that this type of thing would have to be done in the compiled
C section of it..

On Oct 7, 2:53=A0pm, Mark Thomas <(E-Mail Removed)> wrote:
> On Oct 7, 3:58=A0am, "(E-Mail Removed)"
>
> <(E-Mail Removed)> wrote:
> > Hi Mark,

>
> > I'm writing a broken link reporting type tool. When I find a dodgy tag
> > I'd like to be able to relay the character position and or line number
> > to the user. Useful for debugging.

>
> So, are you really interested in broken *links* (as in a GET does not
> return a 200 result code) or broken HTML? I have done the former via
> AJAX (jQuery sends links to a backend rails action, and if it is
> broken changes the class of the link to display a red background). The
> latter may be able to be done with libxml, which reports the character
> position of broken input.
>
> -- Mark.


 
Reply With Quote
 
Mark Thomas
Guest
Posts: n/a
 
      10-07-2008
On Oct 7, 10:28*am, "(E-Mail Removed)"
<(E-Mail Removed)> wrote:
> Well, I suppose there are incorrectly formatted links too... I was
> talking about correctly formatted links that point to a 400+ status
> code resource. Something libxml would not pick up since I guess you're

talking about its syntax checking bit.

Well, libxml stores the line number of every element. So you can
extract all links, check them, and print out element.line_num for each
one that fails the check.

Here's some starter code:

#----------------------------------------------

require 'rubygems'
require 'xml'

XML:arser.default_line_numbers = true

html = <<END_HTML
<html>
<head><title>test</title></head>
<body>
Here is a <a href="http://brok.en">broken link.</a>
</body>
</html>
END_HTML

parser = XML:arser.string html
doc = parser.parse

def broken?(link)
true
end

doc.find("//a[@href]").each do |link|
if broken?(link)
puts "Broken link to #{link['href']} on line #{link.line_num}"
end
end
 
Reply With Quote
 
Mark Thomas
Guest
Posts: n/a
 
      10-08-2008
On Oct 7, 1:36*pm, I wrote:
> Well, libxml stores the line number of every element. So you can
> extract all links, check them, and print out element.line_num for each
> one that fails the check.


Oops, my example mistakenly used the XML parser, so replace that with
XML::HTMLparser since you are parsing HTML.

-- Mark.
 
Reply With Quote
 
henryturnerlists@googlemail.com
Guest
Posts: n/a
 
      10-08-2008
Thanks for the hint towards to libxml-ruby! I didn't even know it
existed. Can't see anything for character position but very happy
indeed. Will have a go at implementing it myself when poss..

cheers
-h

On Oct 8, 3:57=A0am, Mark Thomas <(E-Mail Removed)> wrote:
> On Oct 7, 1:36=A0pm, I wrote:
>
> > Well, libxml stores the line number of every element. So you can
> > extract all links, check them, and print out element.line_num for each
> > one that fails the check.

>
> Oops, my example mistakenly used the XML parser, so replace that with
> XML::HTMLparser since you are parsing HTML.
>
> -- Mark.


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Can I use Hpricot to parse data into different array elem? Christiaan Venter Ruby 1 05-22-2009 05:11 AM
extract value of the hpricot elem Junkone Ruby 1 08-12-2008 07:25 PM
insert an elem into a link list neilcancer@gmail.com C Programming 4 04-04-2006 06:59 AM
[Q] XPath/XSL: how to get "position" of "embedded elem in mixed content? nobody XML 1 07-18-2004 10:21 AM
cross-browser object/elem access Matt Javascript 2 05-07-2004 04:04 AM



Advertisments