Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Beginner Q. interrogate html object OR file search? (http://www.velocityreviews.com/forums/t707454-beginner-q-interrogate-html-object-or-file-search.html)

Mark G 12-03-2009 03:24 AM

Beginner Q. interrogate html object OR file search?
 
Hi all,

I am new to python and don't yet know the libraries well. What would
be the best way to approach this problem: I have a html file parsing
script - the file sits on my harddrive. I want to extract the date
modified from the meta-data. Should I read through lines of the file
doing a string.find to look for the character patterns of the meta-
tag, or should I use a DOM type library to retrieve the html element I
want? Which is best practice? which occupies least code?

Regards, Mark

inhahe 12-03-2009 05:19 AM

Re: Beginner Q. interrogate html object OR file search?
 
or i guess you could go the middle-way and just use regex.
people generally say don't use regex for html (regex can't do the
nesting), but it's what i would do in this case.
though i don't exactly understand the question, re the html file
parsing script you say you have already, or how the date is 'modified
from' the meta-data.

On Wed, Dec 2, 2009 at 10:24 PM, Mark G <markgrahamnz@gmail.com> wrote:
> Hi all,
>
> I am new to python and don't yet know the libraries well. What would
> be the best way to approach this problem: I have a html file parsing
> script - the file sits on my harddrive. I want to extract the date
> modified from the meta-data. Should I read through lines of the file
> doing a string.find to look for the character patterns of the meta-
> tag, or should I use a DOM type library to retrieve the html element I
> want? Which is best practice? which occupies least code?
>
> Regards, Mark
> --
> http://mail.python.org/mailman/listinfo/python-list
>


Mark G 12-03-2009 05:32 AM

Re: Beginner Q. interrogate html object OR file search?
 
On Dec 3, 4:19*pm, inhahe <inh...@gmail.com> wrote:
> or i guess you could go the middle-way and just use regex.
> people generally say don't use regex for html (regex can't do the
> nesting), but it's what i would do in this case.
> though i don't exactly understand the question, re the html file
> parsing script you say you have already, or how the date is 'modified
> from' the meta-data.
>
> On Wed, Dec 2, 2009 at 10:24 PM, Mark G <markgraha...@gmail.com> wrote:
> > Hi all,

>
> > I am new to python and don't yet know the libraries well. What would
> > be the best way to approach this problem: I have a html file parsing
> > script - the file sits on my harddrive. I want to extract the date
> > modified from the meta-data. Should I read through lines of the file
> > doing a string.find to look for the character patterns of the meta-
> > tag, or should I use a DOM type library to retrieve the html element I
> > want? Which is best practice? which occupies least code?

>
> > Regards, Mark
> > --
> >http://mail.python.org/mailman/listinfo/python-list

>
>


I'm tempted to use regex too. I have done a bit of perl & bash, and
that is how I would do it with those.

However, I thought there would be a smarter way to do it with
libraries. I have done some digging through the libraries and think I
can do it with xml.dom.minidom. Here is what I want to try...

# if html file already exists, inherit the created date
# 'output' is the filename for the parsed file
html_xml = xml.dom.minidom.parse(output)
for node in html_xml.getElementsByTagName('meta'): # visit every
node <meta />
#debug print node.toxml()
metatag_type = nodes.attributes["name"]
if metatag_type.name == "DC.Date.Modified":
metatag_date = nodes.attributes["content"]
date_created = metatag_date.value()
print date_created

I haven't quite got up to hear in my debugging. I'll let you know if
it works...

RE: your questions. 1) I already have the script in bash - wanting to
convert to Python :) I'm half way through
I want to extract the value of the tag <metadata
name="DC.Date.Modified" value="2009-11-17">


r0g 12-03-2009 07:24 AM

Re: Beginner Q. interrogate html object OR file search?
 
Mark G wrote:
> Hi all,
>
> I am new to python and don't yet know the libraries well. What would
> be the best way to approach this problem: I have a html file parsing
> script - the file sits on my harddrive. I want to extract the date
> modified from the meta-data. Should I read through lines of the file
> doing a string.find to look for the character patterns of the meta-
> tag, or should I use a DOM type library to retrieve the html element I
> want? Which is best practice? which occupies least code?
>
> Regards, Mark



Beautiful soup is the html parser of choice partly as it handles badly
formed html well.

http://www.crummy.com/software/BeautifulSoup/


If the date info occurs at a consistent offset from the start of the tag
then you can use simple string slicing to snip out the date. If not
then, as others suggest, regex is your friend.

If you need to convert a date/time string back into a unix style
timestamp chop the string into bits, put them into a tuple of length 9
and give that to time.mktime()...

def time_to_timestamp( t ):
return time.mktime( (int(t[0:4]), int(t[5:7]), int(t[8:10]),
int(t[11:13]), int(t[14:16]), int(t[17:19]), 0, 0, 0) )

Note the last 3 values are hardcoded to 0, this is because most
date/time strings I deal with do not encode sub second information, only
YYYY/MM/DD h:m:s


Roger.

Steven D'Aprano 12-03-2009 07:33 AM

Re: Beginner Q. interrogate html object OR file search?
 
On Wed, 02 Dec 2009 19:24:07 -0800, Mark G wrote:

> Hi all,
>
> I am new to python and don't yet know the libraries well. What would be
> the best way to approach this problem: I have a html file parsing script
> - the file sits on my harddrive. I want to extract the date modified
> from the meta-data. Should I read through lines of the file doing a
> string.find to look for the character patterns of the meta- tag,


That will probably be the fastest, simplest, and easiest to develop. But
the downside is that it will be subject to false positives, if some tag
happens to include text which by chance looks like your meta-data. So,
strictly speaking, this approach is incorrect.

> or
> should I use a DOM type library to retrieve the html element I want?
> Which is best practice?


"Best practice" would imply DOM.

As for which you use, you need to weigh up the risks of a false positive
versus the convenience and speed of string matching versus the
correctness of a DOM approach.


> which occupies least code?


Unless you're writing for an embedded system, or if the difference is
vast (e.g. 300 lines versus 30) that's premature optimization.

Personally, I'd use string matching or a regex, and feel guilty about it.



--
Steven


All times are GMT. The time now is 10:30 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.