Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Newbie ? -- SGML metadata extraction

Reply
Thread Tools

Newbie ? -- SGML metadata extraction

 
 
ProvoWallis
Guest
Posts: n/a
 
      01-16-2006
Hi,

I'm trying to write a script that will extract the value of an
attribute from an element using the attribute value of another element
as the basis for extraction.

For example, in my situation I have a pre-defined list of main sections
and I want to extract the id attribute of the form element and create a
dictionary of graphic ID and section number pairs but only for the
sections in my pre-defined list but I want to exclude the id value from
any section that does not appear on my list. I.e., I want to know the
id value for the forms that appear in sections 1 and 3 but not in 2.

Boiled down my SGML looks something like this:

<main-section no="1">

<form id="graphic_1.tif">
<form id="graphic_2.tif">

<main-section no="2">

<form id="graphic_3.tif">

<main-section no="3">

<form id="graphic_4.tif">
<form id="graphic_5.tif">
<form id="graphic_6.tif">

This is what I have come up with on my own so far. My problem is that I
can't seem to pick up the value of the id attribute.

Any advice appreciated.

Greg

###

import os, re, csv

root = raw_input("Enter the path where the program should run: ")
fname = raw_input("Enter name of the CSV file containing the section
numbers: ")
sgmlname = raw_input("Enter name of the SGML file to search: ")
print

given,ext = os.path.splitext(fname)
root_name = os.path.join(root,fname)
n = given + '.new'
outputName = os.path.join(root,n)

reader = csv.reader(open(root_name, 'r'), delimiter=',')

sections = []

for row in reader:
sections.append(row[0])


inputFile = open(os.path.join(root,sgmlname), 'r')

illoList ={}

while 1:
lines = inputFile.readlines()
if not lines:
break
for line in lines:

main = re.search(r'(?i)(?m)(?s)<main-section
no=\"(\w+)\"', line)
id = re.search(r'(?i)id=\"(.*?tif)\"', line)
if main is not None and main.group(1) in sections:

if id is not None:

illoList[illo.group(1)] = main.group(1)

 
Reply With Quote
 
 
 
 
Adonis
Guest
Posts: n/a
 
      01-17-2006
ProvoWallis wrote:

<snip>

From what I gather here is a quickie, probably better solutions on the
way but this accomplishes the idea I think.

Some helpful links:
http://docs.python.org/lib/module-sgmllib.html
http://docs.python.org/lib/module-HTMLParser.html
http://docs.python.org/lib/module-htmllib.html

---

from HTMLParser import HTMLParser

data = """<main-section no="1">

<form id="graphic_1.tif">
<form id="graphic_2.tif">

<main-section no="2">

<form id="graphic_3.tif">

<main-section no="3">

<form id="graphic_4.tif">
<form id="graphic_5.tif">
<form id="graphic_6.tif">
"""

class ParseForms(HTMLParser):

def handle_starttag(self, tag, attrs):
if tag == "form":
# attrs argument is a list of tuples [(attribute, value)]
# converted it to a dictionary to access attribute easier
print "form id: %s" % dict(attrs).get('id')

if __name__ == "__main__":
parser = ParseForms()
parser.feed(data)
 
Reply With Quote
 
 
 
 
ProvoWallis
Guest
Posts: n/a
 
      01-17-2006
Thanks. One more question, though.

I'm not sure how to limit the scope of my search so that I'm just
extracting the id attribute from the sections that I want. I.e., I want
the id attributes from the forms in sections 1 and 3 but not from 2.

Maybe I'm missing something.

 
Reply With Quote
 
Adonis
Guest
Posts: n/a
 
      01-17-2006
ProvoWallis wrote:
> Thanks. One more question, though.
>
> I'm not sure how to limit the scope of my search so that I'm just
> extracting the id attribute from the sections that I want. I.e., I want
> the id attributes from the forms in sections 1 and 3 but not from 2.
>
> Maybe I'm missing something.
>


If the data has closing tags this is easily achieved using a dom or sax
parser, but here is a slightly modified version, very ugly but simple.

hope this helps.

Adonis

---

from HTMLParser import HTMLParser

data = """<main-section no="1">

<form id="graphic_1.tif">
<form id="graphic_2.tif">

<main-section no="2">

<form id="graphic_3.tif">

<main-section no="3">

<form id="graphic_4.tif">
<form id="graphic_5.tif">
<form id="graphic_6.tif">
"""

class ParseForms(HTMLParser):

_section = None
_secDict = dict()

def getSection(self, key):
return self._secDict.get(str(key))

def handle_starttag(self, tag, attrs):
if tag == "form":
if not self._secDict.has_key(self._section):
self._secDict[self._section] = [dict(attrs).get('id')]
else:
self._secDict[self._section].append(dict(attrs).get('id'))

if tag == "main-section":
self._section = dict(attrs).get('no')

if __name__ == "__main__":
parser = ParseForms()
parser.feed(data)
print parser.getSection(1)
print parser.getSection(3)

 
Reply With Quote
 
ProvoWallis
Guest
Posts: n/a
 
      01-18-2006
Thanks very much for your help. It's greatly appreciated.

It look a couple of tries to see what was happening but I've figured
it out.

Greg

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Microsoft Word metadata extraction query Stuart Clarke Ruby 6 10-03-2010 09:45 AM
Video/Audio metadata extraction API totoro Java 0 02-21-2006 09:31 AM
SGML and other issues in parser for web browser Surendra Singhi Firefox 0 02-22-2005 09:47 AM
Good SGML DTD viewer *or* tool for translating SGML DTDs to XML DTDs Clifford W. Racz XML 4 02-13-2004 06:24 PM
Schema Metadata not a Metadata Schema Brett Selleck XML 1 09-04-2003 05:02 PM



Advertisments