Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > identifying and parsing string in text file

Reply
Thread Tools

identifying and parsing string in text file

 
 
Bryan.Fodness@gmail.com
Guest
Posts: n/a
 
      03-08-2008
I have a large file that has many lines like this,

<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE</element>

I would like to identify the line by the tag (300a,0014) and then grab
the name (DoseReferenceStructureType) and value (SITE).

I would like to create a file that would have the structure,

DoseReferenceStructureType = Site
...
...

Also, there is a possibility that there are multiple lines with the
same tag, but different values. These all need to be recorded.

So far, I have a little bit of code to look at everything that is
available,

for line in open(str(sys.argv[1])):
i_line = line.split()
if i_line:
if i_line[0] == "<element":
a = i_line[1]
b = i_line[5]
print "%s | %s" %(a, b)

but do not see a clever way of doing what I would like.

Any help or guidance would be appreciated.

Bryan
 
Reply With Quote
 
 
 
 
Bernard
Guest
Posts: n/a
 
      03-08-2008
Hey Brian,

It seems the text you are trying to parse is similar to XML/HTML.
So I'd use BeautifulSoup[1] if I were you

here's a sample code for your scraping case:

from BeautifulSoup import BeautifulSoup

<python>

# assume the s variable has your text
s = "whatever xml or html here"
# turn it into a tasty & parsable soup
soup = BeautifulSoup(s)
# for every element tag in the soup
for el in soup.findAll("element"):
# print out its tag & name attribute plus its inner value!
print el["tag"], el["name"], el.string

</python>

that's it!

[1] http://www.crummy.com/software/BeautifulSoup/

On 8 mar, 14:49, "(E-Mail Removed)" <(E-Mail Removed)>
wrote:
> I have a large file that has many lines like this,
>
> <element tag="300a,0014" vr="CS" vm="1" len="4"
> name="DoseReferenceStructureType">SITE</element>
>
> I would like to identify the line by the tag (300a,0014) and then grab
> the name (DoseReferenceStructureType) and value (SITE).
>
> I would like to create a file that would have the structure,
>
> DoseReferenceStructureType = Site
> ...
> ...
>
> Also, there is a possibility that there are multiple lines with the
> same tag, but different values. These all need to be recorded.
>
> So far, I have a little bit of code to look at everything that is
> available,
>
> for line in open(str(sys.argv[1])):
> i_line = line.split()
> if i_line:
> if i_line[0] == "<element":
> a = i_line[1]
> b = i_line[5]
> print "%s | %s" %(a, b)
>
> but do not see a clever way of doing what I would like.
>
> Any help or guidance would be appreciated.
>
> Bryan


 
Reply With Quote
 
 
 
 
Nemesis
Guest
Posts: n/a
 
      03-08-2008
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

> I have a large file that has many lines like this,
>
> <element tag="300a,0014" vr="CS" vm="1" len="4"
> name="DoseReferenceStructureType">SITE</element>
>
> I would like to identify the line by the tag (300a,0014) and then grab
> the name (DoseReferenceStructureType) and value (SITE).
>
> I would like to create a file that would have the structure,
>
> DoseReferenceStructureType = Site
> ...
> ...


You should try with Regular Expressions or if it is something like xml there
is for sure a library you can you to parse it ...
anyway you can try something simpler like this:

elem_dic=dict()
for line in open(str(sys.argv[1])):
line_splitted=line.split()
for item in line_splitted:
item_splitted=item.split("=")
if len(item_splitted)>1:
elem_dic[item_splitted[0]]=item_splitted[1]

.... then you have to retrieve from the dict the items you need, for example,
with the line you posted you obtain these items splitted:

['<element']
['tag', '"300a,0014"']
['vr', '"CS"']
['vm', '"1"']
['len', '"4"']
['name', '"DoseReferenceStructureType">SITE</element>']

and elem_dic will contain the last five, with the keys
'tag','vr','vm','len','name' and teh values 300a,0014 etc etc
i.e. this:

{'vr': '"CS"', 'tag': '"300a,0014"', 'vm': '"1"', 'len': '"4"', 'name': '"DoseReferenceStructureType">SITE</element>'}




--
Age is not a particularly interesting subject. Anyone can get old. All
you have to do is live long enough.

 
Reply With Quote
 
Paul McGuire
Guest
Posts: n/a
 
      03-08-2008
On Mar 8, 2:02*pm, Nemesis <(E-Mail Removed)> wrote:
> (E-Mail Removed) wrote:
> > I have a large file that has many lines like this,

>
> > <element tag="300a,0014" vr="CS" vm="1" len="4"
> > name="DoseReferenceStructureType">SITE</element>

>
> > I would like to identify the line by the tag (300a,0014) and then grab
> > the name (DoseReferenceStructureType) and value (SITE).

>
> You should try with Regular Expressions or if it is something like xml there
> is for sure a library you can you to parse it ...

<snip>

When it comes to parsing HTML or XML of uncontrolled origin, regular
expressions are an iffy proposition. You'd be amazed what kind of
junk shows up inside an XML (or worse, HTML) tag.

Pyparsing includes a builtin method for constructing tag matching
parsing patterns, which you can then use to scan through the XML or
HTML source:

from pyparsing import makeXMLTags, withAttribute, SkipTo

testdata = """
<blah>
<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE</element>
<element tag="300Z,0019" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITEXXX</element>
<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE2</element>
<blahblah>
"""

elementStart,elementEnd = makeXMLTags("element")
elementStart.setParseAction(withAttribute(tag="300 a,0014"))
search = elementStart + SkipTo(elementEnd)("body")

for t in search.searchString(testdata):
print t.name
print t.body

Prints:

DoseReferenceStructureType
SITE
DoseReferenceStructureType
SITE2

In this case, the parse action withAttribute filters <element> tag
matches, accepting *only* those with the attribute "tag" and the value
"300a,0014". The pattern search adds on the body of the <element></
element> tag, and gives it the name "body" so it is easily accessed
after parsing is completed.

-- Paul
(More about pyparsing at http://pyparsing.wikispaces.com.)
 
Reply With Quote
 
bruno.desthuilliers@gmail.com
Guest
Posts: n/a
 
      03-09-2008
On 8 mar, 20:49, "(E-Mail Removed)" <(E-Mail Removed)>
wrote:
> I have a large file that has many lines like this,
>
> <element tag="300a,0014" vr="CS" vm="1" len="4"
> name="DoseReferenceStructureType">SITE</element>
>
> I would like to identify the line by the tag (300a,0014) and then grab
> the name (DoseReferenceStructureType) and value (SITE).


It's obviously an XML file, so use a XML parser - there are SAX and
DOM parsers in the stdlib, as well as the ElementTree module.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
identifying and removing comments in a seprate ruby file Robhy B. Ruby 3 12-09-2010 01:55 AM
Truncating text from a string with beginning text from another string Mark Perl Misc 17 03-25-2007 10:03 PM
Help with identifying unique text elements daldridge@gmail.com XML 1 10-09-2006 07:56 PM
In file parsing, taking the first few characters of a text file after a readfile or streamreader file read... .Net Sports ASP .Net 11 01-17-2006 12:44 AM
Assistance parsing text file using Text::CSV_XS Domenico Discepola Perl Misc 6 09-02-2004 03:55 PM



Advertisments