Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Regular Expression assistance

Reply
Thread Tools

Regular Expression assistance

 
 
Steve Dunn
Guest
Posts: n/a
 
      12-29-2003
I'm wondering if anyone can help with the following problem:

I have the following text:

<DOCUMENT>

<TYPE>EX-5

<SEQUENCE>3

<DESCRIPTION>OPINION OF

BRADLEY ARANT, ET AL.

<TEXT>

..

And I have the following (multi-line) regular expression:

^<([^/].+?[^/])>([\S ]+)



This correctly matches any line that contains "<tag>any characters" but not
"</tag>" or "<tag>". The following captures are returned from the
expression:

1 => TYPE

2 => EX-5



1 => SEQUENCE

2 => 3



1 => DESCRIPTION

2 => OPINION OF



I now need to modify the expression to take into account multi-line content.
To give an example, the current expression matches "<DESCRIPTION>OPINION OF"
but it needs to match "<DESCRIPTION>OPINION OF 'new line' BRADLEY ARANT, ET
AL."



Many thanks in advance,



Steve.




 
Reply With Quote
 
 
 
 
Ragnar Hafstaš
Guest
Posts: n/a
 
      12-29-2003
"Steve Dunn" <(E-Mail Removed)> wrote in message
news:UnSHb.12489$(E-Mail Removed)...
> I'm wondering if anyone can help with the following problem:
>
> I have the following text:
>
> <DOCUMENT>

snipped vaguely xml-like text ...
> And I have the following (multi-line) regular expression:
>


> ^<([^/].+?[^/])>([\S ]+)
>


first , a warning:
regular expressions will only work for simple xml-like stuff.
i hope you do not have tag nesting or attributes.
>
>
> I now need to modify the expression to take into account multi-line

content.
> To give an example, the current expression matches "<DESCRIPTION>OPINION

OF"
> but it needs to match "<DESCRIPTION>OPINION OF 'new line' BRADLEY ARANT,

ET
> AL."


a few methods come to mind:

1) if the file is small (not huge) , you can slurp it in, and use something
like
m!^<([^/].+?[^/])>([^<]+)!s

2) set the input record separator to '<' and work with that

3) when you read a line not starting with '<', add it to previous item


what have you tried?

gnari



 
Reply With Quote
 
 
 
 
Steve Dunn
Guest
Posts: n/a
 
      12-29-2003
Hi Ragnar,
Thanks. I'm not using perl just the regular expression (in .NET). It's
not XML (nor HTML), but some half-baked attempt at mark-up that was thought
of shortly after the dinosaurs became extinct! There are no nested tags
within the text, but empty tags must be ignored (in the example below,
<DOCUMENT> is an empty tag). The files are very small, and 'slurping' (like
the expression!) is one possibility if I can't get the regex to work.

Thanks again,

Steve.

"Ragnar Hafstaš" <(E-Mail Removed)> wrote in message
news:bsotm8$vjt$(E-Mail Removed)...
> "Steve Dunn" <(E-Mail Removed)> wrote in message
> news:UnSHb.12489$(E-Mail Removed)...
> > I'm wondering if anyone can help with the following problem:
> >
> > I have the following text:
> >
> > <DOCUMENT>

> snipped vaguely xml-like text ...
> > And I have the following (multi-line) regular expression:
> >

>
> > ^<([^/].+?[^/])>([\S ]+)
> >

>
> first , a warning:
> regular expressions will only work for simple xml-like stuff.
> i hope you do not have tag nesting or attributes.
> >
> >
> > I now need to modify the expression to take into account multi-line

> content.
> > To give an example, the current expression matches "<DESCRIPTION>OPINION

> OF"
> > but it needs to match "<DESCRIPTION>OPINION OF 'new line' BRADLEY ARANT,

> ET
> > AL."

>
> a few methods come to mind:
>
> 1) if the file is small (not huge) , you can slurp it in, and use

something
> like
> m!^<([^/].+?[^/])>([^<]+)!s
>
> 2) set the input record separator to '<' and work with that
>
> 3) when you read a line not starting with '<', add it to previous item
>
>
> what have you tried?
>
> gnari
>
>
>



 
Reply With Quote
 
Ragnar Hafstaš
Guest
Posts: n/a
 
      12-29-2003
"Steve Dunn" <(E-Mail Removed)> wrote in message
news:sjTHb.12510$(E-Mail Removed)...
> Hi Ragnar,
> Thanks. I'm not using perl just the regular expression (in .NET).

It's

well, i do not know if many here are familiar with it.
are you processing the file line by line?

> not XML (nor HTML), but some half-baked attempt at mark-up that was

thought
> of shortly after the dinosaurs became extinct! There are no nested tags
> within the text, but empty tags must be ignored (in the example below,
> <DOCUMENT> is an empty tag).


in your example there was no end tags (</xxx>), so I am not sure of the file
format.

if you can collect the file into one string without linebreaks, you probably
can do a
match with
<([^/].+?[^/])>([^<]+)

gnari

P.S.:
in this newsgroup, it is considered bad form to top-post, i.e. to
put a reply/followup at the top of the message, and quote the whole thread
below, it is better to quote relevant parts along with replys and comments
(a bit like I am doing in this message)
if the conversation develops into a thread, the top-posting becomes more and
more irritating.



 
Reply With Quote
 
Steve Dunn
Guest
Posts: n/a
 
      12-29-2003
Hi Gnari,

"Ragnar Hafstaš" <(E-Mail Removed)> wrote in message
news:bsp41o$vrp$(E-Mail Removed)...
> "Steve Dunn" <(E-Mail Removed)> wrote in message
> news:sjTHb.12510$(E-Mail Removed)...
> > Hi Ragnar,
> > Thanks. I'm not using perl just the regular expression (in .NET).

> It's
>
> well, i do not know if many here are familiar with it.
> are you processing the file line by line?

I am processing the text as one whole string. I've implemented a
work-around that 'slurps' line by line, although I'm not happy with it.

>
> > not XML (nor HTML), but some half-baked attempt at mark-up that was

> thought
> > of shortly after the dinosaurs became extinct! There are no nested tags
> > within the text, but empty tags must be ignored (in the example below,
> > <DOCUMENT> is an empty tag).

>
> in your example there was no end tags (</xxx>), so I am not sure of the

file
> format.

End tags for these elements do not exist in this mark-up (I haven't got a
clue as to why not, but as I said, it was designed before the wheel !)
>
> if you can collect the file into one string without linebreaks, you

probably
> can do a
> match with
> <([^/].+?[^/])>([^<]+)

Thanks for this. It works great although doesn't take into account the '<'
being on a new-line. It is returning the desired results, but will break if
there's any '<' characters in the text (and this 'mark-up' has no
escaping(!))
>
> gnari

Steve.
>
> P.S.:
> in this newsgroup, it is considered bad form to top-post, i.e. to
> put a reply/followup at the top of the message, and quote the whole thread
> below, it is better to quote relevant parts along with replys and comments
> (a bit like I am doing in this message)
> if the conversation develops into a thread, the top-posting becomes more

and
> more irritating.

Message understood. Many thanks for pointing this out and many many thanks
for your help!
>
>
>



 
Reply With Quote
 
Ragnar Hafstaš
Guest
Posts: n/a
 
      12-29-2003
"Steve Dunn" <(E-Mail Removed)> wrote in message
news:YKVHb.12565$(E-Mail Removed)...
> Hi Gnari,
>
> "Ragnar Hafstaš" <(E-Mail Removed)> wrote in message
> news:bsp41o$vrp$(E-Mail Removed)...
> > if you can collect the file into one string without linebreaks, you

> probably
> > can do a
> > match with
> > <([^/].+?[^/])>([^<]+)

> Thanks for this. It works great although doesn't take into account the

'<'
> being on a new-line. It is returning the desired results, but will break

if
> there's any '<' characters in the text (and this 'mark-up' has no
> escaping(!))


ok. if you collect the string *with* linefeeds, you should be able to match
with
\n<([^/].+?[^/])>([^<]+)
then you will have to deal with linefeeds in the capture

by the way, why are you testing for </xxx> and <xxx/> tags?
i thought you said there were none.

> Message understood. Many thanks for pointing this out and many many thanks
> for your help!


you are welcome

gnari



 
Reply With Quote
 
Steve Dunn
Guest
Posts: n/a
 
      12-30-2003

"Ragnar Hafstaš" <(E-Mail Removed)> wrote in message
news:bspo66$3nl$(E-Mail Removed)...
> "Steve Dunn" <(E-Mail Removed)> wrote in message
> news:YKVHb.12565$(E-Mail Removed)...
> > Hi Gnari,
> >
> > "Ragnar Hafstaš" <(E-Mail Removed)> wrote in message
> > news:bsp41o$vrp$(E-Mail Removed)...
> > > if you can collect the file into one string without linebreaks, you

> > probably
> > > can do a
> > > match with
> > > <([^/].+?[^/])>([^<]+)

> > Thanks for this. It works great although doesn't take into account the

> '<'
> > being on a new-line. It is returning the desired results, but will

break
> if
> > there's any '<' characters in the text (and this 'mark-up' has no
> > escaping(!))

>
> ok. if you collect the string *with* linefeeds, you should be able to

match
> with
> \n<([^/].+?[^/])>([^<]+)
> then you will have to deal with linefeeds in the capture


Many thanks Gnari. I think we're almost there.

>
> by the way, why are you testing for </xxx> and <xxx/> tags?
> i thought you said there were none.
>

There aren't any in the snippet that I'm parsing, but the regex is also
used on larger peices of text that might contain closing tags

> > Message understood. Many thanks for pointing this out and many many

thanks
> > for your help!

>
> you are welcome
>
> gnari
>

Steve.
p.s. Happy New Year!
>
>



 
Reply With Quote
 
Matt Garrish
Guest
Posts: n/a
 
      12-30-2003

"Steve Dunn" <(E-Mail Removed)> wrote in message
news:UnSHb.12489$(E-Mail Removed)...
>
> I now need to modify the expression to take into account multi-line

content.
> To give an example, the current expression matches "<DESCRIPTION>OPINION

OF"
> but it needs to match "<DESCRIPTION>OPINION OF 'new line' BRADLEY ARANT,

ET
> AL."
>


You're probably better off "unbust"ing the file first (never checked if
that's actually a technical term, but it is the name of a script we have
where I work). Essentially, you'd just have to write a script to remove
newlines from the file unless the line begins with a top-level tag. You
could then read the file line-by-line with a simple expression like:

m#^<([^>]*)>(.*)(</\1>)?#i

to grab all the data you need. The usefulness, however, will vary depending
on what you are trying to capture and how it is formatted.

Matt


 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      12-30-2003
Matt Garrish <(E-Mail Removed)> wrote:

> You're probably better off "unbust"ing the file first (never checked if
> that's actually a technical term, but it is the name of a script we have
> where I work).



I call my unbusters "preprocessor"s when in polite company,
otherwise they're "defoo"s.


--
Tad McClellan SGML consulting
http://www.velocityreviews.com/forums/(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Seek xpath expression where an attribute name is a regular expression GIMME XML 3 12-29-2008 03:11 PM
C/C++ language proposal: Change the 'case expression' from "integral constant-expression" to "integral expression" Adem C++ 42 11-04-2008 12:39 PM
Matching abitrary expression in a regular expression =?iso-8859-1?B?bW9vcJk=?= Java 8 12-02-2005 12:51 AM
regular expression assistance - which newsgroup? Could not find one on MSNEWS Keith-Earl ASP .Net 1 06-15-2004 05:38 PM
Dynamically changing the regular expression of Regular Expression validator VSK ASP .Net 2 08-24-2003 02:47 PM



Advertisments