Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > split xml file between two processing instructions

Reply
Thread Tools

split xml file between two processing instructions

 
 
kcwolle
Guest
Posts: n/a
 
      06-23-2004
hello,

I want to split an xml file on processing instructions into different
files.
All content between the two PIs should be included in the new file.
The file name should contain the content of first and the last <no>
elements.


example:
<?split ?>
<h1>... text ...</h1>
<start-element/>
<text>
....text text text...
<nr>4</nr>
</text>
text text text
<nr>18</nr>
<end-element/>
<h6> ... text ...</h6>
<?split ?>

In this case the file name should be: test-no4to18.xml and everything
from <h1> to </h6> should be included.
(btw there can be different start and end tags so that no rule on the
starting and ending elements is possible)
I would like to use an XML module (eg XML::Twigs) but how do I get a
node list that contains all nodes between the processing instructions
for further processing?

Can anybody help me?

Yours

Wolfgang
 
Reply With Quote
 
 
 
 
Anno Siegel
Guest
Posts: n/a
 
      06-23-2004
kcwolle <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> hello,
>
> I want to split an xml file on processing instructions into different
> files.
> All content between the two PIs should be included in the new file.
> The file name should contain the content of first and the last <no>
> elements.
>
>
> example:
> <?split ?>
> <h1>... text ...</h1>
> <start-element/>
> <text>
> ...text text text...
> <nr>4</nr>
> </text>
> text text text
> <nr>18</nr>
> <end-element/>
> <h6> ... text ...</h6>
> <?split ?>
>
> In this case the file name should be: test-no4to18.xml and everything
> from <h1> to </h6> should be included.
> (btw there can be different start and end tags so that no rule on the
> starting and ending elements is possible)
> I would like to use an XML module (eg XML::Twigs) but how do I get a
> node list that contains all nodes between the processing instructions
> for further processing?


What have you tried so far?

We help people with programming, but we don't deliver programs
according to specification.

Anno
 
Reply With Quote
 
 
 
 
Tad McClellan
Guest
Posts: n/a
 
      06-23-2004
kcwolle <(E-Mail Removed)> wrote:

> I want to split an xml file on processing instructions into different
> files.



Does it have to work on arbitrary XML or only on "your" XML?

Might you have PIs like this?

<?split ?>
or
<?split
?>

If so, you're on your own. If not, see below.


> All content between the two PIs should be included in the new file.
> The file name should contain the content of first and the last <no>
> elements.



There are no <no> elements...


> example:
><?split ?>
><h1>... text ...</h1>
><start-element/>
><text>
> ...text text text...
><nr>4</nr>
></text>
> text text text
><nr>18</nr>
><end-element/>
><h6> ... text ...</h6>
><?split ?>
>
> In this case the file name should be: test-no4to18.xml and everything
> from <h1> to </h6> should be included.



> I would like to use an XML module



Since you don't need to make use of the XML structuring, I would
treat them as plain ol' text files.


> Can anybody help me?



What have you tried so far?

We generally prefer to help those who have attempted to help
themselves first...


This should get you started:

foreach my $section ( split /\Q<?split ?>/ ) {
my( $num1, $num2) = ($section =~ /<nr>(\d+)/g)[0, -1];
next unless defined $num1;
my $fname = "text-no${num1}to$num2.xml";
print "$fname\n";
}


--
Tad McClellan SGML consulting
http://www.velocityreviews.com/forums/(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
kcwolle
Guest
Posts: n/a
 
      06-24-2004
Hello Anno,

I tried the following code to split the document. The problem is that
I get only the first two <no> elements and not the first and the last.

use strict;

my $text;
my $file = shift;
my $outfile = shift;
my $testfile;
open(INPUT, "<$file") or die "Kann Datei $file nicht lesen!\n";
local $/;
$text = <INPUT>;
close INPUT;


while ($text =~ /<\?split \?>(.*?)(?=<\?split \?>)/sg)
{
my $fragment = $1;
my ($from, $to) = $fragment =~ /<no>(.*?)<\/no>/isg;
$testfile = $outfile."\\test-nr".${from}."to".${to}."\.xml",
open(OUTPUT, ">$testfile") or die "Kann Datei $testfile nicht
schreiben!!!\n";
print OUTPUT $fragment;
close OUTPUT;
}

The general problem with using regular expressions is that there could
be broken elements eg
<?split ?><level1><text>xxx</text><level2><text>yyy</text></level2><?split
?><level2><text>zzz</text></level2></level1>
where a level1 tag begins in the first <?split ?> and an ends in the
second.
How can that broken elements be handled, so that I have well-formed
XML.

On the other hand if I use an XML module the PI is a node that has no
children. How can the following nodes up to the next PI handled?

Btw I'm a relative newbie to Perl and XML programming so that I need
some support in these things. Maybe you can help me?

Yours

Wolfgang
 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      06-24-2004
kcwolle <(E-Mail Removed)> wrote:

> The problem is that
> I get only the first two <no> elements and not the first and the last.



> my ($from, $to) = $fragment =~ /<no>(.*?)<\/no>/isg;



Use a "list slice" ("Slices" section in perldata.pod) to slice
the list that m//g is returning, like I did in my earlier followup:


my ($from, $to) = ($fragment =~ /<no>(.*?)<\/no>/isg)[ 0, -1 ];
^ ^^^^^^^^^^
^ ^^^^^^^^^^

--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Uses of processing instructions and notations Tom Anderson XML 4 12-13-2008 09:33 PM
Executing XML with XSL Processing Instructions sneill@mxlogic.com Javascript 2 10-21-2005 03:33 PM
Small inconsistency between string.split and "".split Carlos Ribeiro Python 11 09-17-2004 05:57 PM
Processing Instructions Dominic Olivastro XML 9 04-16-2004 12:14 AM
Processing instructions removed from result XML webservice Ronald Scheer ASP .Net Web Services 5 10-06-2003 11:17 AM



Advertisments