Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Re: ignoring namespaces?

Reply
Thread Tools

Re: ignoring namespaces?

 
 
Joe Kesselman
Guest
Posts: n/a
 
      06-04-2010
> So - using XML::LibXML, is there a way
> of using XPaths, without namespaces?


Can't vouch for that tool.

You can, if you insist on doing so, write XPaths which are specifically
testing the localname rather than the qualified name
/*[localname()="foo"]/@*[localname()="bar"]
though in some processors the performance of this variant will be
inferior to the proper namespace-aware path. And of course the increased
verbosity makes it harder to write, harder to read, and harder to maintain.

If at all possible, I really recommend hammering on people to fix the
documents and use namespaces correctly. This will continue to cause
problems, and not every XML tool will let you construct this sort of
workaround. You can pay the cost to fix them now, or you can wait and
fix them in a complete panic (probably at greater cost) later.

--
Joe Kesselman,
http://www.love-song-productions.com...lam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
 
Reply With Quote
 
 
 
 
Joe Kesselman
Guest
Posts: n/a
 
      06-05-2010
bugbear wrote:
>> If at all possible, I really recommend hammering on people to fix the
>> documents and use namespaces correctly.

>
> Too late. Legacy applications and legacy files make this impossible.


Understood. As I say, that's going to continue to add to their costs in
the future, but if they can't/won't get everything fixed now, that's
their choice.

"The customer is not always right. The customer is the one with the
money. Sometimes you have to choose between being right and getting the
money."

(This is one reason for always having file formats -- in XML or any
other representation -- carry version numbers. That gives you some hope
of being able to recognize newer data, and process it more efficiently,
while still supporting the "quirks mode" needed by older/sloppier
instances.)

--
Joe Kesselman,
http://www.love-song-productions.com...lam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
 
Reply With Quote
 
 
 
 
Peter Flynn
Guest
Posts: n/a
 
      06-06-2010
bugbear wrote:
[...]
> I also considered walking the entire tree REMOVING namespaces,
> but that doesn't sound like a high performance solution.


sed -e "s+<\([/]*\)\([^:]*:\)+<\1+g" ?

///Peter
 
Reply With Quote
 
P. Lepin
Guest
Posts: n/a
 
      06-07-2010

Peter Flynn wrote:
> bugbear wrote:
> [...]
>> I also considered walking the entire tree REMOVING namespaces,
>> but that doesn't sound like a high performance solution.

>
> sed -e "s+<\([/]*\)\([^:]*:\)+<\1+g" ?


Haven't posted anything for a long while, but I cannot keep quiet after
seeing this.

That's barbarous, sir! Just barbarous!

(smileys implied)

--
P. Lepin
 
Reply With Quote
 
Peter Flynn
Guest
Posts: n/a
 
      06-07-2010
bugbear wrote:
> Peter Flynn wrote:
>> bugbear wrote:
>> [...]
>>> I also considered walking the entire tree REMOVING namespaces,
>>> but that doesn't sound like a high performance solution.

>>
>> sed -e "s+<\([/]*\)\([^:]*:\)+<\1+g" ?

>
> Given that my problem (corrupt data) cannot be solved by a "squeaky
> clean" solution (*), that's strangely appealing.


I always counsel to avoid the non-XML approach because it carries no
guarantee that the object you elect to operate on is actually what you
think it is.

(At least, a formal XML method like XSLT/XPath doesn't have any
"guarantee" as such, but at least I can be reasonably certain that if I
select the fifth paragraph of section 4 of chapter 6, then that is what
I will get, leaving aside my own programming errors.)

But there are times (and invalid XML is one of them) when a combination
of sed, awk, grep, tr, and the rest if the tribe, including Perl, Emacs,
Python, and your own personal favourite, are the only viable solution.

sed has the advantage and disadvantage of being spectacularly fast: get
it wrong and it will eat your data. Properly tested, however, the above
will remove all namespace prefixes to element type names within the
document element. It will not remove the xmlns:* namespace binding
attributes from the root element start-tag, nor will it remove
namespaces prefixes from attributes anywhere (the addition of more REs,
alternations, subexpressions, and backreferences to achieve this is left
as an exercise to the reader . Because it is unparsed, it *will*
remove the namespace prefixes from examples of XML markup in CDATA
marked sections in documentation, for example.

P. Lepin wrote:
> Peter Flynn wrote:
>> bugbear wrote:
>> [...]
>>> I also considered walking the entire tree REMOVING namespaces,
>>> but that doesn't sound like a high performance solution.

>> sed -e "s+<\([/]*\)\([^:]*:\)+<\1+g" ?

>
> Haven't posted anything for a long while, but I cannot keep quiet
> after seeing this.
>
> That's barbarous, sir! Just barbarous!
> (smileys implied)


Peh. I have seen *far* worse [better], both in the Humanities and the
Natural Sciences, trying to coerce evilly-formed documents into XML

///Peter
--
XML FAQ: http://xml.silmaril.ie/
 
Reply With Quote
 
Martijn Lievaart
Guest
Posts: n/a
 
      06-07-2010
On Mon, 07 Jun 2010 09:55:07 +0100, bugbear wrote:

> Peter Flynn wrote:
>> bugbear wrote:
>> [...]
>>> I also considered walking the entire tree REMOVING namespaces, but
>>> that doesn't sound like a high performance solution.

>>
>> sed -e "s+<\([/]*\)\([^:]*:\)+<\1+g" ?

>
> Given that my problem
> (corrupt data) cannot be solved
> by a "squeaky clean" solution (*),
> that's strangely appealing.


It is also very error prone, but may be acceptable. To improve on the
above solution, do split it in two steps. First step, a custom program
(instead of sed) cleans up the files and produces clean files without
namespaces, second step program(s) processes those clean files.

By creating a separate program for the first step, you can have it do
checks to see if the output it produces is sensible and die (to let you
investigate the problem) if it is not.

After cleaning the files, all programs that process them (second step)
don't have to carry convoluted logic to deal with the dirty files).

M4
 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      06-07-2010
On Mon, 07 Jun 2010 12:32:15 +0100, Peter Flynn <(E-Mail Removed)> wrote:

>bugbear wrote:
>> Peter Flynn wrote:
>>> bugbear wrote:
>>> [...]
>>>> I also considered walking the entire tree REMOVING namespaces,
>>>> but that doesn't sound like a high performance solution.
>>>
>>> sed -e "s+<\([/]*\)\([^:]*:\)+<\1+g" ?

>>
>> Given that my problem (corrupt data) cannot be solved by a "squeaky
>> clean" solution (*), that's strangely appealing.

>
>I always counsel to avoid the non-XML approach because it carries no
>guarantee that the object you elect to operate on is actually what you
>think it is.
>
>(At least, a formal XML method like XSLT/XPath doesn't have any
>"guarantee" as such, but at least I can be reasonably certain that if I
>select the fifth paragraph of section 4 of chapter 6, then that is what
>I will get, leaving aside my own programming errors.)
>
>But there are times (and invalid XML is one of them) when a combination
>of sed, awk, grep, tr, and the rest if the tribe, including Perl, Emacs,
>Python, and your own personal favourite, are the only viable solution.
>
>sed has the advantage and disadvantage of being spectacularly fast: get
>it wrong and it will eat your data. Properly tested, however, the above
>will remove all namespace prefixes to element type names within the
>document element. It will not remove the xmlns:* namespace binding
>attributes from the root element start-tag, nor will it remove
>namespaces prefixes from attributes anywhere (the addition of more REs,
>alternations, subexpressions, and backreferences to achieve this is left
>as an exercise to the reader . Because it is unparsed, it *will*
>remove the namespace prefixes from examples of XML markup in CDATA
>marked sections in documentation, for example.
>


This might parse it (with a slight bit of validation)
using regex, while changing just specific parts of the source xml
dealing with namespace in tags and/or attributes.

-sln

# -----------------------------------------------------------
# rx_xml_fixnamespace.pl
# -sln, 6/7/2010
#
# Util to search/replace xml namespace from tags/attributes
# -----------------------------------------------------------

use strict;
use warnings;

## Initialization
##

my $Name = "[A-Za-z_:][\\w:.-]*";
my $SkipName = "[A-Za-z_][\\w.-]*";
my $rxskip_tag = "(?: $SkipName )"; # Skip tags
my $rxskip_attr = "(?: $SkipName )"; # Skip attribute's
my $rxtag = "(?: $Name )"; # Tags
my $rxattr = "(?: $Name )"; # Attribute's


use re 'eval';
my $topen = 0;

my $Rxmarkup = qr
{
(?(?{$topen}) # Begin Conditional

# Have open <TAG> ?
(?:
# Try to match next attribute
(?:
\s*=\s* (?:".*?"|'.*?') \K
|
\s* (?<=\s)
(?: $rxskip_attr \K | \K (?<ATTR> $rxattr) )
(?= \s*=\s* (?:".*?"|'.*?'))
)
(?= [^>]*? \s* /? > )
|
# No more attr's
(?{$topen = 0})
)
|
# Look for new open or close <TAG>
(?:
[^<]*
(?:
# Things that hide markup:
# - Comments/CDATA
(?: <!
(?:
\[CDATA\[.*?\]\]
| --.*?--
| \[[A-Z][A-Z\ ]*\[.*?\]\]
)
> \K

)
|
# Specific markup we seek:
# - TAG
<
(?:
/* $rxskip_tag \K (?= \s* /* >)
|
/* \K (?<TAG> $rxtag ) (?= \s* /* >)
|
(?: $rxskip_tag \K | \K (?<TAG> $rxtag ) )
(?= \s [^>]*? \s* /? > )
(?{$topen = 1})
)
)
|
< \K
)
) # End Conditional
}xs;

## Code
##

my $xml = join '', <DATA>;
$xml =~ s/$Rxmarkup/ fixnamespace( $+{TAG}, $+{ATTR} ) /eg;
print "\n",$xml;

exit (0);


## Subs
##

sub fixnamespace {

if (defined $_[0]) {
my $tag = $_[0];
if ($tag =~ s/^[^:]*://) {
print "Replaced\t$_[0]\n with \t$tag\n";
}
return $tag;
}
if (defined $_[1]) {
my $attr = $_[1];
if ($attr =~ s/^[^:]*://) {
print "Replaced\t$_[1]\n with \t$attr\n";
}
return $attr;
}
return "";
}


__DATA__

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>

<Profile xmlns="xxxxxxxxx" name="" version="1.1" xmlnssi="http://
www.w3.org/2001/XMLSchema-instance" junk="">

<monday:Application Name="App1" Id="/Local/App/App1"
Id2="/Local/App/App2" services="1" policy=""
StartApp="" Bal="5" sessInt="500" WaterMark="1.0"/>

<AppProfileGuid>586e3456dt</AppProfileGuid>

</Profile>

<Application
Name="App99" Id='/Dummy/Test/iii' Services="3"
policy="99" monday:StartApp="2" Bal="7" sessInt="27"
tuesday:WaterMark="4.3" />

<wednesday:Application Id="/testing"
Name="App100" monday:Id="/Dum
my/Test/iii
" Services="4"
policy="99" StartApp="2" Bal="7" sessInt="27"
WaterMark="4.3"/>

<Application
Name="Yyee" Id="/Dat/Inp/Out" Services="5"
policy="88" StartApp="" Bal="1" sessInt="8"
thrusday:WaterMark="2.1"/>

<![CDATA[ <Applic:ation Name="App" Id=""/> ]]>

<AppProfile:Guid>586e3456dt</AppProfile:Guid>
<AppProfile:Guid>a46y2hktt7</AppProfile:Guid>
<AppProfile:Guid>mi6j77mae6</AppProfile:Guid>
</Profile>

 
Reply With Quote
 
Peter Flynn
Guest
Posts: n/a
 
      06-09-2010
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> On Mon, 07 Jun 2010 12:32:15 +0100, Peter Flynn <(E-Mail Removed)> wrote:

[...]
>> I always counsel to avoid the non-XML approach

[...]
> This might parse it (with a slight bit of validation)


It occurs to me that you can combine both methods, iff the document is
well-formed.

Run onsgmls -wxml /usr/share/xml/declaration/xml.dcl doc.xml >doc.esis
to get the ESIS, and then tweak the W3C's esis2xml.py script to re-form
the XML document, omitting the namespaces. Or write your own in Perl...

///Peter
 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      06-12-2010
On Wed, 09 Jun 2010 22:02:32 +0100, Peter Flynn <(E-Mail Removed)> wrote:

>(E-Mail Removed) wrote:
>> On Mon, 07 Jun 2010 12:32:15 +0100, Peter Flynn <(E-Mail Removed)> wrote:

>[...]
>>> I always counsel to avoid the non-XML approach

>[...]
>> This might parse it (with a slight bit of validation)

>
>It occurs to me that you can combine both methods, iff the document is
>well-formed.
>
>Run onsgmls -wxml /usr/share/xml/declaration/xml.dcl doc.xml >doc.esis
>to get the ESIS, and then tweak the W3C's esis2xml.py script to re-form
>the XML document, omitting the namespaces. Or write your own in Perl...
>
>///Peter


Hey thanks!


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Mozilla & Firefox ignoring my Hosts file Captain Infinity Firefox 4 09-03-2009 10:55 AM
Ignoring multiple threads Stig Ove Voll Firefox 0 03-08-2005 09:57 PM
Perl expression for parsing CSV (ignoring parsing commas when in double quotes) GIMME Perl 2 02-11-2004 05:40 PM
ASP.NET ignoring all breakpoints Ron Icard ASP .Net 1 08-22-2003 08:23 PM
ASP.NET Processor Ignoring DataLists? Jonathan Hollinger ASP .Net 3 08-19-2003 01:27 AM



Advertisments