Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Java (http://www.velocityreviews.com/forums/f30-java.html)
-   -   Yet another Java regex problem (http://www.velocityreviews.com/forums/t143725-yet-another-java-regex-problem.html)

bauer@b3s.de 05-24-2005 12:46 PM

Yet another Java regex problem
 
Hi,
there's a DocBook XML file which I want to modify. The file contains
something like
....
<mediaobject>
<imageobject>
<imagedata fileref="PathToImage" format="ImgFormat"/>
</imageobject>
</mediaobject>
....
I just want to match the whole <mediaobject> thingy and prepend one
line which contains the PathToImage as a XML comment just like
<!-- PathToImage -->

My input to the matcher is the whole file as is. First I tried to get a
regex to match the whole thing

content = content.replaceFirst(
"<mediaobject>" +
"\\s*<imageobject>" +
"\\s*<imagedata fileref=\".*\".*/>" +
"\\s*</imageobject>" +
"\\s*</mediaobject>",
"<!-- Test -->"
);

But when I use a backref (like \0 for the whole match or \1 if I use
parentheses for the filename) in the replacement string like this:
"<!-- Test -->\0"
I just get
<!-- Test --> + this square char which cannot display here

The strange thing is that when I use exactly the same pattern with
Pattern.compile(regex).matcher(str).replaceAll(rep l)
nothing matches (opposed to the Java API statment for
String.replaceAll()).

I tried Pattern.MULTILINE and Pattern.DOTALL in any combination. I
tried to use .* instead of \\s and even used \r?\n? for the line
endings ... nothing works.

Please can anyone help me?

_

Tom


TechBookReport 05-24-2005 02:20 PM

Re: Yet another Java regex problem
 
bauer@b3s.de wrote:
> Hi,
> there's a DocBook XML file which I want to modify. The file contains
> something like
> ...
> <mediaobject>
> <imageobject>
> <imagedata fileref="PathToImage" format="ImgFormat"/>
> </imageobject>
> </mediaobject>
> ...
> I just want to match the whole <mediaobject> thingy and prepend one
> line which contains the PathToImage as a XML comment just like
> <!-- PathToImage -->
>
> My input to the matcher is the whole file as is. First I tried to get a
> regex to match the whole thing
>
> content = content.replaceFirst(
> "<mediaobject>" +
> "\\s*<imageobject>" +
> "\\s*<imagedata fileref=\".*\".*/>" +
> "\\s*</imageobject>" +
> "\\s*</mediaobject>",
> "<!-- Test -->"
> );
>
> But when I use a backref (like \0 for the whole match or \1 if I use
> parentheses for the filename) in the replacement string like this:
> "<!-- Test -->\0"
> I just get
> <!-- Test --> + this square char which cannot display here
>
> The strange thing is that when I use exactly the same pattern with
> Pattern.compile(regex).matcher(str).replaceAll(rep l)
> nothing matches (opposed to the Java API statment for
> String.replaceAll()).
>
> I tried Pattern.MULTILINE and Pattern.DOTALL in any combination. I
> tried to use .* instead of \\s and even used \r?\n? for the line
> endings ... nothing works.
>
> Please can anyone help me?
>
> _
>
> Tom
>


Have you tried a pattern of "(<mediaobject)(.*)(</mediaobject>)". You
can then use a replacement along the lines of "<!-- PathToImage
-->$1$2$3". I'd also use Pattern.MULTILINE | Pattern.DOTALL when
building the pattern.

Hope that helps.

Pan
================================================== ====================
TechBookReport Java http://www.techbookreport.com/JavaIndex.html

bauer@b3s.de 05-24-2005 04:22 PM

Re: Yet another Java regex problem
 

TechBookReport wrote:
> Have you tried a pattern of "(<mediaobject)(.*)(</mediaobject>)". You


> can then use a replacement along the lines of "<!-- PathToImage
> -->$1$2$3". I'd also use Pattern.MULTILINE | Pattern.DOTALL when
> building the pattern.
>
> Hope that helps.


Not really ... this results in the same problem I already described.
Instead of substituting \1\2\3 with the matching groups I get only this
special char (looks like a square, cannot displayed here). Btw I even
noticed that you used $1$2$3. This is perl, right? In Java it would be
\1\2\3 or am I wrong?

You can try it yourself. Save the following content to a file:
<chapter>
<title>Chapter 1</title>
<sect1>
<title>Section 1</title>
<para>
Test Test Test Test Test Test Test Test Test
</para>
<mediaobject>
<imageobject>
<imagedata fileref="image.svg" format="SVG"/>
</imageobject>
</mediaobject>
<para>
Test Test Test Test Test Test Test Test Test
</para>
</sect1>
</chapter>

Read this file with
public String readPlain( File file ) throws Exception
{
String content = new String();
String line = new String();
BufferedReader brd = new BufferedReader( new FileReader( file ) );
while ( ( line = brd.readLine() ) != null )
content += line + "\r\n";
brd.close();
return content;
}

and then apply a
content = Pattern.compile( "(<mediaobject)(.*)(</mediaobject>)",
Pattern.MULTILINE|Pattern.DOTALL).matcher(
content).replaceAll("<!-- Test -->\1\2\3");

_

Tom


bauer@b3s.de 05-24-2005 04:30 PM

Re: Yet another Java regex problem
 
Damn Java regex !!! It is $1$2$3. That was the point. I used the wrong
syntax for backrefs. But in Java API 1.4.2 under
java.util.regex.Pattern stands

Back references
\n Whatever the nth capturing group matched

So what ... ?!?


TechBookReport 05-24-2005 04:43 PM

Re: Yet another Java regex problem
 
bauer@b3s.de wrote:
> Damn Java regex !!! It is $1$2$3. That was the point. I used the wrong
> syntax for backrefs. But in Java API 1.4.2 under
> java.util.regex.Pattern stands
>
> Back references
> \n Whatever the nth capturing group matched
>
> So what ... ?!?
>

Did you escape the backslashes? Also, the funny square character is
probably the \r\n you are using. Try
System.getProperty("line.separator") instead.

Pan

================================================== ====================
TechBookReport Java http://www.techbookreport.com/JavaIndex.html

bauer@b3s.de 05-24-2005 05:08 PM

Re: Yet another Java regex problem
 

TechBookReport schrieb:
> bauer@b3s.de wrote:
> > Damn Java regex !!! It is $1$2$3. That was the point. I used the

wrong
> > syntax for backrefs. But in Java API 1.4.2 under
> > java.util.regex.Pattern stands
> >
> > Back references
> > \n Whatever the nth capturing group matched
> >
> > So what ... ?!?
> >

> Did you escape the backslashes? Also, the funny square character is
> probably the \r\n you are using. Try
> System.getProperty("line.separator") instead.
>

No the funny square char is not the \r\n cause if so it would be on
every line independant of the regex code. I'm on Windows and the app
runs only on this system but you are right, better I use
getProperty("line.separator").
I guess the funny square is some unicode character (\1=0x01?) if I use
\1 without escaping the backslash.
But that doesn't matter anymore, my problem is solved. Thanks for your
help.


Alan Moore 05-24-2005 09:54 PM

Re: Yet another Java regex problem
 
On Tue, 24 May 2005 15:20:13 +0100, TechBookReport <tbr@nospam.nos>
wrote:

>Have you tried a pattern of "(<mediaobject)(.*)(</mediaobject>)". You
>can then use a replacement along the lines of "<!-- PathToImage
>-->$1$2$3". I'd also use Pattern.MULTILINE | Pattern.DOTALL when
>building the pattern.


If there can be more than one mediaobject element in a document, you
need to use a reluctant dot-star:

"<mediaobject.*?</mediaobject>"

Otherwise, it will match everything from the first opening tag to the
last closing tag. Even if there's only one such element, it will
probably be more efficient this way.

You don't really need to use capturing parentheses, since you're
re-inserting the whole match; just use $0:

str = str.replaceAll("<mediaobject.*?</mediaobject>",
"<!-- PathToImage -->$0");


The JDK regex package uses the same syntax as Perl WRT
backreferences--"\n" within the regex and "$n" in the replacement
string--except that it uses $0 instead of $& for the whole match, and
doesn't emulate the other dollar-plus-punctuation variables: $`, $',
and $+.


All times are GMT. The time now is 08:19 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.