Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Help: String search in Windows 2000 doesn't find text in Windows

Reply
Thread Tools

Help: String search in Windows 2000 doesn't find text in Windows

 
 
Barry Millman
Guest
Posts: n/a
 
      11-27-2005
Hi:

I am using Perl 5 (I believe both machines are using ActivePERL 5) on
two machines with the same data files. One machine is Win 2000 the
other is Win XP. The files are MS Word 2000 documents e-mailed
(manually) from the Win 2000 machine to the XP machine.

The program searches the MS Word Files (both created with MS Word 2000)
for the word HYPERLINK. The format for the HYPERLINK that I am
searching for in the document is:

HYPERLINK "mydoc.doc"

(I checked this on the XP machine in Notepad and it is OK.)

PROBLEM: The program works on the Windows 2000 machine, but does not
find the files on the Win Xp machine.

The code that is not finding the text on the Win XP machine (same as
the Win 2000 machine which does find the test)is:

----------- start actual code segment --------------------
while (/HYPERLINK(\s+.{1,80}?\.doc)/gim) # the "g" causes multiple
matches

{
$fndxx = $1;

$fndxx =~ s/\"//; # remove leading quote
$fndxx =~ s/\s+//; # remove leading spaces
$dir="C:\\IGINproducts\\UserDocuments\\";

$fullname = ($dir . $fndxx);
$date_string = "Cannot Find";
if (-e $fullname) { $date_string = ctime(stat($dir .
$fndxx)->mtime); } #last update date of that file
print(OUTFILE $fndxx,",",$date_string,", in: ",basename($file),
"\n") ;
$matches += 1; # count matches

} #end while HYPERLINK
----------- end actual code segment --------------------

The output for a found HYPERLINK should look like this (it does on the
Win 2000 machine):

mydoc.doc,(date of last update), in: otherdoc.doc

On Win XP, the program cannot even find the word HYPERLINK (if I modify
the code to just search for that). The directories are valid, I can
have the program print a list of all files as it processes them.

If I try this with a test program (the string to test is in the program
itself ) it works fine on the XP machine.

There are no encryption issues, nor any file or directory problems.

I would really appreciate any comments or suggestions about what I am
doing wrong.

Thanks,

Barry Millman


 
Reply With Quote
 
 
 
 
Barry Millman
Guest
Posts: n/a
 
      11-27-2005
Just some added info:

The search works fine if I save the MS Word files as RTF.

Also I wanted to mention that I have this around the hyperlink search code:
#open the file
open(INFILE,"< $file") or die "Couldn't open file ",$file;


while(<INFILE>)
{
# the hyperlink code I posted earlier
} # end while infile

Barry



Barry Millman wrote:

> Hi:
>
> I am using Perl 5 (I believe both machines are using ActivePERL 5)
> on two machines with the same data files. One machine is Win 2000 the
> other is Win XP. The files are MS Word 2000 documents e-mailed
> (manually) from the Win 2000 machine to the XP machine.
>
> The program searches the MS Word Files (both created with MS Word
> 2000) for the word HYPERLINK. The format for the HYPERLINK that I am
> searching for in the document is:
>
> HYPERLINK "mydoc.doc"
>
> (I checked this on the XP machine in Notepad and it is OK.)
>
> PROBLEM: The program works on the Windows 2000 machine, but does not
> find the files on the Win Xp machine.
>
> The code that is not finding the text on the Win XP machine (same as
> the Win 2000 machine which does find the test)is:
>
> ----------- start actual code segment --------------------
> while (/HYPERLINK(\s+.{1,80}?\.doc)/gim) # the "g" causes multiple
> matches
>
> {
> $fndxx = $1;
>
> $fndxx =~ s/\"//; # remove leading quote
> $fndxx =~ s/\s+//; # remove leading spaces
> $dir="C:\\IGINproducts\\UserDocuments\\";
>
> $fullname = ($dir . $fndxx);
> $date_string = "Cannot Find";
> if (-e $fullname) { $date_string = ctime(stat($dir .
> $fndxx)->mtime); } #last update date of that file
> print(OUTFILE $fndxx,",",$date_string,", in:
> ",basename($file), "\n") ;
> $matches += 1; # count matches
>
> } #end while HYPERLINK
> ----------- end actual code segment --------------------
>
> The output for a found HYPERLINK should look like this (it does on the
> Win 2000 machine):
>
> mydoc.doc,(date of last update), in: otherdoc.doc
>
> On Win XP, the program cannot even find the word HYPERLINK (if I modify
> the code to just search for that). The directories are valid, I can
> have the program print a list of all files as it processes them.
>
> If I try this with a test program (the string to test is in the program
> itself ) it works fine on the XP machine.
>
> There are no encryption issues, nor any file or directory problems.
>
> I would really appreciate any comments or suggestions about what I am
> doing wrong.
>
> Thanks,
>
> Barry Millman
>
>

 
Reply With Quote
 
 
 
 
Tad McClellan
Guest
Posts: n/a
 
      11-27-2005
Barry Millman <> wrote:

> The format for the HYPERLINK that I am
> searching for in the document is:
>
> HYPERLINK "mydoc.doc"


> PROBLEM: The program works on the Windows 2000 machine, but does not
> find the files on the Win Xp machine.



I don't think I can help with that part, but the code is too hokey
to just let it pass...


> ----------- start actual code segment --------------------
> while (/HYPERLINK(\s+.{1,80}?\.doc)/gim) # the "g" causes multiple
> matches



The //m does not do anything, so why is it there?

It changes the meaning of ^ and $, but you don't use those
anchors in your pattern, so you don't need //m.

.{1,80}?

is the same as

.{0,80}

Do you really want to match ' .doc' ?


We can't help you analyse why the match is failing because we
need two things to do that: the pattern and the string that
the pattern is to be matched against.

We have only one of those two things...


>
> {
> $fndxx = $1;
>
> $fndxx =~ s/\"//; # remove leading quote
> $fndxx =~ s/\s+//; # remove leading spaces



Why capture them only to strip them out of the captured string?

Why not just leave them out of the capture in the first place?


while (/HYPERLINK\s+"(.{1,78}\.doc")/gi)

or, probably better:

while (/HYPERLINK\s+"([^"]{1,78}\.doc")/gi)


> $dir="C:\\IGINproducts\\UserDocuments\\";
>



Use single quotes unless you want to make use of one of the two
extra things that double quotes give you (interpolation
and backslash escapes).

Use forward slashes instead of silly slashes unless the path
is going to be fed to the "command interpreter".


$dir='C:/IGINproducts/UserDocuments/';


> print(OUTFILE $fndxx,",",$date_string,", in: ",basename($file),
> "\n") ;



Gak!

Use double quoted strings to concatenate your output string:

print(OUTFILE "$fndxx,$date_string, in: ", basename($file), "\n") ;


> If I try this with a test program (the string to test is in the program
> itself ) it works fine on the XP machine.



If you had shown us your complete test program, then we could
have helped you debug it.

But you didn't, so we can't. (hint)


> I would really appreciate any comments or suggestions about what I am
> doing wrong.



Not posting a short and complete program that we can run that
illustrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?


--
Tad McClellan SGML consulting
Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Barry Millman
Guest
Posts: n/a
 
      11-27-2005
Hi:

I tried your suggestions, but no luck. I did nove that directory
assignment outside the loop. Stupid of me!

There is something really odd in MS Word storage in Win XP. If I save
the document to RTF it finds the stuff in the RTF file.

I looked at both the MS Word and RTF files with the XVI32 Hex editor.
They both showed the same hex values for the string HYPERLINK.

Barry




Purl Gurl wrote:

> Barry Millman wrote:
>
> (snipped)
>
>
>>The code that is not finding the text on the Win XP machine (same as
>>the Win 2000 machine which does find the test)is:

>
>
> (snipped)
>
> Move this line above and outside your while loop:
>
>
>> $dir="C:\\IGINproducts\\UserDocuments\\";

>
>
> The reason for moving that line above and outside your while loop
> is you are creating a new value for that variable with each loop
> iteration. That is inefficient because that variable has a "fixed"
> value; set the value above and outside your while loop.
>
> You do not need to use double left hand slashes for your
> file path but doing so causes no harm. You can use single
> right hand slashes for your path, for a open(FILE) syntax
> as shown below.
>
> However, despite claims of one the "experts" in this group,
> you must use double lefthand slashes for some syntax,
> certainly for some system command syntax for Win32.
>
> For a file open, you do not need double slashes but it
> is perfectly ok to use them.
>
> Uppercase letters in a file path are not needed for Win32
> but are ok to use; no problem.
>
> Your code produces this directory / file name path:
>
> C:\IGINproducts\UserDocuments\mydoc.doc
>
> That "appears" to be a valid path. Check to be sure it is valid.
> Double check to be sure there are not spaces in a directory
> name, such as, User Documents which is typical.
>
> You do not show your syntax for your OUTFILE open for write.
> Be sure to use error checking to verify that file opens for write.
>
> Run this test code,
>
> #!perl
>
> open (TEST, "c:/iginproducts/userdocuments/mydoc.doc") || die "File Open Failed: $!";
>
> while (<TEST>)
> {
> if (index ($_, "HYPERLINK") > -1)
> { print "HYPERLINK found at line $.\n"; }
> }
>
> close (TEST) || die "File Close Failed $!";
>
>
> Clearly I cannot test that code not having your file to test.
> However, my syntax is ok,
>
> C:\APACHE\USERS\TEST>perl -c test.pl
> test.pl syntax OK
>
> Running that test code will determine if your file path and file name
> are valid, and will determine if HYPERLINK is actually in your file.
>
> Be cautious. If your HYPERLINK word spans lines, index will not
> find that specific instance.
>
> Often, reducing your code to most simple version possible will find
> errors for you, quickly.
>
> Purl Gurl

 
Reply With Quote
 
Barry Millman
Guest
Posts: n/a
 
      11-27-2005
OK. Sorry about the bad code. However, let's reduce this to the
minimum, removing the search for the text. All we will do is read
chunks of data, with this program:

-------------------- start of program --------------------------
open (TEST, "c:\\PERL\\Barry\\Starthere.rtf") || die "File Open Failed: $!";

while (<TEST>)
{

print( "Chunk length: ", length($_),"\n");
$chunks += 1;
}

close (TEST) || die "File Close Failed $!";

print( $chunks, " Chunks\n");
-------------------- end of program --------------------------

Now, if I run this using Starthere.rtf, I get 1544 Chunks and they have
all sorts of different lengths. Some of the first chunks are of length:
103, 218, 250,1,230,63, 255.

However, if I run this using Starthere.doc, I get only ONE chunk, and it
is of length 6 bytes.

If I examine the MS Word file using a Hex editor, I get the following
values for bytes 5 through 7 (calling the first byte as zero):
B1 1A E1

The 1A is the seventh byte of the file.

The PERL program (above) seems to stop at this character.

So forgetting about the search, does this yield any clues?

Thank you,

Barry




Tad McClellan wrote:
> Barry Millman <> wrote:
>
>
>>The format for the HYPERLINK that I am
>>searching for in the document is:
>>
>>HYPERLINK "mydoc.doc"

>
>
>>PROBLEM: The program works on the Windows 2000 machine, but does not
>>find the files on the Win Xp machine.

>
>
>
> I don't think I can help with that part, but the code is too hokey
> to just let it pass...
>
>
>
>>----------- start actual code segment --------------------
>> while (/HYPERLINK(\s+.{1,80}?\.doc)/gim) # the "g" causes multiple
>>matches

>
>
>
> The //m does not do anything, so why is it there?
>
> It changes the meaning of ^ and $, but you don't use those
> anchors in your pattern, so you don't need //m.
>
> .{1,80}?
>
> is the same as
>
> .{0,80}
>
> Do you really want to match ' .doc' ?
>
>
> We can't help you analyse why the match is failing because we
> need two things to do that: the pattern and the string that
> the pattern is to be matched against.
>
> We have only one of those two things...
>
>
>
>> {
>> $fndxx = $1;
>>
>> $fndxx =~ s/\"//; # remove leading quote
>> $fndxx =~ s/\s+//; # remove leading spaces

>
>
>
> Why capture them only to strip them out of the captured string?
>
> Why not just leave them out of the capture in the first place?
>
>
> while (/HYPERLINK\s+"(.{1,78}\.doc")/gi)
>
> or, probably better:
>
> while (/HYPERLINK\s+"([^"]{1,78}\.doc")/gi)
>
>
>
>> $dir="C:\\IGINproducts\\UserDocuments\\";
>>

>
>
>
> Use single quotes unless you want to make use of one of the two
> extra things that double quotes give you (interpolation
> and backslash escapes).
>
> Use forward slashes instead of silly slashes unless the path
> is going to be fed to the "command interpreter".
>
>
> $dir='C:/IGINproducts/UserDocuments/';
>
>
>
>> print(OUTFILE $fndxx,",",$date_string,", in: ",basename($file),
>>"\n") ;

>
>
>
> Gak!
>
> Use double quoted strings to concatenate your output string:
>
> print(OUTFILE "$fndxx,$date_string, in: ", basename($file), "\n") ;
>
>
>
>>If I try this with a test program (the string to test is in the program
>>itself ) it works fine on the XP machine.

>
>
>
> If you had shown us your complete test program, then we could
> have helped you debug it.
>
> But you didn't, so we can't. (hint)
>
>
>
>>I would really appreciate any comments or suggestions about what I am
>>doing wrong.

>
>
>
> Not posting a short and complete program that we can run that
> illustrates your problem.
>
> Have you seen the Posting Guidelines that are posted here frequently?
>
>

 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      11-27-2005
Purl Gurl <> wrote:
> Tad McClellan wrote:
>
> (snipped)
>
>> I don't think I can help with that part, but the code is too hokey
>> to just let it pass...

>
> Have you helped the author resolve his problem?



Have you?


--
Tad McClellan SGML consulting
Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Bob Walton
Guest
Posts: n/a
 
      11-27-2005
Barry Millman wrote:

> Hi:
>
> I am using Perl 5 (I believe both machines are using ActivePERL 5)
> on two machines with the same data files. One machine is Win 2000 the
> other is Win XP. The files are MS Word 2000 documents e-mailed
> (manually) from the Win 2000 machine to the XP machine.
>
> The program searches the MS Word Files (both created with MS Word
> 2000) for the word HYPERLINK. The format for the HYPERLINK that I am
> searching for in the document is:
>
> HYPERLINK "mydoc.doc"
>
> (I checked this on the XP machine in Notepad and it is OK.)
>


Note that MS Word documents are stored in a proprietary binary
gibberish format. To assume that a given word in a document will
actually always be stored in an ASCII string in the .doc file is
assuming too much. For example, perhaps it is stored in Unicode?
And maybe newer Notepad versions understand enough to present
Unicode strings? Try looking at your files with an editor that
you *know* won't munge the contents. I suggest VIM.

It is a mystery why a document would get changed while emailing
it from one system to another. Or did you perhaps open the
document with Word after emailing it, and then save it? You
don't say. Is it the same version of Word? And what email
system are you using on each of the computers? Does the same
thing happen if you zip the file, email the zipped version, and
unzip it on the other system?

> PROBLEM: The program works on the Windows 2000 machine, but does not
> find the files on the Win Xp machine.
>
> The code that is not finding the text on the Win XP machine (same as
> the Win 2000 machine which does find the test)is:
>
> ----------- start actual code segment --------------------
> while (/HYPERLINK(\s+.{1,80}?\.doc)/gim) # the "g" causes multiple
> matches


As others have mentioned, the /m modifier does nothing, and the
..{1,80}? would be better as .{0,80} .

>
> {
> $fndxx = $1;
>
> $fndxx =~ s/\"//; # remove leading quote


Your comment doesn't match the regex -- it will remove the first
quote, not a leading quote.

> $fndxx =~ s/\s+//; # remove leading spaces


Again, this will remove the first run of whitespace from the
string, not leading whitespace.

> $dir="C:\\IGINproducts\\UserDocuments\\";
>
> $fullname = ($dir . $fndxx);
> $date_string = "Cannot Find";
> if (-e $fullname) { $date_string = ctime(stat($dir .
> $fndxx)->mtime); } #last update date of that file
> print(OUTFILE $fndxx,",",$date_string,", in:
> ",basename($file), "\n") ;
> $matches += 1; # count matches
>
> } #end while HYPERLINK
> ----------- end actual code segment --------------------
>
> The output for a found HYPERLINK should look like this (it does on the
> Win 2000 machine):
>
> mydoc.doc,(date of last update), in: otherdoc.doc
>
> On Win XP, the program cannot even find the word HYPERLINK (if I modify
> the code to just search for that). The directories are valid, I can
> have the program print a list of all files as it processes them.
>
> If I try this with a test program (the string to test is in the program
> itself ) it works fine on the XP machine.
>
> There are no encryption issues, nor any file or directory problems.


How exactly do you know this? Using a piece of garbage like
Notepad won't definitively tell you this. I would trust Perl
much further than Notepad.
....
> Barry Millman

--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl
 
Reply With Quote
 
foo bar baz qux
Guest
Posts: n/a
 
      11-27-2005
Purl Gurl wrote:
> Purl Gurl wrote:


Isn't talking to yourself the first sign?


>
> I have looked over Word Perfect and MS Word but not RTF formats, on a
> 9.x machine, a 2K machine and an XP machine.


Somewhat irrelevant because the OP wrote " The files are MS Word 2000
documents e-mailed (manually) from the Win 2000 machine to the XP
machine."


<half-baked story about WordPerfect deleted>


> A hex editor will display plaintext format, if in a binary file. I use
> Hex Workshop v. 2.2x for this. Very old program but works with
> excellence. You could simply open your Word document with a
> hex editor, then search for http: from there.


Pay attention Kira, the OP already wrote "I looked at both the MS Word
and RTF files with the XVI32 Hex editor. They both showed the same hex
values for the string HYPERLINK."


Its so sad to see an old rusty V8 that's only running on three
cylinders.

 
Reply With Quote
 
foo bar baz qux
Guest
Posts: n/a
 
      11-27-2005

Purl Gurl wrote:
> Tad McClellan wrote:
>
> > Purl Gurl wrote:
> > > Tad McClellan wrote:

>
> (snipped)
>
> > >> I don't think I can help with that part, but the code is too hokey
> > >> to just let it pass...

>
> > > Have you helped the author resolve his problem?

>
> > Have you?

>
> I have. You have not.
>


The OP wrote about MS Word and you entertained him with a pointless and
inconclusive story about an unrelated product: WordPerfect. After he
wrote about using a hex editor you advised him to use a hex editor.

 
Reply With Quote
 
foo bar baz qux
Guest
Posts: n/a
 
      11-27-2005

Purl Gurl wrote:
> Barry Millman wrote:
>
> (snipped)
>
> > If I examine the MS Word file using a Hex editor, I get the following
> > values for bytes 5 through 7 (calling the first byte as zero):
> > B1 1A E1

>
> > The 1A is the seventh byte of the file.

>
> > The PERL program (above) seems to stop at this character.

>
> Possible false end of file (eof) signal


"Possible"? Don't be such an unassertive wimp Kira, it is well known
that control-Z (hex 1A) *is* the end of file marker for text files on
MS-DOS and hence (for compatibility reasons) on Win32..

Perl uses the OS for file I/O and it is inevitable that Windows stops
reading your binary file prematurely unless you tell it to use binary
mode.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Why does string.find(x, npos) search from the begining of the string Adrian C++ 7 07-12-2007 03:33 PM
| SEO , Search Engine Optimizer, SEARCH OPtiMIzAtIoN with SeaRch OPtiMizer optimizer.seo@gmail.com Digital Photography 0 04-22-2007 04:20 AM
How to search for literal string in Windows Desktop Search? yong321@yahoo.com Computer Support 0 02-06-2007 04:58 AM
Search a Text File for a String, Return String to Function cl@supportreport.org Perl Misc 1 07-29-2006 10:14 PM
search within a search within a search - looking for better way...my script times out Abby Lee ASP General 5 08-02-2004 04:01 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57