Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > find a pattern in binary file

Reply
Thread Tools

find a pattern in binary file

 
 
vizzz
Guest
Posts: n/a
 
      06-20-2008
Hi there,
i need to find an hex pattern like 0x650A1010 in a binary file.
i can make a small algorithm that fetch all the file for the match,
but this file is huge, and i'm scared about performances.
Is there any stl method for a fast search?
Andrea
 
Reply With Quote
 
 
 
 
Kai-Uwe Bux
Guest
Posts: n/a
 
      06-20-2008
vizzz wrote:

> Hi there,
> i need to find an hex pattern like 0x650A1010 in a binary file.
> i can make a small algorithm that fetch all the file for the match,
> but this file is huge, and i'm scared about performances.
> Is there any stl method for a fast search?


You could try std::search() with istreambuf_iterator< unsigned char >.

However:

(a) It is not clear that you will get good performance. Some implementations
are not really all that good with stream iterators.

(b) I am not sure whether search() is allowed to use backtracking
internally, in which case you cannot use it with stream iterators. You
should check.

(c) Even if search finds an occurrence, it reports the result as an
iterator. I do not know of a convenient way to convert that into an offset.


Maybe, rolling your own is not all that bad. You could read the file in
chunks (keeping the last three characters from the previous block) and use
std::search() on the blocks. With the right blocksize, this could be really
fast.


If your OS allows memory mapping of the file, you could do that and use
std::search() with unsigned char * on the whole thing. That could be the
fasted way, but will leave the realm of standard C++.


Best

Kai-Uwe Bux
 
Reply With Quote
 
 
 
 
Ivan
Guest
Posts: n/a
 
      06-20-2008
On Jun 20, 1:11*pm, vizzz <(E-Mail Removed)> wrote:
> Hi there,
> i need to find an hex pattern like 0x650A1010 in a binary file.
> i can make a small algorithm that fetch all the file for the match,
> but this file is huge, and i'm scared about performances.
> Is there any stl method for a fast search?
> Andrea


Hmmm... I had a look at this and ran accross a simple problem. How do
you read a binary file and just echo the HEX for byte to the screen.
The issue is the c++ read function doesn't return number of bytes
read... so on the last read into a buffer how do you know how many
characters to print?

Thanks,
Ivan Novick
http://www.mycppquiz.com
 
Reply With Quote
 
Kai-Uwe Bux
Guest
Posts: n/a
 
      06-21-2008
Ivan wrote:

> On Jun 20, 1:11*pm, vizzz <(E-Mail Removed)> wrote:
>> Hi there,
>> i need to find an hex pattern like 0x650A1010 in a binary file.
>> i can make a small algorithm that fetch all the file for the match,
>> but this file is huge, and i'm scared about performances.
>> Is there any stl method for a fast search?
>> Andrea

>
> Hmmm... I had a look at this and ran accross a simple problem. How do
> you read a binary file and just echo the HEX for byte to the screen.


#include <iostream>
#include <ostream>
#include <fstream>
#include <iterator>
#include <iomanip>
#include <algorithm>
#include <cassert>

class print_hex {

std:stream * ostr_ptr;
unsigned int line_length;
unsigned int index;

public:

print_hex ( std:stream & str_ref, unsigned int length )
: ostr_ptr( &str_ref )
, line_length ( length )
, index ( 0 )
{}

void operator() ( unsigned char ch ) {
++index;
if ( index >= line_length ) {
(*ostr_ptr) << std::hex << std::setw(2) << std::setfill( '0' )
<< (unsigned int)(ch) << '\n';
index = 0;
} else {
(*ostr_ptr) << std::hex << std::setw(2) << std::setfill( '0' )
<< (unsigned int)(ch) << ' ';
}
}

};

int main ( int argn, char ** args ) {
assert( argn == 2 );
std::ifstream in ( args[1] );
std::for_each( std::istreambuf_iterator< char >( in ),
std::istreambuf_iterator< char >(),
print_hex( std::cout, 25 ) );
std::cout << '\n';
}


> The issue is the c++ read function doesn't return number of bytes
> read... so on the last read into a buffer how do you know how many
> characters to print?


Have a look at readsome().



Best

Kai-Uwe Bux
 
Reply With Quote
 
Eric Pruneau
Guest
Posts: n/a
 
      06-21-2008

"vizzz" <(E-Mail Removed)> a écrit dans le message de news:
aad55897-6560-4fd7-ae4f-5b8cc810fe51...oglegroups.com...
> Hi there,
> i need to find an hex pattern like 0x650A1010 in a binary file.
> i can make a small algorithm that fetch all the file for the match,
> but this file is huge, and i'm scared about performances.
> Is there any stl method for a fast search?
> Andrea


Check out boost::regex

http://www.boost.org/doc/libs/1_35_0...tml/index.html



 
Reply With Quote
 
James Kanze
Guest
Posts: n/a
 
      06-21-2008
On Jun 20, 10:43 pm, Kai-Uwe Bux <(E-Mail Removed)> wrote:
> vizzz wrote:


> > i need to find an hex pattern like 0x650A1010 in a binary
> > file. i can make a small algorithm that fetch all the file
> > for the match, but this file is huge, and i'm scared about
> > performances. Is there any stl method for a fast search?


> You could try std::search() with istreambuf_iterator< unsigned char >.


That's very problematic. istreambuf_iterator< unsigned char >
will expect a basic_streambuf< unsigned char >, which isn't
defined by the standard (and you're not allowed to define it).
A number of implementations do provide a generic version of
basic_streambuf, but since the standard doesn't say what the
generic version should do, they tend to differ. (I remember
sometime back someone posting in fr.comp.lang.c++ that he had
problems because g++ and VC++ provide incompatible generic
versions.)

It would, I suppose, be possible to use istream_iterator<
unsigned char >, provided the file was opened in binary mode,
and you reset skipws. I have my doubts about the performance of
this solution, but it's probably worth a try---if the
performance turns out to be acceptable, you won't get much
simpler.

Except, of course, that search requires forward iterators, and
won't (necessarily) work with input iterators.

[...]
> Maybe, rolling your own is not all that bad. You could read
> the file in chunks (keeping the last three characters from the
> previous block) and use std::search() on the blocks. With the
> right blocksize, this could be really fast.


A lot depends on other possible constraints. He didn't say, but
his example was to look for 0x650A1010, not the sequence 0x65,
0x0A, 0x10, 0x10. If what he is really looking for is a four
byte word, correctly aligned, then as long as the block size is
a multiple of 4, he could use search() with an
iterator::value_type of uint32_t. For arbitrary positions and
sequences, on the other hand, some special handling might be
necessary for cases where the sequence spans a block boundary.

When I had to do something similar, I reserved a guard zone in
front of my buffer, and used a BM search in the buffer. When
the BM search would have taken me beyond the end of the buffer,
I copied the last N bytes of the buffer into the end of the
guard zone before reading the next block, and started my next
search from them. This would probably make keeping track of the
offset a bit tricky (I didn't need the offset), and for the best
performance on the system I was using then, I had to respect
alignment of the buffer as well, which also added some extra
complexity. (But I got the speed we needed.)

> If your OS allows memory mapping of the file, you could do
> that and use std::search() with unsigned char * on the whole
> thing. That could be the fasted way, but will leave the realm
> of standard C++.


If the entire file will fit into memory, perhaps just reading it
all into memory, and then using std::search, would be an
appropriate solution. Or perhaps not: it's often faster to use
a somewhat smaller buffer, and manage the "paging" yourself.

--
James Kanze (GABI Software) email:(E-Mail Removed)
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
 
Reply With Quote
 
James Kanze
Guest
Posts: n/a
 
      06-21-2008
On Jun 21, 2:13 am, Kai-Uwe Bux <(E-Mail Removed)> wrote:
> Ivan wrote:
> > On Jun 20, 1:11 pm, vizzz <(E-Mail Removed)> wrote:


> > Hmmm... I had a look at this and ran accross a simple
> > problem. How do you read a binary file and just echo the
> > HEX for byte to the screen.


> #include <iostream>
> #include <ostream>
> #include <fstream>
> #include <iterator>
> #include <iomanip>
> #include <algorithm>
> #include <cassert>


> class print_hex {


> std:stream * ostr_ptr;
> unsigned int line_length;
> unsigned int index;


> public:


> print_hex ( std:stream & str_ref, unsigned int length )
> : ostr_ptr( &str_ref )
> , line_length ( length )
> , index ( 0 )
> {}


> void operator() ( unsigned char ch ) {
> ++index;
> if ( index >= line_length ) {
> (*ostr_ptr) << std::hex << std::setw(2) << std::setfill( '0' )
> << (unsigned int)(ch) << '\n';
> index = 0;
> } else {
> (*ostr_ptr) << std::hex << std::setw(2) << std::setfill( '0' )
> << (unsigned int)(ch) << ' ';


Wouldn't it be preferable to set the formatting flags in the
constructor? I'd also provide an "indent" argument; if index
were 0, I'd output indent spaces, otherwise a single space---or
perhaps the best solution would be to provide a start of line
and a separator string to the constructor, then:

(*ostr_ptr)
<< (inLineCount == 0 ? startString : separString)
<< std::setw( 2 ) << (unsigned int)( ch ) ;
++ inLineCount ;
if ( inLineCount == lineLength ) {
(*ostr_ptr) << endString ;
inLineCount = 0 ;
}

(This supposes that hex and fill were set in the constructor.)
Given the copying that's going on, I'd also simulate move
semantics, so that the final destructor could do something like:

if ( inLineCount != 0 ) {
(*ostr_ptr) << endString ;
}

> }
> }
> };



> int main ( int argn, char ** args ) {
> assert( argn == 2 );
> std::ifstream in ( args[1] );
> std::for_each( std::istreambuf_iterator< char >( in ),
> std::istreambuf_iterator< char >(),
> print_hex( std::cout, 25 ) );


Unless you're doing something relatively generic, with support
for different separators, etc., this really looks like a case of
for_each abuse.

> std::cout << '\n';


Which results in one new line too many if the number of elements
just happened to be an exact multiple of the line length.

About the only real use for this sort of output I've found is
debugging or experimenting, but there, I use it often enough
that I've a generic Dump<T> class (and a generic function which
returns it, for automatic type deduction), so that I can write
things like:

std::cout << dump( someObject ) << std::endl ;

The code that ends up getting called in the << operator is:

IOSave saver( dest ) ;
dest.fill( '0' ) ;
dest.setf( std::ios::hex, std::ios::basefield ) ;
char const* baseStr = "" ;
if ( (dest.flags() & std::ios::showbase) != 0 ) {
baseStr = "0x" ;
dest.unsetf( std::ios::showbase ) ;
}
unsigned char const* const
end = myObj + sizeof( T ) ;
for ( unsigned char const* p = myObj ; p != end ; ++ p ) {
if ( p != myObj ) {
dest << ' ' ;
}
dest << baseStr << std::setw( 2 ) << (unsigned int)( *p ) ;
}

(Note that there's extra code there to support my personal
preference: a "0x" with a small x, even if std::ios::uppercase
is specified.)

> }
> > The issue is the c++ read function doesn't return number of
> > bytes read... so on the last read into a buffer how do you
> > know how many characters to print?


> Have a look at readsome().


Yes, have a look at it. Read it's specification very carefully.
Because if you do, you're realize that it is absolutely
worthless here.

The function he's looking for is istream::gcount(), which
returns the number of bytes read by the last unformatted read.
His basic loop would be:

while ( input.read( &buffer[ 0 ], buffer.size() ) ) {
process( buffer.begin(), buffer.end() ) ;
}
process( buffer.begin(), buffer.begin() + input.gcount() ) ;

(But IMHO, istream really isn't appropriate for binary; if I'm
really working with a binary file, I'll drop down to the system
API.)

--
James Kanze (GABI Software) email:(E-Mail Removed)
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
 
Reply With Quote
 
James Kanze
Guest
Posts: n/a
 
      06-21-2008
On Jun 21, 3:59 am, "Eric Pruneau" <(E-Mail Removed)> wrote:
> "vizzz" <(E-Mail Removed)> a écrit dans le message de news:
> (E-Mail Removed)...


> > i need to find an hex pattern like 0x650A1010 in a binary file.
> > i can make a small algorithm that fetch all the file for the match,
> > but this file is huge, and i'm scared about performances.
> > Is there any stl method for a fast search?
> > Andrea


> Check out boost::regex


Which requires a forward iterator, and so can't be used on data
in a file (for which he'll have at best an input iterator).

Also, if he's only looking for a fixed string, it's likely to be
significantly slower than some other algorithms.

> http://www.boost.org/doc/libs/1_35_0...tml/index.html


 
Reply With Quote
 
vizzz
Guest
Posts: n/a
 
      06-21-2008
On 21 Giu, 12:26, James Kanze <(E-Mail Removed)> wrote:
> On Jun 21, 3:59 am, "Eric Pruneau" <(E-Mail Removed)> wrote:
>
> > "vizzz" <(E-Mail Removed)> a écrit dans le message de news:
> > (E-Mail Removed)...
> > > i need to find an hex pattern like 0x650A1010 in a binary file.
> > > i can make a small algorithm that fetch all the file for the match,
> > > but this file is huge, and i'm scared about performances.
> > > Is there any stl method for a fast search?
> > > Andrea

> > Check out *boost::regex

>
> Which requires a forward iterator, and so can't be used on data
> in a file (for which he'll have at best an input iterator).
>
> Also, if he's only looking for a fixed string, it's likely to be
> significantly slower than some other algorithms.


Maybe explaining my goal can be useful.
in jpeg2000 files (jp2) there are several boxes made of 4byte length,
4byte type and then data.
i must check if box exist by searching somewhere in the file (boxes
can be anywhere in the whole file) for the box type (ex 0x650A1010).
 
Reply With Quote
 
Kai-Uwe Bux
Guest
Posts: n/a
 
      06-21-2008
James Kanze wrote:

> On Jun 21, 2:13 am, Kai-Uwe Bux <(E-Mail Removed)> wrote:
>> Ivan wrote:
>> > On Jun 20, 1:11 pm, vizzz <(E-Mail Removed)> wrote:

>
>> > Hmmm... I had a look at this and ran accross a simple
>> > problem. How do you read a binary file and just echo the
>> > HEX for byte to the screen.

[snip]
>> > The issue is the c++ read function doesn't return number of
>> > bytes read... so on the last read into a buffer how do you
>> > know how many characters to print?

>
>> Have a look at readsome().

>
> Yes, have a look at it. Read it's specification very carefully.
> Because if you do, you're realize that it is absolutely
> worthless here.


I reread it again. I fail to see why it's worthless. Obviously, I am missing
something.

> The function he's looking for is istream::gcount(), which
> returns the number of bytes read by the last unformatted read.
> His basic loop would be:
>
> while ( input.read( &buffer[ 0 ], buffer.size() ) ) {
> process( buffer.begin(), buffer.end() ) ;
> }
> process( buffer.begin(), buffer.begin() + input.gcount() ) ;


On the other hand, that looks very clean.


Best

Kai-Uwe
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
how to find a pattern in a file and get output in another file indifferent format vinitbhu Java 4 03-17-2008 03:43 PM
find a matching pattern in file and find it in another file too nani Perl Misc 2 03-14-2008 05:20 AM
Can some one Explain How To Find File and Leech them Accully How to find all Articals Related to that binary callejachris@tpg.com.au Computer Support 2 12-04-2005 02:15 PM
finding/replacing a long binary pattern in a .bin file yaipa Python 13 01-19-2005 09:20 PM
Search for byte pattern in a binary file. Ryan Tan via JavaKB.com Java 20 11-19-2004 08:41 AM



Advertisments