Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Performance of hash_set vs. Java

Reply
Thread Tools

Performance of hash_set vs. Java

 
 
Alex Gerdemann
Guest
Posts: n/a
 
      10-11-2004
Hello,

I have spent a bunch of time converting a Java program I wrote to C++ in
order to improve performance, and have found that it is not necessarily
faster.

Specifically, I'm writing a circuit simulator, so I must first parse the
netlist file. The program goes through an input file, and makes a hash_set
of unique nodes (std::string's). The words are then sorted and numbered by
copying the hash_set into a vector and sorting. Then, I run a series of
binary searches on the vector to allocate lines in a matrix for each
element.

I've modified STL's lower_bound() to my getIndex() which functions like
Java's equivalent by returning negative indices when the element is not
found.

Also, Java provides a HashSet class that can hash any kind of element it may
contain. Since, the STL only hashes selected items I had to convert my
std::strings to const char*'s for hashing.

Other than that, my code in the Java and C++ is quite similar. However, the
parse of a 100k line file took around a second for Java, but more like 17s
for C++ . In fact, using std::set instead of hash_set actually improved
performance to about 15s in C++.

Any idea why there would be such a difference? In particular, I suspect
that my hash function may be slow from the call to c_str(). Does this have
to allocate new memory or just point to somewhere inside the object? I'm
kind of curious what Java actually uses for its hash. Any general
performace/other criticisms on my scheme are certainly welcome.

On the plus side there is a great direct sparse linear solver for C. I've
only been able to find a iterative sparse solver for Java. This makes the
C++ code overall much faster than Java, but I suspect that the parse stage
should run at least as fast or faster.

Thanks,

Alex Gerdemann
University of Illinois at Urbana-Champaign

For reference, I've included a sketch of the code

So I did something like the following:

struct hashSet {
__gnu_css::hash<const char*> h;

bool operator()(std::string& s) const {
return h(s.ctr());
}
};

inline int getIndex(const std::vector<std::string>& list, const std::string&
item) {
std::vector<std::string>::const_iterator i =
std::lower_bound(list.begin(), list.end(), item);
if(i==list.end()) {
return -(static_cast<int>(i-list.begin())+1);
} else if(*i != item) {
return -(static_cast<int>(i-list.begin())+1);
}
return static_cast<int>(i-list.begin());
}

__gnu_cxx::hash_set<std::string, hashString> nodeSet;
std::vector<std::string> nodeVector;

while(!done) {
//read a line of file
//split the line into words and push into a vector
//check for syntax errors
nextNode = some word in line;
nodeSet.insert(nextNode);
}
std::copy(nodeSet.begin(),node.end(),std::back_ins erter(nodeVector));
std::sort(nodeVector.begin(),nodeVector.end());
while(!done) {
//read line from previously stored words
//more error checking
thisNode = some word in line
getIndex(nodeVector, thisNode);
//allocate an object representing element containing proper indices
representing nodes
//push pointer to element into a vector
}


 
Reply With Quote
 
 
 
 
=?ISO-8859-1?Q?=22Nils_O=2E_Sel=E5sdal=22?=
Guest
Posts: n/a
 
      10-11-2004
Alex Gerdemann wrote:
> Hello,
> struct hashSet {
> __gnu_css::hash<const char*> h;
>
> bool operator()(std::string& s) const {
> return h(s.ctr());
> }
> };
>
> inline int getIndex(const std::vector<std::string>& list, const std::string&
> item) {
> std::vector<std::string>::const_iterator i =
> std::lower_bound(list.begin(), list.end(), item);
> if(i==list.end()) {
> return -(static_cast<int>(i-list.begin())+1);
> } else if(*i != item) {
> return -(static_cast<int>(i-list.begin())+1);
> }
> return static_cast<int>(i-list.begin());
> }
>
> __gnu_cxx::hash_set<std::string, hashString> nodeSet;
> std::vector<std::string> nodeVector;

When you use std::string , you will have *alot* of string copying
going on. Java will just pass its "refrece"/"pointer" along.
Try convert lists, etc. to hold std::string* (pointers )rather

 
Reply With Quote
 
 
 
 
Tom Widmer
Guest
Posts: n/a
 
      10-11-2004
On Mon, 11 Oct 2004 11:56:38 GMT, "Alex Gerdemann"
<(E-Mail Removed)> wrote:

>Hello,
>
>I have spent a bunch of time converting a Java program I wrote to C++ in
>order to improve performance, and have found that it is not necessarily
>faster.


No, it probably won't be unless you use a better class than
std::string to hold the strings. java.lang.String is in several ways
more efficient than typical implementations of std::string.

>Specifically, I'm writing a circuit simulator, so I must first parse the
>netlist file. The program goes through an input file, and makes a hash_set
>of unique nodes (std::string's). The words are then sorted and numbered by
>copying the hash_set into a vector and sorting. Then, I run a series of
>binary searches on the vector to allocate lines in a matrix for each
>element.
>
>I've modified STL's lower_bound() to my getIndex() which functions like
>Java's equivalent by returning negative indices when the element is not
>found.
>
>Also, Java provides a HashSet class that can hash any kind of element it may
>contain. Since, the STL only hashes selected items I had to convert my
>std::strings to const char*'s for hashing.


The main benefit Java has in hashing is that Strings cache their
hashcodes, so they need only be calculated once. std::string knows
nothing about hashing, and therefore doesn't perform this
optimization.

>Other than that, my code in the Java and C++ is quite similar. However, the
>parse of a 100k line file took around a second for Java, but more like 17s
>for C++ . In fact, using std::set instead of hash_set actually improved
>performance to about 15s in C++.


How long did it take just to read in the file? I don't know much about
GCC's hash_set, but I assume it is configurable for bucket size, etc.
You might want to tweak this. What version of GCC are you using?

>Any idea why there would be such a difference? In particular, I suspect
>that my hash function may be slow from the call to c_str(). Does this have
>to allocate new memory or just point to somewhere inside the object?


Sometimes is has to do the former, due to reference counting problems,
but usually it just returns a pointer to the storage. Tracing in using
the debugger would tell you what's going on in the hash calls.

I'm
>kind of curious what Java actually uses for its hash. Any general
>performace/other criticisms on my scheme are certainly welcome.


Java just iterates over the whole String, generating the hash value,
and then caches it for future calls.

>On the plus side there is a great direct sparse linear solver for C. I've
>only been able to find a iterative sparse solver for Java. This makes the
>C++ code overall much faster than Java, but I suspect that the parse stage
>should run at least as fast or faster.
>
>Thanks,
>
>Alex Gerdemann
>University of Illinois at Urbana-Champaign
>
>For reference, I've included a sketch of the code
>
>So I did something like the following:
>
>struct hashSet {
> __gnu_css::hash<const char*> h;
>
> bool operator()(std::string& s) const {
> return h(s.ctr());
> }


I doubt that is your bottleneck, but if the strings are long, then
Java's hashcode caching might be giving it a major advantage.

>};
>
>inline int getIndex(const std::vector<std::string>& list, const std::string&
>item) {
> std::vector<std::string>::const_iterator i =
>std::lower_bound(list.begin(), list.end(), item);
> if(i==list.end()) {
> return -(static_cast<int>(i-list.begin())+1);
> } else if(*i != item) {
> return -(static_cast<int>(i-list.begin())+1);
> }
> return static_cast<int>(i-list.begin());
>}
>
>__gnu_cxx::hash_set<std::string, hashString> nodeSet;
>std::vector<std::string> nodeVector;
>
>while(!done) {
> //read a line of file
> //split the line into words and push into a vector
> //check for syntax errors
> nextNode = some word in line;
> nodeSet.insert(nextNode);


The above code may be where your main bottleneck is. How do you read
the line and then split it into words? If you are creating a number of
temporary strings, that will be slowing you down. You might consider
using a fixed vector or two and reusing them, to avoid temporaries.

>}


Here you definitely want:
nodeVector.reserve(nodeSet.size());

>std::copy(nodeSet.begin(),node.end(),std::back_in serter(nodeVector));
>std::sort(nodeVector.begin(),nodeVector.end());


Both of those operations should be fast if you are using a reference
counted string class. I believe GCC 3+ uses such a beast.

>while(!done) {
> //read line from previously stored words
> //more error checking
> thisNode = some word in line
> getIndex(nodeVector, thisNode);
> //allocate an object representing element containing proper indices
>representing nodes
> //push pointer to element into a vector
>}


Overall, I think your best bet would be to benchmark smaller bits of
the code or put it through a profiler (-gprof IIRC, which I probably
don't), to find out where the bottleneck actually lies.

Tom
 
Reply With Quote
 
Ivan Vecerina
Guest
Posts: n/a
 
      10-11-2004
"Alex Gerdemann" <(E-Mail Removed)> wrote in message
news:Wxuad.369933$Fg5.53609@attbi_s53...
> I have spent a bunch of time converting a Java program I wrote to C++ in
> order to improve performance, and have found that it is not necessarily
> faster.

This can often be the case. Some native Java features can be more efficient
that the C++ equivalents (which sometimes are designed for more
flexibility).
This is especially true for C++ streams.

> Specifically, I'm writing a circuit simulator, so I must first parse the
> netlist file. The program goes through an input file, and makes a
> hash_set of unique nodes (std::string's). The words are then sorted and
> numbered by copying the hash_set into a vector and sorting. Then, I run a
> series of binary searches on the vector to allocate lines in a matrix for
> each element.

As someone already pointed out, copying strings can be expensive in C++.
An alternative would be to keep the originally read std::string,
then use only a "const char*" pointer obtained using the string::c_str()
member function (I'd do this rather than use a std::string*).

> I've modified STL's lower_bound() to my getIndex() which functions like
> Java's equivalent by returning negative indices when the element is not
> found.

This wrapper is an unnecessary Java-ism, but ok,
it should be irrelevant to performance.

> Also, Java provides a HashSet class that can hash any kind of element it
> may contain. Since, the STL only hashes selected items I had to convert
> my std::strings to const char*'s for hashing.

All C++ hash_set implementations I've seen allow users to provide
a custom hash function as an additional parameter. But this shouldn't
be a big problem.

> Other than that, my code in the Java and C++ is quite similar. However,
> the parse of a 100k line file took around a second for Java, but more like
> 17s for C++ . In fact, using std::set instead of hash_set actually
> improved performance to about 15s in C++.

Unfortunatly, hash_set is not yet in the C++ standard, so the explanation
has to be implementation-specific.
However, you may want to see if you can pre-allocate or predefine the size
of your hash table. Also, as previously pointed out, beware of std::string
instances being copied.

> Any idea why there would be such a difference? In particular, I suspect
> that my hash function may be slow from the call to c_str(). Does this
> have to allocate new memory or just point to somewhere inside the object?
> I'm kind of curious what Java actually uses for its hash. Any general
> performace/other criticisms on my scheme are certainly welcome.

Are you sure that the hashing is the performance bottleneck?
C++ i/o streams could also be a likely cause for the weak performance.

All in all, I am very confident that better performance can be achieved
in C++ than in Java. But in some cases, especially string and file i/o,
this can require extra work

Unfortunately, the code you posted shows very little about the file
reading code and hash table, so we can't really help there.
Regarding the vector and sorting only, I can say that using const char*
instead of std::string is likely to improve execution speed.

Cheers -Ivan
--
http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form


 
Reply With Quote
 
Alex Gerdemann
Guest
Posts: n/a
 
      10-12-2004
Tom wrote:
>>Other than that, my code in the Java and C++ is quite similar. However,
>>the
>>parse of a 100k line file took around a second for Java, but more like 17s
>>for C++ . In fact, using std::set instead of hash_set actually improved
>>performance to about 15s in C++.

>
> How long did it take just to read in the file? I don't know much about
> GCC's hash_set, but I assume it is configurable for bucket size, etc.
> You might want to tweak this. What version of GCC are you using?
>


I'm using Cygwin's special version of GCC 3.3.3. I can't really seperate
how much time is spent just reading the file, as I process it line by line,
doing some work between each I/O call.

>>while(!done) {
>> //read a line of file
>> //split the line into words and push into a vector
>> //check for syntax errors
>> nextNode = some word in line;
>> nodeSet.insert(nextNode);

>
> The above code may be where your main bottleneck is. How do you read
> the line and then split it into words? If you are creating a number of
> temporary strings, that will be slowing you down. You might consider
> using a fixed vector or two and reusing them, to avoid temporaries.
>


Since you asked here's exactly what I did:

std::vector< std::vector< std::string > > lines;
std::string line;
std::vector<std::string> thisLine;

do {
std::getline(file,line);
if (line.length() != 0) {
split(line,thisLine);
lines.push_back(thisLine);
//process the line
} while(!netlist.eof());

inline void split(const std::string& line, std::vector<std::string>& words)
{
unsigned int firstMark = 0, lastMark = 0;
words.clear();
while(lastMark!=std::string::npos) {
firstMark=line.find_first_not_of(" ",lastMark);
if(firstMark==std::string::npos) break;
lastMark=line.find_first_of(" ",firstMark);
words.push_back(line.substr(firstMark,lastMark-firstMark));
}
}


>>}

>
> Here you definitely want:
> nodeVector.reserve(nodeSet.size());
>
>>std::copy(nodeSet.begin(),node.end(),std::back_i nserter(nodeVector));
>>std::sort(nodeVector.begin(),nodeVector.end()) ;

>
> Both of those operations should be fast if you are using a reference
> counted string class. I believe GCC 3+ uses such a beast.
>


I added the reserve call which saved me a couple of tenths of a second.
Nice, but must not be the major bottleneck. What is a "reference counted"
class?

> Overall, I think your best bet would be to benchmark smaller bits of
> the code or put it through a profiler (-gprof IIRC, which I probably
> don't), to find out where the bottleneck actually lies.


I went ahead and tried this, but can't quite understand the results. The
total time it computes is way off. (It says its 8s vs. the actual 14s).
Also, it thinks that only 30% of run time was spent in the main() function.
This can't possibly be right.

-Alex Gerdemann
University of Illinois Urbana-Champaign


 
Reply With Quote
 
Alex Gerdemann
Guest
Posts: n/a
 
      10-12-2004
Ivan wrote:
>> I've modified STL's lower_bound() to my getIndex() which functions like
>> Java's equivalent by returning negative indices when the element is not
>> found.

> This wrapper is an unnecessary Java-ism, but ok,
> it should be irrelevant to performance.


Well the specific negative value returned is unnecessary, but I do need to
know if the string isn't found. It's kind of annoying that the function
doesn't return this information because to find it myself I have to:

1) check that the returned iterator does not point to set.end()
2) check that the iterator points to what it claims.

Certainly, the search algorithm knows internally if it found what it was
looking for. Having to do that job again will certainly cost at least some
time.

The reason the node may not be in the list has to do with the details of
circuit analysis. Specifically, the matrix describing the system should not
include a row or column representing the ground node. So, my program
collects all the nodes the user defines, searches for the ground node, and
erases it from the list. That way, I can sort the list, and use the index
of each node to allocate a line in the matrix. Since I've collected all the
nodes, I know that if a particular node isn't found in the list, it must be
the ground node. It's kind of an ugly mechanism, I guess, but I currently
don't have a better idea.

>> Also, Java provides a HashSet class that can hash any kind of element it
>> Other than that, my code in the Java and C++ is quite similar. However,
>> the parse of a 100k line file took around a second for Java, but more
>> like 17s for C++ . In fact, using std::set instead of hash_set actually
>> improved performance to about 15s in C++.

> Unfortunatly, hash_set is not yet in the C++ standard, so the explanation
> has to be implementation-specific.
> However, you may want to see if you can pre-allocate or predefine the size
> of your hash table. Also, as previously pointed out, beware of std::string
> instances being copied.


I would like to convert the vector to store pointers to strings, rather than
the strings themselves, but then I cannot search use the built in sort, and
binary searches to find a particular string efficiently.

> Unfortunately, the code you posted shows very little about the file
> reading code and hash table, so we can't really help there.
> Regarding the vector and sorting only, I can say that using const char*
> instead of std::string is likely to improve execution speed.


I didn't write my own hash table. I though hash_set handled this problem on
its own. Not coming from a CS background, I don't know how to write a good
hash scheme, and it seems like this is the sort of thing that should be
provided by the built in libraries. Since it works faster, for now, I've
switched back to the regular (tree?) set. I want to use a set so repeat
copies of nodes will not be duplicated when added to the set. I actually
don't do any of the searches until after the set is copied into a vector.
Given that, should I actually expect the hash_set to have a faster insertion
time? Not having a CS background, I just read Java's documentation which
says that a hash set is faster in most cases.

On the I/O code, I posted this in my other reply, but here's another copy:

std::vector< std::vector< std::string > > lines;
std::string line;
std::vector<std::string> thisLine;

do {
std::getline(file,line);
if (line.length() != 0) {
split(line,thisLine);
lines.push_back(thisLine);
//process the line
} while(!netlist.eof());

inline void split(const std::string& line, std::vector<std::string>& words)
{
unsigned int firstMark = 0, lastMark = 0;
words.clear();
while(lastMark!=std::string::npos) {
firstMark=line.find_first_not_of(" ",lastMark);
if(firstMark==std::string::npos) break;
lastMark=line.find_first_of(" ",firstMark);
words.push_back(line.substr(firstMark,lastMark-firstMark));
}
}

Thanks for the tips,

-Alex Gerdemann
University of Illinois Urbana-Champaign


 
Reply With Quote
 
Ivan Vecerina
Guest
Posts: n/a
 
      10-12-2004
"Alex Gerdemann" <(E-Mail Removed)> wrote in message
news:%WTad.234454$D%.43700@attbi_s51...
> Ivan wrote:
>>> I've modified STL's lower_bound() to my getIndex() which functions like
>>> Java's equivalent by returning negative indices when the element is not
>>> found.

>> This wrapper is an unnecessary Java-ism, but ok,
>> it should be irrelevant to performance.

>
> Well the specific negative value returned is unnecessary, but I do need to
> know if the string isn't found. It's kind of annoying that the function
> doesn't return this information because to find it myself I have to:
>
> 1) check that the returned iterator does not point to set.end()
> 2) check that the iterator points to what it claims.
>
> Certainly, the search algorithm knows internally if it found what it was
> looking for. Having to do that job again will certainly cost at least
> some
> time.

Actually lower_bound may will not test that the hit value is actually
equal to the one being looked for (because it only uses '<' for comparison)
but I agree the interface is cumbersome.
std::equal_range could be an alternative...

>>> Also, Java provides a HashSet class that can hash any kind of element it
>>> Other than that, my code in the Java and C++ is quite similar. However,
>>> the parse of a 100k line file took around a second for Java, but more
>>> like 17s for C++ . In fact, using std::set instead of hash_set actually
>>> improved performance to about 15s in C++.

>> Unfortunatly, hash_set is not yet in the C++ standard, so the explanation
>> has to be implementation-specific.
>> However, you may want to see if you can pre-allocate or predefine the
>> size
>> of your hash table. Also, as previously pointed out, beware of
>> std::string
>> instances being copied.

>
> I would like to convert the vector to store pointers to strings, rather
> than
> the strings themselves, but then I cannot search use the built in sort,
> and
> binary searches to find a particular string efficiently.


Actually you can, but you will need to provide these algorithms with an
additional parameter, which is a comparison function:

// add this declaration
struct StrPtrCompare
{
bool operator()(char const* a, char const* b) const
{ return std::strcmp( a, b ) < 0; }
};

//and from your function call call:
std::sort( vect.begin(), vect.end(), StrPtrCompare() );


>> Unfortunately, the code you posted shows very little about the file
>> reading code and hash table, so we can't really help there.
>> Regarding the vector and sorting only, I can say that using const char*
>> instead of std::string is likely to improve execution speed.

>
> I didn't write my own hash table.


I did not suggest you should. Just some implementations of hash_set
have an equivalent of vector::reserve() to preallocate a larger
table (and avoid later reallocations). But no big deal...

> copies of nodes will not be duplicated when added to the set. I actually
> don't do any of the searches until after the set is copied into a vector.
> Given that, should I actually expect the hash_set to have a faster
> insertion
> time? Not having a CS background, I just read Java's documentation which
> says that a hash set is faster in most cases.

What may be faster is to first copy and sort everything into a vector,
then look for (contiguous) duplicate items and remove them.
[ since you need a sorted vector anyway, the hash may be redundant ]

> On the I/O code, I posted this in my other reply, but here's another copy:
>
> std::vector< std::vector< std::string > > lines;


Unfortunately, such a data structure can be inefficient in C++,
and involve many object copies an memory allocations.
But there are a few tricks that can help....

> std::string line;
> std::vector<std::string> thisLine;
>
> do {
> std::getline(file,line);
> if (line.length() != 0) {
> split(line,thisLine);
> lines.push_back(thisLine);


Instead of the last two lines, the following will avoid
an intermediate copy of the objects and sub-objects:
lines.push_back( std::vector<std::string>() );
split( line, lines.back() );

Also, it will help if you call
lines.reserve( someGuessOfTheNumberOfInputLines );
prior to reading the file.
Alternatively, you could use a different type:
std::vector< std::vector< std::string > > lines;
// will be faster on some platforms.

> //process the line
> } while(!netlist.eof());
>
> inline void split(const std::string& line, std::vector<std::string>&
> words)
> {
> unsigned int firstMark = 0, lastMark = 0;
> words.clear();
> while(lastMark!=std::string::npos) {
> firstMark=line.find_first_not_of(" ",lastMark);
> if(firstMark==std::string::npos) break;
> lastMark=line.find_first_of(" ",firstMark);
> words.push_back(line.substr(firstMark,lastMark-firstMark));
> }
> }


Now not talking about efficiency, all the input code above probably
could be simplified as follows (including <sstream> and <iterator>):

while( std::getline(file,line) )
{
lines.push_back( std::vector<std::string>() ); // add empty object
lines.back().assign(
std::istream_iterator<std::string>( istringstream(line) )
std::istream_iterator<std::string>() );
}

This is the C++ input code I would start with.
[ for the rest, the tips above - use char* after this input code
and skip the hash if possible - should improve execution speed ]


If the reading code remains a critical bottleneck, the
ultimate solution would be to:
1) skip iostreams, and use memory mapping or a single fread
to bring the whole file into memory
2) parse the file in-place, and add null chars at the end of
each 'word', so I can use a simple char* to access all
strings in-place, without any memory allocation.
That would take me an extra day of programming and be less
portable/maintainable, but be blazingly fast...



Well... I hope some of this stuff will be useful.
[I'm a bit in a rush]

Cheers,
Ivan
--
http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form


 
Reply With Quote
 
Ivan Vecerina
Guest
Posts: n/a
 
      10-13-2004
I (Ivan Vecerina) wrote in message news:ckh97b$sna$(E-Mail Removed)...
> > copies of nodes will not be duplicated when added to the set. I

actually
> > don't do any of the searches until after the set is copied into a

vector.
> > Given that, should I actually expect the hash_set to have a faster
> > insertion
> > time? Not having a CS background, I just read Java's documentation

which
> > says that a hash set is faster in most cases.

> What may be faster is to first copy and sort everything into a vector,
> then look for (contiguous) duplicate items and remove them.
> [ since you need a sorted vector anyway, the hash may be redundant ]

NB: however this only makes sense if there are few duplicate strings to
detect.

> Well... I hope some of this stuff will be useful.

And of course it is difficult to suggest solutions
without seeing the big picture...

Regards
-Ivan
--
http://ivan.vecerina.com/contact/?subject=NG_POST <- e-mail contact form


 
Reply With Quote
 
Tom Widmer
Guest
Posts: n/a
 
      10-13-2004
On Tue, 12 Oct 2004 16:28:16 GMT, "Alex Gerdemann"
<(E-Mail Removed)> wrote:

>I'm using Cygwin's special version of GCC 3.3.3. I can't really seperate
>how much time is spent just reading the file, as I process it line by line,
>doing some work between each I/O call.


Ok, I'm 95% sure that gcc 3.3 uses a copy-on-write (COW) string
implementation, with reference counting (gcc 3.4 certainly does). This
means that copying a string doesn't require that any memory is
allocated - the copy shares representation with the original and only
creates its own unique copy when it might modify it. This also means
that copying strings is cheap, as long as those strings are only
accessed through const member functions after the copy has occurred.

>>>while(!done) {
>>> //read a line of file
>>> //split the line into words and push into a vector
>>> //check for syntax errors
>>> nextNode = some word in line;
>>> nodeSet.insert(nextNode);

>>
>> The above code may be where your main bottleneck is. How do you read
>> the line and then split it into words? If you are creating a number of
>> temporary strings, that will be slowing you down. You might consider
>> using a fixed vector or two and reusing them, to avoid temporaries.
>>

>
>Since you asked here's exactly what I did:
>
>std::vector< std::vector< std::string > > lines;


You definitely want:
lines.reserve(estimateOfNumberOfLines);

>std::string line;
>std::vector<std::string> thisLine;
>
>do {


line.reserve(enoughForALine); //may well help

> std::getline(file,line);
> if (line.length() != 0) {
> split(line,thisLine);
> lines.push_back(thisLine);


The above line is going to be slow, since it involves copying the
whole vector. Instead, you might do:
//avoid coping the vector:
lines.resize(lines.size() + 1); //add extra default element
lines.back().swap(thisLine); //swap it with the current element

> //process the line
>} while(!netlist.eof());
>
>inline void split(const std::string& line, std::vector<std::string>& words)
>{
> unsigned int firstMark = 0, lastMark = 0;
> words.clear();


words.reserve(50); //say

> while(lastMark!=std::string::npos) {
> firstMark=line.find_first_not_of(" ",lastMark);
> if(firstMark==std::string::npos) break;
> lastMark=line.find_first_of(" ",firstMark);
> words.push_back(line.substr(firstMark,lastMark-firstMark));


That might be slightly more efficient as:

words.push_back(std::string(line, firstMark, lastMark - firstMark));

but I doubt it will make much difference with a COW string
implementation.

> }
>}
>
>
>>>}

>>
>> Here you definitely want:
>> nodeVector.reserve(nodeSet.size());
>>
>>>std::copy(nodeSet.begin(),node.end(),std::back_ inserter(nodeVector));
>>>std::sort(nodeVector.begin(),nodeVector.end() );

>>
>> Both of those operations should be fast if you are using a reference
>> counted string class. I believe GCC 3+ uses such a beast.
>>

>
>I added the reserve call which saved me a couple of tenths of a second.
>Nice, but must not be the major bottleneck. What is a "reference counted"
>class?


See above.

>
>> Overall, I think your best bet would be to benchmark smaller bits of
>> the code or put it through a profiler (-gprof IIRC, which I probably
>> don't), to find out where the bottleneck actually lies.

>
>I went ahead and tried this, but can't quite understand the results. The
>total time it computes is way off. (It says its 8s vs. the actual 14s).
>Also, it thinks that only 30% of run time was spent in the main() function.
>This can't possibly be right.


It might not be counting time spent in system functions, such as IO
(system vs user time?).

If you want to optimize the above, I'd do this:

Write a simple immutable string type that is initialized with a
pointer and a length, but allocate no memory and does nothing in the
destructor. Include in the string a cached hashcode value, so that the
hashcode need only be calculated once. e.g.

class mystring
{
char const* m_ptr;
std::size_t m_length;
mutable unsigned long m_hashCode;
public:
mystring(char const* ptr, std::size_t length)
:m_ptr(ptr), m_length(length),
m_hashCode(static_cast<unsigned long>(-1))
{
}

unsigned long hashCode() const
{
if (m_hashCode == static_cast<unsigned long>(-1))
{
//calculate hashCode (copy java.lang.String code?)
}
return m_hashCode;
}

char operator[](std::size_t index) const
{
return m_ptr[index];
}

//operator==, <, etc.
//compiler generated destructor, copy, assignment are fine.
};

Read the entire file into a vector<char>.

Iterate over the vector, adding creating "mystring"s pointing into the
vector for each word. You might also consider replacing ' ' characters
with '\0's, so that you can add c_str() method to mystring that simply
looks like this:
char const* c_str() const
{
return m_ptr;
}

Operate on this new vector<vector<mystring> >.

That should be much much faster, since the memory allocation overhead
will be vastly decreased. If you don't need a vector of lines, just
have an overall vector of words for another speed up. Essentially,
optimization in non-numerical C++ is often about reducing the number
of calls to "new" and "delete", which are often even slower than the
Java versions (new + gc).

Tom
 
Reply With Quote
 
Karl Heinz Buchegger
Guest
Posts: n/a
 
      10-13-2004
Alex Gerdemann wrote:
>

[snip]
> Other than that, my code in the Java and C++ is quite similar. However, the
> parse of a 100k line file took around a second for Java, but more like 17s
> for C++ . In fact, using std::set instead of hash_set actually improved
> performance to about 15s in C++.


Hmm. Just a quick question:
Are you sure the C++ optimizer run over your code and you are not
timing a debug version?

Especially with container templates the optimizer can often dramatically
reduce execution time.

--
Karl Heinz Buchegger
http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
hash_set was not declared even I #include<hash_set> eric C++ 3 07-12-2011 10:27 AM
STL hash_set problem, SGI impl Timo Qvist C++ 1 11-18-2004 03:04 PM
using hash_set in gcc3.3 Bart Blommerde C++ 5 10-13-2004 01:51 PM
hash_set : need the location /index where stored . Abhijit Ray C++ 1 04-27-2004 05:23 AM
template-id `hash_set<Tool*>' used as a declarator Vasileios C++ 4 11-04-2003 01:46 PM



Advertisments