Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > C Containers Library vs STL

Reply
Thread Tools

C Containers Library vs STL

 
 
jacob navia
Guest
Posts: n/a
 
      08-03-2011
Hi

I would like to compare the C containers library (written in C for C
programmers) against the STL.

Here is the code for the CCL. Maybe one C++ wizard would solve the
same problem using the STL?

I would be very ingterested in comparing code size, complexity, etc.


Thanks in advance, and here is the C part:
--------------------------------------------------------------
Unique
------

Given a text file, print in standard output the lines that are
unique in it, i.e. filtering all duplicated lines.

Algorithm:
---------

Normally this involves keeping a sorted list/array of lines
and testing if a line is in the set or not.

Solution using the CCL.
----------------------

1 #include <containers.h>
2 int main(int argc,char *argv[])
3 {
4 FILE *f;
5 int i=1,r;
6 Dictionary *dict;
7 char buf[8192];
8
9 if (argc < 2) {
10 fprintf(stderr,"%s <file name>\n",argv[0]);
11 return -1;
12 }
13 f = fopen(argv[1],"r");
14 if (f == NULL)
15 return -1;
16 dict = iDictionary.Create(0,500);
17 if (dict == NULL)
18 return -1;
19 while (fgets(buf,sizeof(buf),f)) {
20 r= iDictionary.Add(dict,buf,NULL);
21 if (r > 0)
22 printf("[%3d] %s",i,buf);
23 else if (r < 0) break;
24 i++;
25 }
26 iDictionary.Finalize(dict);
27 fclose(f);
28 }

Algorithm
---------
A hash table will be used to determine if a line is a duplicate
or not.

Commentary
----------
We use the following local variables (lines 4-7):
Name Usage
f Input stream bound to the file to read
i Counter for lines read
r Result of adding a line
dict Dictionary (Hash table)
buf Line buffer limited to 8K per line

Lines 9-15 are concerned with opening the input file, with some error
checking.

In line 16 we create a dictionary, requesting a size of zero for the
data associated with the key since we aren't storing any data, just the
key, and we suppose that the table will contain more or less 500
entries. If the file contains much more lines performance could
suffer but the algorithm would still work.

Lines 19-25 are the main loop of the program. We read each line into
the buffer and add it to then dictionary. If the "Add" API returns
a positive number the line wasn't there, if it returns zero the
line was already in the dictionary. If the result is negative it
is an error code and we stop the loop aborting the operation. Failure
can be provoked only by lack of memory.

If the result is positive we print the line.

Cleanup is performed in lines 26 and 27: we dispose of the dictionary
and close the file.

The C containers library web page is:
http://code.google.com/p/ccl/
You will find there the source code and the documentation.

-----------------------------------------------------------------------
 
Reply With Quote
 
 
 
 
Ian Collins
Guest
Posts: n/a
 
      08-03-2011
On 08/ 4/11 08:40 AM, jacob navia wrote:
> Hi
>
> I would like to compare the C containers library (written in C for C
> programmers) against the STL.
>
> Here is the code for the CCL. Maybe one C++ wizard would solve the
> same problem using the STL?
>
> I would be very ingterested in comparing code size, complexity, etc.
>
>
> Thanks in advance, and here is the C part:
> --------------------------------------------------------------
> Unique
> ------
>
> Given a text file, print in standard output the lines that are
> unique in it, i.e. filtering all duplicated lines.
>
> Algorithm:
> ---------
>
> Normally this involves keeping a sorted list/array of lines
> and testing if a line is in the set or not.
>
> Solution using the CCL.
> ----------------------
>
> 1 #include<containers.h>
> 2 int main(int argc,char *argv[])
> 3 {
> 4 FILE *f;
> 5 int i=1,r;
> 6 Dictionary *dict;
> 7 char buf[8192];
> 8
> 9 if (argc< 2) {
> 10 fprintf(stderr,"%s<file name>\n",argv[0]);
> 11 return -1;
> 12 }
> 13 f = fopen(argv[1],"r");
> 14 if (f == NULL)
> 15 return -1;
> 16 dict = iDictionary.Create(0,500);
> 17 if (dict == NULL)
> 18 return -1;
> 19 while (fgets(buf,sizeof(buf),f)) {
> 20 r= iDictionary.Add(dict,buf,NULL);
> 21 if (r> 0)
> 22 printf("[%3d] %s",i,buf);
> 23 else if (r< 0) break;
> 24 i++;
> 25 }
> 26 iDictionary.Finalize(dict);
> 27 fclose(f);
> 28 }
>
> Algorithm
> ---------
> A hash table will be used to determine if a line is a duplicate
> or not.


This is probably the closest equivalent (sticking to a similar layout
style):

#include <set>
#include <string>
#include <iostream>
#include <fstream>

typedef std::set<std::string> Lines;

int main( int argc, char** argv )
{
std::ifstream in( argv[1] );

Lines unique;

while( in ) {
std::string line;

std::getline( in, line );

if( unique.insert(line).second ) {
std::cout << line << std::endl;
}
}
}


> Commentary
> ----------
> We use the following local variables (lines 4-7):
> Name Usage
> f Input stream bound to the file to read
> i Counter for lines read
> r Result of adding a line
> dict Dictionary (Hash table)
> buf Line buffer limited to 8K per line


It would have been much easier to give the variables meaningful names!

--
Ian Collins
 
Reply With Quote
 
 
 
 
jacob navia
Guest
Posts: n/a
 
      08-03-2011
Le 04/08/11 00:38, Robert Wessel a écrit :
> On Wed, 03 Aug 2011 17:36:54 -0500, Robert Wessel
> <(E-Mail Removed)> wrote:
>


>>
>> I'd point out that unlike Jacob's original, you're not displaying a
>> line number, although that's trivial. You'd probably change to cout
>> to something like
>>
>> std::cout<< "["<< setw(3)<< linenumber<< "] "<< line<< std:endl;
>>
>> And add the appropriate code to maintain linenumber.

>
>
> errr... std::setw(3)


Thanks to you and to Ian.

More important than the line number, is the behavior of the STL
container in case of errors, for instance when there is no more
memory. Does it throw an exception? In that case it would be
fully equivalent of the C code that tests for the return code
being less than zero.

In my Macintosh I compile using gcc the code to
-rw-r--r-- 1 jacobnavia staff 6548 3 aoû 23:23 unique.o // CPP
-rw-r--r-- 1 jacobnavia staff 1348 3 aoû 23:23 unique1.o // C

both are compiled using -Os (size optimization) using the same commpiler
~/Documents/stl $ gcc -v
Using built-in specs.
Target: i686-apple-darwin10
gcc version 4.2.1 (Apple Inc. build 5664)


I think that the codes are really equivalent, and I am pleased that
the CCL doesn't come all that bad. What performance is concerned
I doubt that there would be any differences since the programs are
bound by I/O anyway


Thanks again for your answers.

jacob
 
Reply With Quote
 
Ian Collins
Guest
Posts: n/a
 
      08-04-2011
On 08/ 4/11 11:30 AM, jacob navia wrote:
>
> Thanks to you and to Ian.
>
> More important than the line number, is the behavior of the STL
> container in case of errors, for instance when there is no more
> memory. Does it throw an exception? In that case it would be
> fully equivalent of the C code that tests for the return code
> being less than zero.
>
> In my Macintosh I compile using gcc the code to
> -rw-r--r-- 1 jacobnavia staff 6548 3 aoû 23:23 unique.o // CPP
> -rw-r--r-- 1 jacobnavia staff 1348 3 aoû 23:23 unique1.o // C
>
> both are compiled using -Os (size optimization) using the same commpiler
> ~/Documents/stl $ gcc -v
> Using built-in specs.
> Target: i686-apple-darwin10
> gcc version 4.2.1 (Apple Inc. build 5664)
>
>
> I think that the codes are really equivalent, and I am pleased that
> the CCL doesn't come all that bad. What performance is concerned
> I doubt that there would be any differences since the programs are
> bound by I/O anyway


Um, I tried both versions with the same optimisations (cc/CC -fast) on
my Solaris box and the run times (over several runs) for a 350K line
file with 214K unique lines were:

C: 5.027s
CC: 1.835s

--
Ian Collins
 
Reply With Quote
 
Ian Collins
Guest
Posts: n/a
 
      08-04-2011
On 08/ 4/11 12:34 PM, Ian Collins wrote:
> On 08/ 4/11 11:30 AM, jacob navia wrote:
>>
>> Thanks to you and to Ian.
>>
>> More important than the line number, is the behavior of the STL
>> container in case of errors, for instance when there is no more
>> memory. Does it throw an exception? In that case it would be
>> fully equivalent of the C code that tests for the return code
>> being less than zero.
>>
>> In my Macintosh I compile using gcc the code to
>> -rw-r--r-- 1 jacobnavia staff 6548 3 aoû 23:23 unique.o // CPP
>> -rw-r--r-- 1 jacobnavia staff 1348 3 aoû 23:23 unique1.o // C
>>
>> both are compiled using -Os (size optimization) using the same commpiler
>> ~/Documents/stl $ gcc -v
>> Using built-in specs.
>> Target: i686-apple-darwin10
>> gcc version 4.2.1 (Apple Inc. build 5664)
>>
>>
>> I think that the codes are really equivalent, and I am pleased that
>> the CCL doesn't come all that bad. What performance is concerned
>> I doubt that there would be any differences since the programs are
>> bound by I/O anyway

>
> Um, I tried both versions with the same optimisations (cc/CC -fast) on
> my Solaris box and the run times (over several runs) for a 350K line
> file with 214K unique lines were:
>
> C: 5.027s
> CC: 1.835s


A quick profile shows the C version spends most of its time in strcmp
while the C++ version spends most of its time in I/O. So I guess the
performance difference is mainly between the r-b tree used by std::set
and your hashing algorithm.

So if I change your code to use std::set in order to remove the
difference in I/O libraries:

#include <set>
#include <string>
#include <stdio.h>

int main(int argc,char *argv[])
{
FILE* f = fopen(argv[1],"r");
if (f == NULL)
return -1;

std::set<std::string> unique;
char buf[8192];
int i = 0;

while (fgets(buf,sizeof(buf),f)) {
if( unique.insert(buf).second ) {
printf("[%3d] %s",i,buf);
}
i++;
}
fclose(f);
}

the run time drops to 0.66 seconds.

--
Ian Collins
 
Reply With Quote
 
Juha Nieminen
Guest
Posts: n/a
 
      08-04-2011
Ian Collins <(E-Mail Removed)> wrote:
> the run time drops to 0.66 seconds.


If you wanted to reduce the running time even further, and assuming the
entire input file fits in RAM, what you could do is to read the entire
file into RAM with one single fread() call and then use a std::set of
char* (and the proper comparator), each pointer pointing to this single
data block (which needs newlines replaced with null characters).

In this case the (probably significant) speedup comes from removing the
need to make a memory allocation for each line.

(And if that's not enough, then the next step is to not to use a
std::set, but a std::vector<char*> instead. Once it has been initialized,
sort it and make the elements unique (both one-liners). That way we avoid
making another memory allocation per line.)
 
Reply With Quote
 
Ian Collins
Guest
Posts: n/a
 
      08-04-2011
On 08/ 4/11 07:02 PM, Juha Nieminen wrote:
> Ian Collins<(E-Mail Removed)> wrote:
>> the run time drops to 0.66 seconds.

>
> If you wanted to reduce the running time even further, and assuming the
> entire input file fits in RAM, what you could do is to read the entire
> file into RAM with one single fread() call and then use a std::set of
> char* (and the proper comparator), each pointer pointing to this single
> data block (which needs newlines replaced with null characters).


Running more than once should be sufficient to cache the file in RAM.

> In this case the (probably significant) speedup comes from removing the
> need to make a memory allocation for each line.


My point was the choice of container was the dominant player in this
comparison. All of the above optimisations can be applied to either
version.

--
Ian Collins
 
Reply With Quote
 
jacob navia
Guest
Posts: n/a
 
      08-04-2011
Le 04/08/11 02:34, Ian Collins a écrit :
>
> Um, I tried both versions with the same optimisations (cc/CC -fast) on
> my Solaris box and the run times (over several runs) for a 350K line
> file with 214K unique lines were:
>
> C: 5.027s
> CC: 1.835s
>


Did you change the number passed to the Create API?

The code as shown is optimized for a file of 500 lines or so.

Changing that 500 to (say 135000) would be significantly
better but would use much more memory.

jacob
 
Reply With Quote
 
Ian Collins
Guest
Posts: n/a
 
      08-04-2011
On 08/ 4/11 07:42 PM, jacob navia wrote:
> Le 04/08/11 02:34, Ian Collins a écrit :
>>
>> Um, I tried both versions with the same optimisations (cc/CC -fast) on
>> my Solaris box and the run times (over several runs) for a 350K line
>> file with 214K unique lines were:
>>
>> C: 5.027s
>> CC: 1.835s
>>

>
> Did you change the number passed to the Create API?
>
> The code as shown is optimized for a file of 500 lines or so.
>
> Changing that 500 to (say 135000) would be significantly
> better but would use much more memory.


That does make a very significant difference:

0.51s

Slightly quicker than using std::set.

--
Ian Collins
 
Reply With Quote
 
jacob navia
Guest
Posts: n/a
 
      08-04-2011
Le 04/08/11 01:40, Robert Wessel a écrit :
>
>
> Yes, you'd get an exception if you ran out of memory. The handling of
> that is a bit different than your program (you quietly return -1,
> where as the C++ program would probably die with an "unhandled
> exception" message, neither of which is really acceptable, although
> both programs could obviously be improved).


The default behavior in the CCL is to print an error message in the
standard error stream. In this case the user would get a "No memory
left" message. That is why I do not print any error message and
return -1.

You can obviously change this behavior by another one if you wish.
(For instance aborting the program).

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Are sequence containers not a subset of general containers? Sebastian Mach C++ 5 10-06-2012 07:54 PM
nested stl containers - aliasing stl objects without invokingoperator= Andrey Vul C++ 6 10-22-2009 09:00 AM
Containers of iterators vs. containers of references clark.coleman@att.net C++ 7 01-25-2008 01:37 PM
Library exposing STL containers Bob C++ 2 07-26-2006 02:04 AM
sorting with STL containers containing objects of different subclasses Koen C++ 1 06-30-2003 03:51 PM



Advertisments