Home  Forums  Reviews  Guides  Newsgroups  Register  Search 
Thread Tools 
Markus Dehmann 


 
Gianni Mariani
Guest
Posts: n/a

Markus Dehmann wrote:
> I am planning to have tens of thousands of bit vectors (or bitsets). > Each of them has about 20,000 bits, but it's sparse: In a given > vector, only 2030 bits will be set. What data structure would be best > for space (and probably also time) efficiency? Is std::bitset better > than std::bit_vector (vector<bool>)? Can I trust that the used space > for each of the bitvectors will be in the order of 2030 bits (i.e. > the expected number of bits set), or rather 20,000 bits (the constant > number of bits that each vector has)? Or are there other, better > classes to use? Sparse Matrix classes? You could use map. I have attached the Austria C++ "rangemap" that can be used as a sparse bitmap. What kind of operations do you need to have on your bitmap ? // // The Austria library is copyright (c) Gianni Mariani 2004. // // Grant Of License. Grants to LICENSEE the nonexclusive right to use the Austria // library subject to the terms of the LGPL. // // A copy of the license is available in this directory or one may be found at this URL: // http://www.gnu.org/copyleft/lesser.txt // /** * at_rangemap.h * */ #ifndef x_at_rangemap_h_x #define x_at_rangemap_h_x 1 #include "at_exports.h" #include "at_os.h" #include "at_assert.h" #include <map> // Austria namespace namespace at { // ======== TypeRange ================================================= /** * TypeRange describes the range of a particular type * */ template <typename w_RangeType> class TypeRange { public: // range type typedef w_RangeType t_RangeType; // ======== Adjacent ============================================== /** * Adjacent returns true if the two parameters are "one apart" * * @param i_lesser is the lesser of the two values * @param i_greater is the greater of the two * @return true is no other elements exist between i_lesser and i_greater */ static bool Adjacent( const t_RangeType & i_lesser, const t_RangeType & i_greater ) { t_RangeType l_greater_less( i_greater );  l_greater_less; // go to the earlier element // deal with wrapping if ( i_greater < l_greater_less ) { return false; } return !( i_lesser < l_greater_less ); } }; // ======== RangeMap ================================================== /** * RangeMap is a template that defines ranges. * */ template <typename w_RangeType, typename w_RangeTraits=TypeRange<w_RangeType> > class RangeMap { public: // range type typedef w_RangeType t_RangeType; typedef w_RangeTraits t_RangeTraits; // index on the end of the range typedef std::map< t_RangeType, t_RangeType > t_Map; typedef typename t_Map::iterator t_Iterator; // ======== AddRange ============================================== /** * Add a segment to the range. * * @param i_begin The beginning of the range (inclusive) * @param i_end The end of the range (inclusive) * @return nothing */ void AddRange( const t_RangeType & i_begin, const t_RangeType & i_end ) { const bool l_less_than( i_end < i_begin ); const t_RangeType & l_begin = ! l_less_than ? i_begin : i_end; const t_RangeType & l_end = l_less_than ? i_begin : i_end; // deal with an empty map here if ( m_map.empty() ) { // shorthand adding the first element into the map m_map[ l_end ] = l_begin; return; } // see if there is a segment to merge  find the element that preceeds // l_begin t_Iterator l_begin_bound = m_map.lower_bound( l_begin ); if ( l_begin_bound == m_map.end() ) { // l_begin is after the last element  l_begin_bound; if ( t_RangeTraits::Adjacent( l_begin_bound>first, l_begin ) ) { // yes, they are mergable t_RangeType l_temp = l_begin_bound>second; m_map.erase( l_begin_bound ); m_map[ l_end ] = l_temp; return; } // not mergable  add the segment at the end m_map[ l_end ] = l_begin; return; } // if the end of the segment being inserted is not beyond this one if ( ( l_end < l_begin_bound>second ) && ! t_RangeTraits::Adjacent( l_end, l_begin_bound>second ) ) { // NOT mergable with subsequent segments if ( l_begin_bound == m_map.begin() ) { // There is no previous segment m_map[ l_end ] = l_begin; return; } // The segment being inserted can't be merged at the end // see if it can be merged with the previous one t_Iterator l_previous = l_begin_bound;  l_previous; AT_Assert( l_previous>first < l_begin ); if ( ! t_RangeTraits::Adjacent( l_previous>first, l_begin ) ) { // not overlapping with previous and not mergable m_map[ l_end ] = l_begin; return; } else { // we are mergable with the previous element // yes, they are mergable t_RangeType l_temp = l_previous>second; m_map.erase( l_previous ); m_map[ l_end ] = l_temp; return; } } if ( l_begin_bound == m_map.begin() ) { if ( l_end < l_begin_bound>first ) { if ( l_end < l_begin_bound>second ) { if ( t_RangeTraits::Adjacent( l_end, l_begin_bound>second ) ) { l_begin_bound>second = l_begin; return; } else { m_map[ l_end ] = l_begin; return; } } else { if ( l_begin < l_begin_bound>second ) { l_begin_bound>second = l_begin; } return; } } else { t_RangeType l_new_begin = l_begin; if ( l_begin_bound>second < l_begin ) { l_new_begin = l_begin_bound>second; } // Check to see what segment is close to the end t_Iterator l_end_bound = m_map.lower_bound( l_end ); if ( l_end_bound == m_map.end() ) { // erase all the segments from l_previous to the end and // replace with one m_map.erase( l_begin_bound, l_end_bound ); m_map[ l_end ] = l_new_begin; return; } if ( l_end < l_end_bound>second && ! t_RangeTraits::Adjacent( l_end, l_end_bound>second ) ) { m_map.erase( l_begin_bound, l_end_bound ); m_map[ l_end ] = l_new_begin; return; } // merge with the current end m_map.erase( l_begin_bound, l_end_bound ); // erase segments in between l_end_bound>second = l_new_begin; return; } } if ( l_begin_bound == m_map.begin() ) { // no previous ranges // see if we can merge with the current range } // find the previous iterator t_Iterator l_previous = l_begin_bound;  l_previous; t_RangeType l_new_begin = l_begin; if ( t_RangeTraits::Adjacent( l_previous>first, l_begin ) ) { l_new_begin = l_previous>second; } else { ++ l_previous; if ( l_previous>second < l_new_begin ) { l_new_begin = l_previous>second; } } t_RangeType l_new_end = l_end; // Check to see what segment is close to the end t_Iterator l_end_bound = m_map.lower_bound( l_end ); if ( l_end_bound == m_map.end() ) { // erase all the segments from l_previous to the end and // replace with one m_map.erase( l_previous, l_end_bound ); m_map[ l_end ] = l_new_begin; return; } if ( l_end < l_end_bound>second && ! t_RangeTraits::Adjacent( l_end, l_end_bound>second ) ) { m_map.erase( l_previous, l_end_bound ); m_map[ l_end ] = l_new_begin; return; } // merge with the current end m_map.erase( l_previous, l_end_bound ); // erase segments in between l_end_bound>second = l_new_begin; return; } // ======== SubtractRange ========================================= /** * SubtractRange removes the range. (opposite of Add) * * * @param i_begin Beginning of range to subtract * @param i_end End of range to subtract * @return nothing */ void SubtractRange( const t_RangeType & i_begin, const t_RangeType & i_end ) { const bool l_less_than( i_end < i_begin ); const t_RangeType & l_begin = ! l_less_than ? i_begin : i_end; const t_RangeType & l_end = l_less_than ? i_begin : i_end; // deal with an empty map here if ( m_map.empty() ) { // Nothing to remove return; } // See if we find a segment // l_begin t_Iterator l_begin_bound = m_map.lower_bound( l_begin ); if ( l_begin_bound == m_map.end() ) { // this does not cover any segments return; } if ( l_begin_bound>second < l_begin ) { // this segment is broken up t_RangeType l_newend = l_begin;  l_newend; m_map[ l_newend ] = l_begin_bound>second; l_begin_bound>second = l_begin; } t_Iterator l_end_bound = m_map.lower_bound( l_end ); if ( l_end_bound == m_map.end() ) { // erase all the segments from the beginning to end m_map.erase( l_begin_bound, l_end_bound ); return; } if ( !( l_end < l_end_bound>first ) ) { // the segment end must be equal the segment given ++ l_end_bound; m_map.erase( l_begin_bound, l_end_bound ); return; } // need to break up the final segment m_map.erase( l_begin_bound, l_end_bound ); if ( !( l_end < l_end_bound>second ) ) { t_RangeType l_newbegin = l_end; ++ l_newbegin; l_end_bound>second = l_newbegin; } return; } // ======== IsSet ================================================= /** * Checks to see if the position is set * * @param i_pos * @return True if the position is set */ bool IsSet( const t_RangeType & i_pos ) { t_Iterator l_bound = m_map.lower_bound( i_pos ); if ( l_bound == m_map.end() ) { // this does not cover any segments return false; } return !( i_pos < l_bound>second ); } t_Map m_map; }; }; // namespace #endif // x_at_rangemap_h_x 




Gianni Mariani 


 
Markus Dehmann
Guest
Posts: n/a

On Oct 26, 1:41 am, Gianni Mariani <(EMail Removed)> wrote:
> Markus Dehmann wrote: > > I am planning to have tens of thousands of bit vectors (or bitsets). > > Each of them has about 20,000 bits, but it's sparse: In a given > > vector, only 2030 bits will be set. What data structure would be best > > for space (and probably also time) efficiency? Is std::bitset better > > than std::bit_vector (vector<bool>)? Can I trust that the used space > > for each of the bitvectors will be in the order of 2030 bits (i.e. > > the expected number of bits set), or rather 20,000 bits (the constant > > number of bits that each vector has)? Or are there other, better > > classes to use? Sparse Matrix classes? > > You could use map. I have attached the Austria C++ "rangemap" that can > be used as a sparse bitmap. Thanks! What are the differences compared to std::map? Could I also just use map<bool>? > What kind of operations do you need to have on your bitmap ? I will have to compute unions of the vectors/sets, and I need a fast way to lookup what bits are set in a given vector. I have one vector of doubles (also 20,000 elements), and I frequently have to multiply the bit vectors with that vector of doubles and then sum over all the elements. Thanks! Markus > [at_rangemap.h]// 




Markus Dehmann 
=?UTF8?B?RXJpayBXaWtzdHLDtm0=?=
Guest
Posts: n/a

On 20071026 07:32, Markus Dehmann wrote:
> I am planning to have tens of thousands of bit vectors (or bitsets). > Each of them has about 20,000 bits, but it's sparse: In a given > vector, only 2030 bits will be set. What data structure would be best > for space (and probably also time) efficiency? Is std::bitset better > than std::bit_vector (vector<bool>)? A bitset is probably better than a vector<bool>, but I do not think that you should use any of them. > Can I trust that the used space > for each of the bitvectors will be in the order of 2030 bits (i.e. > the expected number of bits set), or rather 20,000 bits (the constant > number of bits that each vector has)? The bitset will be at least as big as the maximum number of bits (20,000 bits ~ 2.5 KB). > Or are there other, better classes to use? Sparse Matrix classes? You should just store the index of the bits that are set. If we do some quick calculations we see that with a bitset you will need (for 10,000 sets of 20,000 bits each) 10,000 * 20,000 ~ 25 MB. Using vector<int> instead to store the indexes of the set bits would require (given 20 set bits in each set and 4 byte per int) 10,000 * 20 * 4 ~ 800 KB. Using shorts and you might be able to halve that. Depending on how you plan to use the sets you might want to use sets, deques, or some other container, a vector together with the *_heap() algorithms might also be an idea. Computing the unions when storing only the indexes can be done quite easily by hand or if your store them in sorted order using set_union(). To get which bits that are set just iterate over the container. To get the sum from the vector of doubles something like this can be used: double sum(const std::vector<int>& bits, const std::vector<double>& val) { double sum = 0; for (size_t i = 0; i < bits.size();++i) sum += val[bits[i]]; return sum; } This can easily be adopted to other containers using iterators.  Erik WikstrÃ¶m 




=?UTF8?B?RXJpayBXaWtzdHLDtm0=?= 
Gianni Mariani
Guest
Posts: n/a

Markus Dehmann wrote:
> On Oct 26, 1:41 am, Gianni Mariani <(EMail Removed)> wrote: >> Markus Dehmann wrote: >>> I am planning to have tens of thousands of bit vectors (or bitsets). >>> Each of them has about 20,000 bits, but it's sparse: In a given >>> vector, only 2030 bits will be set. What data structure would be best >>> for space (and probably also time) efficiency? Is std::bitset better >>> than std::bit_vector (vector<bool>)? Can I trust that the used space >>> for each of the bitvectors will be in the order of 2030 bits (i.e. >>> the expected number of bits set), or rather 20,000 bits (the constant >>> number of bits that each vector has)? Or are there other, better >>> classes to use? Sparse Matrix classes? >> You could use map. I have attached the Austria C++ "rangemap" that can >> be used as a sparse bitmap. > > Thanks! What are the differences compared to std::map? Could I also > just use map<bool>? You can just use std::set<int>. The rangemap has the advantage that if you have consecutive ranges, then only one entry in the rangemap is used. > >> What kind of operations do you need to have on your bitmap ? > > I will have to compute unions of the vectors/sets, and I need a fast > way to lookup what bits are set in a given vector. I have one vector > of doubles (also 20,000 elements), and I frequently have to multiply > the bit vectors with that vector of doubles and then sum over all the > elements. rangemap is a map of "ranges". It's like a bitmap in that you're storing just the ranges of set bits. If you want to "multiply", then the quickest thing is to go through the ranges in the rangemap and add all the elements that are indexed by the ranges. 




Gianni Mariani 
Markus Dehmann
Guest
Posts: n/a

On Oct 26, 4:20 am, Erik Wikström <(EMail Removed)> wrote:
> On 20071026 07:32, Markus Dehmann wrote: > > > I am planning to have tens of thousands of bit vectors (or bitsets). > > Each of them has about 20,000 bits, but it's sparse: In a given > > vector, only 2030 bits will be set. What data structure would be best > > for space (and probably also time) efficiency? Is std::bitset better > > than std::bit_vector (vector<bool>)? > > A bitset is probably better than a vector<bool>, but I do not think that > you should use any of them. > > > Can I trust that the used space > > for each of the bitvectors will be in the order of 2030 bits (i.e. > > the expected number of bits set), or rather 20,000 bits (the constant > > number of bits that each vector has)? > > The bitset will be at least as big as the maximum number of bits (20,000 > bits ~ 2.5 KB). That's quite shocking, together with your statement that vector<bool> (I suppose you mean the bit_vector version) would be even worse. I thought that some space optimization would be done? > > Or are there other, better classes to use? Sparse Matrix classes? > > You should just store the index of the bits that are set. If we do some > quick calculations we see that with a bitset you will need (for 10,000 > sets of 20,000 bits each) 10,000 * 20,000 ~ 25 MB. Using vector<int> > instead to store the indexes of the set bits would require (given 20 set > bits in each set and 4 byte per int) 10,000 * 20 * 4 ~ 800 KB. Using > shorts and you might be able to halve that. Depending on how you plan to > use the sets you might want to use sets, deques, or some other > container, a vector together with the *_heap() algorithms might also be > an idea. Storing just the bits that are set makes sense. Actually I was thinking that bit_vector would pretty much do that for me as a space optimization. Another good container to use might be the googlesparsehash (http:// code.google.com/p/googlesparsehash). It is spaceefficient and a hash would make it easier to compute the unions of two of them. 




Markus Dehmann 
=?UTF8?B?RXJpayBXaWtzdHLDtm0=?=
Guest
Posts: n/a

On 20071026 15:25, Markus Dehmann wrote:
> On Oct 26, 4:20 am, Erik WikstrÃ¶m <(EMail Removed)> wrote: >> On 20071026 07:32, Markus Dehmann wrote: >> >> > I am planning to have tens of thousands of bit vectors (or bitsets). >> > Each of them has about 20,000 bits, but it's sparse: In a given >> > vector, only 2030 bits will be set. What data structure would be best >> > for space (and probably also time) efficiency? Is std::bitset better >> > than std::bit_vector (vector<bool>)? >> >> A bitset is probably better than a vector<bool>, but I do not think that >> you should use any of them. >> >> > Can I trust that the used space >> > for each of the bitvectors will be in the order of 2030 bits (i.e. >> > the expected number of bits set), or rather 20,000 bits (the constant >> > number of bits that each vector has)? >> >> The bitset will be at least as big as the maximum number of bits (20,000 >> bits ~ 2.5 KB). > > That's quite shocking, together with your statement that vector<bool> > (I suppose you mean the bit_vector version) would be even worse. I > thought that some space optimization would be done? My problem with vector<bool> is not so much with how much space it uses (which should be roughly equal to bitset) but rather the fact that it does not behave in the same way as a vector<T>. >> > Or are there other, better classes to use? Sparse Matrix classes? >> >> You should just store the index of the bits that are set. If we do some >> quick calculations we see that with a bitset you will need (for 10,000 >> sets of 20,000 bits each) 10,000 * 20,000 ~ 25 MB. Using vector<int> >> instead to store the indexes of the set bits would require (given 20 set >> bits in each set and 4 byte per int) 10,000 * 20 * 4 ~ 800 KB. Using >> shorts and you might be able to halve that. Depending on how you plan to >> use the sets you might want to use sets, deques, or some other >> container, a vector together with the *_heap() algorithms might also be >> an idea. > > Storing just the bits that are set makes sense. Actually I was > thinking that bit_vector would pretty much do that for me as a space > optimization. I have not studied the standard in great detail so such an optimisation might be allowed, but the most likely implementation is to allocate as many bits (rounded up to an even byte) as necessary. The problem with such an optimisation would be if all bits are set, in which case you would require more memory than is you just set bits in memory. I am not sure if the complexity requirements allows for a scheme that switches between the most efficient storage format as needed, even if it does the extra complexity is not something you want unless it is really necessary.  Erik WikstrÃ¶m 




=?UTF8?B?RXJpayBXaWtzdHLDtm0=?= 
James Kanze
Guest
Posts: n/a

On Oct 26, 10:20 am, Erik Wikström <(EMail Removed)> wrote:
> On 20071026 07:32, Markus Dehmann wrote: > > I am planning to have tens of thousands of bit vectors (or > > bitsets). Each of them has about 20,000 bits, but it's > > sparse: In a given vector, only 2030 bits will be set. What > > data structure would be best for space (and probably also > > time) efficiency? Is std::bitset better than std::bit_vector > > (vector<bool>)? > A bitset is probably better than a vector<bool>, but I do not > think that you should use any of them. I tend to agree, but not for reasons of space. > > Can I trust that the used space > > for each of the bitvectors will be in the order of 2030 bits (i.e. > > the expected number of bits set), or rather 20,000 bits (the constant > > number of bits that each vector has)? > The bitset will be at least as big as the maximum number of > bits (20,000 bits ~ 2.5 KB). > > Or are there other, better classes to use? Sparse Matrix classes? > You should just store the index of the bits that are set. If we do some > quick calculations we see that with a bitset you will need (for 10,000 > sets of 20,000 bits each) 10,000 * 20,000 ~ 25 MB. Using vector<int> > instead to store the indexes of the set bits would require (given 20 set > bits in each set and 4 byte per int) 10,000 * 20 * 4 ~ 800 KB. Using > shorts and you might be able to halve that. Using short probably wouldn't change anything. Set it a node based container; each element in the set is in its own node, a node which also contains a couple of pointers (at least three, I think, for std::set). On most modern machines, pointers are either 4 or 8 bytes, and require a similar alignment. So you have to add 4 or 8 times 3 (or more) to each element (2 or 4), then round up if necessary to ensure alignment. And maybe add a bit more for the overhead of the allocator. On a 32 bit machine, you end up with at least 16 bytes, and on a 64 bit machine, at least 32 bytes, per element. Depending on how the data is used, it might be best to maintain it in an vector of short or int, using lower_bound, etc., to find the insertion point in order to maintain the vector sorted. If there aren't too many elements in each set, even an insertion at the beginning will not be too expensive; probably less expensive than the allocation of a new node when inserting into a set. And in this case, the extra overhead will be a couple of pointers for the entire set, rather than for each element. If you know the maximum number of elements, and it's fairly small, you could also use boost::array. Or if it's usually under some very small value, but with a few outlyers, you could use something like the small string optimization frequently used for strings. Whatever you start with (I'd probably start with set, or maybe vector), you should wrap it in an application specific class, so that you can easily change it at will. > Depending on how you plan to > use the sets you might want to use sets, deques, or some other > container, a vector together with the *_heap() algorithms might also be > an idea. > Computing the unions when storing only the indexes can be done quite > easily by hand or if your store them in sorted order using set_union(). > To get which bits that are set just iterate over the container. To get > the sum from the vector of doubles something like this can be used: There are a set of functions for treating sorted sequences (set, but also sorted vector, etc.) as a set. Which means that a lot of his work is already done for him.  James Kanze (GABI Software) email:(EMail Removed) Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 place Sémard, 78210 St.Cyrl'École, France, +33 (0)1 30 23 00 34 




James Kanze 
James Kanze
Guest
Posts: n/a

On Oct 26, 3:25 pm, Markus Dehmann <(EMail Removed)> wrote:
[...] > > The bitset will be at least as big as the maximum number of bits (20,000 > > bits ~ 2.5 KB). > That's quite shocking, together with your statement that vector<bool> > (I suppose you mean the bit_vector version) would be even worse. I > thought that some space optimization would be done? Why? The most frequent use is probably fairly small sets, for which space optimization is counter productive.  James Kanze (GABI Software) email:(EMail Removed) Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 place Sémard, 78210 St.Cyrl'École, France, +33 (0)1 30 23 00 34 




James Kanze 


 
Thread Tools  


Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Re: How include a large array?  Edward A. Falk  C Programming  1  04042013 08:07 PM 
Powered by vBulletin®. Copyright ©2000  2014, vBulletin Solutions, Inc..
SEO by vBSEO ©2010, Crawlability, Inc. 