Home  Forums  Reviews  Guides  Newsgroups  Register  Search 
Thread Tools 
Gustavo Rondina 


 
kid joe
Guest
Posts: n/a

On Sat, 27 Jun 2009 14:07:55 0700, Gustavo Rondina wrote:
> Hello all, > > First of all, this might be somewhat offtopic for some of you, so let > me > apology in advance. > > I am writing a program that requires a huge 2D symmetric matrix of > booleans > (i.e., either 0 or 1). My first thought was to store it in a packed > way, so > the memory needed would be about half of the amount required to store > the > whole array in a traditional fashion. My packing is described as > follows. > > Given this symmetric matrix: > > 0 1 2 3 4 > ++++++ > 0  1  0  0  1  1  > ++++++ > 1  0  0  1  1  0  > ++++++ > 2  0  1  1  1  1  > ++++++ > 3  1  1  1  1  1  > ++++++ > 4  1  0  1  1  0  > ++++++ > > I am storing only the upper triangular portion plus the diagonal > continuously > in memory, so I have: > > 00 01 02 03 04 11 12 13 14 22 23 24 33 34 44 > ++++++++++++++++ >  1  0  0  1  1  0  1  1  0  1  1  1  1  1  0  > ++++++++++++++++ > > The two digits numbers above the schematic correspond to the indexes > in the > original matrix. Assuming each position is one char, instead of N^2 > bytes the > packed way is using only N*(N+1)/2 bytes. The tradeoff is of course > having a > more costly way to access the matrix. Instead of simply doing M[i][J], > I need > a function like this: > > #define N 10000 /* size of matrix */ > ... > 1 unsigned char is_set(unsigned int i, unsigned int j) > 2 { > 3 unsigned int tmp; > 4 unsigned int real_index; > 5 > 6 if (i > j) { > 7 tmp = j; > 8 j = i; > 9 i = tmp; > 10 } > 11 > 12 real_index = (i * N)  (i * (i  1) / 2) + j  i; > 13 return packed_array[real_index]; > 14 } > > I came up with the formula to convert the indexes from the traditional > array > to the packed array and it seems to work. > > Then I realized it is a waste to use a char to store a single > information, so > I decided to the bits of a char to do this, then instead of using N > (N1)/2 > bytes, I would need only 1/8 of this memory. Of course this leads to > bit > manipulation and all. The final result, with a few tweaks, is the > following > function: > > 1 unsigned char is_set(int i, int j) > 2 { > 3 register unsigned int idx; > 4 > 5 if (i > j) { > 6 register int temp = i; > 7 i = j; > 8 j = temp; > 9 } > 10 > 11 idx = ((i * ((N << 1)  i  1)) >> 1) + j; > 16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned > char) 0x80); > 17 } > > (I reordered the expression to reduce the number of multiplications > and use > shift operations.) > > This function is called a lot of times, and this is having some > noticeable > impact in the overall performance. The memory saving is really huge, > but > the performance is making me wonder if this is really the best choice. > > Does anyone know how could I make it faster, or a clever way to solve > this > problem, or any suggestions on the code I posted above? Only > limitation > is that they only work properly if a char is 8 bits long, but I can > live with that. > > Thanks, > Gustavo > > PS. Sorry if the schematics look messed up, with a monospace font they > looked nice. Hi Gustavo, The bitshifts and ANDs will be very fast so probably the problem is the overhead of a function call. Try declaring your function as inline (or use a Macro). Cheers, Joe  ...................... o _______________ _, ` Good Evening! , /\_ _  .'_ `................, _\__`[_______________ _ (_ ] [ \, ][ ][ (_ 




kid joe 


 
Barry Schwarz
Guest
Posts: n/a

On Sat, 27 Jun 2009 14:07:55 0700 (PDT), Gustavo Rondina
<(EMail Removed)> wrote: >Hello all, > >First of all, this might be somewhat offtopic for some of you, so let >me >apology in advance. > >I am writing a program that requires a huge 2D symmetric matrix of >booleans >(i.e., either 0 or 1). My first thought was to store it in a packed >way, so >the memory needed would be about half of the amount required to store >the >whole array in a traditional fashion. My packing is described as >follows. snip conventional 2d array description >I am storing only the upper triangular portion plus the diagonal >continuously >in memory, so I have: snip description of converted linear array >Then I realized it is a waste to use a char to store a single >information, so >I decided to the bits of a char to do this, then instead of using N >(N1)/2 >bytes, I would need only 1/8 of this memory. Of course this leads to >bit >manipulation and all. The final result, with a few tweaks, is the >following >function: > > 1 unsigned char is_set(int i, int j) > 2 { > 3 register unsigned int idx; > 4 > 5 if (i > j) { > 6 register int temp = i; > 7 i = j; > 8 j = temp; > 9 } >10 >11 idx = ((i * ((N << 1)  i  1)) >> 1) + j; >16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned >char) 0x80); What happened to lines 12 through 15? >17 } > >(I reordered the expression to reduce the number of multiplications >and use >shift operations.) > >This function is called a lot of times, and this is having some >noticeable >impact in the overall performance. The memory saving is really huge, >but >the performance is making me wonder if this is really the best choice. > Welcome to the common experience of being able to optimize for time or space but not both. >Does anyone know how could I make it faster, or a clever way to solve >this >problem, or any suggestions on the code I posted above? Only >limitation >is that they only work properly if a char is 8 bits long, but I can >live with that. You need to look at the generated assembly code. N<<1 is a compiletime constant and should not be computed each time. If your compiler is not optimizing this, you might want to compute it once and store it in a static variable. Assuming packed_array has type unsigned char, the arithmetic in the return statement is still performed using int and then converted to unsigned. Is the generated code spending a lot of time converting between the types? If so, you might consider using unsigned int instead of unsigned char and adjusting your shifts for 32 bit objects instead of 8 bit ones. In my experience, multiplication is the long tent pole in your expression. Consider your expression i * ((N << 1)  i  1) This is the same as i * (N << 1)  i *(i+1) If you increase N until it is a power of 2, the first multiplication can be replaced by another shift. The second multiplication depends only on i. i will always be between 0 and N1. You could build an N element array of int where element y is set to y*y+y. Your multiplication expression then becomes (i << NN)  array[i] with no multiplication involved. (The array reference should result in a shift, not a multiply, but you do need to check the assembly code.)  Remove del for email 




Barry Schwarz 
Ben Bacarisse
Guest
Posts: n/a

Gustavo Rondina <(EMail Removed)> writes:
<snip> > I am storing only the upper triangular portion plus the diagonal > continuously > in memory, so I have: > > 00 01 02 03 04 11 12 13 14 22 23 24 33 34 44 > ++++++++++++++++ >  1  0  0  1  1  0  1  1  0  1  1  1  1  1  0  > ++++++++++++++++ > > The two digits numbers above the schematic correspond to the indexes > in the > original matrix. Assuming each position is one char, instead of N^2 > bytes the > packed way is using only N*(N+1)/2 bytes. The tradeoff is of course > having a > more costly way to access the matrix. Instead of simply doing M[i][J], <snip> > Then I realized it is a waste to use a char to store a single > information, so > I decided to the bits of a char to do this, then instead of using N > (N1)/2 > bytes, I would need only 1/8 of this memory. Of course this leads to > bit > manipulation and all. The final result, with a few tweaks, is the > following > function: > > 1 unsigned char is_set(int i, int j) > 2 { > 3 register unsigned int idx; > 4 > 5 if (i > j) { > 6 register int temp = i; > 7 i = j; > 8 j = temp; > 9 } > 10 > 11 idx = ((i * ((N << 1)  i  1)) >> 1) + j; > 16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned > char) 0x80); > 17 } > > (I reordered the expression to reduce the number of multiplications > and use > shift operations.) > > This function is called a lot of times, and this is having some > noticeable > impact in the overall performance. The memory saving is really huge, > but > the performance is making me wonder if this is really the best > choice. If you've coded the rest cleanly, it should be very simple to switch in a plain array and see how much better/worse it is. My only alternative is to use an array of pointers rather than a packed array. You arrange for M[i] to be a pointer to row i. This replaces one part of the index calculation with an indirection. You do this along with any of your other plans  storing only one half of the matrix and/or using bits rather than bytes. You might also want to try this: allocate the diagonals and not the rows. Why? Because you can make the symmetry explicit that way. For an NxN matrix you allocate an array of 2N1 pointers P. P[0] points to an array of 1 element (M[0][N1]), P[1] to two elements (M[0][N2] and M[1][N1]) and so on up to P[N1] which points to the N elements of the main diagonal. The remaining Ps pointer to the *same* arrays but in reverse order (P[N] = P[N2], P[N+1] = P[N3] and so on). Now, make a new pointertopointer variable D and set it to &P[N1]. D[0] is now the main diagonal and both D[1] and D[1] are the (same) diagonal just above and below it. Element M[i][j] is now indexed as D[ij][min(i,j)] i.e: D[ij][i < j ? i : j] As with all optimisations, this may or may not pay off but I'd want to try it out. You can use the bit allocation trick here as well, of course. Note that when using an array of pointers you don't have to allocate the smaller arrays separately. You can so it all on one go and simply point to the appropriate part. <snip>  Ben. 




Ben Bacarisse 
Thomas Matthews
Guest
Posts: n/a

Gustavo Rondina wrote:
> Hello all, > > First of all, this might be somewhat offtopic for some of you, so let > me > apology in advance. > > I am writing a program that requires a huge 2D symmetric matrix of > booleans > (i.e., either 0 or 1). My first thought was to store it in a packed > way, so > the memory needed would be about half of the amount required to store > the > whole array in a traditional fashion. My packing is described as > follows. > > Given this symmetric matrix: > > 0 1 2 3 4 > ++++++ > 0  1  0  0  1  1  > ++++++ > 1  0  0  1  1  0  > ++++++ > 2  0  1  1  1  1  > ++++++ > 3  1  1  1  1  1  > ++++++ > 4  1  0  1  1  0  > ++++++ > > I am storing only the upper triangular portion plus the diagonal > continuously > in memory, so I have: > > 00 01 02 03 04 11 12 13 14 22 23 24 33 34 44 > ++++++++++++++++ >  1  0  0  1  1  0  1  1  0  1  1  1  1  1  0  > ++++++++++++++++ > > The two digits numbers above the schematic correspond to the indexes > in the > original matrix. Assuming each position is one char, instead of N^2 > bytes the > packed way is using only N*(N+1)/2 bytes. The tradeoff is of course > having a > more costly way to access the matrix. Instead of simply doing M[i][J], > I need > a function like this: > > #define N 10000 /* size of matrix */ > ... > 1 unsigned char is_set(unsigned int i, unsigned int j) > 2 { > 3 unsigned int tmp; > 4 unsigned int real_index; > 5 > 6 if (i > j) { > 7 tmp = j; > 8 j = i; > 9 i = tmp; > 10 } > 11 > 12 real_index = (i * N)  (i * (i  1) / 2) + j  i; > 13 return packed_array[real_index]; > 14 } > > I came up with the formula to convert the indexes from the traditional > array > to the packed array and it seems to work. > > Then I realized it is a waste to use a char to store a single > information, so > I decided to the bits of a char to do this, then instead of using N > (N1)/2 > bytes, I would need only 1/8 of this memory. Of course this leads to > bit > manipulation and all. The final result, with a few tweaks, is the > following > function: > > 1 unsigned char is_set(int i, int j) > 2 { > 3 register unsigned int idx; > 4 > 5 if (i > j) { > 6 register int temp = i; > 7 i = j; > 8 j = temp; > 9 } > 10 > 11 idx = ((i * ((N << 1)  i  1)) >> 1) + j; > 16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned > char) 0x80); > 17 } > > (I reordered the expression to reduce the number of multiplications > and use > shift operations.) > > This function is called a lot of times, and this is having some > noticeable > impact in the overall performance. The memory saving is really huge, > but > the performance is making me wonder if this is really the best choice. > > Does anyone know how could I make it faster, or a clever way to solve > this > problem, or any suggestions on the code I posted above? Only > limitation > is that they only work properly if a char is 8 bits long, but I can > live with that. > > Thanks, > Gustavo > > PS. Sorry if the schematics look messed up, with a monospace font they > looked nice. Gustavo, I suggest you use an unsigned int as your base (foundation) type rather than char. Many processors are now geared up to fetch integers at a fast rate. Some processors actually slow down when processing chars. For example, if a processor has a 32bit integer, and you are using 8bit chars, it would fetch 8 bits at a time using char and 32bits in one fetch using integer. Also, you may want to ask yourself if the space saving is necessary and actually benefits the product. Many "desktop" applications are not memory sensitive, so more time should be spent on correctness of the program.  Thomas Matthews C++ newsgroup welcome message: http://www.slack.net/~shiva/welcome.txt C++ Faq: http://www.parashift.com/c++faqlite C Faq: http://www.eskimo.com/~scs/cfaq/top.html alt.comp.lang.learn.cc++ faq: http://www.comeaucomputing.com/learn/faq/ Other sites: http://www.josuttis.com  C++ STL Library book http://www.sgi.com/tech/stl  Standard Template Library 




Thomas Matthews 
Nobody
Guest
Posts: n/a

On Sat, 27 Jun 2009 14:07:55 0700, Gustavo Rondina wrote:
> (I reordered the expression to reduce the number of multiplications > and use shift operations.) The compiler ought to be able to do this for you. Certainly, multiplication or division by 2 will be translated to a shift for any sane compiler. > This function is called a lot of times, and this is having some > noticeable impact in the overall performance. The memory saving is > really huge, but the performance is making me wonder if this is really > the best choice. > > Does anyone know how could I make it faster, or a clever way to solve this > problem, or any suggestions on the code I posted above? Only limitation > is that they only work properly if a char is 8 bits long, but I can live > with that. If the multiplication (i*(N/2i1)) is the bottleneck, using a precalculated array of row pointers may be faster. Also try: Returning "int" rather than "char". Using an array of "int"s (or "long"s) rather than "char"s. Returning 0/1 rather than 0/0x80, e.g.: return (packed_array[idx>>3] >> (idx&7)) & 1; Returning zero/nonzero, e.g.: return (packed_array[idx>>3] & (1 << (idx&7))); Defining the function as "static inline ..." in a header file. Finally: storing values as bits rather than bytes will use 1/8th the memory, but using a triangular matrix will save less than a half, so maybe the latter isn't worth it. 




Nobody 
Gustavo Rondina
Guest
Posts: n/a

On Sat, 27 Jun 2009 16:36:06 0700, Barry Schwarz wrote:
[snip code from previous post] > What happened to lines 12 through 15? Sorry, I had copied a previous version of the function with a few extra unnecessary lines which were removed only after I inserted the line numbers by hand  next time 'cat n' certainly will be used. > You need to look at the generated assembly code. > > N<<1 is a compiletime constant and should not be computed > each time. If your compiler is not optimizing this, you might want to > compute it once and store it in a static variable. I did this and it made the function a bit faster, thanks for the tip. > Assuming packed_array has type unsigned char, the arithmetic > in the return statement is still performed using int and then converted > to unsigned. Is the generated code spending a lot of time converting > between the types? If so, you might consider using unsigned int instead > of unsigned char and adjusting your shifts for 32 bit objects instead of > 8 bit ones. Just now Thomas Matthews suggested this as well in another post, thank you both. I am using integers now, and it is indeed a little faster. > In my experience, multiplication is the long tent pole in your > expression. Consider your expression > i * ((N << 1)  i  1) > This is the same as > i * (N << 1)  i *(i+1) > If you increase N until it is a power of 2, the first > multiplication can be replaced by another shift. > The second multiplication depends only on i. i will always be > between 0 and N1. You could build an N element array of int where > element y is set to y*y+y. This is interesting, I will give it a try. I don't know though if I can afford increasing N to a power of 2. Only tests will tell. For now I have the following. The computations involving constants have been removed from the function and the result is stored somewhere else. Again, this assumes that a unsigned int is 32 bits long. 1 unsigned int is_set (unsigned int i, unsigned int j) 2 { 3 register unsigned int idx; 4 5 if (i > j) 6 { 7 register int temp = i; 8 i = j; 9 j = temp; 10 } 11 12 /* Original expression: 13 14 idx = i * N  ((i * (i  1)) / 2) + (j  i) 15 16 In the code below NN == (N << 1)  1 is a variable 17 initialized somwhere else. 18 */ 19 20 idx = NN  i; 21 idx *= i; 22 idx >>= 1; 23 idx += j; 24 25 /* packed_array is a array of unsigned integers */ 26 return ((packed_array[idx >> 5] << (idx & 31)) & 0x80000000); 27 } Again, thanks for your insights. Cheers,  Gustavo 




Gustavo Rondina 
Gustavo Rondina
Guest
Posts: n/a

On Sun, 28 Jun 2009 00:18:18 0700, Thomas Matthews wrote:
> I suggest you use an unsigned int as your base (foundation) type rather > than char. Many processors are now geared up to fetch integers at a > fast rate. Some processors actually slow down when processing chars. I did this modification and indeed it is a bit faster (something around 10% to 15%, but I haven't run extensive benchmarks yet.) [snip] > Also, you may want to ask yourself if the space saving is necessary and > actually benefits the product. Many "desktop" applications are not > memory sensitive, so more time should be spent on correctness of the > program. I see. Well, this is a scientific application for simulating the dynamics of system of molecules. The memory requirement scales with N, which is the number of molecules/atoms/beads one wants to simulate. In the future the plan is to port this application to run on GPUs, where memory is not as abundant as in the regular PC, so I am trying to design the application to use as less memory as possible so large systems can be simulated. With the highly parallel design of GPUs the expensive computations of calculating indexes and such might be hidden by the enormous amount of threads one can run, but nonetheless I would like to get a fast ``serial'' (i.e., CPU) version as well. The replies to my original question were very good and gave me some ideas I might want to try. Cheers,  Gustavo 




Gustavo Rondina 
Barry Schwarz
Guest
Posts: n/a

On Sun, 28 Jun 2009 08:49:37 +0000 (UTC), Gustavo Rondina
<(EMail Removed)> wrote: >On Sat, 27 Jun 2009 16:36:06 0700, Barry Schwarz wrote: snip >> In my experience, multiplication is the long tent pole in your >> expression. Consider your expression >> i * ((N << 1)  i  1) >> This is the same as >> i * (N << 1)  i *(i+1) >> If you increase N until it is a power of 2, the first >> multiplication can be replaced by another shift. >> The second multiplication depends only on i. i will always be >> between 0 and N1. You could build an N element array of int where >> element y is set to y*y+y. > >This is interesting, I will give it a try. I don't know though if I can >afford increasing N to a power of 2. Only tests will tell. Your N was defined as 10,000. If this was just an approximation as opposed to a precise requirement, why not use 8,192 as your approximation. Otherwise your are up to 16,384. The extra bits in packed_array won't matter that much. Even the extra 6K elements of the int array should not be that big a deal given how much space you saved by packing the boolean values. > >For now I have the following. The computations involving constants have >been removed from the function and the result is stored somewhere else. > >Again, this assumes that a unsigned int is 32 bits long. > > 1 unsigned int is_set (unsigned int i, unsigned int j) > 2 { > 3 register unsigned int idx; > 4 > 5 if (i > j) > 6 { > 7 register int temp = i; This should be unsigned also. > 8 i = j; > 9 j = temp; > 10 } snip  Remove del for email 




Barry Schwarz 
Flash Gordon
Guest
Posts: n/a

Gustavo Rondina wrote:
<snip> > I see. Well, this is a scientific application for simulating the dynamics > of system of molecules. The memory requirement scales with N, which is the > number of molecules/atoms/beads one wants to simulate. In the future the > plan is to port this application to run on GPUs, where memory is not as > abundant as in the regular PC, so I am trying to design the application to > use as less memory as possible so large systems can be simulated. > > With the highly parallel design of GPUs the expensive computations of > calculating indexes and such might be hidden by the enormous amount of > threads one can run, but nonetheless I would like to get a fast ``serial'' > (i.e., CPU) version as well. > > The replies to my original question were very good and gave me some ideas > I might want to try. A few points you might want to consider as you will be switching to a different processor... The size of unsigned in could well be different on the final target processor. There are differences between compilers on what they can optimise efficiently. There are difference between processors on what opperations are efficient. So, all your careful benchmarking and speed optimisations (and some space optimisations) could well be undone by switching to a different processor. So I would recommend keeping all of the different versions of the code and getting hold of the real target (or a simulator for it) and redoing the work on the final target system!  Flash Gordon 




Flash Gordon 


 
Thread Tools  


Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
compare several boolean matrix’s  keijaf2011@gmail.com  Java  32  02102013 06:54 PM 
Subtle difference between boolean value and boolean comparison?  Metre Meter  Javascript  7  08062010 08:40 PM 
difference between 'boolean' and 'java.lang.Boolean'  J Leonard  Java  4  01192008 02:56 AM 
Funny problem about symmetric encryption  Alexandre Oberlin  Computer Security  5  11152006 11:48 PM 
Frequency Doubler in VHDL with symmetric duty cycle  Gazelle  VHDL  0  11122003 07:35 PM 
Powered by vBulletin®. Copyright ©2000  2014, vBulletin Solutions, Inc..
SEO by vBSEO ©2010, Crawlability, Inc. 