Velocity Reviews > Storage of symmetric boolean matrix

# Storage of symmetric boolean matrix

Gustavo Rondina
Guest
Posts: n/a

 06-27-2009
Hello all,

First of all, this might be somewhat off-topic for some of you, so let
me

I am writing a program that requires a huge 2D symmetric matrix of
booleans
(i.e., either 0 or 1). My first thought was to store it in a packed
way, so
the memory needed would be about half of the amount required to store
the
whole array in a traditional fashion. My packing is described as
follows.

Given this symmetric matrix:

0 1 2 3 4
+---+---+---+---+---+
0 | 1 | 0 | 0 | 1 | 1 |
+---+---+---+---+---+
1 | 0 | 0 | 1 | 1 | 0 |
+---+---+---+---+---+
2 | 0 | 1 | 1 | 1 | 1 |
+---+---+---+---+---+
3 | 1 | 1 | 1 | 1 | 1 |
+---+---+---+---+---+
4 | 1 | 0 | 1 | 1 | 0 |
+---+---+---+---+---+

I am storing only the upper triangular portion plus the diagonal
continuously
in memory, so I have:

00 01 02 03 04 11 12 13 14 22 23 24 33 34 44
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

The two digits numbers above the schematic correspond to the indexes
in the
original matrix. Assuming each position is one char, instead of N^2
bytes the
packed way is using only N*(N+1)/2 bytes. The trade-off is of course
having a
more costly way to access the matrix. Instead of simply doing M[i][J],
I need
a function like this:

#define N 10000 /* size of matrix */
...
1 unsigned char is_set(unsigned int i, unsigned int j)
2 {
3 unsigned int tmp;
4 unsigned int real_index;
5
6 if (i > j) {
7 tmp = j;
8 j = i;
9 i = tmp;
10 }
11
12 real_index = (i * N) - (i * (i - 1) / 2) + j - i;
13 return packed_array[real_index];
14 }

I came up with the formula to convert the indexes from the traditional
array
to the packed array and it seems to work.

Then I realized it is a waste to use a char to store a single
information, so
I decided to the bits of a char to do this, then instead of using N
(N-1)/2
bytes, I would need only 1/8 of this memory. Of course this leads to
bit
manipulation and all. The final result, with a few tweaks, is the
following
function:

1 unsigned char is_set(int i, int j)
2 {
3 register unsigned int idx;
4
5 if (i > j) {
6 register int temp = i;
7 i = j;
8 j = temp;
9 }
10
11 idx = ((i * ((N << 1) - i - 1)) >> 1) + j;
16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned
char) 0x80);
17 }

(I reordered the expression to reduce the number of multiplications
and use
shift operations.)

This function is called a lot of times, and this is having some
noticeable
impact in the overall performance. The memory saving is really huge,
but
the performance is making me wonder if this is really the best choice.

Does anyone know how could I make it faster, or a clever way to solve
this
problem, or any suggestions on the code I posted above? Only
limitation
is that they only work properly if a char is 8 bits long, but I can
live with that.

Thanks,
Gustavo

PS. Sorry if the schematics look messed up, with a monospace font they
looked nice.

kid joe
Guest
Posts: n/a

 06-27-2009
On Sat, 27 Jun 2009 14:07:55 -0700, Gustavo Rondina wrote:
> Hello all,
>
> First of all, this might be somewhat off-topic for some of you, so let
> me
>
> I am writing a program that requires a huge 2D symmetric matrix of
> booleans
> (i.e., either 0 or 1). My first thought was to store it in a packed
> way, so
> the memory needed would be about half of the amount required to store
> the
> whole array in a traditional fashion. My packing is described as
> follows.
>
> Given this symmetric matrix:
>
> 0 1 2 3 4
> +---+---+---+---+---+
> 0 | 1 | 0 | 0 | 1 | 1 |
> +---+---+---+---+---+
> 1 | 0 | 0 | 1 | 1 | 0 |
> +---+---+---+---+---+
> 2 | 0 | 1 | 1 | 1 | 1 |
> +---+---+---+---+---+
> 3 | 1 | 1 | 1 | 1 | 1 |
> +---+---+---+---+---+
> 4 | 1 | 0 | 1 | 1 | 0 |
> +---+---+---+---+---+
>
> I am storing only the upper triangular portion plus the diagonal
> continuously
> in memory, so I have:
>
> 00 01 02 03 04 11 12 13 14 22 23 24 33 34 44
> +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
> | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
> +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
>
> The two digits numbers above the schematic correspond to the indexes
> in the
> original matrix. Assuming each position is one char, instead of N^2
> bytes the
> packed way is using only N*(N+1)/2 bytes. The trade-off is of course
> having a
> more costly way to access the matrix. Instead of simply doing M[i][J],
> I need
> a function like this:
>
> #define N 10000 /* size of matrix */
> ...
> 1 unsigned char is_set(unsigned int i, unsigned int j)
> 2 {
> 3 unsigned int tmp;
> 4 unsigned int real_index;
> 5
> 6 if (i > j) {
> 7 tmp = j;
> 8 j = i;
> 9 i = tmp;
> 10 }
> 11
> 12 real_index = (i * N) - (i * (i - 1) / 2) + j - i;
> 13 return packed_array[real_index];
> 14 }
>
> I came up with the formula to convert the indexes from the traditional
> array
> to the packed array and it seems to work.
>
> Then I realized it is a waste to use a char to store a single
> information, so
> I decided to the bits of a char to do this, then instead of using N
> (N-1)/2
> bytes, I would need only 1/8 of this memory. Of course this leads to
> bit
> manipulation and all. The final result, with a few tweaks, is the
> following
> function:
>
> 1 unsigned char is_set(int i, int j)
> 2 {
> 3 register unsigned int idx;
> 4
> 5 if (i > j) {
> 6 register int temp = i;
> 7 i = j;
> 8 j = temp;
> 9 }
> 10
> 11 idx = ((i * ((N << 1) - i - 1)) >> 1) + j;
> 16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned
> char) 0x80);
> 17 }
>
> (I reordered the expression to reduce the number of multiplications
> and use
> shift operations.)
>
> This function is called a lot of times, and this is having some
> noticeable
> impact in the overall performance. The memory saving is really huge,
> but
> the performance is making me wonder if this is really the best choice.
>
> Does anyone know how could I make it faster, or a clever way to solve
> this
> problem, or any suggestions on the code I posted above? Only
> limitation
> is that they only work properly if a char is 8 bits long, but I can
> live with that.
>
> Thanks,
> Gustavo
>
> PS. Sorry if the schematics look messed up, with a monospace font they
> looked nice.

Hi Gustavo,

The bitshifts and ANDs will be very fast so probably the problem is the
overhead of a function call. Try declaring your function as inline (or use
a Macro).

Cheers,
Joe

--
...................... o _______________ _,
` Good Evening! , /\_ _| | .-'_|
`................, _\__`[_______________| _| (_|
] [ \, ][ ][ (_|

Barry Schwarz
Guest
Posts: n/a

 06-27-2009
On Sat, 27 Jun 2009 14:07:55 -0700 (PDT), Gustavo Rondina
<(E-Mail Removed)> wrote:

>Hello all,
>
>First of all, this might be somewhat off-topic for some of you, so let
>me
>
>I am writing a program that requires a huge 2D symmetric matrix of
>booleans
>(i.e., either 0 or 1). My first thought was to store it in a packed
>way, so
>the memory needed would be about half of the amount required to store
>the
>whole array in a traditional fashion. My packing is described as
>follows.

snip conventional 2-d array description

>I am storing only the upper triangular portion plus the diagonal
>continuously
>in memory, so I have:

snip description of converted linear array

>Then I realized it is a waste to use a char to store a single
>information, so
>I decided to the bits of a char to do this, then instead of using N
>(N-1)/2
>bytes, I would need only 1/8 of this memory. Of course this leads to
>bit
>manipulation and all. The final result, with a few tweaks, is the
>following
>function:
>
> 1 unsigned char is_set(int i, int j)
> 2 {
> 3 register unsigned int idx;
> 4
> 5 if (i > j) {
> 6 register int temp = i;
> 7 i = j;
> 8 j = temp;
> 9 }
>10
>11 idx = ((i * ((N << 1) - i - 1)) >> 1) + j;
>16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned
>char) 0x80);

What happened to lines 12 through 15?

>17 }
>
>(I reordered the expression to reduce the number of multiplications
>and use
>shift operations.)

>
>This function is called a lot of times, and this is having some
>noticeable
>impact in the overall performance. The memory saving is really huge,
>but
>the performance is making me wonder if this is really the best choice.
>

Welcome to the common experience of being able to optimize for time or
space but not both.

>Does anyone know how could I make it faster, or a clever way to solve
>this
>problem, or any suggestions on the code I posted above? Only
>limitation
>is that they only work properly if a char is 8 bits long, but I can
>live with that.

You need to look at the generated assembly code.

N<<1 is a compile-time constant and should not be computed
each time. If your compiler is not optimizing this, you might want to
compute it once and store it in a static variable.

Assuming packed_array has type unsigned char, the arithmetic
in the return statement is still performed using int and then
converted to unsigned. Is the generated code spending a lot of time
converting between the types? If so, you might consider using
bit objects instead of 8 bit ones.

In my experience, multiplication is the long tent pole in your
i * ((N << 1) - i - 1)
This is the same as
i * (N << 1) - i *(i+1)
If you increase N until it is a power of 2, the first
multiplication can be replaced by another shift.
The second multiplication depends only on i. i will always be
between 0 and N-1. You could build an N element array of int where
element y is set to y*y+y.
(i << NN) - array[i]
with no multiplication involved. (The array reference should result
in a shift, not a multiply, but you do need to check the assembly
code.)

--
Remove del for email

Ben Bacarisse
Guest
Posts: n/a

 06-28-2009
Gustavo Rondina <(E-Mail Removed)> writes:
<snip>
> I am storing only the upper triangular portion plus the diagonal
> continuously
> in memory, so I have:
>
> 00 01 02 03 04 11 12 13 14 22 23 24 33 34 44
> +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
> | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
> +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
>
> The two digits numbers above the schematic correspond to the indexes
> in the
> original matrix. Assuming each position is one char, instead of N^2
> bytes the
> packed way is using only N*(N+1)/2 bytes. The trade-off is of course
> having a
> more costly way to access the matrix. Instead of simply doing M[i][J],

<snip>
> Then I realized it is a waste to use a char to store a single
> information, so
> I decided to the bits of a char to do this, then instead of using N
> (N-1)/2
> bytes, I would need only 1/8 of this memory. Of course this leads to
> bit
> manipulation and all. The final result, with a few tweaks, is the
> following
> function:
>
> 1 unsigned char is_set(int i, int j)
> 2 {
> 3 register unsigned int idx;
> 4
> 5 if (i > j) {
> 6 register int temp = i;
> 7 i = j;
> 8 j = temp;
> 9 }
> 10
> 11 idx = ((i * ((N << 1) - i - 1)) >> 1) + j;
> 16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned
> char) 0x80);
> 17 }
>
> (I reordered the expression to reduce the number of multiplications
> and use
> shift operations.)
>
> This function is called a lot of times, and this is having some
> noticeable
> impact in the overall performance. The memory saving is really huge,
> but
> the performance is making me wonder if this is really the best
> choice.

If you've coded the rest cleanly, it should be very simple to switch in
a plain array and see how much better/worse it is.

My only alternative is to use an array of pointers rather than a
packed array. You arrange for M[i] to be a pointer to row i. This
replaces one part of the index calculation with an indirection. You
do this along with any of your other plans -- storing only one half of
the matrix and/or using bits rather than bytes.

You might also want to try this: allocate the diagonals and not the
rows. Why? Because you can make the symmetry explicit that way. For
an NxN matrix you allocate an array of 2N-1 pointers P. P[0] points
to an array of 1 element (M[0][N-1]), P[1] to two elements (M[0][N-2]
and M[1][N-1]) and so on up to P[N-1] which points to the N elements
of the main diagonal. The remaining Ps pointer to the *same* arrays
but in reverse order (P[N] = P[N-2], P[N+1] = P[N-3] and so on).

Now, make a new pointer-to-pointer variable D and set it to &P[N-1].
D[0] is now the main diagonal and both D[1] and D[-1] are the (same)
diagonal just above and below it. Element M[i][j] is now indexed as

D[i-j][min(i,j)] i.e: D[i-j][i < j ? i : j]

As with all optimisations, this may or may not pay off but I'd want to
try it out.

You can use the bit allocation trick here as well, of course.

Note that when using an array of pointers you don't have to allocate
the smaller arrays separately. You can so it all on one go and simply
point to the appropriate part.

<snip>
--
Ben.

Thomas Matthews
Guest
Posts: n/a

 06-28-2009
Gustavo Rondina wrote:
> Hello all,
>
> First of all, this might be somewhat off-topic for some of you, so let
> me
>
> I am writing a program that requires a huge 2D symmetric matrix of
> booleans
> (i.e., either 0 or 1). My first thought was to store it in a packed
> way, so
> the memory needed would be about half of the amount required to store
> the
> whole array in a traditional fashion. My packing is described as
> follows.
>
> Given this symmetric matrix:
>
> 0 1 2 3 4
> +---+---+---+---+---+
> 0 | 1 | 0 | 0 | 1 | 1 |
> +---+---+---+---+---+
> 1 | 0 | 0 | 1 | 1 | 0 |
> +---+---+---+---+---+
> 2 | 0 | 1 | 1 | 1 | 1 |
> +---+---+---+---+---+
> 3 | 1 | 1 | 1 | 1 | 1 |
> +---+---+---+---+---+
> 4 | 1 | 0 | 1 | 1 | 0 |
> +---+---+---+---+---+
>
> I am storing only the upper triangular portion plus the diagonal
> continuously
> in memory, so I have:
>
> 00 01 02 03 04 11 12 13 14 22 23 24 33 34 44
> +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
> | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
> +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
>
> The two digits numbers above the schematic correspond to the indexes
> in the
> original matrix. Assuming each position is one char, instead of N^2
> bytes the
> packed way is using only N*(N+1)/2 bytes. The trade-off is of course
> having a
> more costly way to access the matrix. Instead of simply doing M[i][J],
> I need
> a function like this:
>
> #define N 10000 /* size of matrix */
> ...
> 1 unsigned char is_set(unsigned int i, unsigned int j)
> 2 {
> 3 unsigned int tmp;
> 4 unsigned int real_index;
> 5
> 6 if (i > j) {
> 7 tmp = j;
> 8 j = i;
> 9 i = tmp;
> 10 }
> 11
> 12 real_index = (i * N) - (i * (i - 1) / 2) + j - i;
> 13 return packed_array[real_index];
> 14 }
>
> I came up with the formula to convert the indexes from the traditional
> array
> to the packed array and it seems to work.
>
> Then I realized it is a waste to use a char to store a single
> information, so
> I decided to the bits of a char to do this, then instead of using N
> (N-1)/2
> bytes, I would need only 1/8 of this memory. Of course this leads to
> bit
> manipulation and all. The final result, with a few tweaks, is the
> following
> function:
>
> 1 unsigned char is_set(int i, int j)
> 2 {
> 3 register unsigned int idx;
> 4
> 5 if (i > j) {
> 6 register int temp = i;
> 7 i = j;
> 8 j = temp;
> 9 }
> 10
> 11 idx = ((i * ((N << 1) - i - 1)) >> 1) + j;
> 16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned
> char) 0x80);
> 17 }
>
> (I reordered the expression to reduce the number of multiplications
> and use
> shift operations.)
>
> This function is called a lot of times, and this is having some
> noticeable
> impact in the overall performance. The memory saving is really huge,
> but
> the performance is making me wonder if this is really the best choice.
>
> Does anyone know how could I make it faster, or a clever way to solve
> this
> problem, or any suggestions on the code I posted above? Only
> limitation
> is that they only work properly if a char is 8 bits long, but I can
> live with that.
>
> Thanks,
> Gustavo
>
> PS. Sorry if the schematics look messed up, with a monospace font they
> looked nice.

Gustavo,

I suggest you use an unsigned int as your base (foundation) type rather
than char. Many processors are now geared up to fetch integers at a
fast rate. Some processors actually slow down when processing chars.

For example, if a processor has a 32-bit integer, and you are using
8-bit chars, it would fetch 8 bits at a time using char and 32-bits in
one fetch using integer.

Also, you may want to ask yourself if the space saving is necessary
and actually benefits the product. Many "desktop" applications are not
memory sensitive, so more time should be spent on correctness of the
program.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.comeaucomputing.com/learn/faq/
Other sites:
http://www.josuttis.com -- C++ STL Library book
http://www.sgi.com/tech/stl -- Standard Template Library

Nobody
Guest
Posts: n/a

 06-28-2009
On Sat, 27 Jun 2009 14:07:55 -0700, Gustavo Rondina wrote:

> (I reordered the expression to reduce the number of multiplications
> and use shift operations.)

The compiler ought to be able to do this for you. Certainly,
multiplication or division by 2 will be translated to a shift for any sane
compiler.

> This function is called a lot of times, and this is having some
> noticeable impact in the overall performance. The memory saving is
> really huge, but the performance is making me wonder if this is really
> the best choice.
>
> Does anyone know how could I make it faster, or a clever way to solve this
> problem, or any suggestions on the code I posted above? Only limitation
> is that they only work properly if a char is 8 bits long, but I can live
> with that.

If the multiplication (i*(N/2-i-1)) is the bottleneck, using a
pre-calculated array of row pointers may be faster.

Also try:

Returning "int" rather than "char".

Using an array of "int"s (or "long"s) rather than "char"s.

Returning 0/1 rather than 0/0x80, e.g.:

return (packed_array[idx>>3] >> (idx&7)) & 1;

Returning zero/non-zero, e.g.:

return (packed_array[idx>>3] & (1 << (idx&7)));

Defining the function as "static inline ..." in a header file.

Finally: storing values as bits rather than bytes will use 1/8th the
memory, but using a triangular matrix will save less than a half, so maybe
the latter isn't worth it.

Gustavo Rondina
Guest
Posts: n/a

 06-28-2009
On Sat, 27 Jun 2009 16:36:06 -0700, Barry Schwarz wrote:

[snip code from previous post]

> What happened to lines 12 through 15?

Sorry, I had copied a previous version of the function with a few extra
unnecessary lines which were removed only after I inserted the line
numbers by hand -- next time 'cat -n' certainly will be used.

> You need to look at the generated assembly code.
>
> N<<1 is a compile-time constant and should not be computed
> each time. If your compiler is not optimizing this, you might want to
> compute it once and store it in a static variable.

I did this and it made the function a bit faster, thanks for the tip.

> Assuming packed_array has type unsigned char, the arithmetic
> in the return statement is still performed using int and then converted
> to unsigned. Is the generated code spending a lot of time converting
> between the types? If so, you might consider using unsigned int instead
> 8 bit ones.

Just now Thomas Matthews suggested this as well in another post, thank
you both. I am using integers now, and it is indeed a little faster.

> In my experience, multiplication is the long tent pole in your
> i * ((N << 1) - i - 1)
> This is the same as
> i * (N << 1) - i *(i+1)
> If you increase N until it is a power of 2, the first
> multiplication can be replaced by another shift.
> The second multiplication depends only on i. i will always be
> between 0 and N-1. You could build an N element array of int where
> element y is set to y*y+y.

This is interesting, I will give it a try. I don't know though if I can
afford increasing N to a power of 2. Only tests will tell.

For now I have the following. The computations involving constants have
been removed from the function and the result is stored somewhere else.

Again, this assumes that a unsigned int is 32 bits long.

1 unsigned int is_set (unsigned int i, unsigned int j)
2 {
3 register unsigned int idx;
4
5 if (i > j)
6 {
7 register int temp = i;
8 i = j;
9 j = temp;
10 }
11
12 /* Original expression:
13
14 idx = i * N - ((i * (i - 1)) / 2) + (j - i)
15
16 In the code below NN == (N << 1) - 1 is a variable
17 initialized somwhere else.
18 */
19
20 idx = NN - i;
21 idx *= i;
22 idx >>= 1;
23 idx += j;
24
25 /* packed_array is a array of unsigned integers */
26 return ((packed_array[idx >> 5] << (idx & 31)) & 0x80000000);
27 }

Cheers,

--
Gustavo

Gustavo Rondina
Guest
Posts: n/a

 06-28-2009
On Sun, 28 Jun 2009 00:18:18 -0700, Thomas Matthews wrote:

> I suggest you use an unsigned int as your base (foundation) type rather
> than char. Many processors are now geared up to fetch integers at a
> fast rate. Some processors actually slow down when processing chars.

I did this modification and indeed it is a bit faster (something around
10% to 15%, but I haven't run extensive benchmarks yet.)

[snip]

> Also, you may want to ask yourself if the space saving is necessary and
> actually benefits the product. Many "desktop" applications are not
> memory sensitive, so more time should be spent on correctness of the
> program.

I see. Well, this is a scientific application for simulating the dynamics
of system of molecules. The memory requirement scales with N, which is the
number of molecules/atoms/beads one wants to simulate. In the future the
plan is to port this application to run on GPUs, where memory is not as
abundant as in the regular PC, so I am trying to design the application to
use as less memory as possible so large systems can be simulated.

With the highly parallel design of GPUs the expensive computations of
calculating indexes and such might be hidden by the enormous amount of
threads one can run, but nonetheless I would like to get a fast ``serial''
(i.e., CPU) version as well.

The replies to my original question were very good and gave me some ideas
I might want to try.

Cheers,

--
Gustavo

Barry Schwarz
Guest
Posts: n/a

 06-28-2009
On Sun, 28 Jun 2009 08:49:37 +0000 (UTC), Gustavo Rondina
<(E-Mail Removed)> wrote:

>On Sat, 27 Jun 2009 16:36:06 -0700, Barry Schwarz wrote:

snip

>> In my experience, multiplication is the long tent pole in your
>> i * ((N << 1) - i - 1)
>> This is the same as
>> i * (N << 1) - i *(i+1)
>> If you increase N until it is a power of 2, the first
>> multiplication can be replaced by another shift.
>> The second multiplication depends only on i. i will always be
>> between 0 and N-1. You could build an N element array of int where
>> element y is set to y*y+y.

>
>This is interesting, I will give it a try. I don't know though if I can
>afford increasing N to a power of 2. Only tests will tell.

Your N was defined as 10,000. If this was just an approximation as
opposed to a precise requirement, why not use 8,192 as your
approximation. Otherwise your are up to 16,384. The extra bits in
packed_array won't matter that much. Even the extra 6K elements of
the int array should not be that big a deal given how much space you
saved by packing the boolean values.

>
>For now I have the following. The computations involving constants have
>been removed from the function and the result is stored somewhere else.
>
>Again, this assumes that a unsigned int is 32 bits long.
>
> 1 unsigned int is_set (unsigned int i, unsigned int j)
> 2 {
> 3 register unsigned int idx;
> 4
> 5 if (i > j)
> 6 {
> 7 register int temp = i;

This should be unsigned also.

> 8 i = j;
> 9 j = temp;
> 10 }

snip

--
Remove del for email

Flash Gordon
Guest
Posts: n/a

 06-29-2009
Gustavo Rondina wrote:

<snip>

> I see. Well, this is a scientific application for simulating the dynamics
> of system of molecules. The memory requirement scales with N, which is the
> number of molecules/atoms/beads one wants to simulate. In the future the
> plan is to port this application to run on GPUs, where memory is not as
> abundant as in the regular PC, so I am trying to design the application to
> use as less memory as possible so large systems can be simulated.
>
> With the highly parallel design of GPUs the expensive computations of
> calculating indexes and such might be hidden by the enormous amount of
> threads one can run, but nonetheless I would like to get a fast ``serial''
> (i.e., CPU) version as well.
>
> The replies to my original question were very good and gave me some ideas
> I might want to try.

A few points you might want to consider as you will be switching to a
different processor...

The size of unsigned in could well be different on the final target
processor.

There are differences between compilers on what they can optimise
efficiently.

There are difference between processors on what opperations are efficient.

So, all your careful benchmarking and speed optimisations (and some
space optimisations) could well be undone by switching to a different
processor.

So I would recommend keeping all of the different versions of the code
and getting hold of the real target (or a simulator for it) and re-doing
the work on the final target system!
--
Flash Gordon