Velocity Reviews > simd-style coding

# simd-style coding

ben-goldberg@hotmail.com
Guest
Posts: n/a

 02-07-2013
I've got some code which was written to make use of some integers as vectors of bits, which deals with only a couple of bits at a time.

I'd like to speed it up, and use whole words at a time... similar to how SIMD coding works.

Here's some pseudocode for how the original bit-at-a-time code works:

outer_loop_start:

old_bits[0] = lfsr_lsb();
have_old[0] = true;
new_bit = lfsr_lsb();
i = 0;

inner_loop_start:

next_bit = new_bit != old_bits[i];
if( next_bit ) output(new_bit);
new_bit = next_bit;
have_old[i] = false;
i = i + 1;
if( i >= M ) goto outer_loop_start;

if( have_old[i] ) goto inner_loop_start;

have_old[i] = true;
old_bits[i] = new_bit;

goto outer_loop_start;

An LFSR is a Linear Feedback Shift Register, a type of pseudo-random bit generator which produces statistically decent output. This pseudocode takes the bits and hopefully makes them into something cryptographically secure.

But if I can't make it go fast, then it's not worthwhile to try to cryptanalyse it, because there are so many fast, well studied (and believed to be secure) algorithms out there.

So, I want to completely eliminate the inner loop, by keeping an extra unsigned int, and replacing the current inner loop with a few operations which perform, in effect, 32 simultaneous passes through the original inner loop using ands, ors, nots, xors, etc.

What I have so far is:

static unsigned lfsr = 1, have_old = 0, old_bits = 0, new_bits = 0;

void output(unsigned which, unsigned thebits);

void make_some_bits() {
unsigned bit = lfsr & 1;
lfsr = (lfsr >> 1) ^ (0xD0000001u & -(uint32_t)bit);
old_bits += bit;
have_old += 1;
bit = lfsr & 1;
lfsr = (lfsr >> 1) ^ (0xD0000001u & -(uint32_t)bit);
new_bits += bit;

unsigned next_bit = new_bits ^ old_bits;
unsigned output_these = have_old & next_bit;
output( output_these, new_bits );
new_bits = next_bit << 1;

/* Help? */
}

void output(unsigned which, unsigned thebits) {
while( which ) {
int next = which & (which-1);
int lsb = which - next;
printf("%d", (thebits & lsb) ? 1 : 0 );
which = next;
}
}

But I'm not sure what to do next.

Shao Miller
Guest
Posts: n/a

 02-07-2013
On 2/7/2013 01:02, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> I've got some code which was written to make use of some integers as vectors of bits, which deals with only a couple of bits at a time.
>

Ok.

> I'd like to speed it up, and use whole words at a time... similar to how SIMD coding works.
>

C doesn't define the speed of operations. If you want extremes of
speed, you probably don't want to code in C. In C, we can talk about
minimizing the number of side effects and sequence points, but these
aren't exactly related to speed.

> Here's some pseudocode for how the original bit-at-a-time code works:
>
> outer_loop_start:
>
> old_bits[0] = lfsr_lsb();
> have_old[0] = true;
> new_bit = lfsr_lsb();
> i = 0;
>
> inner_loop_start:
>
> next_bit = new_bit != old_bits[i];
> if( next_bit ) output(new_bit);
> new_bit = next_bit;
> have_old[i] = false;
> i = i + 1;
> if( i >= M ) goto outer_loop_start;
>
> if( have_old[i] ) goto inner_loop_start;
>
> have_old[i] = true;
> old_bits[i] = new_bit;
>
> goto outer_loop_start;
>
> An LFSR is a Linear Feedback Shift Register, a type of pseudo-random bit generator which produces statistically decent output. This pseudocode takes the bits and hopefully makes them into something cryptographically secure.
>
> But if I can't make it go fast, then it's not worthwhile to try to cryptanalyse it, because there are so many fast, well studied (and believed to be secure) algorithms out there.
>

Are you thinking about gathering statistics of the code's output and
scrutinizing those statistics? Is a mathematical analysis (without C
code involvement) not possible?

> So, I want to completely eliminate the inner loop, by keeping an extra unsigned int, and replacing the current inner loop with a few operations which perform, in effect, 32 simultaneous passes through the original inner loop using ands, ors, nots, xors, etc.
>

You might be interested in "loop unrolling". Is that what you're

> What I have so far is:
>
>
> static unsigned lfsr = 1, have_old = 0, old_bits = 0, new_bits = 0;
>
> void output(unsigned which, unsigned thebits);
>
> void make_some_bits() {
> unsigned bit = lfsr & 1;
> lfsr = (lfsr >> 1) ^ (0xD0000001u & -(uint32_t)bit);
> old_bits += bit;
> have_old += 1;
> bit = lfsr & 1;
> lfsr = (lfsr >> 1) ^ (0xD0000001u & -(uint32_t)bit);
> new_bits += bit;
>
> unsigned next_bit = new_bits ^ old_bits;
> unsigned output_these = have_old & next_bit;
> output( output_these, new_bits );
> new_bits = next_bit << 1;
>
> /* Help? */
> }
>
> void output(unsigned which, unsigned thebits) {
> while( which ) {
> int next = which & (which-1);
> int lsb = which - next;
> printf("%d", (thebits & lsb) ? 1 : 0 );
> which = next;
> }
> }
>
> But I'm not sure what to do next.
>

Is the goal to rewrite the original pseudo-code in C, or to speed up
some existing C code, or neither of these, or both?

--
- Shao Miller
--
"Thank you for the kind words; those are the kind of words I like to hear.

Cheerily," -- Richard Harter

glen herrmannsfeldt
Guest
Posts: n/a

 02-07-2013
Shao Miller <(E-Mail Removed)> wrote:
> On 2/7/2013 01:02, (E-Mail Removed) wrote:
>> I've got some code which was written to make use of some integers as
>> vectors of bits, which deals with only a couple of bits at a time.

> Ok.

>> I'd like to speed it up, and use whole words at a time... similar
>> to how SIMD coding works.

> C doesn't define the speed of operations. If you want extremes of
> speed, you probably don't want to code in C. In C, we can talk about
> minimizing the number of side effects and sequence points, but these
> aren't exactly related to speed.

Well, yes, but if you do bit manipulation sizeof(int)*CHAR_BIT bits
at a time instead of one bit at a time, it is likely faster.

(snip)

>> An LFSR is a Linear Feedback Shift Register, a type
>> of pseudo-random bit generator which produces statistically
>> decent output. This pseudocode takes the bits and hopefully
>> makes them into something cryptographically secure.

>> But if I can't make it go fast, then it's not worthwhile to try
>> to cryptanalyse it, because there are so many fast, well studied
>> (and believed to be secure) algorithms out there.

The LFSRs used for CRC calculation are commonly done in software
one byte at a time. That is, the whole 32 bit CRC is processed
with a table lookup, one 32 bit XOR, and a shift. (Not necessarily
in that order.)

Here is a CRC calculating routine and table generating routine.
Though it is more usual to put in a static 256 entry table
initialized with constants.

#define POLYNOMIAL 0x04c11db7L
static unsigned long crc_table[256];
void gen_crc_table()
/* generate the table of CRC remainders for all possible bytes */
{ register int i, j; register unsigned long crc_accum;
for ( i = 0; i < 256; i++ )
{ crc_accum = ( (unsigned long) i << 24 );
for ( j = 0; j < 8; j++ )
{ if ( crc_accum & 0x80000000L )
crc_accum =
( crc_accum << 1 ) ^ POLYNOMIAL;
else
crc_accum =
( crc_accum << 1 ); }
crc_table[i] = crc_accum; }
return; }

unsigned long update_crc(unsigned long crc_accum, char *data_blk_ptr,
int data_blk_size)
/* update the CRC on the data block one byte at a time */
{ register int i, j;
for ( j = 0; j < data_blk_size; j++ )
{ i = ( (int) ( crc_accum >> 24) ^ *data_blk_ptr++ ) & 0xff;
crc_accum = ( crc_accum << 8 ) ^ crc_table[i]; }
return crc_accum; }

-- glen

ben-goldberg@hotmail.com
Guest
Posts: n/a

 02-08-2013
On Thursday, February 7, 2013 2:00:31 AM UTC-5, Shao Miller wrote:
> On 2/7/2013 01:02, (E-Mail Removed) wrote:
>
> > I've got some code which was written to make use of some integers
> > as vectors of bits, which deals with only a couple of bits at a time.

>
>
> Ok.
>
>
> > I'd like to speed it up, and use whole words at a
> > time... similar to how SIMD coding works.

>
> C doesn't define the speed of operations. If you want extremes of
> speed, you probably don't want to code in C. In C, we can talk about
> minimizing the number of side effects and sequence points, but these
> aren't exactly related to speed.

True, C doesn't define how long any particular operation takes.
But generally, fewer operations takes less time.

For example, suppose my original code were this:

for( int i = 0; i < sizeof(unsigned)*CHAR_BIT; ++i ) {
int bit_from_a = (a >> i) & 1;
int bit_from_b = (b >> i) & 1;
unsigned bit_for_c = bit_from_a ^ bit_from_b;
c = (c & ~(1<<i)) + (bit_for_c << i);
}

and I replaced it with this:
c = a ^ b;

The latter is unquestionably faster, because the compiler isn't smart
enough to convert the first into the second.

> > Here's some pseudocode for how the original bit-at-a-time code works:

>

[snip]
> > An LFSR is a Linear Feedback Shift Register, a type of pseudo-random bit
> > generator which produces statistically decent output. This pseudocode
> > takes the bits and hopefully makes them into something cryptographically
> > secure.

>
> >

>
> > But if I can't make it go fast, then it's not worthwhile to try to
> > cryptanalyse it, because there are so many fast, well studied (and believed
> > to be secure) algorithms out there.

>
>
> Are you thinking about gathering statistics of the code's output and
> scrutinizing those statistics?

Oh, I've already done that. I know that it's output is statistically good.

having a statistically good pseudorandom generator doesn't guarantee
cryptographic goodness.

For example, MT19937 produces statistically excellent output,
and it is very fast, but you can't use it (by itself) as a stream
cipher, since one can very easily use a known plaintext attack
compute MT's internal state.

> Is a mathematical analysis (without C
> code involvement) not possible?

Determining whether something is cryptographically secure does not require C code, you're quite right in your implication.

However, cryptographic analysis is difficult in general, and quite time consuming.

But even supposing I do analyse the algorithm, and find it to be secure, it's going to be fairly useless if I can't make it run faster than the BBS cipher.

> > So, I want to completely eliminate the inner loop, by keeping an extra
> > unsigned int, and replacing the current inner loop with a few operations
> > which perform, in effect, 32 simultaneous passes through the original inner
> > loop using ands, ors, nots, xors, etc.

>
>
> You might be interested in "loop unrolling". Is that what you're

Simple loop unrolling would convert the inner loop of the algorithm
into 32 or 64 repetitions of the inside of that loop.

What I want is to eliminate the inner loop entirely,
in the same way the example loop I showed earlier was
converted to the much, much, simpler code, c = a ^ b.

> > What I have so far is:

>

[snip]
>
> Is the goal to rewrite the original pseudo-code in C, or to speed up
> some existing C code, or neither of these, or both?

I already have some code in C, which does the same as the pseudo-code.

Here's a working code snippet:

But it's still much much much too slow to be a practical cipher.

I want to turn the entire inner loop into three
or four xors, ands, etc, with no loop at all, and
no branching in the code that replaces that loop.

I'm perfectly willing to have the output bits come
out in a different order, if that's necessary for the
code to be sped up.

Pierre Asselin
Guest
Posts: n/a

 02-08-2013
(E-Mail Removed) wrote:

> [ actual C question snipped ]
>
> An LFSR is a Linear Feedback Shift Register, a type of pseudo-random
> bit generator which produces statistically decent output. This
> pseudocode takes the bits and hopefully makes them into something
> cryptographically secure.

No! These things are not designed for cryptography.

For the C question I was going to suggest you look at CRC routines
on the net, but Glen already did --and provided code. If I remember
correctly there is also a good discussion in the Wikipedia article
(which one is for you to search !)

--
pa at panix dot com

Bart van Ingen Schenau
Guest
Posts: n/a

 02-08-2013
On Wed, 06 Feb 2013 22:02:20 -0800, ben-goldberg wrote:

> I've got some code which was written to make use of some integers as
> vectors of bits, which deals with only a couple of bits at a time.
>
> I'd like to speed it up, and use whole words at a time... similar to how
> SIMD coding works.

<snip>
> But I'm not sure what to do next.

As your algorithm works on a different number of bits each time around
the outer loop, your best bet for letting it operate on words (or bytes)
at a time is to use a helper table that stores the values of the masks
and shifts needed for each iteration to compare the right number of bits
and to insert the new bit in the right location.

Bart v Ingen Schenau