Velocity Reviews > Optimize a discovery algorithm

# Optimize a discovery algorithm

pozz
Guest
Posts: n/a

 01-01-2014
I'm programming a small 8-bit microcontroller (AVR from Atmel). This is
the single master in a 2-wires RS485 network where N slaves (AVR based)
are connected. In RS485 networks, only one transmitter can be active at
a time, otherwise the transmitted characters will arrive corrupted to
the receivers. Moreover, the characters transmitted from the active
network that are listening if they aren't transmitting).

I want to design and implement a discovery technique: the master should
be able to understand how many slaves are connected and assign them a
I can use the 32-bits serial number stored in a non volatile memory in
the slaves and master (each node on the bus has a different serial number).

The discovery algorithm I was thinking of is based on the following
consideration: if two or more slaves answer to a query originated by the
master, two cases could happen.

1. the answer messages are separated in time, so they arrive
uncorrupted to the master;
2. two or more answer messages overlap, so they could arrive
corrupted to the master.

The master is able to detect a corrupted message (through some
mechanisms, such as CRC).

The algorithm I'm implementing is based on binary search: the master
sends a broadcast query with an interval of serial numbers. All the
slaves with a serial number in the interval answers to the master. As
the slave. When a slave receives a new address, it stops responding to
discovery queries from the master.

A test code for the algorithm is:

uint32_t rx_sn;

int
autodetect(void)
{
uint32_t value = 0;
uint32_t vbit = 0;
uint32_t mbit = 0;
int result = 0;
while (result == 0) {
vbit = 0;
mbit = 0;
continue;
} else if (r == NO_ANSWER) {
if (vbit == 0) {
result = 1; /* No more slaves to detect */
} else {
value |= mbit;
vbit = 1;
continue;
}
} else {
result = -1; /* Error */
}
} else { /* r == GARBAGE */
result = -2; /* Error */
} else {
if (mbit) mbit <<= 1; else mbit = 1;
vbit = 0;
}
}
}
return result;
}

The function query() simulates the transmitting on the bus of the
message "Are there any slave with serial numbers in the range defined by
this mask and this value?". The return value of query() could be
NO_ANSWER (no slave has responded at all), GARBAGE (two or more slave
If the result is ANSWER_OK, the serial number of the slave that has
responded is stored in global variable rx_sn.

The function assign_new_address() simulates the transmitting on the bus
of the message "The slave with this serial number will use this address
for further communication. From now on, it won't answer anymore to
discovery queries".

The interval of serial numbers is defined by a 32-bits MASK and a
32-bits VALUE: a serial number SN is in the range if

If garbage is received, the algorithm tries to reduce the interval
fixing the value of the least significant bit, first to 0 and after to
1. If more than one slave answer, the successive left bit is fixed.

I think this algorithm works, but it is slow and not optimized. As you
can note, when a new slave is discovered, it starts from the beginning.
This approach will loose some time to sends the same query many times.

Suppose to have three slaves on the network with serial numbers
00000001, 00000003 and 00000007. The previous algorithm will produce 12
queries:

Assigning new address for SN 0x00000001
Assigning new address for SN 0x00000003
Assigning new address for SN 0x00000007

The query 2 has no answer, so it's a lost of time to send again during
the algorithm (see query 6): it will never be answered.

I think this could be solved avoiding to restart the algorithm from the
beginning every time a new slave is found. I have the impression it
could be optimized with recursion, but I can't use recursion on AVR
small microcontroller (stack space is very small). I know every
recursive algorithm can be written without recursion, so it could be my
solution.

Is someone suggest an optimization of my technique?

Thank you.

pozz
Guest
Posts: n/a

 01-02-2014
Il 02/01/2014 03:05, Robert Wessel ha scritto:
> I've posted this before in comp.arch.embedded,

Yes, I remember it. The thread you wrote in was originated always from
but now I've problem in implementing the algorithm in C.

> but don't use a mask
> and value, use a range instead:

With mask and value you don't have a mathematical range (min..max), but
always a group of serial numbers (serial numbers that have certain bits
in certain positions). Both tecnique are identical, I think.

I hope binary operations (AND) on 32-bits variables will be much more
simple for 8-bit core than mathematical comparisons..

> It's only necessary to be able to reasonably reliably detect that
> someone has answered. The proper way to do the query is to specify a
> range in which you'd like devices to respond. For example, let's say
> you have a sixteen bit serial number (too short, but it keeps the
> example short). The first "Are You There" query would specify
> 0x0000-0xffff, so any device would respond. If the host got no
> response, it's done. If it got a good response* (say 0x1234: "I'm
> Here"), then it would record that and then re-query for the ranges
> left and right of the response (IOW, 0x000-0x1233 and 0x1235-0xffff).
> If it got a gibberish response (a collision), it would simply split
> the range in two (IOW, a binary search) and re-query those subranges
> (IOW, 0x0000-0x7fff and 0x8000-0xffff). The process repeats until all
> queried subranges return nothing.

It's very similar to my algorithm except for two things.
You use a numerical min..max range and I use bitmask, but it is
equivalent (see above).
You don't repeat the entire process from the beginning, but you continue
querying remaining subranges until the last one. This is the complex
job I'm not able to make and it is my true question of my original post:
how to implement the discovery algorithm without needing to restart from
the beginning (and without using recursion). It's strictly a
programming problem, related to C language.

> The performance is usually only slightly worse than linear with the
> number of devices to find, although you can construct pathological
> cases where it's O(n*lg(m)), where m is the number of possible serial
> numbers. So much longer serial numbers have only modest impact.
>
> Repeating the whole process several times can help compensate for less
> than perfect collision detection.
>
> If you don't have a reasonable chance of detecting a collision, the
> problem is much harder.
>
>
> *Either because that was the only device in the subrange, or it
> "overpowered" all the other devices in the subrange, and the host got
> a good message from it.

I agree with you. Indeed I have chosen to use your tecnique, but now I
have to implement it

pozz
Guest
Posts: n/a

 01-02-2014
Il 02/01/2014 02:51, Sam ha scritto:
> On 01/01/2014 23:51, pozz wrote:
>> I'm programming a small 8-bit microcontroller (AVR from Atmel). This is
>> the single master in a 2-wires RS485 network where N slaves (AVR based)
>> are connected. In RS485 networks, only one transmitter can be active at
>> a time, otherwise the transmitted characters will arrive corrupted to
>> the receivers. Moreover, the characters transmitted from the active
>> transmitter are received by all receivers (all other nodes in the
>> network that are listening if they aren't transmitting).
>>
>> I want to design and implement a discovery technique:

>
> FYI there was a thread in comp.arch.embedded in mid-December (e.g.
> around 18th) called: "Re: RS485 bus and auto-addressing". You might
> find some value in reading that one - even if that thread doesn't
> completely match your intended algorithm (e.g. that was discussing
> addresses), some of the issues described there might be relevant.

Yes, I know very well that thread, because it was originated by me.

I learned many things by reading the answers to my post, and now I want
to implement the technique described here, inspired from the thread in
comp.arch.embedded.

Now the problem is strictly related to programming: how to optimize the
algorithm I have chosen, so reducing the number of queries on the bus
that are time consuming, without using recursion.

pozz
Guest
Posts: n/a

 01-02-2014
Il 02/01/2014 03:05, Robert Wessel ha scritto:
> I've posted this before in comp.arch.embedded, but don't use a mask
> and value, use a range instead:

(transformed with bitmask operations as I'm trying to do) and another
approach.

Because collision detection is problematic in RS485 bus, slaves answer
with a break character to queries and not with a data message. If there
are two or more slaves that answer, the break character is normally
detected by the master as there was just one slave.

This approach is more reliable for sure, but it costs much more queries
to detect the exact serial number. For example, if I have just one
slave on the bus, with your (and mine) approach a single query is
sufficient. With break characters I need about 32 queries, until I
remain with a one-element range.

I think our approach is good if auto-detection is controlled by the
user: a new slave is installed, the user push a button to start
auto-detection and check the result, otherwise it could restart again
the process.

The second approach could be better in situations where the
auto-detection isn't initiated by a user but automatically with a
regular frequency, it isn't user controlled (the result isn't checked by
a user) and the time needed to discover the new slave isn't critical.

Johannes Bauer
Guest
Posts: n/a

 01-02-2014
On 02.01.2014 08:51, pozz wrote:

> I hope binary operations (AND) on 32-bits variables will be much more
> simple for 8-bit core than mathematical comparisons..

Definitely not. I'd guess exactly the same amount of clock cycles per
operation (four instructions and one conditional branch). You'll see it
if you look at the assembly output via objdump.

Regards,
Johannes

--
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

> Zumindest nicht öffentlich!

Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <hidbv3\$om2\$(E-Mail Removed)>

pozz
Guest
Posts: n/a

 01-02-2014
Il 02/01/2014 10:23, Johannes Bauer ha scritto:
>> I hope binary operations (AND) on 32-bits variables will be much more
>> simple for 8-bit core than mathematical comparisons..

>
> Definitely not. I'd guess exactly the same amount of clock cycles per
> operation (four instructions and one conditional branch). You'll see it
> if you look at the assembly output via objdump.

You are right... maybe bitmask operations are even less efficient.

Anyway the optimization problem is the same... or similar.

Keith Thompson
Guest
Posts: n/a

 01-02-2014
Johannes Bauer <(E-Mail Removed)> writes:
> On 02.01.2014 08:51, pozz wrote:
>> I hope binary operations (AND) on 32-bits variables will be much more
>> simple for 8-bit core than mathematical comparisons..

>
> Definitely not. I'd guess exactly the same amount of clock cycles per
> operation (four instructions and one conditional branch). You'll see it
> if you look at the assembly output via objdump.

I'm not so sure about that.

If the CPU can only perform 8-bit operations (I have no idea whether
that's the case or not), it can do a 32-bit AND, OR, or XOR in 4 8-bit
operations. For arithmetic operations, it has to deal with carry bits
for addition and subtraction, and more complex interactions for
multiplication and division.

Bitwise operations are, on a small scale, "embarassingly parallel".
https://en.wikipedia.org/wiki/Embarassingly_parallel

--
Keith Thompson (The_Other_Keith) http://www.velocityreviews.com/forums/(E-Mail Removed) <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Willem
Guest
Posts: n/a

 01-02-2014
Keith Thompson wrote:
) Johannes Bauer <(E-Mail Removed)> writes:
)> On 02.01.2014 08:51, pozz wrote:
)>> I hope binary operations (AND) on 32-bits variables will be much more
)>> simple for 8-bit core than mathematical comparisons..
)>
)> Definitely not. I'd guess exactly the same amount of clock cycles per
)> operation (four instructions and one conditional branch). You'll see it
)> if you look at the assembly output via objdump.
)
) I'm not so sure about that.
)
) If the CPU can only perform 8-bit operations (I have no idea whether
) that's the case or not), it can do a 32-bit AND, OR, or XOR in 4 8-bit
) operations. For arithmetic operations, it has to deal with carry bits
) for addition and subtraction, and more complex interactions for
) multiplication and division.

The addition and subtraction operations on 8-bit CPU's (or any CPU for that
matter) already takes the carry bits into account, so 'it has to deal with
carry bits' has been built into the hardware.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

glen herrmannsfeldt
Guest
Posts: n/a

 01-02-2014
Keith Thompson <(E-Mail Removed)> wrote:
> Johannes Bauer <(E-Mail Removed)> writes:
>> On 02.01.2014 08:51, pozz wrote:
>>> I hope binary operations (AND) on 32-bits variables will be much more
>>> simple for 8-bit core than mathematical comparisons..

>> Definitely not. I'd guess exactly the same amount of clock cycles
>> per operation (four instructions and one conditional branch).
>> You'll see it if you look at the assembly output via objdump.

> I'm not so sure about that.

> If the CPU can only perform 8-bit operations (I have no idea whether
> that's the case or not), it can do a 32-bit AND, OR, or XOR in 4 8-bit
> operations. For arithmetic operations, it has to deal with carry bits
> for addition and subtraction, and more complex interactions for
> multiplication and division.

This is why some people like litte-endian systems. In this particular
case, of you execute four successive operations, each using the carry
generated by the previous (you also have a carry-in if needed)
You only need to increment an address, never decrement.

But past the 6502, I haven't been convinced. Note that the 6502
doesn't push the address of the following instruction on the stack
in the case of a subroutine call. (It hasn't computes the address
yet.)

But for anything past the 6502, and pretty much anything with
multiply, the advantage pretty much goes away. Well, if you
want the full description "Blaauw and Brooks" in their book
on computer architecture, explain it.

> Bitwise operations are, on a small scale, "embarassingly parallel".
> https://en.wikipedia.org/wiki/Embarassingly_parallel

If you are doing the operations sequentially, though, they aren't
parallel anyway.

-- glen

pozz
Guest
Posts: n/a

 01-02-2014
Il 02/01/2014 17:38, Keith Thompson ha scritto:
> Johannes Bauer <(E-Mail Removed)> writes:
>> On 02.01.2014 08:51, pozz wrote:
>>> I hope binary operations (AND) on 32-bits variables will be much more
>>> simple for 8-bit core than mathematical comparisons..

>>
>> Definitely not. I'd guess exactly the same amount of clock cycles per
>> operation (four instructions and one conditional branch). You'll see it
>> if you look at the assembly output via objdump.

>
> I'm not so sure about that.
>
> If the CPU can only perform 8-bit operations (I have no idea whether
> that's the case or not), it can do a 32-bit AND, OR, or XOR in 4 8-bit
> operations. For arithmetic operations, it has to deal with carry bits
> for addition and subtraction, and more complex interactions for
> multiplication and division.
>
> Bitwise operations are, on a small scale, "embarassingly parallel".
> https://en.wikipedia.org/wiki/Embarassingly_parallel

Those are similar arguments I have to use bitwise operations. I tried
the following instructions with avr-gcc compiler (it's a compiler for
AVR8 core) and the result is the following.

eecfg.x, eecfg.y and eecfg.z are defined as uint32_t.

In the "bitwise method", the test is:

if ((eecfg.x & eecfg.y) == eecfg.z) {
13e: 40 91 da 00 lds r20, 0x00DA
142: 50 91 db 00 lds r21, 0x00DB
146: 60 91 dc 00 lds r22, 0x00DC
14a: 70 91 dd 00 lds r23, 0x00DD
14e: 80 91 d6 00 lds r24, 0x00D6
152: 90 91 d7 00 lds r25, 0x00D7
156: a0 91 d8 00 lds r26, 0x00D8
15a: b0 91 d9 00 lds r27, 0x00D9
15e: 48 23 and r20, r24
160: 59 23 and r21, r25
162: 6a 23 and r22, r26
164: 7b 23 and r23, r27
166: 80 91 de 00 lds r24, 0x00DE
16a: 90 91 df 00 lds r25, 0x00DF
16e: a0 91 e0 00 lds r26, 0x00E0
172: b0 91 e1 00 lds r27, 0x00E1
176: 48 17 cp r20, r24
178: 59 07 cpc r21, r25
17a: 6a 07 cpc r22, r26
17c: 7b 07 cpc r23, r27
17e: 09 f0 breq .+2 ; 0x182 <__stack+0x23>

In the "mathematical method", the test is:

if ((eecfg.x >= eecfg.y) && (eecfg.x <= eecfg.z)) {
13e: 40 91 d6 00 lds r20, 0x00D6
142: 50 91 d7 00 lds r21, 0x00D7
146: 60 91 d8 00 lds r22, 0x00D8
14a: 70 91 d9 00 lds r23, 0x00D9
14e: 80 91 da 00 lds r24, 0x00DA
152: 90 91 db 00 lds r25, 0x00DB
156: a0 91 dc 00 lds r26, 0x00DC
15a: b0 91 dd 00 lds r27, 0x00DD
15e: 48 17 cp r20, r24
160: 59 07 cpc r21, r25
162: 6a 07 cpc r22, r26
164: 7b 07 cpc r23, r27
166: 08 f4 brcc .+2 ; 0x16a <__stack+0xb>
168: a4 c0 rjmp .+328 ; 0x2b2 <__stack+0x153>
16a: 80 91 de 00 lds r24, 0x00DE
16e: 90 91 df 00 lds r25, 0x00DF
172: a0 91 e0 00 lds r26, 0x00E0
176: b0 91 e1 00 lds r27, 0x00E1
17a: 84 17 cp r24, r20
17c: 95 07 cpc r25, r21
17e: a6 07 cpc r26, r22
180: b7 07 cpc r27, r23
182: 08 f4 brcc .+2 ; 0x186 <__stack+0x27>

21 instructions for bitwise comparison versus 23 instructions for
mathematical comparison. There isn't a good reason to choose one method
for reducing instructions.