Can we assume the input are all present at once(parallel)? Since there are
only 5 inputs(5x8=40bits), is it a reasonable assumption?
The requirement of output throughput is 3350MHz, i.e., it should output 33
33
million to 50 million 12bits element per second,
and each 5 inputs correspond to 10x10=100 such 12bits element outputs...
The technology I am going to use is 0.25u.
I think the inputs are naturally serial, but again, I am not sure how to do
do
the parallelserial partition of the internal MACs... and how to pace the
outputs...
Seems inputs are faster than the outputs, maybe I should let the input wait
wait
after fed into the unit?
Can you give some further advice on how to do this architecture? how to do
the timing? I think it is really difficult...and point me to some resources?
resources?
Thanks very much,
Walala
The input are 5 elements(either sequential or parallel) each having 8 bits.
> bits.
It needs to multiply each of these 5 inputs with a predefined constant
matrix(10x10, floating point scaled and round to integer). The output will
> will
be a 10x10 matrix summing the above five matrices up, each element having 12
having
> 12
bits). So for each element of the matrix, I can have a MAC unit. The
internal computation will be 16 bits.
Hence for each 5 inputs x1, x2, x3, x4, x5, the output matrix
Y=x1*C1+x2*C2+x3*C3+x4*C4+x5*C5 where Y, C1, C2, C3, C4, C5 are matrices;
matrices;
What is your throughput requirement and what technology are you using?
That will determine the amount of parallelism that you need.
If the requirement is low enough, then only one MAC unit will be required.
required.
> >
Next, you must define the timing of the inputs. If they are serial, then
then
it's easy: stuff the data into the MAC unit. Being pipelined (right?),
the MAC unit will output the answer N clocks later.
If you have more parallelism in your input data than you want in your
MAC units, then you will need to buffer the data. This circuit will be
easy to design once you define the timing requirements.
