![]() |
|
|
|||||||
![]() |
VHDL - Synthesis of Concurrent Statements for FIR Filter |
|
|
Thread Tools | Search this Thread |
|
|
#1 |
|
Dear List,
I am trying to implement a 16-tap FIR Low-Pass Filter and have written the convolution in VHDL (of which I am a beginner). The input sequence 'x' is a 1-bit sequence of 1's and 0's. This is to be converted to 1's and -1's and convolved with the impulse sequence 'h'. My goal is for the convolution portion of the filter to be completely asynchronous and parallel. That is with each clock cycle 16 bits of the input sequence are convolved with the impulse response providing a single 12- bit output. Each element of the impulse response 'h' is a 10 bit signed integer. The input sequence 'x' is a known sequence and I am sure the output sequence 'y' will always fit into 12 bits. Here is the code: -- 16-tap FIR Low-Pass Filter Convolution Function -- -- -- When convolved with the code it will produce a maximum value that will fit into 12-bits library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; use ieee.numeric_std.all; entity fir_lpf_conv is port ( x: in std_logic_vector(15 downto 0); y: out std_logic_vector(11 downto 0) ); end fir_lpf_conv; architecture fir_lpf_conv_arch of fir_lpf_conv is type coef_type is array(0 to 15) of integer range -511 to 511; constant h: coef_type := (4,-2,-28,-53,-17,128,345,511,511,345,128,-17,-53,-28,-2,4); signal mult: coef_type; signal sum: integer range -2047 to 2047; begin blabla: for i in x'range generate mult(i) <= h(i) when x(i)='1' else -h(i); end generate; sum <= mult(0) + mult(1) + mult(2) + mult(3) + mult(4) + mult(5) + mult(6) + mult(7) + mult( + mult(14) + mult(15); y <= std_logic_vector(to_signed(sum,12)); end fir_lpf_arch; I haven't simulated it yet, but I have a sneaky feeling it will not do what I expect. Even if it does do what I want it to then I'd like to understand why. The code should multiply each h element by the corresponding x element (with zeros converted to -1s) in parallel, AND THEN sum the result into sum AND THEN put the 'sum' result into 'y'. My use of AND THEN in that statement makes me think I need sequential code, that is the multiply should be done in parallel, and the sum should be done in parallel, but the sum should use the results of the multiply. However when I look at sequential code it is always clock or event driven and I don't think that's what I need. All this should be done in less than 1/2 clock cycle. I could see the compiler synthesizing the above code in two different ways: 1. Multiply in parallel AND THEN add the results in parallel. (this would be good) 2. Multiply in parallel and add in parallel. The parallel sum will use the previous values stored in 'mult', and possibly some updated values in 'mult' depending on the exact timing. (this would be bad) So my question is: if the code is correct, then what is the rule for synthesis? How does the compiler know that I want 'AND THEN' behavior? If the code is incorrect, what do I write to get 'AND THEN' behavior that is not clock driven? I also have a couple less important questions: Is there a better way to write my sum using a for loop? I couldn't get it to compile. I really don't need the intermediate signal 'sum'. I'd like to just sum into 'y' but I get a type error because the synthesizer doesn't know if the stuff on the right is signed or unsigned. Thank You! Brian heilig.brian@gmail.com |
|
|
|
|
#2 |
|
Posts: n/a
|
On 24 Mar, 15:46, heilig.br...@gmail.com wrote:
> Dear List, > > I am trying to implement a 16-tap FIR Low-Pass Filter and have written > the convolution in VHDL (of which I am a beginner). The input sequence > 'x' is a 1-bit sequence of 1's and 0's. This is to be converted to 1's > and -1's and convolved with the impulse sequence 'h'. My goal is for > the convolution portion of the filter to be completely asynchronous > and parallel. That is with each clock cycle 16 bits of the input > sequence are convolved with the impulse response providing a single 12- > bit output. Each element of the impulse response 'h' is a 10 bit > signed integer. The input sequence 'x' is a known sequence and I am > sure the output sequence 'y' will always fit into 12 bits. > > Here is the code: > -- 16-tap FIR Low-Pass Filter Convolution Function > -- > -- > -- When convolved with the code it will produce a maximum value that > will fit into 12-bits > > library ieee; > use ieee.std_logic_1164.all; > use ieee.std_logic_arith.all; > use ieee.numeric_std.all; > > entity fir_lpf_conv is > * * * * port ( > * * * * * * * * x: in std_logic_vector(15 downto 0); > * * * * * * * * y: out std_logic_vector(11 downto 0) > * * * * ); > end fir_lpf_conv; > > architecture fir_lpf_conv_arch of fir_lpf_conv is > * * * * type coef_type is array(0 to 15) of integer range -511 to 511; > * * * * constant h: coef_type := > (4,-2,-28,-53,-17,128,345,511,511,345,128,-17,-53,-28,-2,4); > * * * * signal mult: coef_type; > * * * * signal sum: integer range -2047 to 2047; > begin > * * * * blabla: for i in x'range generate > * * * * * * * * mult(i) <= h(i) when x(i)='1' else -h(i); > * * * * end generate; > * * * * sum <= mult(0) + mult(1) + mult(2) + mult(3) + mult(4) + mult(5) + > mult(6) + mult(7) > * * * * * * *+ mult( > + mult(14) + mult(15); > * * * * y <= std_logic_vector(to_signed(sum,12)); > end fir_lpf_arch; > > I haven't simulated it yet, but I have a sneaky feeling it will not do > what I expect. Even if it does do what I want it to then I'd like to > understand why. > > The code should multiply each h element by the corresponding x element > (with zeros converted to -1s) in parallel, AND THEN sum the result > into sum AND THEN put the 'sum' result into 'y'. My use of AND THEN in > that statement makes me think I need sequential code, that is the > multiply should be done in parallel, and the sum should be done in > parallel, but the sum should use the results of the multiply. However > when I look at sequential code it is always clock or event driven and > I don't think that's what I need. All this should be done in less than > 1/2 clock cycle. > > I could see the compiler synthesizing the above code in two different > ways: > > 1. Multiply in parallel AND THEN add the results in parallel. (this > would be good) > 2. Multiply in parallel and add in parallel. The parallel sum will use > the previous values stored in 'mult', and possibly some updated values > in 'mult' depending on the exact timing. (this would be bad) > > So my question is: if the code is correct, then what is the rule for > synthesis? How does the compiler know that I want 'AND THEN' behavior? > If the code is incorrect, what do I write to get 'AND THEN' behavior > that is not clock driven? > > I also have a couple less important questions: > Is there a better way to write my sum using a for loop? I couldn't get > it to compile. > I really don't need the intermediate signal 'sum'. I'd like to just > sum into 'y' but I get a type error because the synthesizer doesn't > know if the stuff on the right is signed or unsigned. > > Thank You! > Brian What you have written contains 0 multipliers, 15 x 2-1 muxes, no registers and a very long adder chain. It is very very unlikely that this will work. you will HAVE to break up the adder chain and pipeline it - 16 adds just isnt going to work without pipelining. You normally only want to add 2-3 numbers in a single clock cycle. You also say "x" is a 1 bit sequence? is it coming in serially? or is it really coming in as a bus like you've written. As it stands, it expects all the X bits to be there at the same time. I suggest you read up on digital design. this code is no way synthesisable. Here is a hint (Im going to assume that X is a synchronous input and not asynchronous like you said): It should give you a latency of 4 clock cycles: library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity fir_lpf_conv is port ( clk : in std_logic; x: in std_logic_vector(15 downto 0); y: out std_logic_vector(11 downto 0) ); end fir_lpf_conv; architecture fir_lpf_conv_arch of fir_lpf_conv is type coef_type is array(0 to 15) of integer range -511 to 511; constant h: coef_type := (4,-2,-28,-53,-17,128,345,511,511,345,128,-17,-53,-28,-2,4); signal mult: coef_type; subtype sum_range_t is integer range -2047 to 2047; signal sum: sum_range_t; signal sum01 : sum_range_t; signal sum23 : sum_range_t; .....etc begin blabla: for i in x'range generate mult(i) <= h(i) when x(i)='1' else -h(i); end generate; sum_proc : process(clk) variable sum_total : integer; begin if rising_edge(clk) then sum01 <= mult(0) + mult(1); sum23 <= mult(2) + mult(3); ........etc sum0123 <= sum01 + sum23; .......etc sum_total := sum0to7 + sum8to15; y <= std_logic_vector( to_signed( sum_total, 12) ); end if; end process; end fir_lpf_arch; Also - delete std_logic_arith from the code. It clashes with numeric_std. always use the numeric_std package (which you have). Tricky |
|
|
|
#3 |
|
Member
Join Date: Jan 2009
Posts: 52
|
VHDL (simulation) works with "delta" delays:
- each mult(i) is set. - after a "delta" delay, all mult(i) are summed. - after a "delta" delay, y is assigned. The synthesis tool will take care of keeping this "delta" semantics intact I think this will do what you want? Though you could also write this as a process: Code:
The best way to calculate sum would be something like, Code:
This isn't too hard to rewrite into a few loops to make this dynamic. Might even write as a nested loop, using a double indexed array sum(i,j) (or sum(i)(j), depending on how you declare it) I wrote it like that to make the pattern clear for looping. It isn't actually needed to write like that to get parallelism. You can simply do: Code:
joris Last edited by joris : 03-24-2009 at 04:28 PM. |
|
|
|
|
|
#4 |
|
Posts: n/a
|
> What you have written contains 0 multipliers,
The following line... mult(i) <= h(i) when x(i)='1' else -h(i); ....is a 1 bit multiplier where a 1 means 'multiply by 1' and a 0 means 'multiply by -1'. When x(i)='1' then mult(i) <= h(i) * 1, else mult(i) <= h(i) * -1. > 15 x 2-1 muxes, no > registers and a very long adder chain. It is very very unlikely that > this will work. you will HAVE to break up the adder chain and pipeline > it - 16 adds just isnt going to work without pipelining. You normally > only want to add 2-3 numbers in a single clock cycle. Because of the propagation delay? The Quartus II software I'm using has a parallel_add megafunction (if you're not familiar with Quartus II a megafunction is like a parameterized logical element) that can add up to 128 32-bit integers in parallel! Well, at least that's what it says. > You also say "x" is a 1 bit sequence? is it coming in serially? or is > it really coming in as a bus like you've written. As it stands, it > expects all the X bits to be there at the same time. It is a 1-bit sequence that is initially serial but through a series of external d flip flops I am converting it to 16 bits in parallel. However each of these bits represents one element of the x sequence. It is not converted to a 16 bit word. > I suggest you read up on digital design. this code is no way > synthesisable. Ouch. Well you caught me. I bought "Circuit Design with VHDL" a few days ago and it is on its way. I thought, "How hard can this be?" > Here is a hint (Im going to assume that X is a > synchronous input and not asynchronous like you said): X is a synchronous input. The problem is I could draw a working logic diagram that would perform the 16 1-bit multiplies in parallel and then sum all the results in parallel. In fact I started off this way but then figured it's a good time to learn VHDL. So if I know that it can be represented as a bunch of logic gates then the problem is to write VHDL code that will synthesize those gates for me. > It should give you a latency of 4 clock cycles: > > library ieee; > use ieee.std_logic_1164.all; > use ieee.numeric_std.all; > > entity fir_lpf_conv is > * * * * port ( > * * * * * * * * clk : in std_logic; > > * * * * * * * * x: in std_logic_vector(15 downto 0); > * * * * * * * * y: out std_logic_vector(11 downto 0) > * * * * ); > end fir_lpf_conv; > > architecture fir_lpf_conv_arch of fir_lpf_conv is > * * * * type coef_type is array(0 to 15) of integer range -511 to 511; > * * * * constant h: coef_type := > (4,-2,-28,-53,-17,128,345,511,511,345,128,-17,-53,-28,-2,4); > * * * * signal mult: coef_type; > > * * * * subtype sum_range_t is integer range -2047 to 2047; > > * * * * signal sum: sum_range_t; > > * * * * signal sum01 : sum_range_t; > * * * * signal sum23 : sum_range_t; > * * * * .....etc > begin > * * * * blabla: for i in x'range generate > * * * * * * * * mult(i) <= h(i) when x(i)='1' else -h(i); > * * * * end generate; > > * * * * sum_proc : process(clk) > * * * * * variable sum_total : integer; > * * * * begin > * * * * * if rising_edge(clk) then > > * * * * * * sum01 * <= mult(0) + mult(1); > * * * * * * sum23 * <= mult(2) + mult(3); > * * * * * * ........etc > > * * * * * * sum0123 <= sum01 + sum23; > * * * * * * .......etc > > * * * * * * sum_total := sum0to7 + sum8to15; > * * * * * * y * * * * <= std_logic_vector( to_signed( sum_total, > 12) ); > > * * * * * end if; > * * * * end process; > > end fir_lpf_arch; > > Also - delete std_logic_arith from the code. It clashes with > numeric_std. always use the numeric_std package (which you have). Ok. Thanks for the help. heilig.brian@gmail.com |
|
|
|
#5 |
|
Posts: n/a
|
On 24 Mar, 16:53, heilig.br...@gmail.com wrote:
> > What you have written contains 0 multipliers, > > The following line... > > mult(i) <= h(i) when x(i)='1' else -h(i); > > ...is a 1 bit multiplier where a 1 means 'multiply by 1' and a 0 means > 'multiply by -1'. When x(i)='1' then mult(i) <= h(i) * 1, else mult(i) > <= h(i) * -1. Thats probably the way you intend it, but in reality you've just written a mux with 2 constant inputs that are selected via the appropriate bit on X. on looking at the RTL viewer, that constants you have chosen make it even less complicated, making each input input just a function of X. You could completly change the constants, and you will never get a hardware multiply, you will always get a mux. > > > 15 x 2-1 muxes, no > > registers and a very long adder chain. It is very very unlikely that > > this will work. you will HAVE to break up the adder chain and pipeline > > it - 16 adds just isnt going to work without pipelining. You normally > > only want to add 2-3 numbers in a single clock cycle. > > Because of the propagation delay? The Quartus II software I'm using > has a parallel_add megafunction (if you're not familiar with Quartus > II a megafunction is like a parameterized logical element) that can > add up to 128 32-bit integers in parallel! Well, at least that's what > it says. Whats wrong with a propgation delay? FPGAs are great for massive parrallel processing, but there is normally a latency involved. Pipelining still means you can get 1 result/clock cycle, but you have to wait n clock cycles of latency before the first result arrives. n is ALWAYS fixed, so you know when the output is valid, and from then on every clock cycle yields a valid result. I fear if latency is your bigest worry, you're coming at FPGA design from the wrong angle. Yes altera do provide a parallel_add megafunction, but it looks horrible to use (the data input is based on their own 2d-array of std_logic for a start, not the best way to encourage use!). But I could do a parallel add without their mega function, and add 256x64 bit numbers in parallel if I want, just using the "+" sign. Doesnt mean it'll make good hardware/firmware though. You'll also add that there is a "Pipeline" parameter on the parallel add megafunction. ok, Ive compiled some stuff, and heres the results: As a quick reference, I ran your initial massive add through timequest, on a stratix 2 (putting registers in at the mux stage and the output, so timequest could actually work) - FMax = 94Mhz Doing the massive add with a parallel add component, 0 latency FMax = 200MHz parallel adder, pipeline length of 4, FMax = 320Mhz Pipelining it the way I did in previous post : FMax = 360MHz. remember this has been done on a large device with no additional logic, so FMax reports may be artificially high. But I know which method Id rather use!. to get hold of the parallel add, you have to actually instatiate it. Converting the data input into the write format is a bit of an arse: signal data : altera_mf_logic_2D(15 downto 0, 9 downto 0); begin i_gen : for i in data'range(1) generate j_gen :for j in data'range(2) generate data(i, j) <= std_logic_vector( to_signed(mult(i), 10) )(j); end generate j_gen; end generate i_gen; par_add : parallel_add generic map ( width => 10, size => 16, widthr => 12, pipeline => 0, representation => "SIGNED" ) port map ( data => data, result => result ); > > > You also say "x" is a 1 bit sequence? is it coming in serially? or is > > it really coming in as a bus like you've written. As it stands, it > > expects all the X bits to be there at the same time. > > It is a 1-bit sequence that is initially serial but through a series > of external d flip flops I am converting it to 16 bits in parallel. > However each of these bits represents one element of the x sequence. > It is not converted to a 16 bit word. Well, you have x coming in as a 16 bit bus. And you have 16 "multiplies" in parallel. Another question - how fast is the serial bus? 16x the main clock speed? if it isnt, how do you know when any of the X bit are valid? > > Here is a hint (Im going to assume that X is a > > synchronous input and not asynchronous like you said): > > X is a synchronous input. The problem is I could draw a working logic > diagram that would perform the 16 1-bit multiplies in parallel and > then sum all the results in parallel. In fact I started off this way > but then figured it's a good time to learn VHDL. So if I know that it > can be represented as a bunch of logic gates then the problem is to > write VHDL code that will synthesize those gates for me. But Id recommend you do it that way, especially as a VHDL beginner. VHDL is a description language, not a programming language. It is meant for describing digital hardware. You can write whatever you want in VHDL (to a point), and it may simulate how you intend giving the results you wanted in the way you specified, but that doesnt mean its any good as a hardware description. Tricky |
|
|
|
#6 |
|
Posts: n/a
|
> Thats probably the way you intend it, but in reality you've just
> written a mux with 2 constant inputs that are selected via the > appropriate bit on X. on looking at the RTL viewer, that constants you > have chosen make it even less complicated, making each input input > just a function of X. You could completly change the constants, and > you will never get a hardware multiply, you will always get a mux. I think this is good. It is equivalent to a multiply by 1 or -1, right? > Whats wrong with a propgation delay? FPGAs are great for massive > parrallel processing, but there is normally a latency involved. > Pipelining still means you can get 1 result/clock cycle, but you have > to wait n clock cycles of latency before the first result arrives. n > is ALWAYS fixed, so you know when the output is valid, and from then > on every clock cycle yields a valid result. I fear if latency is your > bigest worry, you're coming at FPGA design from the wrong angle. You are right. Latency is hardly a concern. I guess my line of thinking was that I could imagine the logic diagram, now if I could just write the VHDL to make that logic diagram a reality. But I wasn't asking about throughput delay, which I think is (or I'll define as) the time through the entire device. Rather I was asking about the delay between when the x elements are available on the rising edge of the clock, to when the next y outputs are available to be sampled. If this time is greater than half a clock cycle then I will get garbage out. I think you summarized this in your discussion below determining FMax. > ok, Ive compiled some stuff, and heres the results: Again, thank you for your help. > As a quick reference, I ran your initial massive add through > timequest, on a stratix 2 (putting registers in at the mux stage and > the output, so timequest could actually work) - FMax = 94Mhz > Doing the massive add with a parallel add component, 0 latency FMax = > 200MHz > parallel adder, pipeline length of 4, FMax = 320Mhz > Pipelining it the way I did in previous post : FMax = 360MHz. I see. My sample clock is 20 MHz so that's ok. But I see your point and will add pipelining. > Well, you have x coming in as a 16 bit bus. And you have 16 > "multiplies" in parallel. > Another question - how fast is the serial bus? 16x the main clock > speed? if it isnt, how do you know when any of the X bit are valid? The serial bus is 20 MHz as is the sample clock. Every time a new x bit is shifted in I process the entire 16-bit sequence again. So bits 0-14 in the last interval become bits 1-15 in this one. > But Id recommend you do it that way, especially as a VHDL beginner. > VHDL is a description language, not a programming language. It is > meant for describing digital hardware. You can write whatever you want > in VHDL (to a point), and it may simulate how you intend giving the > results you wanted in the way you specified, but that doesnt mean its > any good as a hardware description. You caught me again. I am a programmer with some hardware experience. This small exercise is only the beginning, I'll soon need to know VHDL well. So I guess I'll start reading! Thanks, Brian heilig.brian@gmail.com |
|
|
|
#7 |
|
Posts: n/a
|
This is the code I finally settled on:
library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity code_filter is port ( x0: in std_logic; y: out std_logic_vector(11 downto 0); clk: in std_logic ); end code_filter; architecture code_filter_arch of code_filter is type coef_type is array(0 to 15) of integer range -511 to 511; constant h: coef_type := (4,-2,-28,-53,-17,128,345,511,511,345,128,-17,-53,-28,-2,4); signal mult: coef_type; signal x: std_logic_vector(15 downto 0); signal sum0_1, sum2_3, sum4_5, sum6_7: integer range -1023 to 1023; signal sum8_9, sum10_11, sum12_13, sum14_15: integer range -1023 to 1023; signal sum0_3, sum4_7, sum8_11, sum12_15: integer range -2047 to 2047; signal sum0_7, sum8_15: integer range -4095 to 4095; signal sum_total: integer range -8191 to 8191; begin process (clk) begin if rising_edge(clk) then x(x'high downto 1) <= x((x'high-1) downto 0); x(0) <= x0; end if; end process; one_bit_multiply: for i in x'range generate mult(i) <= h(i) when x(i)='1' else -h(i); end generate; sum0_1 <= mult(0) + mult(1); sum2_3 <= mult(2) + mult(3); sum4_5 <= mult(4) + mult(5); sum6_7 <= mult(6) + mult(7); sum8_9 <= mult( sum10_11 <= mult(10) + mult(11); sum12_13 <= mult(12) + mult(13); sum14_15 <= mult(14) + mult(15); sum0_3 <= sum0_1 + sum2_3; sum4_7 <= sum4_5 + sum6_7; sum8_11 <= sum8_9 + sum10_11; sum12_15 <= sum12_13 + sum14_15; sum0_7 <= sum0_3 + sum4_7; sum8_15 <= sum8_11 + sum12_15; sum_total <= sum0_7 + sum8_15; y <= std_logic_vector(to_signed(sum_total,12)); end code_filter_arch; The entire filter is now contained in this code, including the shift registers (which used to be in another file). It has been simulated and it works great. The major difference between this version and what I had before is the processing of the add. The lesson I learned here is that VHDL produces a result that closely matches the code, unlike C which will perform aggressive optimizations. My previous version resulted in 15 adders in one long chain (exactly as the code was written) whereas the current version resulted in 15 adders in a hierarchical structure (again exactly as it is written). This resulted in a reduction of the propagation delay by a factor of log2(16). Anyway it's good to know my initial design actually did work, even though it wasn't optimal. After your first scathing response I felt like I should turn in my degree and restart a career in some liberal arts field. But your reply is greatly appreciated as I understand what is going on much better. Brian heilig.brian@gmail.com |
|
|
|
#8 |
|
Member
Join Date: Dec 2008
Posts: 85
|
Brian, thanks for closing out with your working solution! It is nice to see the design process form beginning to end.
John JohnDuq |
|
|
|
![]() |
| Thread Tools | Search this Thread |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Help on auto conversion from Matlab to vhdl on filter design | hardheart | Hardware | 0 | 12-07-2007 09:19 AM |