Home  Forums  Reviews  Guides  Newsgroups  Register  Search 
Thread Tools 
Luis Cupido 


 
Anton Gunman
Guest
Posts: n/a

Hello,
That's an interesting one! I think that you first need to decide, or at least estimate, how much precision you will need.  How many bits can you truncate before doing the SQRT?  How many bits will the SQRT output? (Typically it's a fixed point notation)  How much resources can you afford to use?  Is your design fully pipelined (perform a new calculation at every clock cycle), or can you afford to have some sort of "load"/"done" signals to perform your calculations? Afterwards, here are some ideas you can look into: 1 Sometimes you don't really need to do the SQRT, you can just use the result prior to the SQRT and keep some of the MSBs (it is not as accurate). 2 Do you really need it to be done in a single clock cycle? Sometimes a pipelined design could be used (takes several clock cycles to output the first result, but can accept and output a new result at every other clock cycle). 3 You can try using your tool's core generator (e.g. Xilinx does have some SQRT cores that use the CORDIC algorithm, you can generate them using the wizard).  You might be able to choose between a slow core (area efficient and several clock cycles per operation) and a fast core (fully pipelined and consumes more resources). 4 You can also take a look at the following links (I have not tested them):  https://groups.google.com/forum/?fromgroups=#!searchin/comp.lang.vhdl/square$20root/comp.lang.vhdl/BQ9MypRSkhM/Qf7uLkIrmJQJ (see "function SquareRoot")  http://vhdlguru.blogspot.ca/2010/03/...uareroot.html  But these are COMBINATORIAL circuits (no clocks), they are not efficient in terms of resources and might affect timing, so be careful here, especially with large bit widths! 5 Depending on the number of bits you are using, you can think of any function (in your case the SQRT) as a large LUT and implement it using a single ROM with precalculated values. The info below roughly applies to internal block RAM/ROMs. Keep in mind that the FPGA RAM is limited (you can quickly check your FPGA's datasheet to get an estimate of how many of these block RAMs you can use). If you are using Xilinx devices, typically a single BRAM (small internal FPGA block RAM) is either 18kbits 18*1024 bits (or 36 kbits). These BRAM can be configured with various address/data configurations (for e.g. 10 address bits and 18 data bits, etc..., not all configurations might be valid) You would typically code it as an array and let the tools worry about it (but you MUST check the synthesis result to see if the BRAM has been correctly inferred or if you are using way too many BRAM) In this case you might need to change your approach. Note: The ROM/RAM has to be synchronous, otherwise the synthesizer will try using registers instead of BRAM and you will see an explosion in resource utilization. A rather "rough" way of approximating the number of BRAM that will be consumed is the following: Num. BRAM = ceil(2^addr_width * data_width/1024/1 (use 36 instead of 18 depending on your device, but you should get the point) The actual number will depend on the configuration possibilities, your coding style, and how "smart" the tools are. In your case IF you can round your result (prior to the SQRT) to something like 10 bits (or just use a generic for that), this might work for you. e.g.: type type_my_rom is array(integer range 0 to 2**101) of unsigned(181 downto 0); signal my_sqrt_rom : type_my_rom := MY_INIT_FUNCTION();  You can write an init function of some sort, or read it from file etc... ...  clocked process  ... pv <= i*i + q*q;  Calculate pv_10bit <= pv(pv'high downto pv'high10);  Truncate mv <= my_sqrt_rom(pv_10bit);  SQRT using a ROM, pv_10bit is used as the address Note: this is really going to depend on your application's needs. I would suggest you try it out in software (or nonsynthesizable code) and see how many bits you can truncate without losing too much precision. 10 bits might be too much of a truncation and if you increase your addr_width to 16 bits you will quickly run out of resources, so this might not be the best approach.  I would probably go for the core generator approach, you can check the documentation online to see if it fits your requirements, or just try generating it. (It will still probably require a few clock cycles depending on the precision).  If you can afford the LUT resources, maybe look into the combinatorial implementations #2 (you can simply try synthesizing it and checking your resource usage and timing estimate). I hope this helps a bit, Anton 




Anton Gunman 


 
Luis Cupido
Guest
Posts: n/a

> I hope this helps a bit, > Anton That helps a lot. Thanks. I can narrow down the specs and add some info. The the design can grow to the maximum I can fit on the device or up to the point of start limiting the speed. The system input has a pair of 16bit ADC. I decide on an global design CONSTANT how many bits I use, from (whatever) to 16bit At that point of the calculation I have a pair of max 17bit each so here I have, from (whatever+1) to 17bit  i^2+q^2 would make 35bit maximum so here I have, from (whatever*2+1) to 35bit Ideally I should get back to the same dynamic range and the sqrt would get me to (whatever+1) to 17bit now I'm truncating the 35bit to 16bit thus cutting dynamic range in half which is very uninteresting since I do have a 16bit transfer bus at the end and since the ENOB of the ADC's are below 15bit at that speed (120MS/s) I would aim to 15bit at ADC and 16bit output therefore a sqrt(33bit).  It is a big desgin, and YES I have it all pipelined. there are several blocks Each block even takes care of pipelining the address of the data so I don't care how many clock cycles it takes, I just have to make a pipeline for address to make it all consistent at the output. (silly me that I asked for a 1 cycle thing, that was because I had the i^2+q^2 in embedded multipliers so no need to pipeline that until now...). The ROM... don't think so... it is a sqrt( ) with too many bits I'm afraid. Tks again for your comments. If you like to comment now on my refined description I would appreciate. Thanks, Luis C. 




Luis Cupido 
Gabor
Guest
Posts: n/a

Luis Cupido wrote:
> >> I hope this helps a bit, >> Anton > > That helps a lot. Thanks. > > I can narrow down the specs and add some info. > > The the design can grow to the maximum > I can fit on the device or up to the point of start limiting > the speed. > > > The system input has a pair of 16bit ADC. > I decide on an global design CONSTANT how > many bits I use, from (whatever) to 16bit > > At that point of the calculation I have a pair of max 17bit each > so here I have, from (whatever+1) to 17bit > >  i^2+q^2 would make 35bit maximum > so here I have, from (whatever*2+1) to 35bit > > Ideally I should get back to the same dynamic range and > the sqrt would get me to (whatever+1) to 17bit > > now I'm truncating the 35bit to 16bit thus cutting dynamic range in > half which is very uninteresting > > since I do have a 16bit transfer bus at the end and since the ENOB of the > ADC's are below 15bit at that speed (120MS/s) I would aim > to 15bit at ADC and 16bit output therefore a sqrt(33bit). > >  > > It is a big desgin, and YES I have it all pipelined. there are several > blocks > Each block even takes care of pipelining the address of the data so I don't > care how many clock cycles it takes, I just have to make a pipeline for > address to make it all consistent at the output. > > (silly me that I asked for a 1 cycle thing, that was because I had the > i^2+q^2 > in embedded multipliers so no need to pipeline that until now...). > > The ROM... don't think so... it is a sqrt( ) with too many bits I'm afraid. > > Tks again for your comments. > If you like to comment now on my refined description I would appreciate. > Thanks, > > Luis C. > > You might think about ROM + linear interpolation. You can model this in an Excel spreadsheet to check the accuracy at different ROM bit depths. You can also do a normalization before using the ROM, so you only need values 0.25 <= X < 1 in the table.  Gabor 




Gabor 
Sven Lundgren
Guest
Posts: n/a

If you're looking for a relatively small, fast and simple solution and can deal with +/ 4% accuracy, check out the alpha max + beta min algorithm (http://en.wikipedia.org/wiki/Alpha_m..._min_algorithm).
Otherwise, try a variation of Anton's suggestion #3, where a CORDIC core/algorithm is used to rotate your vector to 0 degrees, then MV = I. 




Sven Lundgren 
Anton Gunman
Guest
Posts: n/a

These are some interesting suggestions by both Gabor and Sven!
If you are still considering using a fully pipelined SQRT implementation, here goes (excuse me for the long post). I personally do not like using cores (although they are often more optimized than most of the stuff that you will code). If you try synthesizing the fully combinatorial function your timing will probably be terrible (although it will be able to do it in a single clock cycle). Here below is some code that used the combinatorial function from the previous links, you can try synthesizing it to check your resource/timing. I would not recommend using that for large bit widths (unless if your clock frequency is quite low). I have added some I/O registers. (This is a synthesis resource/timing estimate for INPUT_WIDTH := 32) Number of Slice Registers : 78 Number of Slice LUTs : 731 Number of fully used LUTFF pairs: 9 Minimum period: 42.849ns (Maximum Frequency: 23.338MHz)   Combinatorial sqrt module  Note: I would not recommend using that for synthesis using large bit widths.  library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity sqrt_comb is generic ( INPUT_WIDTH : integer := 32  Has to be even ); port ( clk : in std_logic; sync_reset : in std_logic; data_in : in std_logic_vector(INPUT_WIDTH1 downto 0); data_out : out std_logic_vector(INPUT_WIDTH/21 downto 0) ); end sqrt_comb; architecture rtl of sqrt_comb is   Function declaration  function SquareRoot (Arg : unsigned) return unsigned is constant AMSB : integer := Arg'length1; constant RMSB : integer := (Arg'length/2)  1; variable Root : unsigned(RMSB downto 0); variable Test : unsigned(RMSB+1 downto 0); variable Rest : unsigned(AMSB+1 downto 0); begin Root := (others => '0'); Rest := '0' & Arg; for i in RMSB downto 0 loop Test := Root(RMSB1 downto 0) & "01"; if Test(RMSBi+1 downto 0) > Rest(AMSB+1 downto 2*i) then Root := Root(RMSB1 downto 0) & '0'; else Root := Root(RMSB1 downto 0) & '1'; Rest(AMSB downto i*2) := Rest(AMSB downto i*2)  Test(RMSBi+1 downto 0); end if; end loop; return Root; end;   Signal declaration  signal data_in_reg : unsigned(INPUT_WIDTH1 downto 0); begin   Calculate sqrt  Note: Using I/O registers.  calc_sqrt : process (clk) begin if rising_edge(clk) then if sync_reset = '1' then data_in_reg <= (others => '0'); data_out <= (others => '0'); else data_in_reg <= unsigned(data_in); data_out <= std_logic_vector(SquareRoot(data_in_reg)); end if; end if; end process calc_sqrt; end rtl;  Now if you take the previous function and add some pipeline stages, you can get a fully pipelined design, the timing should be MUCH better, but you will get a latency of INPUT_WIDTH/2. I am not sure how it compares to a fully pipelined core and have not tried optimizing it. I personally might require using a SQRT function later on, so I drafted the following implementation: (You will get some warnings about constants being optimized, since not all the bits are used/set for every stage, but it should be fine). (This is a synthesis resource/timing estimate for INPUT_WIDTH := 32) Number of Slice Registers : 434 Number of Slice LUTs : 678 Number of fully used LUTFF pairs: 294 Minimum period: 3.862ns (Maximum Frequency: 258.903MHz)   Fully pipelined sqrt module   * The data INPUT_WIDTH has to be even (The output width is INPUT_WIDTH/2).  * The latency is INPUT_WIDTH/2.  Note: I have not thoroughly tested this module.  library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity sqrt_pipelined is generic ( INPUT_WIDTH : integer := 32  Has to be even ); port ( clk : in std_logic; sync_reset : in std_logic; data_in : in std_logic_vector(INPUT_WIDTH1 downto 0); data_out : out std_logic_vector(INPUT_WIDTH/21 downto 0) ); end sqrt_pipelined; architecture rtl of sqrt_pipelined is   Signal declaration  constant AMSB : integer := INPUT_WIDTH  1; constant RMSB : integer := (INPUT_WIDTH/2)  1;  These are arrays for pipelining. RMSB + 1 stages (which is the latency) type type_root_array is array (0 to RMSB) of unsigned(RMSB downto 0); type type_rest_array is array (0 to RMSB) of unsigned(AMSB + 1 downto 0); signal root : type_root_array; signal rest : type_rest_array; begin   Calculate sqrt  calc_sqrt : process (clk) variable test : unsigned(RMSB+1 downto 0);  Temporary variable begin if rising_edge(clk) then if sync_reset = '1' then root <= (others => (others => '0')); rest <= (others => (others => '0')); else for i in RMSB downto 0 loop if i = RMSB then  First stage: shift in new data root(i) <= (others => '0'); rest(i) <= '0' & unsigned(data_in); else  Pipeline shift registers root(i) <= root(i+1); rest(i) <= rest(i+1); test := root(i+1)(RMSB1 downto 0) & "01"; if test(RMSBi+1 downto 0) > rest(i+1)(AMSB+1 downto 2*i) then root(i) <= root(i+1)(RMSB1 downto 0) & '0'; else root(i) <= root(i+1)(RMSB1 downto 0) & '1'; rest(i)(AMSB downto i*2) <= rest(i+1)(AMSB downto i*2)  test(RMSBi+1 downto 0); end if; end if; end loop; end if; end if; end process calc_sqrt;  The output is at the last stage of the pipe array data_out <= std_logic_vector(root(0)); end rtl;  Also here is a small testbench that you can use to test the above code.   Testbench for sqrt_pipelined  library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity sqrt_pipelined_tb is end sqrt_pipelined_tb; architecture behavioral of sqrt_pipelined_tb is component sqrt_pipelined generic ( INPUT_WIDTH : integer); port ( clk : in std_logic; sync_reset : in std_logic; data_in : in std_logic_vector(INPUT_WIDTH1 downto 0); data_out : out std_logic_vector(INPUT_WIDTH/21 downto 0)); end component;  component generics constant INPUT_WIDTH : integer := 32;  component ports signal sync_reset : std_logic; signal data_in : std_logic_vector(INPUT_WIDTH1 downto 0); signal data_out : std_logic_vector(INPUT_WIDTH/21 downto 0);  clock signal clk : std_logic := '1'; begin  behavioral  component instantiation DUT : sqrt_pipelined generic map ( INPUT_WIDTH => INPUT_WIDTH) port map ( clk => clk, sync_reset => sync_reset, data_in => data_in, data_out => data_out);  clock generation clk <= not clk after 10 ns;  waveform generation WaveGen_Proc : process begin  insert signal assignments here sync_reset <= '1'; data_in <= (others => '0'); wait for 100 ns; wait until rising_edge(clk); sync_reset <= '0'; data_in <= std_logic_vector(to_unsigned(4, data_in'length)); wait until rising_edge(clk); data_in <= std_logic_vector(to_unsigned(9, data_in'length)); wait until rising_edge(clk); data_in <= std_logic_vector(to_unsigned(121, data_in'length)); wait until rising_edge(clk); data_in <= (others => '0'); wait; end process WaveGen_Proc; end behavioral;  Cheers ! By the way, does anyone know if we can paste source code in the group with proper formatting (or at least courier fonts)? Thanks ! 




Anton Gunman 
Luis Cupido
Guest
Posts: n/a

Thanks.
I cant really cope with the non pipelined suff, from Anton's values it results way too slow. So I will try the pipelined approach. Many thanks for the code Anton. In the mean time it doesn't hurt to make a component using alpha max + beta min algorithm. I have not evaluated the impact of the error but just to see all this working it will be worth a try as coding effort is very small. Tks all for the great suggestions. Luis C. 




Luis Cupido 
Luis Cupido
Guest
Posts: n/a

Anton
The pipeline code works fantastic but has a small bug in the algorithm I could not figure out what it is. it computes perfect up to 32767^2 and for any value above this the result is always 32767=7FFF For 32 bit input and 16bit out I should go up to 0xFFFF at the output. In fact the compiler warns that data(15) is stuck low. the code is from your message copy/paste. I sure you can spot what is wrong in a eye blink. Super thanks. Luis Cupido. 




Luis Cupido 
Anton Gunman
Guest
Posts: n/a

Thanks for pointing that out !
Indeed there was a bug, I added an extra pipeline stage, which increases the latency by one but should fix the problem (I was not iterating over all the bits). The current latency is INPUT_WIDTH/2 + 1. Note: If you're going to test the corner cases make sure you don't reach the INTEGER limitations inside the testbench. I was able to reproduce the problem by setting the generic INPUT_WIDTH := 8 in the testbench. Below is the modified code, let me know if there's still a problem. Anton   Fully pipelined sqrt module   * The data INPUT_WIDTH has to be even (The output width is INPUT_WIDTH/2).  * The latency is INPUT_WIDTH/2 + 1.  Note: I have not thoroughly tested this module.  library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity sqrt_pipelined is generic ( INPUT_WIDTH : integer := 32  Has to be even ); port ( clk : in std_logic; sync_reset : in std_logic; data_in : in std_logic_vector(INPUT_WIDTH1 downto 0); data_out : out std_logic_vector(INPUT_WIDTH/21 downto 0) ); end sqrt_pipelined; architecture rtl of sqrt_pipelined is   Signal declaration  constant AMSB : integer := INPUT_WIDTH  1; constant RMSB : integer := (INPUT_WIDTH/2)  1;  These are arrays for pipelining. RMSB + 2 stages (which is the latency) type type_root_array is array (0 to RMSB+1) of unsigned(RMSB downto 0); type type_rest_array is array (0 to RMSB+1) of unsigned(AMSB + 1 downto 0); signal root : type_root_array; signal rest : type_rest_array; begin   Calculate sqrt  calc_sqrt : process (clk) variable test : unsigned(RMSB+1 downto 0);  Temporary variable begin if rising_edge(clk) then if sync_reset = '1' then root <= (others =>(others => '0')); rest <= (others =>(others => '0')); else  Input register root(RMSB + 1) <= (others => '0'); rest(RMSB + 1) <= '0' & unsigned(data_in); for i in RMSB downto 0 loop  Pipeline shift registers root(i) <= root(i+1); rest(i) <= rest(i+1); test := root(i+1)(RMSB1 downto 0) & "01"; if test(RMSBi+1 downto 0) > rest(i+1)(AMSB+1 downto 2*i) then root(i) <= root(i+1)(RMSB1 downto 0) & '0'; else root(i) <= root(i+1)(RMSB1 downto 0) & '1'; rest(i)(AMSB downto i*2) <= rest(i+1)(AMSB downto i*2)  test(RMSBi+1 downto 0); end if; end loop; end if; end if; end process calc_sqrt;  The output is at the last stage of the pipe array data_out <= std_logic_vector(root(0)); end rtl; 




Anton Gunman 
Anton Gunman
Guest
Posts: n/a

Thanks for pointing that out !
Indeed there was a bug, I added an extra pipeline stage, which increases the latency by one but should fix the problem. The current latency is INPUT_WIDTH/2 + 1. Note: If you're going to test the corner cases make sure you don't reach the INTEGER limitations inside the testbench. I was able to reproduce the problem by setting the generic INPUT_WIDTH := 8 in the testbench. Below is the modified code, let me know if there's still a problem. Anton   Fully pipelined sqrt module   * The data INPUT_WIDTH has to be even (The output width is INPUT_WIDTH/2).  * The latency is INPUT_WIDTH/2 + 1.  Note: I have not thoroughly tested this module.  library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity sqrt_pipelined is generic ( INPUT_WIDTH : integer := 32  Has to be even ); port ( clk : in std_logic; sync_reset : in std_logic; data_in : in std_logic_vector(INPUT_WIDTH1 downto 0); data_out : out std_logic_vector(INPUT_WIDTH/21 downto 0) ); end sqrt_pipelined; architecture rtl of sqrt_pipelined is   Signal declaration  constant AMSB : integer := INPUT_WIDTH  1; constant RMSB : integer := (INPUT_WIDTH/2)  1;  These are arrays for pipelining. RMSB + 2 stages (which is the latency) type type_root_array is array (0 to RMSB+1) of unsigned(RMSB downto 0); type type_rest_array is array (0 to RMSB+1) of unsigned(AMSB + 1 downto 0); signal root : type_root_array; signal rest : type_rest_array; begin   Calculate sqrt  calc_sqrt : process (clk) variable test : unsigned(RMSB+1 downto 0);  Temporary variable begin if rising_edge(clk) then if sync_reset = '1' then root <= (others =>(others => '0')); rest <= (others =>(others => '0')); else  Input register root(RMSB + 1) <= (others => '0'); rest(RMSB + 1) <= '0' & unsigned(data_in); for i in RMSB downto 0 loop  Pipeline shift registers root(i) <= root(i+1); rest(i) <= rest(i+1); test := root(i+1)(RMSB1 downto 0) & "01"; if test(RMSBi+1 downto 0) > rest(i+1)(AMSB+1 downto 2*i) then root(i) <= root(i+1)(RMSB1 downto 0) & '0'; else root(i) <= root(i+1)(RMSB1 downto 0) & '1'; rest(i)(AMSB downto i*2) <= rest(i+1)(AMSB downto i*2)  test(RMSBi+1 downto 0); end if; end loop; end if; end if; end process calc_sqrt;  The output is at the last stage of the pipe array data_out <= std_logic_vector(root(0)); end rtl; 




Anton Gunman 


 
Thread Tools  


Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
recommendation doing cosimulation between c/c++ with vhdl  Carson  VHDL  0  10052005 07:40 PM 
What is VS Doing to my Code???  downs.matt@gmail.com  ASP .Net  1  09022005 11:38 AM 
Why is Mozilla doing this ... and how to fix it  Joe S.  Firefox  7  06072005 04:12 AM 
VB6/VB.Net Programming Question  what am i doing wrong?  Salisha Khan  ASP .Net  1  08012003 01:55 PM 
What am I doing wrong?  Keith R. Williams  VHDL  4  07152003 03:08 AM 
Powered by vBulletin®. Copyright ©2000  2014, vBulletin Solutions, Inc..
SEO by vBSEO ©2010, Crawlability, Inc. 