This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

1-cycle multiply, 64-bit result,  reciprocal?

Can someone tell me how many extra gates the 1-cycle multiply uses? If there was a 64-bit result, how many more gates would be used? Can these gates also be used to find the reciprocal of a number so instead of divides, the coder multiplies the reciprocal? In the 90s, Nvidia sent a coder to optimize Tomb Raider on their video cards. He explained that they didn't use a Z-buffer but rather a W-buffer which was the reciprocal of the Z. This meant that the draw engine used multiplies rather than divides when calculating texture coordinates, lighting and other vertex controls.. As far as I know, they still do.

MP3 on the M0 (or 2 x M0s) uses a LOT of multiplies for the FFTs. Since the BBC Microbit has a Nordic Semiconductors nRF51822 bluetooth chip. Like the CPU, it's clocked at 48MHz. If the RAM of this CPU could be mapped into the CPU address space, it would be possible to build 1 channel using the Nordic chip & the other using the CPU.

I'm looking to use 16:16 fixed-point by modification of the Minimp3 player with some extra speed/space tradeoff so that the player can go right up to 320Kb/S.

Thanks in advance.

Sean

  • Have a look with google at some images for carry look ahead adders - which are practical - and Wallace tree multipliers - which aren't but you might see some others which are slower but practical. These work in time proportional to the log of the number of bits so that has to fit into one cycle to do what you want. And whilst one can get a reciprocal using an approximation and some multiplies you'd need yet another circuit to do it much faster..Speeding up that sort of thing uses lots of gates and is left to bigger systems than an M0 core!

    I just looked at WIkipedia about the Microbit and it said the main processor ran at 16Mhz and the Bluetooth one at 48Mhz which seems a bit strange to me. On a quick look on the web at what other MP3 decoders achieve I'd guess you'd have real trouble at 16Mhz doing the highest bitrate so I can see why you want to hack the Bluetooth chip. It may well be possible but really that would be up to you I'm afraid.,

  • Yep - if the CPU is 16Mhz while the Bluetooth is 48MHz, I was considering asking for a small amount of the program-RAM for the Bluetooth CPU to do all of those FFT calculations. My other option is not good - specify Sandisk USB sticks and reprogram the program RAM in them to placed the MP3 decode inside the memory stick. Of course, nobody has specified the speed they are running the memory stick. I know that on a PC they run at 100MHz so plenty of power for MP3. If all else fails, ADPCM is simple but not so compact. I really need to find a codec that is designed to output 8-bit data...

    I wrote a multiply for the 8080 in the Gameboy in which I split the numbers into 4-bit blocks and used a lookup table. I wonder if a similar trick would beat the usual 32x32=64 bit software multiply which itself uses 31 cycles IF it has the 1-cycle multiply. Without it, it's about 250 cycles!

  • I believe both of the cores implement the single cycle multiply option but you'd better check that.

  • I'm somewhat confused with the questions in the first paragraph and I'll just try to share information the way I interpret the objective of the questions.

    I suppose you have a notion that a circuit for finding the reciprocal of an operand is coupled or integrated in a multiplier. In this case, when division is to be executed, the divisor is passed to the reciprocal circuit and the result is used as the multiplier (operand) while the dividend is applied as the multiplicand to the multiplier (circuit).

    Newer processors (e.g., ARMv7, Itanium, PowerPC - later versions, etc.) allow the reciprocal estimator (or initial estimator) to work independently and a reciprocal estimate instruction (or instruction to support reciprocal estimation) is available.

    To determine how many extra/more gates is used, I suggest you consult a book/reference in advanced digital/microprocessor/arithmetic circuit design. For our space I think it would suffice to remind us that division in digital computer circuit consumes far more resources compared to multiplication. Hence, division is avoided by transforming the operation into multiplication of the dividend and the reciprocal of the divisor. It may seem absurd because finding the reciprocal of a number is also division. However, division is truly eluded because reciprocal estimation is performed by iterating through

    Xn+1 = Xn(2 - dXn)

    which does not involve division (see for example the ARM® Architecture Reference Manual ARMv7-A and ARMv7-R edition page 86 to see a more readable equation). This sprung from Newton-Raphson root-finding method. From the function

    f(x) = d - (1/x)

    f(x) = 0 when x is the reciprocal of d.

    For details you may consult a reference in Numerical Methods/Analysis, Scientific Computing, and related subjects.

    Lastly, we have to be reminded that the method of division we are discussing does not apply to integer arithmetic circuit and the reciprocal estimate instruction belongs to floating-point operations of microprocessors. If a nonzero number is not exactly equal to 1, the number itself or its reciprocal involves a fractional part.

  • Have you considered a Log & Antilog table to narrow down your search range? Depending on size, it could, at least, tell you what size of divide you need? I know there are tricks for fast square-root so maybe the problem can be expressed differently? Please forgive me if this is useless - 15 years since I coded - I can ONLY code in assembly-language and their isn't much of a market.