Researchers Claim New Technique Slashes AI Energy Use By 95% (decrypt.co) 115

Posted by BeauHD on Tuesday October 08, 2024 @11:30PM from the would-you-look-at-that dept.

Researchers at BitEnergy AI, Inc. have developed Linear-Complexity Multiplication (L-Mul), a technique that reduces AI model power consumption by up to 95% by replacing energy-intensive floating-point multiplications with simpler integer additions. This method promises significant energy savings without compromising accuracy, but it requires specialized hardware to fully realize its benefits. Decrypt reports: L-Mul tackles the AI energy problem head-on by reimagining how AI models handle calculations. Instead of complex floating-point multiplications, L-Mul approximates these operations using integer additions. So, for example, instead of multiplying 123.45 by 67.89, L-Mul breaks it down into smaller, easier steps using addition. This makes the calculations faster and uses less energy, while still maintaining accuracy. The results seem promising. "Applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by element wise floating point tensor multiplications and 80% energy cost of dot products," the researchers claim. Without getting overly complicated, what that means is simply this: If a model used this technique, it would require 95% less energy to think, and 80% less energy to come up with new ideas, according to this research.

The algorithm's impact extends beyond energy savings. L-Mul outperforms current 8-bit standards in some cases, achieving higher precision while using significantly less bit-level computation. Tests across natural language processing, vision tasks, and symbolic reasoning showed an average performance drop of just 0.07% -- a negligible tradeoff for the potential energy savings. Transformer-based models, the backbone of large language models like GPT, could benefit greatly from L-Mul. The algorithm seamlessly integrates into the attention mechanism, a computationally intensive part of these models. Tests on popular models such as Llama, Mistral, and Gemma even revealed some accuracy gain on certain vision tasks.

At an operational level, L-Mul's advantages become even clearer. The research shows that multiplying two float8 numbers (the way AI models would operate today) requires 325 operations, while L-Mul uses only 157 -- less than half. "To summarize the error and complexity analysis, L-Mul is both more efficient and more accurate than fp8 multiplication," the study concludes. But nothing is perfect and this technique has a major achilles heel: It requires a special type of hardware, so the current hardware isn't optimized to take full advantage of it. Plans for specialized hardware that natively supports L-Mul calculations may be already in motion. "To unlock the full potential of our proposed method, we will implement the L-Mul and L-Matmul kernel algorithms on hardware level and develop programming APIs for high-level model design," the researchers say.

Researchers Claim New Technique Slashes AI Energy Use By 95%

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 115 Comments Log In/Create an Account

Comments Filter:

This is not a new Technique (Score:1)

by Anonymous Coward writes:

Has been used a thousand times before and after floating point units
- Re:This is not a new Technique (Score:5, Funny)
  
  by mdvx ( 4169463 ) writes: on Tuesday October 08, 2024 @11:41PM (#64850247)
  
  Seems like a couple Gen-Zs have rediscovered the fundamentals of computing, after couple years of college, 50 years later!
  
  - Re: (Score:2)
    
    by hey! ( 33014 ) writes:
    
    That's kind of the point of sending them to college.
  - Re: (Score:2)
    
    by narcc ( 412956 ) writes:
    
    It's not their fault. They stopped teaching computer science early in the 'dot com' era, turning CS programs into four-year-long programming boot camps. Things continued to degrade to the point that there is now a taboo against actually writing code.
- Re:This is not a new Technique (Score:5, Interesting)
  
  by 4im ( 181450 ) writes: on Wednesday October 09, 2024 @03:33AM (#64850435)
  
  Remember fractint? It allowed computing a bunch of different types of fractals, using integer operations, making it extremely quicker than classic implementations in floating point. Way back when a naive implementation of the Mandelbrot fractal took 2h on a 286, fractint got a pretty immediate response.
  
  - Tables (Score:2)
    
    by fyngyrz ( 762201 ) writes:
    
    ...as little as 3-4 bit precision (which is such a low bit precision that it makes more sense to think of it as a lookup table of exponentially growing values than as actual floating point math).
    Even with FP8, you just need to generate (once) a 64k-entry table of results, and then there's no CPU/FPU FP math at all to do the 2-element "multiplication." 8BOpA as MS8 bits, 8BOpB as LS8 bits, results in a direct 16B index to the answer from the 64k table.
    The significant cost of FP8 versus FP3 or FP4 is in the s
    - Re: (Score:3)
      
      by Rei ( 128717 ) writes:
      
      64k is still pretty sizable for a cache that you need to have needed local to every part of the current layer being processed (remember that you have thousands of cores working at once). If the access time is more than several times that of a register then the above algo is probably going to be faster.
      On the other hand, if you're talking FP4, a 256b cache (heck, should we even be calling it a cache? Might as well just be etched directly into the silicon) is really easy to have local everywhere.
      A thought:
  - Re: (Score:2)
    
    by TeknoHog ( 164938 ) writes:
    
    FWIF, a couple of years ago I generated some fractals in Bash, implementing fixed point arithmetic with Bash integers and ANSI-like colour codes: https://www.instagram.com/p/CL... [instagram.com]
- Re: This is not a new Technique (Score:3)
  
  by SuperDre ( 982372 ) writes:
  
  And yet nobody else seems to have thought if using it as a replacement for calculating AI computations. Sometimes old tricks still work perfectly and more optimized.
- - Re: This is not a new Technique (Score:3)
    
    by klipclop ( 6724090 ) writes:
    
    How else do will they get seed investors and floods of customers to buy their hardware? :D
    - Re: (Score:1)
      
      by gweihir ( 88907 ) writes:
      
      Indeed. Most of the hype is now basically a scam.
  - Re:This is not a new Technique (Score:5, Interesting)
    
    by Rei ( 128717 ) writes: on Wednesday October 09, 2024 @07:43AM (#64850705) Homepage
    
    To be clear: This is NOT fixed point arithmetic in some linear processor. Those techniques didn't go out of fashion on a lark, but because they became slower than floating point math, so you were slowing yourself down in order to make your math less accurate.
    There's several things going on here that push the equation back in the direction of integer math. The first is that we're already dealing with greatly reduced precision. The article talks about FP8, but many people run inference with even less, as little as 3-4 bit precision (which is such a low bit precision that it makes more sense to think of it as a lookup table of exponentially growing values than as actual floating point math). So it's not a question of whether to give up precision, but rather, by what means to give up precision, and how much.
    Secondly, this is not fixed point. If there's any analogies, it's to the old "fast inverse square root" trick that was beloved by game developers for several years. That is to say, it's still working with floating, not fixed, point numbers, with exponents and mantissas, but relies on approximating away part of the math.
    The multiplication of x and y starts off (treating mantissas as their float fractions) as:
    (1 + xm) * 2 ^ xe * (1 + ym) * 2 ^ ye
    Where m is mantissa and e is the exponent, of each component x and y. Remember that the mantissa is a value from 0 to 1. This can be mathematically expanded to:
    (1 + xm + ym + xm * ym) * 2^(xe + ye)
    The 2^(xe + ye) is not a problem because multiplying by 2^a can be done with just a bitshift. The lag in multiplication is in the mantissa, because the amount of work needed to do it is O(N^2) with respect to the number of bits.
    To get rid of that problematic xm * ym, and remembering that xm and ym (when thought of as floats) are less than 1 so their product is of a smaller order than xm + ym, they instead change the equation to:
    (1 + xm + ym + magic number) * 2^(xe + ye) ... where the magic number chosen depends on the number of mantissa bits. The purpose of this magic number is to not get the exact value of the product correct, but just the "order" of the operation correct.
    A normal floating point operation proceeds (apart from the sign bit):
    1) Calculate (1+xm) * (1+ym)
    2) Separate that into the carry and (1 + xm * ym)
    3) Store the xm + ym as the new mantissa (the right side / least significant bits of the floating point number).
    4) Calculate the new exponent as xe + ye - offset + carry
    5) Store it as-is to the left of the mantissa.
    Their version however lets you do:
    1) Calculate the mantissa as xm + ym + magic number
    2) ... and (since it's all one big integer with the exponent on the left of the mantissa), the carry happens automatically into the exponent during the addition.
    So it's a far more direct operation.
    So no, this isn't fixed point math. But it does have some "echoes" of certain earlier tricks used. And it does stress the need to potentially have some rethinks to how we do things in hardware when dealing with such low bit precisions, where you're guaranteed a lot of inaccuracy anyway. Like, if we go down to even smaller quantizations, maybe the entire process should just be a lookup table.
    And then you might take it even further. Instead of summing thousands of numbers in a vector, perhaps maybe we should break with tradition even further and go analog? E.g. have each weighted value in the vector control a variable resistor and output all the currents into a shared wire, then read the current back into a number on the other end to get the sum. Again, our precision is going to be bad to begin with, and flash attention is already ruining our determinism anyway, so what's the harm? And hey, if we're outputting analog, then why not do our weights and activation functions in analog as well, with a transistor-based exponential current source? You probably can't train a traditional network like this, but I see no reason why you can't run inference like this.
    Read the rest of this comment...
    
    - Re:This is not a new Technique (Score:4, Interesting)
      
      by mbkennel ( 97636 ) writes: on Wednesday October 09, 2024 @12:38PM (#64851299)
      
      > Instead of summing thousands of numbers in a vector, perhaps maybe we should break with tradition even further and go analog?
      There have been attempts in this direction, implementing neural networks in analog electronic hardware, for literally decades and a number of failed startups.
      I believe a major problem is fabrication differences: one chip is different from another and you can't replicate results. So a net you train on one chip will not work properly on another. That makes so much of a human burden (figuring out how to recalibrate if possible between hardware) that they are not feasible products.
      If it's inference only, then its going into a production situation and needs to be cheap, but moreover in something like that (e.g. automotive) it needs to be mass producible and reliable and not require expert fiddling.
      
      - Re: (Score:2)
        
        by Rei ( 128717 ) writes:
        
        That argument doesn't stand. We haven't even implemented Transformers on ASICs yet, let alone any sort of nontraditional hardware. It's not some technological barrier that's kept us on GPUs, but rather, the combination of the size of the market (now finally getting large) times the risk of developing hardware for an architecture that's obsolete by the time it's launched.
        I believe a major problem is fabrication differences: one chip is different from another and you can't replicate results.
        (A) currents add
I'm suprised (Score:3)

by JamesTRexx ( 675890 ) writes: on Wednesday October 09, 2024 @12:04AM (#64850295) Journal

I thought CPUs were already made to do floating operations as effectively as possible, so what makes it different for LLM?
Makes it sound as if instead of writing float multiplications in C code, it's faster to translate it into integer calculations.
Yes, I do know sometimes it is faster to convert to whole numbers and then convert the reult into float in the end.

- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  Simple, float is still more effort than integer, and it is so in multiple dimensions. That will never change.
- Re:I'm suprised (Score:5, Interesting)
  
  by jonsmirl ( 114798 ) writes: on Wednesday October 09, 2024 @12:16AM (#64850309) Homepage
  
  L-Mult works by defining a new floating point format which is less capable than the existing, standard one. This works because the new format has been specifically tailored to match the needs of current AI tensor math. The existing standard float format has more features than AI requires. Since this new format is less complex it can be implemented in fewer logic gates which is where the power savings comes from.
  I suspect it would be more fruitful to increase efforts on converting existing models to quantized integer models and just stick with the existing hardware. This also massively lowers the power consumption by replacing floating point instructions with integer ones.
  
  - Re: (Score:3)
    
    by 93 Escort Wagon ( 326346 ) writes:
    
    I suspect the real purpose of all this is an attempt to prime the pump for specialized hardware sales - hardware someone affiliated with the researchers will soon release.
    - Re: (Score:3)
      
      by ShanghaiBill ( 739463 ) writes:
      
      I suspect the real purpose of all this is an attempt to prime the pump for specialized hardware sales
      Big tech will only buy it if it works.
      If it works as promised, they'll make billions.
      - Re:I'm suprised (Score:4, Insightful)
        
        by Rei ( 128717 ) writes: on Wednesday October 09, 2024 @09:05AM (#64850829) Homepage
        
        I mean, it's obvious that we're going to be headed to custom hardware; it's kind of shocking to me that GPUs still dominate, even though they've been adding on some features targeted at NN processing. The greater AI compute needs grow, the more push there is going to be for custom hardware. The main thing that's been slowing it down (vs. say Bitcoin, which went to custom ASICs pretty quickly) is that the algos are still very much evolving.
        
        
        Re: (Score:2)
        
        by mrfaithful ( 1212510 ) writes:
        
        I think it's because AI startups are just in such a rush that they don't wait to analyse the tech, they just hire a bunch of "AI engineers" who only know CUDA based Python libs and buy the hardware that works with what their under-qualified staff know how to work. And this gets them meaningless "results" fast which keeps the money happy in the short term. They'll probably not be the source for the next generation LLM, but they'll prop up nVidia's stock price while they go.
        
        Re: (Score:2)
        
        by allo ( 1728082 ) writes:
        
        Or maybe almost all new AI code is CUDA (and torch/tensorflow) and so it is best to use that instead of having your own software engineers reimplement all the libraries on them own.
        The dominance of CUDA is not good, but it is understandable.
        
        Re: (Score:2)
        
        by Mal-2 ( 675116 ) writes:
        
        This is also why AMD wants to increase their market share to break this particular chicken-and-egg problem. Right now CUDA drivers always come first, with Vulkan either a little (weeks) to a lot (months or years) behind that. People who want to generate on their own hardware don't want to wait for drivers to be written, they want the best support possible, so they all buy nVidia. Certainly that's why I went that direction rather than saving ~20% with the equivalent AMD option. They're right that getting mar
        
        Re: (Score:2)
        
        by allo ( 1728082 ) writes:
        
        Papers like this or Bitnet could also mean a huge chance for AMD. This is one architecture, but if the "next transformer" (that's what almost all LLMs are based on as well as a few image generators, and many recognition networks) would work on such a Bit or Integer architecture, the first one to build an efficient hardware for that could break the CUDA dominance.
        
        Re: (Score:3)
        
        by mbkennel ( 97636 ) writes:
        
        What makes you think the top NVidia processors like the H100 and G200 are not custom hardware tuned for NN processing? If one started from scratch, where are the main inefficiencies that would be solved, and at what cost? Would one invent something substantially better than what NVidia has now? Its doubtful.
        The main use case for NN training does require highly programmable and flexible parallel processing like graphics computations do. BTC mining is a single computation but NN work is not as simply st
        
        Re: (Score:2)
        
        by Rei ( 128717 ) writes:
        
        Compare inference on H100 to on Grok or Cerebras. And that's not even ASICs.
        GPU architectures are structured for generalist computing. They're not optimized to train or do inference on Transformers. Adding say BF8 support to a GPU is not the same thing as having the hardware structured around executing Transformers.
        
        Re: (Score:2)
        
        by drinkypoo ( 153816 ) writes:
        
        GPU architectures are structured for generalist computing. They're not optimized to train or do inference on Transformers.
        
        They are now, e.g. Grace Hopper was designed to improve performance for such workloads.
  - Re:I'm suprised (Score:4, Insightful)
    
    by ShanghaiBill ( 739463 ) writes: on Wednesday October 09, 2024 @02:20AM (#64850401)
    
    converting existing models to quantized integer models and just stick with the existing hardware.
    That's how Google's TPU already does it. It has 64k 8-bit integer multipliers.
    I don't know how this new technique differs.
    
    - Re: (Score:3)
      
      by pjt33 ( 739471 ) writes:
      
      Using the implicit leading 1, a floating point number is stored as (s, m, e) representing (-1)^s (1+m) 2^e. The interesting part of multiplication is (1+m1)(1+m2)=(1+m1+m2+m1 m2). They approximate this as (1.0625+m1+m2). I assume that they work on the basis that the system is robust enough that the error doesn't matter, because naively I would think that the way to optimise it would be to do the multiplication m1 m2 in lower precision (e.g. using the leading 4 bits of each mantissa).
      - Re: (Score:3)
        
        by jabuzz ( 182671 ) writes:
        
        I suggest that you take a look at the FP8 and FP4 formats which are currently the "big" thing in AI. Everything you thought you knew about floating point numbers is out the window. Basically they are lookup tables to a limited range of numbers; 256 and 16 respectively. There is also no NaN etc. in FP4.
        
        Re: (Score:2)
        
        by Mal-2 ( 675116 ) writes:
        
        FP4 is great for getting everyone to the same result as fast as possible!
        Because of the lack of variety in output, I see FP4 as a toy not even fit for mass consumer use, although it may be closer to acceptable there than I give it credit for. Does it really matter if you generate a birthday card that looks just like someone else's if you'll never meet them?
        
        Re: (Score:2)
        
        by Rei ( 128717 ) writes:
        
        I regularly run Q4 models. They work great.
        ANNs are inherently extremely noise tolerant, as they're in effect fuzzy logic engines. What you lose in precision by going from FP16 to FP8, or FP8 to FP4, you gain in terms of doublings of the total number of weights and biases. You get less precision per "question" / less superposition of "questions" that are in effect asked and answered by a given neuron, but in exchange, you get twice as many neurons. And that's often a very good tradeoff.
    - Re:I'm suprised (Score:5, Informative)
      
      by serviscope_minor ( 664417 ) writes: on Wednesday October 09, 2024 @03:11AM (#64850433) Journal
      
      Well I've skimmed the paper. From what I can tell, the method approximates floating point multiplication, using almost only addition. This is possible because floating point is a semi-logarithmic format, so adding the exponents is a good part of multiplication.
      Basically, an fp number is (1+x/M)*2^e
      x is an integer 0xM, M is a constant (a power of 2), e is an integer. If you multiply 2 fp numbers (x,e and y,f) you get:
      (1 + x/M + y/M + x*y/M^2) * 2^(e+f)
      there's some sign bits and extra fiddling so if the first bracketed bit exceeds 2, you need to knock a bit off and add 1 to the exponent, but fundamentally the cost (according to them) is dominated by the x*y integer multiplication there.
      The paper more or less proposes simply discarding the x*y term and replacing it with a constant.
      Then FP multiplication becomes "smoosh two numbers together in a nonlinear way that's a bit like multiplication", the latter being much much cheaper.
      This to me seems pretty plausible.
      
  - Re:I'm suprised (Score:4, Interesting)
    
    by larryjoe ( 135075 ) writes: on Wednesday October 09, 2024 @10:18AM (#64851001)
    
    I suspect it would be more fruitful to increase efforts on converting existing models to quantized integer models and just stick with the existing hardware. This also massively lowers the power consumption by replacing floating point instructions with integer ones.
    The paper mentions quantization in its related works section but doesn't elaborate on why the paper's ideas are better.
    Of course, the big misdirection from the paper is that they talk about energy savings for compute but not for the entire processor. Compute, i.e., the ALU-ish part, is a small part of the total chip energy usage, and the paper's idea isn't even talking about energy for the entire ALU-ish part but a fraction of that.
    
    - Re: (Score:2)
      
      by drinkypoo ( 153816 ) writes:
      
      Compute, i.e., the ALU-ish part, is a small part of the total chip energy usage
      That depends on how much cache is onboard. If there's a lot of cache, that's true. If there's little cache, it isn't...
  - Re: (Score:2)
    
    by mbkennel ( 97636 ) writes:
    
    > I suspect it would be more fruitful to increase efforts on converting existing models to quantized integer models and just stick with the existing hardware. This also massively lowers the power consumption by replacing floating point instructions with integer ones.
    This is already done. And it doesn't lower the power consumption that much.
    It's useful for using already trained nets, but not for training where the dynamic range of floating point is essential. There is lots of existing work in the literat
- Re: (Score:2)
  
  by ShanghaiBill ( 739463 ) writes:
  
  I thought CPUs were already made to do floating operations as effectively as possible
  These calculations are not run on CPUs.
  AI has been using GPUs but is increasingly using custom silicon.
  so what makes it different for LLM?
  Much lower precision for starters. CPUs don't support FP16, FP12, or FP8.
  - Re: I'm suprised (Score:2)
    
    by shadowjk ( 654432 ) writes:
    
    Inference ("running an ai model") is much less expensive than training one, LLMs seem able to run with somewhat tolerable performance on the CPU of a Raspberry Pi 5. As far as I understand they're quantized down from the original FP32?
    For training they still need to run FP32
    - Re: (Score:2)
      
      by narcc ( 412956 ) writes:
      
      That's right. You really only need that level of precision during training. After that, it's just (a lot of) wasted space.
- Re:I'm suprised (Score:4, Interesting)
  
  by R3d M3rcury ( 871886 ) writes: on Wednesday October 09, 2024 @02:16AM (#64850393) Journal
  
  I thought CPUs were already made to do floating operations as effectively as possible
  I think it depends on the CPU.
  I remember that PowerPC used to be faster with floating point operations than integer ones and Apple would occasionally suggest converting to floats from ints for various array operations. Conversely, when Apple switched to Intel, the opposite was true and Apple changed their suggestion.
  
- Flashback (Score:5, Insightful)
  
  by JamesTRexx ( 675890 ) writes: on Wednesday October 09, 2024 @03:47AM (#64850451) Journal
  
  So, from the explanations given by esteemed fellow nerds,
  simply put, they pulled a John Carmack.
  
- Re: (Score:2)
  
  by phantomfive ( 622387 ) writes:
  
  AI math can be done with surprisingly large inaccuracies.
- Re: (Score:2)
  
  by gnasher719 ( 869701 ) writes:
  
  I thought CPUs were already made to do floating operations as effectively as possible, so what makes it different for LLM?
  Their claim of floating-point vs. integer operations is pure nonsense. What they have done is replace (1 + x)(1 + y) with (1 + x + y) instead of (1 + x + y + xy).
  
  This is a major loss of precision. But it seems that a multiplication isn't actually needed, just some operation that grows if either operand grows.
  
  If we use a two bit mantissa, after we add the leading 1 bit, we actually process 3 x 3 bit products. Multiplication turns this into 9 single bit products; one product is 1, four products are exis
Only binary is the future for AI (Score:3, Funny)

by thesjaakspoiler ( 4782965 ) writes: on Wednesday October 09, 2024 @12:44AM (#64850325)

it reduces the complexity of integers even more and thus saves 4000% more energy.
You could do your training on a TI-84 running on a battery.

- Re: (Score:2, Funny)
  
  by Anonymous Coward writes:
  
  Oh look, it's mister fancy pants over here with his TI-84. In my day we did it the real way, raw-dogging it with a TI-83.
  - Re: (Score:3)
    
    by Randseed ( 132501 ) writes:
    
    Oh look, it's mister fancy pants over here with his TI-84. In my day we did it the real way, raw-dogging it with a TI-83.
    Amateurs. The only real way is a Casio calculator watch!
    - Re:Only binary is the future for AI (Score:5, Funny)
      
      by Rei ( 128717 ) writes: on Wednesday October 09, 2024 @09:20AM (#64850859) Homepage
      
      Amateurs. The only real way is a Casio calculator watch!
      *Clacks in abacus*
      
      - Re: (Score:2)
        
        by shanen ( 462549 ) writes:
        
        Funniest so modded, but the story was a richer target.
        Now to the second search of discussion for references to the low power PoC. The actual human brain is rated at 35 W. A Thousand Brains by Jeff Hawkins is still commended.
        
        Re: (Score:2)
        
        by Rei ( 128717 ) writes:
        
        The actual human brain is rated at 35 W.
        
        Yep. Why do vector math if you can just let physics multiply and sum your signals for you? :)
        Metabolism is an extremely inefficient way to power a "computer", and our wetware has a lot of parasitic overhead costs, but passive analog mechanisms are so much vastly more efficient than digital calculations that it outweighs everything else by large margins.
        
        Re: (Score:2)
        
        by shanen ( 462549 ) writes:
        
        There's an abacus (soroban) school about 2 minutes walk from here.
We live in a satisfactory sim (Score:2)

by Reckoning ( 10502566 ) writes:

Of course my factory is more efficient on power. It uses 4 times the space!
Any improvement for consumer h/w? (Score:2)

by mattr ( 78516 ) writes:

I read the preprint. I didn't quite catch whether this algorithm would deliver speed, efficiency or energy savings if implemented on existing hardware, like consumer CPUs and GPUs. The paper ends with "we will implement hardware and API". Do they mean that existing hardware beats it unless hardware is built specifically with their algorithm in mind?
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  It is hype. Expect most of the claims to be empty and the rest to be misleading.
- Re: (Score:2)
  
  by ShanghaiBill ( 739463 ) writes:
  
  I didn't quite catch whether this algorithm would deliver speed, efficiency or energy savings if implemented on existing hardware
  No. It requires new silicon.
  like consumer CPUs and GPUs.
  Nobody uses consumer CPUs and GPUs for AI anymore. H100 tensor core GPUs are used, but most big tech companies are developing their own custom silicon designs.
  - Re: (Score:2)
    
    by OneOfMany07 ( 4921667 ) writes:
    
    like consumer CPUs and GPUs.
    Nobody uses consumer CPUs and GPUs for AI anymore. H100 tensor core GPUs are used, but most big tech companies are developing their own custom silicon designs.
    Everything bounces. There are people who prefer to run local only. Then you can't expect high end, brand new hardware for inference.
    Like Microsoft's CoPilot+ stuff that expects an NPU. No idea if this is just a deal between Qualcomm (and Intel) and Microsoft to try to sell more of their hardware by saying 'we do AI too!!!', and by excluding the other obvious options (CPU and GPU) which they seem to have chosen to do.
- Re: (Score:2)
  
  by q_e_t ( 5104099 ) writes:
  
  Dedicated hardware for these sort of things can be very efficient. It was all the rage in the 90s and 00s. However, the might of the GPU makers outstripped the relative cottage industry that AI/ML was back then in terms of delivering silicon. But now AI/ML is getting more funding, then dedicated silicon will make sense.
Where's the news? (Score:2, Informative)

by WereCatf ( 1263464 ) writes:

Using integer maths instead of floating point maths where you want a similar result, but don't need as high precision has been done for multiple decades already and is a common thing to do in e.g. resource constrained embedded devices. Applying it to AI models doesn't magically make it a new thing and I seem to recall having seen several articles over the years from other groups on doing it to AI models/training before this.
This just smells like an attempt at hyping things up in the hopes of grant money or
- Re: (Score:2)
  
  by serviscope_minor ( 664417 ) writes:
  
  Did you RTFP?
- Re: (Score:3)
  
  by q_e_t ( 5104099 ) writes:
  
  AI/ML is a good target for this sort of approximate maths as you can often reduce the precision of individual calculations significantly but still get reasonable overall results. You might want to be careful where you are applying it as 0.07% reduction in accuracy, if that meant (and it involves ROC curves so it's not a direct comparison) 1 million tests on people who have cancer missed 700 people. That would be somewhat offset by the fact that when testing for cancer, assuming people have symptoms, you don
  - Re: (Score:2)
    
    by Rei ( 128717 ) writes:
    
    That doesn't apply to NNs as a whole, though. A single neuron isn't calculating someone's odds of cancer; it's a huge number of neurons acting together. Each neuron in effect asks and answers a superposition of questions about one particular aspect of the problem. Reducing the FP precision reduces how much superposition there can be and/or how much nuance there can be in the answer of its questions, but at the same time buys you more neurons - a lot more total questions about a lot more aspects.
    - Re: (Score:2)
      
      by q_e_t ( 5104099 ) writes:
      
      Yes, the behaviour is based on collective interactions, which is why a reduction of precision for any individual calculation may have negligible effect on overall performance ('accuracy') Indeed, this was a focus of my research. However, for overall performance TFS just has a difficult to interpret 0.07% reduction in accuracy, so taking that as an approximation I tried to indicate what that might mean in terms of FP and FN rates (not that it is linear, but it was illustrative purposes only) but how that the
Not true (Score:3)

by Visarga ( 1071662 ) writes: on Wednesday October 09, 2024 @03:10AM (#64850431)

It only reduces compute energy usage, but memory access is the real bottleneck. It's a deceiving title

- Re:Not true (Score:5, Interesting)
  
  by jabuzz ( 182671 ) writes: on Wednesday October 09, 2024 @06:35AM (#64850611) Homepage
  
  I don't think you quite understand how insanely energy intensive AI currently is. Let say you have a system with 2000 H100 cards from NVidia. You are going to need something in the order of a 3MW grid connection and your energy bill is going to be millions of dollars a year. Even a 10% reduction in the overall energy requirements is huge. I work in HPC and thought I was abreast of stupidly high power requirements and cooling, then we had to start doing AI and Oh boy was I mistaken.
  
Excellent maths (Score:2)

by gnasher719 ( 869701 ) writes:

So their invention is that to calculate a product x times y, where x = 1 + a for 0
Now their claim seems to be that there was never any reason to calculate a product in the first place (floating point is a red herring, this is about calculating products), but that for LLMs a product is not needed, just an operation that produces larger numbers from larger inputs. And of course adding the sums in fixed-point arithmetic (which is what they are effectively doing) will be using less power.
Excellent maths (without Slashdot messing up my po (Score:2)

by gnasher719 ( 869701 ) writes:

If you can't guess it, that was supposed to read "without Slashdot messing up my post" but Slashdot messed up the title as well.
So their invention is that to calculate a product x times y, where x = 1 + a for 0 = a 1 and y = 1 + b for 0 = b 1, it is Ok to calculate 1 + a + b instead of the correct 1 + a + b + a*b. This will give results that are always too small, by up to 25%.
Now their claim seems to be that there was never any reason to calculate a product in the first place (floating point is a red herr
RISC-V (Score:5, Interesting)

by bill_mcgonigle ( 4333 ) * writes: on Wednesday October 09, 2024 @04:17AM (#64850485) Homepage Journal

Given the past decade of arduous RISC-V infrastructure work I wouldn't be surprised to see China tap out L-Multi arch samples before New Year and have massive clusters of low-E AI in production by next summer, leaving sanctions on nVidia chips in the dust.
Definitely a potential game changer, though it's hard to imagine a lab in China isn't already doing similar work, given the momentum in the field (I think I left a similar comment last week about error-tolerant low-E AI machines).
This won't cut energy demand, though - just make it cheaper to provide which will increase demand.

- Re: (Score:2)
  
  by drinkypoo ( 153816 ) writes:
  
  Given the past decade of arduous RISC-V infrastructure work I wouldn't be surprised to see China tap out L-Multi arch samples before New Year and have massive clusters of low-E AI in production by next summer, leaving sanctions on nVidia chips in the dust.
  Since China doesn't have advanced process technology and isn't likely to get it anytime soon as ASML only managed to create their latest machines through a multinational effort, and nVidia and everyone else is also free to implement such techniques, China will still be behind by the same amount as before.
something we aren't being told (Score:2)

by mukundajohnson ( 10427278 ) writes:

I saw this removing multiplication in a paper many months ago. Considering that chipmakers are *beyond* antsy to find an advantage in LLM compute, what would stop them from building this hardware? There's something we aren't being told, and 0.07% drop in performance ain't it.
- Re: (Score:2)
  
  by Rei ( 128717 ) writes:
  
  I think I know the paper you were talking about. But it was for ternary networks, not FP8.
  - Re: (Score:2)
    
    by narcc ( 412956 ) writes:
    
    You're probably thinking of this one [arxiv.org]
    I would have guessed this one [arxiv.org]
    it was for ternary networks, not FP8.
    The goal is to get rid of expensive multiplications, not specifically to do cheap FP8 multiplications.
    - Re: (Score:2)
      
      by Rei ( 128717 ) writes:
      
      It very much is to do cheap FP8 multiplications. The numbers are still in FP8 form. They're just multiplying them in a method that relies on integer math. They expand out the multiplication equation, find that the biggest delay is for a low-order floating point multiplication in the mantissa that doesn't hugely effect the output, replace it with a constant magic number, add everything in the mantissa as integers, and let the carry bit overflow into the exponent. They still have an exponent and a mantissa
      - Re: (Score:2)
        
        by narcc ( 412956 ) writes:
        
        I see you're having your own personal conversation. Let me know when you want to participate in this one.
8-bit values? (Score:2)

by bradley13 ( 1118935 ) writes:

I wonder whether 8-bit values are precise enough for the long term. Sure, they work now, but models now definitely have their limitations. If they are going to build custom hardware, it seems like it might be wise to foresee 16-bit values?
- Re: (Score:2)
  
  by Rei ( 128717 ) writes:
  
  Bit sizes have been trending down, not up. I commonly run FP4 models.
  Basically, yes, you lose precision floats half the size, but it means you get twice as many parameters, and when it comes to the precision loss from FP16 to FP8, having double the parameters is a no-brainer "yes, please!" choice.
  The important thing to understand is that NNs are already dealing in fuzzy math, and are thus, highly resilient to noise. The noise-tolerant nature of LLMs should be obvious to anyone who's ever used one. They t
This is not multiplication (Score:2)

by gnasher719 ( 869701 ) writes:

It doesn't matter what they call it, it is not multiplication. It is a very rough approximation to multiplication. It replaces calculating (1 + a) * (1 + b) with calculating 1 + a + b, leaving an error of ab. With two bit mantissa, they find 1.75 * 1.75 = 2.5 instead of 3 1/16 which would be rounded to 3. For three bit mantissa the worst case would be 1.875 x 1.875 = 2.75 instead of 1.875 * 1.875 = 2.75 instead of 3.51625 rounded to 3.5.
- Re: (Score:2)
  
  by Rei ( 128717 ) writes:
  
  1) They don't leave an error of AB, they add in a magic number to approximate typical values of AB.
  2) They're dealing with FP8 (and in some cases as low as *FP3*). There's already huge error.
Why should I care? (Score:2)

by bistromath007 ( 1253428 ) writes:

They're not going to use less electricity, they'll just build 20 more of the useless pieces of shit. Anybody with the money to build this kind of system is ideologically opposed to doing anything interesting or productive with it. They just see a way to burn oil faster and flood every last scrap of human existence with advertising surveillance.
Data point and question (Score:2)

by ElizabethGreene ( 1185405 ) writes:

Back when GPU-based crypto mining was a thing, AMD's cards were significantly faster than NVIDIAs at integer operations.
I wonder if this will shift some of the GPU business back to AMD.
environmentalists beware (Score:1)

by retchdog ( 1319261 ) writes:

Beware of. . . "unforeseen consequences."
technical improvements in efficiency tend to increase consumption and thereby defeat any purported benefits of economization. this is called âoejevon's paradoxâ or âoejevon's lawâ by some.
the actual benefit is that it will now be easier to train neural nets (good!), and that this will increase usage and thus total energy consumption.
this is also good since it provides an impetus to go nuclear, which will itself increase energy consumption. but do
So what do we do with... (Score:2)

by dark.nebulae ( 3950923 ) writes:

So what do we do with all those nuclear reactors the major players are working to restart?
- Re: (Score:2)
  
  by Rei ( 128717 ) writes:
  
  Their proponents are hoping to Jevon it. I'm dubious.
I can reduce it by 100% easily (Score:2)

by Rick Schumann ( 4662797 ) writes:

Just realize it's all crap and stop using it.
- Re: (Score:2)
  
  by allo ( 1728082 ) writes:
  
  Yeah, people are investing billions because it's all crap.
  I don't believe you, that you really believe it's all crap. While it is overhyped and not every use is useful for everyone, such a generic statement is obviously untrue.
  - Re: (Score:2)
    
    by Rick Schumann ( 4662797 ) writes:
    
    They're investing billions because they're either fools who believe it's not crap, or they know that so many others ARE fools who believe it's not crap and will give them their money so they make a massive ROI. THAT is the truth.,
    It's crap, and I'm FAR from the only person who says that, so seethe harder.
    - Re: (Score:2)
      
      by allo ( 1728082 ) writes:
      
      There are literally millions of people who find it useful. You may not like it. You do not have to use it. It may be hyped. Investments may be a bubble or not. But telling "it is all crap" is either extremely ignorant or a bad faith argument, because it is impossible not to see where these things find application and how many people use them productively. As said, you do not need to use them or like them at all. But you need to recognize that others do and they increase their productivity or do things they
      - Re: (Score:2)
        
        by Rick Schumann ( 4662797 ) writes:
        
        A broken (12 hour) clock is accurate twice a day. The analogy applies 100% to so-called 'AI' crapware. The imminent danger of so-called, inappropriately-named 'AI' is that people will get comfortable with it, rely on it, and very bad, perhaps disasterous things will occur when it's horribly wrong. It's deeply flawed technology being pushed by companies that invested, and continue investing ridiculous amounts of money in the development of something that is a dead-end. Meanwhile more and more valuable resour
        
        Re: (Score:2)
        
        by allo ( 1728082 ) writes:
        
        Are your serious with your post? Nevermind, I don't think the discussion will lead somewhere, because you not only choose to dislike AI (thats fine), but also completely ignore the reality of the people around you (that's not good for a serious discussion).
        
        Re: (Score:2)
        
        by Rick Schumann ( 4662797 ) writes:
        
        The 'reality of the people around me' is that too many people think it's some sort techno-magic that's going to solve everyones' problems, and that's about as far from the truth as it could be. The massive flaws in this 'technology' are being smoothed over by the same marketing people who are hyping it to everyone to get it to sell, and the average person, not to mention the media, don't know the difference.
        Are there some limited applications where it's helpful? I guess. But overall it's too much hype and
        
        Re: (Score:2)
        
        by allo ( 1728082 ) writes:
        
        It is neither magic, nor perfect. But it also won't kill us all and won't kill as many jobs as some doomers say. But that was all not my point here. Here, I just pointed out, that one must close both eyes to be unable to see that there are people using AI in helpful ways, and so it can't be "all crap". I respect all criticism and share some of it, but I can't go along with people who discuss like they don't see the benefits others (maybe not themselves) are taking from the tools.
        > Do you have some sort
        
        Re: (Score:2)
        
        by Rick Schumann ( 4662797 ) writes:
        
        My 'informed viewpoint' comes from the fact that so-called, inappropriately-termed 'AI' has no actual reasoning capability, which is a fundamental, fatal flaw. It doesn't matter what version of it you're talking about, none of them can 'reason' at all, and none of them will ever have that capability until we understand our own ability to 'reason' -- and we're nowhere near being able to do that. That's why it's crap. That's why it makes ridiculous mistakes that a small child wouldn't make, and that's why it'
- - Re: (Score:2)
    
    by Rick Schumann ( 4662797 ) writes:
    
    Cry harder.
SlashGPT didn't detect a dupe (Score:1)

by Tablizer ( 95088 ) writes:

I'm pretty sure there was already a story about this or similar roughly 2 months ago.
- Re: (Score:2)
  
  by allo ( 1728082 ) writes:
  
  You're thinking about something else.
  A few month ago there was Bitnet, which can replace dot products with binary operations by using -1, 0, 1 weights. This one is about using integers in an efficient way.
AI: Claiming Cures--Causing Problems (Score:2)

by BrendaEM ( 871664 ) writes:

Nothing good has come from "AI" yet. People didn't quite use it to cure cancer, or make some discovery in the cosmos. Instead, AI makes revenge/star/child porn and political deepfakes. AI is used by advertising companies and police to profile you. The rest of it goes for writing bad research papers poorly written articles, and half-finished programming code.
Regression testing (Score:2)

by Tony Isaac ( 1301187 ) writes:

I presume that the new math comes back with answers that are similar, but not identical to, the traditional floating-point algorithms. How will these small differences affect the output of AI models? For example, will it cause answers to cluster around a smaller set of distinct results, in the way that digital audio recordings are more reproducible but less "warm" than analog recordings? It might be difficult to measure how the changes affects the final output.
Is this truly impossible to implement? (Score:2)

by OneOfMany07 ( 4921667 ) writes:

Meaning if you had firmware level access to a device, I wonder if their underlying engines could be nudged to attempt this. Probably depends where in the process the change is and how much is hard coded to happen.
"Perfect is the enemy of good" (enough)
This is the kind of stuff I'd hope an AGI might notice and mention.
- "impossible on current hardware" (Score:2)
  
  by OneOfMany07 ( 4921667 ) writes:
  
  For clarity "impossible on current hardware" meaning without making custom hardware...
Nuclear power plant anybody? (Score:2)

by Fons_de_spons ( 1311177 ) writes:

Heard a cheap refurbished one may hit the market if this would become reality.
- Re: (Score:2)
  
  by allo ( 1728082 ) writes:
  
  The average person has less than five fingers. So going closer to the average from a six finger hand is beneficial. Precision allows for overfitting.
  Seriously: In the best case you would have much fewer weights, but know the right numbers for them. Since that problem is NP hard, you use enough waits for a good approximation. And if you need more weights, you can use less precision, because much of the precision is only capturing noise. It's like saving images with fewer colors (and no dithering) can remove
- Re: (Score:2)
  
  by allo ( 1728082 ) writes:
  
  It's a similar idea. Ternary would only require even simpler operations, but one does not know if this approach scales better. Both will only really thrive on specialized hardware.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

This is not a new Technique (Score:1)

Re:This is not a new Technique (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re:This is not a new Technique (Score:5, Interesting)

Tables (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: This is not a new Technique (Score:3)

Re: This is not a new Technique (Score:3)

Re: (Score:1)

Re:This is not a new Technique (Score:5, Interesting)

Re:This is not a new Technique (Score:4, Interesting)

Re: (Score:2)

I'm suprised (Score:3)

Re: (Score:2)

Re:I'm suprised (Score:5, Interesting)

Re: (Score:3)

Re: (Score:3)

Re:I'm suprised (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re:I'm suprised (Score:4, Insightful)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re:I'm suprised (Score:5, Informative)

Re:I'm suprised (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: I'm suprised (Score:2)

Re: (Score:2)

Re:I'm suprised (Score:4, Interesting)

Flashback (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Only binary is the future for AI (Score:3, Funny)

Re: (Score:2, Funny)

Re: (Score:3)

Re:Only binary is the future for AI (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

We live in a satisfactory sim (Score:2)

Any improvement for consumer h/w? (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Where's the news? (Score:2, Informative)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Not true (Score:3)

Re:Not true (Score:5, Interesting)

Excellent maths (Score:2)

Excellent maths (without Slashdot messing up my po (Score:2)

RISC-V (Score:5, Interesting)

Re: (Score:2)

something we aren't being told (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

8-bit values? (Score:2)

Re: (Score:2)

This is not multiplication (Score:2)

Re: (Score:2)

Why should I care? (Score:2)

Data point and question (Score:2)

environmentalists beware (Score:1)

So what do we do with... (Score:2)