Researchers Claim New Technique Slashes AI Energy Use By 95% (decrypt.co) 115
Researchers at BitEnergy AI, Inc. have developed Linear-Complexity Multiplication (L-Mul), a technique that reduces AI model power consumption by up to 95% by replacing energy-intensive floating-point multiplications with simpler integer additions. This method promises significant energy savings without compromising accuracy, but it requires specialized hardware to fully realize its benefits. Decrypt reports: L-Mul tackles the AI energy problem head-on by reimagining how AI models handle calculations. Instead of complex floating-point multiplications, L-Mul approximates these operations using integer additions. So, for example, instead of multiplying 123.45 by 67.89, L-Mul breaks it down into smaller, easier steps using addition. This makes the calculations faster and uses less energy, while still maintaining accuracy. The results seem promising. "Applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by element wise floating point tensor multiplications and 80% energy cost of dot products," the researchers claim. Without getting overly complicated, what that means is simply this: If a model used this technique, it would require 95% less energy to think, and 80% less energy to come up with new ideas, according to this research.
The algorithm's impact extends beyond energy savings. L-Mul outperforms current 8-bit standards in some cases, achieving higher precision while using significantly less bit-level computation. Tests across natural language processing, vision tasks, and symbolic reasoning showed an average performance drop of just 0.07% -- a negligible tradeoff for the potential energy savings. Transformer-based models, the backbone of large language models like GPT, could benefit greatly from L-Mul. The algorithm seamlessly integrates into the attention mechanism, a computationally intensive part of these models. Tests on popular models such as Llama, Mistral, and Gemma even revealed some accuracy gain on certain vision tasks.
At an operational level, L-Mul's advantages become even clearer. The research shows that multiplying two float8 numbers (the way AI models would operate today) requires 325 operations, while L-Mul uses only 157 -- less than half. "To summarize the error and complexity analysis, L-Mul is both more efficient and more accurate than fp8 multiplication," the study concludes. But nothing is perfect and this technique has a major achilles heel: It requires a special type of hardware, so the current hardware isn't optimized to take full advantage of it. Plans for specialized hardware that natively supports L-Mul calculations may be already in motion. "To unlock the full potential of our proposed method, we will implement the L-Mul and L-Matmul kernel algorithms on hardware level and develop programming APIs for high-level model design," the researchers say.
The algorithm's impact extends beyond energy savings. L-Mul outperforms current 8-bit standards in some cases, achieving higher precision while using significantly less bit-level computation. Tests across natural language processing, vision tasks, and symbolic reasoning showed an average performance drop of just 0.07% -- a negligible tradeoff for the potential energy savings. Transformer-based models, the backbone of large language models like GPT, could benefit greatly from L-Mul. The algorithm seamlessly integrates into the attention mechanism, a computationally intensive part of these models. Tests on popular models such as Llama, Mistral, and Gemma even revealed some accuracy gain on certain vision tasks.
At an operational level, L-Mul's advantages become even clearer. The research shows that multiplying two float8 numbers (the way AI models would operate today) requires 325 operations, while L-Mul uses only 157 -- less than half. "To summarize the error and complexity analysis, L-Mul is both more efficient and more accurate than fp8 multiplication," the study concludes. But nothing is perfect and this technique has a major achilles heel: It requires a special type of hardware, so the current hardware isn't optimized to take full advantage of it. Plans for specialized hardware that natively supports L-Mul calculations may be already in motion. "To unlock the full potential of our proposed method, we will implement the L-Mul and L-Matmul kernel algorithms on hardware level and develop programming APIs for high-level model design," the researchers say.
This is not a new Technique (Score:1)
Re:This is not a new Technique (Score:5, Funny)
Re: (Score:2)
That's kind of the point of sending them to college.
Re: (Score:2)
It's not their fault. They stopped teaching computer science early in the 'dot com' era, turning CS programs into four-year-long programming boot camps. Things continued to degrade to the point that there is now a taboo against actually writing code.
Re:This is not a new Technique (Score:5, Interesting)
Remember fractint? It allowed computing a bunch of different types of fractals, using integer operations, making it extremely quicker than classic implementations in floating point. Way back when a naive implementation of the Mandelbrot fractal took 2h on a 286, fractint got a pretty immediate response.
Tables (Score:2)
Even with FP8, you just need to generate (once) a 64k-entry table of results, and then there's no CPU/FPU FP math at all to do the 2-element "multiplication." 8BOpA as MS8 bits, 8BOpB as LS8 bits, results in a direct 16B index to the answer from the 64k table.
The significant cost of FP8 versus FP3 or FP4 is in the s
Re: (Score:3)
64k is still pretty sizable for a cache that you need to have needed local to every part of the current layer being processed (remember that you have thousands of cores working at once). If the access time is more than several times that of a register then the above algo is probably going to be faster.
On the other hand, if you're talking FP4, a 256b cache (heck, should we even be calling it a cache? Might as well just be etched directly into the silicon) is really easy to have local everywhere.
A thought:
Re: (Score:2)
Re: This is not a new Technique (Score:3)
Re: This is not a new Technique (Score:3)
Re: (Score:1)
Indeed. Most of the hype is now basically a scam.
Re:This is not a new Technique (Score:5, Interesting)
To be clear: This is NOT fixed point arithmetic in some linear processor. Those techniques didn't go out of fashion on a lark, but because they became slower than floating point math, so you were slowing yourself down in order to make your math less accurate.
There's several things going on here that push the equation back in the direction of integer math. The first is that we're already dealing with greatly reduced precision. The article talks about FP8, but many people run inference with even less, as little as 3-4 bit precision (which is such a low bit precision that it makes more sense to think of it as a lookup table of exponentially growing values than as actual floating point math). So it's not a question of whether to give up precision, but rather, by what means to give up precision, and how much.
Secondly, this is not fixed point. If there's any analogies, it's to the old "fast inverse square root" trick that was beloved by game developers for several years. That is to say, it's still working with floating, not fixed, point numbers, with exponents and mantissas, but relies on approximating away part of the math.
The multiplication of x and y starts off (treating mantissas as their float fractions) as:
(1 + xm) * 2 ^ xe * (1 + ym) * 2 ^ ye
Where m is mantissa and e is the exponent, of each component x and y. Remember that the mantissa is a value from 0 to 1. This can be mathematically expanded to:
(1 + xm + ym + xm * ym) * 2^(xe + ye)
The 2^(xe + ye) is not a problem because multiplying by 2^a can be done with just a bitshift. The lag in multiplication is in the mantissa, because the amount of work needed to do it is O(N^2) with respect to the number of bits.
To get rid of that problematic xm * ym, and remembering that xm and ym (when thought of as floats) are less than 1 so their product is of a smaller order than xm + ym, they instead change the equation to:
(1 + xm + ym + magic number) * 2^(xe + ye) ... where the magic number chosen depends on the number of mantissa bits. The purpose of this magic number is to not get the exact value of the product correct, but just the "order" of the operation correct.
A normal floating point operation proceeds (apart from the sign bit):
1) Calculate (1+xm) * (1+ym)
2) Separate that into the carry and (1 + xm * ym)
3) Store the xm + ym as the new mantissa (the right side / least significant bits of the floating point number).
4) Calculate the new exponent as xe + ye - offset + carry
5) Store it as-is to the left of the mantissa.
Their version however lets you do:
1) Calculate the mantissa as xm + ym + magic number ... and (since it's all one big integer with the exponent on the left of the mantissa), the carry happens automatically into the exponent during the addition.
2)
So it's a far more direct operation.
So no, this isn't fixed point math. But it does have some "echoes" of certain earlier tricks used. And it does stress the need to potentially have some rethinks to how we do things in hardware when dealing with such low bit precisions, where you're guaranteed a lot of inaccuracy anyway. Like, if we go down to even smaller quantizations, maybe the entire process should just be a lookup table.
And then you might take it even further. Instead of summing thousands of numbers in a vector, perhaps maybe we should break with tradition even further and go analog? E.g. have each weighted value in the vector control a variable resistor and output all the currents into a shared wire, then read the current back into a number on the other end to get the sum. Again, our precision is going to be bad to begin with, and flash attention is already ruining our determinism anyway, so what's the harm? And hey, if we're outputting analog, then why not do our weights and activation functions in analog as well, with a transistor-based exponential current source? You probably can't train a traditional network like this, but I see no reason why you can't run inference like this.
Re:This is not a new Technique (Score:4, Interesting)
> Instead of summing thousands of numbers in a vector, perhaps maybe we should break with tradition even further and go analog?
There have been attempts in this direction, implementing neural networks in analog electronic hardware, for literally decades and a number of failed startups.
I believe a major problem is fabrication differences: one chip is different from another and you can't replicate results. So a net you train on one chip will not work properly on another. That makes so much of a human burden (figuring out how to recalibrate if possible between hardware) that they are not feasible products.
If it's inference only, then its going into a production situation and needs to be cheap, but moreover in something like that (e.g. automotive) it needs to be mass producible and reliable and not require expert fiddling.
Re: (Score:2)
That argument doesn't stand. We haven't even implemented Transformers on ASICs yet, let alone any sort of nontraditional hardware. It's not some technological barrier that's kept us on GPUs, but rather, the combination of the size of the market (now finally getting large) times the risk of developing hardware for an architecture that's obsolete by the time it's launched.
(A) currents add
I'm suprised (Score:3)
I thought CPUs were already made to do floating operations as effectively as possible, so what makes it different for LLM?
Makes it sound as if instead of writing float multiplications in C code, it's faster to translate it into integer calculations.
Yes, I do know sometimes it is faster to convert to whole numbers and then convert the reult into float in the end.
Re: (Score:2)
Simple, float is still more effort than integer, and it is so in multiple dimensions. That will never change.
Re:I'm suprised (Score:5, Interesting)
L-Mult works by defining a new floating point format which is less capable than the existing, standard one. This works because the new format has been specifically tailored to match the needs of current AI tensor math. The existing standard float format has more features than AI requires. Since this new format is less complex it can be implemented in fewer logic gates which is where the power savings comes from.
I suspect it would be more fruitful to increase efforts on converting existing models to quantized integer models and just stick with the existing hardware. This also massively lowers the power consumption by replacing floating point instructions with integer ones.
Re: (Score:3)
I suspect the real purpose of all this is an attempt to prime the pump for specialized hardware sales - hardware someone affiliated with the researchers will soon release.
Re: (Score:3)
I suspect the real purpose of all this is an attempt to prime the pump for specialized hardware sales
Big tech will only buy it if it works.
If it works as promised, they'll make billions.
Re:I'm suprised (Score:4, Insightful)
I mean, it's obvious that we're going to be headed to custom hardware; it's kind of shocking to me that GPUs still dominate, even though they've been adding on some features targeted at NN processing. The greater AI compute needs grow, the more push there is going to be for custom hardware. The main thing that's been slowing it down (vs. say Bitcoin, which went to custom ASICs pretty quickly) is that the algos are still very much evolving.
Re: (Score:2)
Re: (Score:2)
Or maybe almost all new AI code is CUDA (and torch/tensorflow) and so it is best to use that instead of having your own software engineers reimplement all the libraries on them own.
The dominance of CUDA is not good, but it is understandable.
Re: (Score:2)
This is also why AMD wants to increase their market share to break this particular chicken-and-egg problem. Right now CUDA drivers always come first, with Vulkan either a little (weeks) to a lot (months or years) behind that. People who want to generate on their own hardware don't want to wait for drivers to be written, they want the best support possible, so they all buy nVidia. Certainly that's why I went that direction rather than saving ~20% with the equivalent AMD option. They're right that getting mar
Re: (Score:2)
Papers like this or Bitnet could also mean a huge chance for AMD. This is one architecture, but if the "next transformer" (that's what almost all LLMs are based on as well as a few image generators, and many recognition networks) would work on such a Bit or Integer architecture, the first one to build an efficient hardware for that could break the CUDA dominance.
Re: (Score:3)
What makes you think the top NVidia processors like the H100 and G200 are not custom hardware tuned for NN processing? If one started from scratch, where are the main inefficiencies that would be solved, and at what cost? Would one invent something substantially better than what NVidia has now? Its doubtful.
The main use case for NN training does require highly programmable and flexible parallel processing like graphics computations do. BTC mining is a single computation but NN work is not as simply st
Re: (Score:2)
Compare inference on H100 to on Grok or Cerebras. And that's not even ASICs.
GPU architectures are structured for generalist computing. They're not optimized to train or do inference on Transformers. Adding say BF8 support to a GPU is not the same thing as having the hardware structured around executing Transformers.
Re: (Score:2)
GPU architectures are structured for generalist computing. They're not optimized to train or do inference on Transformers.
They are now, e.g. Grace Hopper was designed to improve performance for such workloads.
Re:I'm suprised (Score:4, Insightful)
converting existing models to quantized integer models and just stick with the existing hardware.
That's how Google's TPU already does it. It has 64k 8-bit integer multipliers.
I don't know how this new technique differs.
Re: (Score:3)
Using the implicit leading 1, a floating point number is stored as (s, m, e) representing (-1)^s (1+m) 2^e. The interesting part of multiplication is (1+m1)(1+m2)=(1+m1+m2+m1 m2). They approximate this as (1.0625+m1+m2). I assume that they work on the basis that the system is robust enough that the error doesn't matter, because naively I would think that the way to optimise it would be to do the multiplication m1 m2 in lower precision (e.g. using the leading 4 bits of each mantissa).
Re: (Score:3)
I suggest that you take a look at the FP8 and FP4 formats which are currently the "big" thing in AI. Everything you thought you knew about floating point numbers is out the window. Basically they are lookup tables to a limited range of numbers; 256 and 16 respectively. There is also no NaN etc. in FP4.
Re: (Score:2)
FP4 is great for getting everyone to the same result as fast as possible!
Because of the lack of variety in output, I see FP4 as a toy not even fit for mass consumer use, although it may be closer to acceptable there than I give it credit for. Does it really matter if you generate a birthday card that looks just like someone else's if you'll never meet them?
Re: (Score:2)
I regularly run Q4 models. They work great.
ANNs are inherently extremely noise tolerant, as they're in effect fuzzy logic engines. What you lose in precision by going from FP16 to FP8, or FP8 to FP4, you gain in terms of doublings of the total number of weights and biases. You get less precision per "question" / less superposition of "questions" that are in effect asked and answered by a given neuron, but in exchange, you get twice as many neurons. And that's often a very good tradeoff.
Re:I'm suprised (Score:5, Informative)
Well I've skimmed the paper. From what I can tell, the method approximates floating point multiplication, using almost only addition. This is possible because floating point is a semi-logarithmic format, so adding the exponents is a good part of multiplication.
Basically, an fp number is (1+x/M)*2^e
x is an integer 0xM, M is a constant (a power of 2), e is an integer. If you multiply 2 fp numbers (x,e and y,f) you get:
(1 + x/M + y/M + x*y/M^2) * 2^(e+f)
there's some sign bits and extra fiddling so if the first bracketed bit exceeds 2, you need to knock a bit off and add 1 to the exponent, but fundamentally the cost (according to them) is dominated by the x*y integer multiplication there.
The paper more or less proposes simply discarding the x*y term and replacing it with a constant.
Then FP multiplication becomes "smoosh two numbers together in a nonlinear way that's a bit like multiplication", the latter being much much cheaper.
This to me seems pretty plausible.
Re:I'm suprised (Score:4, Interesting)
I suspect it would be more fruitful to increase efforts on converting existing models to quantized integer models and just stick with the existing hardware. This also massively lowers the power consumption by replacing floating point instructions with integer ones.
The paper mentions quantization in its related works section but doesn't elaborate on why the paper's ideas are better.
Of course, the big misdirection from the paper is that they talk about energy savings for compute but not for the entire processor. Compute, i.e., the ALU-ish part, is a small part of the total chip energy usage, and the paper's idea isn't even talking about energy for the entire ALU-ish part but a fraction of that.
Re: (Score:2)
Compute, i.e., the ALU-ish part, is a small part of the total chip energy usage
That depends on how much cache is onboard. If there's a lot of cache, that's true. If there's little cache, it isn't...
Re: (Score:2)
> I suspect it would be more fruitful to increase efforts on converting existing models to quantized integer models and just stick with the existing hardware. This also massively lowers the power consumption by replacing floating point instructions with integer ones.
This is already done. And it doesn't lower the power consumption that much.
It's useful for using already trained nets, but not for training where the dynamic range of floating point is essential. There is lots of existing work in the literat
Re: (Score:2)
I thought CPUs were already made to do floating operations as effectively as possible
These calculations are not run on CPUs.
AI has been using GPUs but is increasingly using custom silicon.
so what makes it different for LLM?
Much lower precision for starters. CPUs don't support FP16, FP12, or FP8.
Re: I'm suprised (Score:2)
Inference ("running an ai model") is much less expensive than training one, LLMs seem able to run with somewhat tolerable performance on the CPU of a Raspberry Pi 5. As far as I understand they're quantized down from the original FP32?
For training they still need to run FP32
Re: (Score:2)
That's right. You really only need that level of precision during training. After that, it's just (a lot of) wasted space.
Re:I'm suprised (Score:4, Interesting)
I thought CPUs were already made to do floating operations as effectively as possible
I think it depends on the CPU.
I remember that PowerPC used to be faster with floating point operations than integer ones and Apple would occasionally suggest converting to floats from ints for various array operations. Conversely, when Apple switched to Intel, the opposite was true and Apple changed their suggestion.
Flashback (Score:5, Insightful)
So, from the explanations given by esteemed fellow nerds,
simply put, they pulled a John Carmack.
Re: (Score:2)
Re: (Score:2)
I thought CPUs were already made to do floating operations as effectively as possible, so what makes it different for LLM?
Their claim of floating-point vs. integer operations is pure nonsense. What they have done is replace (1 + x)(1 + y) with (1 + x + y) instead of (1 + x + y + xy).
This is a major loss of precision. But it seems that a multiplication isn't actually needed, just some operation that grows if either operand grows.
If we use a two bit mantissa, after we add the leading 1 bit, we actually process 3 x 3 bit products. Multiplication turns this into 9 single bit products; one product is 1, four products are exis
Only binary is the future for AI (Score:3, Funny)
it reduces the complexity of integers even more and thus saves 4000% more energy.
You could do your training on a TI-84 running on a battery.
Re: (Score:2, Funny)
Oh look, it's mister fancy pants over here with his TI-84. In my day we did it the real way, raw-dogging it with a TI-83.
Re: (Score:3)
Oh look, it's mister fancy pants over here with his TI-84. In my day we did it the real way, raw-dogging it with a TI-83.
Amateurs. The only real way is a Casio calculator watch!
Re:Only binary is the future for AI (Score:5, Funny)
*Clacks in abacus*
Re: (Score:2)
Funniest so modded, but the story was a richer target.
Now to the second search of discussion for references to the low power PoC. The actual human brain is rated at 35 W. A Thousand Brains by Jeff Hawkins is still commended.
Re: (Score:2)
Yep. Why do vector math if you can just let physics multiply and sum your signals for you? :)
Metabolism is an extremely inefficient way to power a "computer", and our wetware has a lot of parasitic overhead costs, but passive analog mechanisms are so much vastly more efficient than digital calculations that it outweighs everything else by large margins.
Re: (Score:2)
There's an abacus (soroban) school about 2 minutes walk from here.
We live in a satisfactory sim (Score:2)
Any improvement for consumer h/w? (Score:2)
I read the preprint. I didn't quite catch whether this algorithm would deliver speed, efficiency or energy savings if implemented on existing hardware, like consumer CPUs and GPUs. The paper ends with "we will implement hardware and API". Do they mean that existing hardware beats it unless hardware is built specifically with their algorithm in mind?
Re: (Score:2)
It is hype. Expect most of the claims to be empty and the rest to be misleading.
Re: (Score:2)
I didn't quite catch whether this algorithm would deliver speed, efficiency or energy savings if implemented on existing hardware
No. It requires new silicon.
like consumer CPUs and GPUs.
Nobody uses consumer CPUs and GPUs for AI anymore. H100 tensor core GPUs are used, but most big tech companies are developing their own custom silicon designs.
Re: (Score:2)
like consumer CPUs and GPUs.
Nobody uses consumer CPUs and GPUs for AI anymore. H100 tensor core GPUs are used, but most big tech companies are developing their own custom silicon designs.
Everything bounces. There are people who prefer to run local only. Then you can't expect high end, brand new hardware for inference.
Like Microsoft's CoPilot+ stuff that expects an NPU. No idea if this is just a deal between Qualcomm (and Intel) and Microsoft to try to sell more of their hardware by saying 'we do AI too!!!', and by excluding the other obvious options (CPU and GPU) which they seem to have chosen to do.
Re: (Score:2)
Where's the news? (Score:2, Informative)
Using integer maths instead of floating point maths where you want a similar result, but don't need as high precision has been done for multiple decades already and is a common thing to do in e.g. resource constrained embedded devices. Applying it to AI models doesn't magically make it a new thing and I seem to recall having seen several articles over the years from other groups on doing it to AI models/training before this.
This just smells like an attempt at hyping things up in the hopes of grant money or
Re: (Score:2)
Did you RTFP?
Re: (Score:3)
Re: (Score:2)
That doesn't apply to NNs as a whole, though. A single neuron isn't calculating someone's odds of cancer; it's a huge number of neurons acting together. Each neuron in effect asks and answers a superposition of questions about one particular aspect of the problem. Reducing the FP precision reduces how much superposition there can be and/or how much nuance there can be in the answer of its questions, but at the same time buys you more neurons - a lot more total questions about a lot more aspects.
Re: (Score:2)
Not true (Score:3)
Re:Not true (Score:5, Interesting)
I don't think you quite understand how insanely energy intensive AI currently is. Let say you have a system with 2000 H100 cards from NVidia. You are going to need something in the order of a 3MW grid connection and your energy bill is going to be millions of dollars a year. Even a 10% reduction in the overall energy requirements is huge. I work in HPC and thought I was abreast of stupidly high power requirements and cooling, then we had to start doing AI and Oh boy was I mistaken.
Excellent maths (Score:2)
Now their claim seems to be that there was never any reason to calculate a product in the first place (floating point is a red herring, this is about calculating products), but that for LLMs a product is not needed, just an operation that produces larger numbers from larger inputs. And of course adding the sums in fixed-point arithmetic (which is what they are effectively doing) will be using less power.
Excellent maths (without Slashdot messing up my po (Score:2)
If you can't guess it, that was supposed to read "without Slashdot messing up my post" but Slashdot messed up the title as well.
So their invention is that to calculate a product x times y, where x = 1 + a for 0 = a 1 and y = 1 + b for 0 = b 1, it is Ok to calculate 1 + a + b instead of the correct 1 + a + b + a*b. This will give results that are always too small, by up to 25%.
Now their claim seems to be that there was never any reason to calculate a product in the first place (floating point is a red herr
RISC-V (Score:5, Interesting)
Given the past decade of arduous RISC-V infrastructure work I wouldn't be surprised to see China tap out L-Multi arch samples before New Year and have massive clusters of low-E AI in production by next summer, leaving sanctions on nVidia chips in the dust.
Definitely a potential game changer, though it's hard to imagine a lab in China isn't already doing similar work, given the momentum in the field (I think I left a similar comment last week about error-tolerant low-E AI machines).
This won't cut energy demand, though - just make it cheaper to provide which will increase demand.
Re: (Score:2)
Given the past decade of arduous RISC-V infrastructure work I wouldn't be surprised to see China tap out L-Multi arch samples before New Year and have massive clusters of low-E AI in production by next summer, leaving sanctions on nVidia chips in the dust.
Since China doesn't have advanced process technology and isn't likely to get it anytime soon as ASML only managed to create their latest machines through a multinational effort, and nVidia and everyone else is also free to implement such techniques, China will still be behind by the same amount as before.
something we aren't being told (Score:2)
I saw this removing multiplication in a paper many months ago. Considering that chipmakers are *beyond* antsy to find an advantage in LLM compute, what would stop them from building this hardware? There's something we aren't being told, and 0.07% drop in performance ain't it.
Re: (Score:2)
I think I know the paper you were talking about. But it was for ternary networks, not FP8.
Re: (Score:2)
You're probably thinking of this one [arxiv.org]
I would have guessed this one [arxiv.org]
it was for ternary networks, not FP8.
The goal is to get rid of expensive multiplications, not specifically to do cheap FP8 multiplications.
Re: (Score:2)
It very much is to do cheap FP8 multiplications. The numbers are still in FP8 form. They're just multiplying them in a method that relies on integer math. They expand out the multiplication equation, find that the biggest delay is for a low-order floating point multiplication in the mantissa that doesn't hugely effect the output, replace it with a constant magic number, add everything in the mantissa as integers, and let the carry bit overflow into the exponent. They still have an exponent and a mantissa
Re: (Score:2)
I see you're having your own personal conversation. Let me know when you want to participate in this one.
8-bit values? (Score:2)
Re: (Score:2)
Bit sizes have been trending down, not up. I commonly run FP4 models.
Basically, yes, you lose precision floats half the size, but it means you get twice as many parameters, and when it comes to the precision loss from FP16 to FP8, having double the parameters is a no-brainer "yes, please!" choice.
The important thing to understand is that NNs are already dealing in fuzzy math, and are thus, highly resilient to noise. The noise-tolerant nature of LLMs should be obvious to anyone who's ever used one. They t
This is not multiplication (Score:2)
It doesn't matter what they call it, it is not multiplication. It is a very rough approximation to multiplication. It replaces calculating (1 + a) * (1 + b) with calculating 1 + a + b, leaving an error of ab. With two bit mantissa, they find 1.75 * 1.75 = 2.5 instead of 3 1/16 which would be rounded to 3. For three bit mantissa the worst case would be 1.875 x 1.875 = 2.75 instead of 1.875 * 1.875 = 2.75 instead of 3.51625 rounded to 3.5.
Re: (Score:2)
1) They don't leave an error of AB, they add in a magic number to approximate typical values of AB.
2) They're dealing with FP8 (and in some cases as low as *FP3*). There's already huge error.
Why should I care? (Score:2)
They're not going to use less electricity, they'll just build 20 more of the useless pieces of shit. Anybody with the money to build this kind of system is ideologically opposed to doing anything interesting or productive with it. They just see a way to burn oil faster and flood every last scrap of human existence with advertising surveillance.
Data point and question (Score:2)
Back when GPU-based crypto mining was a thing, AMD's cards were significantly faster than NVIDIAs at integer operations.
I wonder if this will shift some of the GPU business back to AMD.
environmentalists beware (Score:1)
Beware of. . . "unforeseen consequences."
technical improvements in efficiency tend to increase consumption and thereby defeat any purported benefits of economization. this is called âoejevon's paradoxâ or âoejevon's lawâ by some.
the actual benefit is that it will now be easier to train neural nets (good!), and that this will increase usage and thus total energy consumption.
this is also good since it provides an impetus to go nuclear, which will itself increase energy consumption. but do
So what do we do with... (Score:2)
So what do we do with all those nuclear reactors the major players are working to restart?
Re: (Score:2)
Their proponents are hoping to Jevon it. I'm dubious.
I can reduce it by 100% easily (Score:2)
Re: (Score:2)
Yeah, people are investing billions because it's all crap.
I don't believe you, that you really believe it's all crap. While it is overhyped and not every use is useful for everyone, such a generic statement is obviously untrue.
Re: (Score:2)
It's crap, and I'm FAR from the only person who says that, so seethe harder.
Re: (Score:2)
There are literally millions of people who find it useful. You may not like it. You do not have to use it. It may be hyped. Investments may be a bubble or not. But telling "it is all crap" is either extremely ignorant or a bad faith argument, because it is impossible not to see where these things find application and how many people use them productively. As said, you do not need to use them or like them at all. But you need to recognize that others do and they increase their productivity or do things they
Re: (Score:2)
Re: (Score:2)
Are your serious with your post? Nevermind, I don't think the discussion will lead somewhere, because you not only choose to dislike AI (thats fine), but also completely ignore the reality of the people around you (that's not good for a serious discussion).
Re: (Score:2)
Are there some limited applications where it's helpful? I guess. But overall it's too much hype and
Re: (Score:2)
It is neither magic, nor perfect. But it also won't kill us all and won't kill as many jobs as some doomers say. But that was all not my point here. Here, I just pointed out, that one must close both eyes to be unable to see that there are people using AI in helpful ways, and so it can't be "all crap". I respect all criticism and share some of it, but I can't go along with people who discuss like they don't see the benefits others (maybe not themselves) are taking from the tools.
> Do you have some sort
Re: (Score:2)
Re: (Score:2)
SlashGPT didn't detect a dupe (Score:1)
I'm pretty sure there was already a story about this or similar roughly 2 months ago.
Re: (Score:2)
You're thinking about something else.
A few month ago there was Bitnet, which can replace dot products with binary operations by using -1, 0, 1 weights. This one is about using integers in an efficient way.
AI: Claiming Cures--Causing Problems (Score:2)
Regression testing (Score:2)
I presume that the new math comes back with answers that are similar, but not identical to, the traditional floating-point algorithms. How will these small differences affect the output of AI models? For example, will it cause answers to cluster around a smaller set of distinct results, in the way that digital audio recordings are more reproducible but less "warm" than analog recordings? It might be difficult to measure how the changes affects the final output.
Is this truly impossible to implement? (Score:2)
Meaning if you had firmware level access to a device, I wonder if their underlying engines could be nudged to attempt this. Probably depends where in the process the change is and how much is hard coded to happen.
"Perfect is the enemy of good" (enough)
This is the kind of stuff I'd hope an AGI might notice and mention.
"impossible on current hardware" (Score:2)
For clarity "impossible on current hardware" meaning without making custom hardware...
Nuclear power plant anybody? (Score:2)
Re: (Score:2)
The average person has less than five fingers. So going closer to the average from a six finger hand is beneficial. Precision allows for overfitting.
Seriously: In the best case you would have much fewer weights, but know the right numbers for them. Since that problem is NP hard, you use enough waits for a good approximation. And if you need more weights, you can use less precision, because much of the precision is only capturing noise. It's like saving images with fewer colors (and no dithering) can remove
Re: (Score:2)
It's a similar idea. Ternary would only require even simpler operations, but one does not know if this approach scales better. Both will only really thrive on specialized hardware.