AMD says future PCs will run 30B parameter models at 100T/s

Alphane Moon@lemmy.world · 4 months ago

AMD says future PCs will run 30B parameter models at 100T/s

carl_dungeon@lemmy.world · 4 months ago

Well I’m gonna say it- future PCs will run 60B parameter models at 200T/s

TommySoda@lemmy.world · 4 months ago

That’s so unnecessary. The way we are going about “AI” is like brute forcing a password. Sure you’ll get the job done, but it’s the least efficient way to do it. I’m not saying I have the answer, but why not try and find a more efficient way to get the same results instead of building unnecessarily powerful PC’s? I can only imagine how much redundancy is present in those LLM black boxes. It’s just a word calculator.

Irremarkable@fedia.io · 4 months ago

I’m gonna go off on a limb and say that they’re likely doing both.

Ptsf@lemmy.world · 4 months ago

Also going to go off on a limb and say both is better, plus AMD being a compute company has a… Ahem… Vested interest?.. In the further utilization of ever increasing amounts of compute.

TommySoda@lemmy.world · 4 months ago

Well they are definitely prioritizing one over the other.

IsThisAnAI@lemmy.world · 4 months ago

You definitely sound educated and certainly not like a novice just bitching. In terms of AI research, which methods being explored (or not) do you feel need more direct investment?

Blue_Morpho@lemmy.world · 4 months ago

Look man, he said he doesn’t understand anything. Why don’t you just accept that everyone working in AI is stupid and there’s a completely better way to do everything.

/s

Irremarkable@fedia.io · edit-2 4 months ago

Man, I sure do wonder which one is going to shown visible returns sooner? Fundamentally reworking how the models work, or simply duct taping more processing power to it?

Obviously the brute force method is going to show the most returns immediately, you’re just throwing more resources at it. Efficiency gains take time. While it’s absolutely a much bigger deal with AI, that’s pretty much the path all all these things have. Crypto mining, ray tracing, 3d graphics, hell even all the way back to 2d graphics.

There’s no magic “make run more efficient” button.

TommySoda@lemmy.world · edit-2 4 months ago

While I agree with you to a certain extent, these technologies always take different paths depending on those priorities. The thing with 3d and 2d graphics was that they were working with limiting technology. In fact I would even use that as an argument against just “building a better machine.” Back then they had to make software work with the limitations of the hardware. You couldn’t just duct tape two SNES’s together and get better performance. They had to be efficient or have no product to even release. Nowadays you can just buy more computing power. Even when it comes to graphics there are so many companies that release unoptimized software onto the market because the consumer can just “build a better machine.” Crypto has so much unnecessary redundancy that all of the computations just get thrown out the window while only 1 computer gets to add to the Blockchain and gets that reward for the actual mining.

Those older industries had more limitations than we did so they had to make it as efficient as possible. Now we have so much computing power there is no incentive to make things more efficient save for long term viability. Which none of these companies give a shit about as long as they are making money. I’m not saying they need to hit the magic “efficiency” button. I’m just saying they’re lazy and making everyone else pay the price.

Blue_Morpho@lemmy.world · 4 months ago

I’m not saying I have the answer, but why not try and find a more efficient way

??? You don’t understand the problem yet you claim there’s a better way that everyone has missed until now?

Well sure. That’s everything.

“Planes are so unnecessary, why haven’t they found a better way.”

“CPU’s have billions of transistors, why haven’t they found a better way.”

Alphane Moon@lemmy.world · 4 months ago

My question would be why do I need to run “30B parameter models at 100 tokens per second” on my PC?

I understand the benefits of running things locally, but why not just use Google’s or OpenAI’s LLM? You shouldn’t be sharing sensitive information with such tools in the first place, so that leaves level low-impact business queries and random “lifestyle” queries. Why wouldn’t I use cloud infrastructure for such queries?

c10l@lemmy.world · 4 months ago

I understand the benefits of running things locally, but why not just use Google’s or OpenAI’s LLM?

I understand the benefits of cutting down sugar, but why not just binge on cake and ice cream?

Sounds like you don’t understand the benefits of running things, and specifically LLMs and other kinds of AI models locally.

Alphane Moon@lemmy.world · 4 months ago

So what are the benefits with respect to local LLMs in the context I described?

b34k@lemmy.world · 4 months ago

If you’re doing it locally, more sensitive queries become ok, because that data is never leaving your computer……

c10l@lemmy.world · 4 months ago

Even when you’re not sending data that you consider sensitive, it’s helping train their models (and you’re paying for it!).

Also what’s not sensitive to one person might be extremely sensitive to another.

Also something you run locally, by definition, can be used with no Internet connection (like writing code on a plane or in a train tunnel).

For me as a consultant, it means I can generally use an assistant without worrying about privacy policies on the LLM provider or client policies related to AI and third parties in general.

For me as an individual, it means I can query the model away without worrying that every word I send it will be used to build a profile of who I am that can later be exploited by ad companies and other adversaries.

j4k3@lemmy.world · 4 months ago

It makes a huge difference running bigger models. I like to run a quantized 70b or 8×7b most. These large models are far easier to access their true depth with less momentum required to find it.

The issue is not memory bandwidth in general. The primary bottleneck is the L2 to L1 bus width. That is the narrowest point, and it is designed for the typical pipeline of a sequential processor running at insanely fast speeds that are not possible if things get more spread out. The issues are more like radio than typical electronics. The route lengths, capacitance, and inductances become super critical.

It will make a major difference just to add the avx instruction set to consumer processors. Those are already present but fused off or simply not listed in the microcode in some instances. The full AVX instructions are not used because you would need to have a processes scheduler that is a good bit more complicated. These types of complex schedulers already exist in some mobile ARM devices but for more simple types of hardware and software systems, and without the backwards compatibility that is the whole reason x86 is a thing. The more advanced AVX instructions do things like loading a 512 bit wide word in a single instruction.

Hardware moves super slow, like 10 years for a total redesign. The module nature of ARM makes it a little easier to make some minor alternation. The market shifted substantially with AI a year and a half back. Right now, everything we are seeing was already in the pipeline long before the AI demand. I expect the first real products to ship in 2 years, 3-4 before any are worth spending a few bucks on, and in ~8 years, hardware from right now will feel as archaic as stuff from 20-30 years ago.

Running the bigger model makes a huge difference. Saying it will run at those speeds is actually making a statement about agents and augmented generation. I can run my largest models with streaming text barely faster than my reading pace. I can’t do extra stuff with that kind of model because I would need the entire text to send to other code or models for further processing. This speed is saying, I could do many things like text to speech, speech to text, augmented data retrieval for citations, function calling in code, and running several models to do unique things while keeping a conversational pace.

TheGrandNagus@lemmy.world · 4 months ago

Maybe not everyone wants to hand Google or OpenAI their data or custom?

And this hardware will no doubt be used for more than that anyway.

Alphane Moon@lemmy.world · 4 months ago

Sure, but I prefaced my statement by saying that I am only looking at a subset of “low-impact business queries and random “lifestyle” queries”.

Blue_Morpho@lemmy.world · 4 months ago

There are no low impact queries. Wouldn’t you prefer to use Google without it data mining you for correlations to predict your life?

Does it not bother you that Google knows more about you than you know about yourself because it records every restaurant you eat in, everything you buy (tap pay), every question you have?

TheGrandNagus@lemmy.world · 4 months ago

AMD doesn’t manufacture CPUs for you specifically.

The Snark Urge@lemmy.world · 4 months ago

The industry is not exactly built on efficient computing as a business model

stoy · 4 months ago

Unless you are a researcher you probably won’t.