New Mistral model is out

The Hobbyist · edit-2 8 months ago

New Mistral model is out

toothbrush@lemmy.blahaj.zone · 8 months ago

It doesnt even have a license yet, or a blog post, anouncement, etc. To me it currently looks like someone just leaked their model, time will tell …

rufus@discuss.tchncs.de · edit-2 8 months ago

They’ve done it that way for the previous models, too. I suppose it’s to add a bit of “mystery” around it and give people some riddle to solve.

Pennomi@lemmy.world · 8 months ago

Likely they’re trying to get in before Llama 3 drops, because I suspect that’s all people will talk about for a fair bit.

h3ndrik@feddit.de · 8 months ago

Probably not. Since they’re doing exactly this (drop a magnet link out of the blue) for the third time or so in a row… It’s more likely a scheme than related to current happenings.

Fubarberry@sopuli.xyz · 8 months ago

281GB

That’s huge, I’m guessing we’ll need to use a giant swap file?

The Hobbyist · edit-2 8 months ago

You’re right, but the model is also not quantized so is likely to be in 16bit floats. If you quantize it you can get substantially smaller models which run faster though may be somewhat less accurate.

~~Knowing that the 4 bit quantized 8x7B model gets downscaled to 4.1GB, this might be roughly 3 times larger? So maybe 12GB? Let’s see.~~

Edit: sorry those numbers were for Mistral 7B, not mixtral. For Mixtral, the quantized model size is 26GB (4 bits), so triple that would be roughly 78 GB. Luckily, being an MoE, not all of it has to be loaded simultaneously to the GPU.

From what I recall, it only uses 13B parameters at once, so if we compare that to codellama 13B, quantized to 4 bits, that is 7.4GB, so triple that would be 22GB, so would require a 24GB GPU. Someone double check if I misunderstood something.

24GB GPUs include the AMD 7900 XTX and the nvidia RTX 4090 (Ti), non-mobile.

Audalin@lemmy.world · edit-2 8 months ago

I thought MoEs had to be loaded entirely in the (V)RAM and the inference speedup was because you only need to use a fraction of layers to compute the next token (but the choice of layers can be different for each token, so you need them all ready; or keep moving data between the disk <-> RAM <-> VRAM and get reduced performance).