How to run LLaMA (and other LLMs) on Android.

llama@lemmy.dbzer0.com · edit-2 2 hours ago

How to run LLaMA (and other LLMs) on Android.

Cris16228@lemmy.today · 8 hours ago

And what’s the purpose of running it locally? Just curious. Is there’s anything really libre or better?

Is there any difference between LLaMA or any libre model and ChatGPT (the first and popular I know)

llama@lemmy.dbzer0.com · 35 minutes ago

For me the biggest benefits are:

Your queries don’t ever leave your computer
You don’t have to trust a third party with your data
You know exactly what you’re running
You can tweak most models to your liking
You can upload sensitive information to it and not worry about it
It works entirely offline
You can run several models

projectmoon@lemm.ee · 8 hours ago

Most open/local models require a fraction of the resources of chatgpt. But they are usually not AS good in a general sense. But they often are good enough, and can sometimes surpass ChatGPT in specific domains.

Cris16228@lemmy.today · edit-2 7 hours ago

Do you know about anything libre? I’m curious to try something. Better if self-hosted (?)

According to a Youtuber, deekseek (or whatever the name is, the Chinese Open source one) is better than ChatGPT when he tried one simple request of making a Tetris game and ChatGPT gave a broken game while the other one didn’t

Idk why lol

projectmoon@lemm.ee · 7 hours ago

They’re probably referring to the 671b parameter version of deepseek. You can indeed self host it. But unless you’ve got a server rack full of data center class GPUs, you’ll probably set your house on fire before it generates a single token.

If you want a fully open source model, I recommend Qwen 2.5 or maybe deepseek v2. There’s also OLmo2, but I haven’t really tested it.

Mistral small 24b also just came out and is Apache licensed. That is something I’m testing now.

Cris16228@lemmy.today · 3 hours ago

But unless you’ve got a server rack full of data center class GPUs, you’ll probably set your house on fire before it generates a single token.

Its cold outside and I don’t want to spend money on keeping my house warm so I could… Try

I’ll check them out! Thank you

projectmoon@lemm.ee · 2 hours ago

Lol, there are smaller versions of Deepseek-r1. These aren’t the “real” Deepseek model, but they are distilled from other foundation models (Qwen2.5 and Llama3 in this case).

For the 671b parameter file, the medium-quality version weighs in at 404 GB. That means you need 404 GB of RAM/VRAM just to load the thing. Then you need preferably ALL of that in VRAM (i.e. GPU memory) to get it to generate anything fast.

For comparison, I have 16 GB of VRAM and 64 GB of RAM on my desktop. If I run the 70b parameter version of Llama3 at Q4 quant (medium quality-ish), it’s a 40 GB file. It’ll run, but mostly on the CPU. It generates ~0.85 tokens per second. So a good response will take 10-30 minutes. Which is fine if you have time to wait, but not if you want an immediate response. If I had two beefy GPUs with 24 GB VRAM each, that’d be 48 total GB and I could run the whole model in VRAM and it’d be very fast.

Cris16228@lemmy.today · 1 hour ago

No house on fire :(

Thanks! I’ll check it out

Autonomous User@lemmy.world · edit-2 13 hours ago

Warning, Llama is not libre. llama.com/llama3_3/license

Options here (check the license columm is green) wikipedia.org/wiki/List_of_large_language_models

kekmacska · 7 hours ago

you only fry your phone with this. very bad idea

llama@lemmy.dbzer0.com · 5 hours ago

Not true. If you load a model that is below your phone’s hardware capabilities it simply won’t open. Stop spreading fud.

projectmoon@forum.agnos.is · 19 minutes ago

@[email protected] Depends on the inference engine. Some of them will try to load the model until it blows up and runs out of memory. Which can cause its own problems. But it won’t overheat the phone, no. But if you DO use a model that the phone can run, like any intense computation, it can cause the phone to heat up. Best not run a long inference prompt while the phone is in your pocket, I think.

llama@lemmy.dbzer0.com · edit-2 2 minutes ago

Thanks for your comment. That for sure is something to look out for. It is really important to know what you’re running and what possible limitations there could be. Not what the original comment said, though.

kekmacska · 24 minutes ago

that’s not how it works. Your phone can easily overheat if you use it too much, even if your device can handle it. Smartphones don’t have cooling like pcs and laptops (except some rog phone and stuff). If you don’t want to fry your processor, only run LLMs on high-end gaming pcs with All in one water cooling

llama@lemmy.dbzer0.com · 4 minutes ago

Of course that is something to be mindful of, but that’s not what the person in the original comment said. It does run, but you need to be aware of the limitations and potential consequences. That goes without saying, though.

Don’t overdo it and your phone will be just fine.

How to run LLaMA (and other LLMs) on Android.

How to run LLaMA (and other LLMs) on Android.

Step 1: Install Termux

Step 2: Set Up proot-distro and Install Debian

Step 3: Install Dependencies

Step 4: Install Ollama

Step 5: Download and run the Llama3.2:1B Model