Flash-MoE: Running a 397B Parameter Model on a Laptop

(github.com)

398 pontos | por mft_ 22 dias atrás

58 comentários

  • tarruda
    22 dias atrás
    Note that this is not the only way to run Qwen 3.5 397B on consumer devices, there are excellent ~2.5 BPW quants available that make it viable for 128G devices.

    I've had great success (~20 t/s) running it on a M1 Ultra with room for 256k context. Here are some lm-evaluation-harness results I ran against it:

        mmlu: 87.86%
    
        gpqa diamond: 82.32%
    
        gsm8k: 86.43%
    
        ifeval: 75.90%
    
    More details of my experience:

    - https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...

    - https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...

    - https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...

    Overall an excellent model to have for offline inference.

    • Aurornis
      22 dias atrás
      The method in this link is already using a 2-bit quant. They also reduced the number of experts per token from 10 to 4 which is another layer of quality degradation.

      In my experience the 2-bit quants can produce output to short prompts that makes sense but they aren’t useful for doing work with longer sessions.

      This project couldn’t even get useful JSON out of the model because it can’t produce the right token for quotes:

      > *2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable.

      • tarruda
        22 dias atrás
        I can't say anything about the OP method, but I already tested the smol-IQ2_XS quant (which has 2.46 BPW) with the pi harness. I did not do a very long session because token generation and prompt processing gets very slow, but I think I worked for up to ~70k context and it maintained a lot of coherence in the session. IIRC the GPQA diamond is supposed to exercise long chains of thought and it scored exceptionally well with 82% (the original BF16 official number is 88%: https://huggingface.co/Qwen/Qwen3.5-397B-A17B).

        Note that not all quants are the same at a certain BPW. The smol-IQ2_XS quant I linked is pretty dynamic, with some tensors having q8_0 type, some q6_k and some q4_k (while the majority is iq2_xs). In my testing, this smol-IQ2_XS quant is the best available at this BPW range.

        Eventually I might try a more practical eval such as terminal bench.

        • Aurornis
          22 dias atrás
          > I did not do a very long session

          This is always the problem with the 2-bit and even 3-bit quants: They look promising in short sessions but then you try to do real work and realize they’re a waste of time.

          Running a smaller dense model like 27B produces better results than 2-bit quants of larger models in my experience.

          • amelius
            22 dias atrás
            > This is always the problem with the 2-bit and even 3-bit quants: They look promising in short sessions but then you try to do real work and realize they’re a waste of time.

            It would be nice to see a scientific assessment of that statement.

          • singpolyma3
            22 dias atrás
            Lots of people seem to use 4bit. Do you think that's worth it vs a smaller model in some cases?
            • Aurornis
              22 dias atrás
              4 bit is as low as I like to go. There are KLD and perplexity tests that compare quantizations where you can see the curve of degradation, but perplexity and KLD numbers can be misleading compared to real world use where small errors compound over long sessions.

              In my anecdotal experience I’ve been happier with Q6 and dealing with the tradeoffs that come with it over Q4 for Qwen3.5 27B.

            • hnfong
              22 dias atrás
              Generally the perplexity charts indicate that quality drops significantly below 4-bit, so in that sense 4-bit is the sweet spot if you're resource constrained.
      • simonw
        22 dias atrás
        The project doesn't just use 2-bit - that was one of the formats they tried, but when that didn't give good tool calls they switched to 4-bit.
        • tarruda
          22 dias atrás
          In my case it the 2.46BPW has been working flawless for tool calling, so I don't think 2-bit was the culprit for JSON failing.

          They did reduce the number of experts, so maybe that was it?

      • stuaxo
        22 dias atrás
        There's at least one project they could use to repair the JSON and another that work takes a different approach.
    • arjie
      22 dias atrás
      What's the tok/s you get these days? Does it actually work well when you use more of that context?

      By the way, it's been a long time since I last saw your username. You're the guy who launched Neovim! Boy what a success. Definitely the Kickstarter/Bountysource I've been a tiny part of that had the best outcome. I use it every day.

      • tarruda
        22 dias atrás
        > What's the tok/s you get these days?

        I ran llama-bench a couple of weeks ago when there was a big speed improvement on llama.cpp (https://github.com/ggml-org/llama.cpp/pull/20361#issuecommen...):

            % llama-bench -m ~/ml-models/huggingface/ubergarm/Qwen3.5-397B-A17B-GGUF/smol-IQ2_XS/Qwen3.5-397B-A17B-smol-IQ2_XS-00001-of-00004.gguf -fa 1 -t 1 -ngl 99 -b 2048 -ub 2048 -d 0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,150000,200000,250000
            ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
            ggml_metal_library_init: using embedded metal library
            ggml_metal_library_init: loaded in 0.008 sec
            ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
            ggml_metal_device_init: GPU name:   MTL0
            ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
            ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
            ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
            ggml_metal_device_init: simdgroup reduction   = true
            ggml_metal_device_init: simdgroup matrix mul. = true
            ggml_metal_device_init: has unified memory    = true
            ggml_metal_device_init: has bfloat            = true
            ggml_metal_device_init: has tensor            = false
            ggml_metal_device_init: use residency sets    = true
            ggml_metal_device_init: use shared buffers    = true
            ggml_metal_device_init: recommendedMaxWorkingSetSize  = 134217.73 MB
            | ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | -: | --------------: | -------------------: |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |           pp512 |        189.67 ± 1.98 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |           tg128 |         19.98 ± 0.01 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d10000 |        168.92 ± 0.55 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d10000 |         18.93 ± 0.02 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d20000 |        152.42 ± 0.22 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d20000 |         17.87 ± 0.01 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d30000 |        139.37 ± 0.28 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d30000 |         17.12 ± 0.01 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d40000 |        128.38 ± 0.33 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d40000 |         16.38 ± 0.00 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d50000 |        118.07 ± 0.55 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d50000 |         15.66 ± 0.00 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d60000 |        108.44 ± 0.38 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d60000 |         14.98 ± 0.01 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d70000 |         98.85 ± 0.18 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d70000 |         14.36 ± 0.00 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d80000 |         91.39 ± 0.49 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d80000 |         13.84 ± 0.00 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d90000 |         85.76 ± 0.24 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d90000 |         13.30 ± 0.00 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d100000 |         80.19 ± 0.83 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d100000 |         12.82 ± 0.00 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d150000 |         54.46 ± 0.33 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d150000 |         10.17 ± 0.09 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d200000 |         47.05 ± 0.15 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d200000 |          9.04 ± 0.02 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d250000 |         40.71 ± 0.26 |
            | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d250000 |          8.01 ± 0.02 |
        
            build: d28961d81 (8299)
        
        So it starts at 20 tps tg and 190 tps pp with empty context and ends at 8 tps tg and 40 tps pp with 250k prefill.

        I suspect that there are still a lot of optimizations to be implemented for Qwen 3.5 on llama.cpp, wouldn't be surprised to reach 25 tps in a few months.

        > You're the guy who launched Neovim!

        That's me ;D

        > I use it every day.

        So do I for the past 12 years! Though I admit in the past year I greatly reduced the amount of code I write by hand :/

        • hnfong
          22 dias atrás
          Apologies to others for the offtopic comment, but thank you so much for neovim. I started using Vim 25 years ago and I almost don't know how to type without a proper Vi-based editor. I don't write as much code these days, but I write other stuff (which definitely needs to be mostly hand written) in neovim and I feel so grateful that this tool is still receiving love and getting new updates.
          • tarruda
            22 dias atrás
            > in neovim and I feel so grateful that this tool is still receiving love and getting new updates.

            @justinmk deserves the credit for this!

        • terhechte
          22 dias atrás
          Thank you for NeoVim! I also use it every day, mostly for thinking / text / markdown though these days.

          Have you compared against MLX? Sometimes I’m getting much faster responses but it feels like the quality is worse (eg tool calls not working, etc)

          • tarruda
            22 dias atrás
            > Have you compared against MLX?

            I don't think MLX supports similar 2-bit quants, so I never tried 397B with MLX.

            However I did try 4-bit MLX with other Qwen 3.5 models and yes it is significantly faster. I still prefer llama.cpp due to it being a one in all package:

            - SOTA dynamic quants (especially ik_llama.cpp) - amazing web ui with MCP support - anthropic/openai compatible endpoints (means it can be used with virtually any harness) - JSON constrained output which basically ensures tool call correctness. - routing mode

        • arjie
          22 dias atrás
          That's surprisingly fast. Thanks for sharing.
    • outlog
      22 dias atrás
      What is power usage? maybe https://www.coconut-flavour.com/coconutbattery/ can tell you estimate?
      • tarruda
        22 dias atrás
        I don't think I've ever seen the M1 ultra GPU exceed 80w in asitop.

        Update: I just did a quick asitop test while inferencing and the GPU power was averaging at 53.55

    • iwontberude
      22 dias atrás
      Thank you, I have been using way too much credits for my personal automation.
    • woile
      22 dias atrás
      Just a single m1 ultra?
      • tarruda
        22 dias atrás
        Yes. Note that the only reason I acquired this device was to run LLMs, so I can dedicate its whole RAM to it. Probably not viable for a 128G device where you are actively using for other things.
  • Aurornis
    22 dias atrás
    Reading the details, he is using 2-bit quantization and reduced the number of experts per token from 10 down to 4 to get 5 tokens/sec. Cool proof of concept but it’s far from the quality and performance of the 397B model as normally used. Dropping the number of experts is particularly misleading.

    This is some interesting work, but applying such extreme measures to LLMs to get them to run severely degrades quality. I know he claims negligible quality loss, but in my experience 2-bit quantizations are completely useless for real work. You can get them to respond to prompts, but they lose their intelligence and will go around in circles.

    He also shows 5-6 tokens per second. Again that’s impressive for a large model on limited hardware but it’s very slow. Between the severely degraded model abilities and the extremely slow output the 397B result should be considered an attempt at proving something can technically run, not evidence that it can run well and produce output you’d expect from a 397B model.

    He even mentions the obvious problems with his changes:

    > *2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable.

    So right out of the gate this isn’t useful if you want to do anything with it. He could have tried smaller models or less quantizations to get actual useful output from the model, but it wouldn’t look as impressive. It’s honestly getting kind of exhausting to read all of these AI-coded (admitted in the link) and AI-written papers made more for resume building. It would have been interesting to see this work applied to running a useful model that hadn’t been lobotomized instead of applying tricks to get an impressive headline but useless output.

    • bearjaws
      21 dias atrás
      We need a rule that if your LLM benchmark is running under 20t/s it's simply unusable in any real workflow.

      6t/s is unbearable, if you used it with OpenCode you would be waiting 20+ minutes per turn.

      • zozbot234
        21 dias atrás
        This is not an ordinary LLM benchmark, it's streaming experts' weights from storage. It opens up running very large (near-SOTA, potentially SOTA) MoE models on very limited hardware, since you no longer need enough RAM for the entirety of the model's parameters. The comparison to 20 t/s local AI models is simply not fair.
        • bearjaws
          21 dias atrás
          I understand that. I am saying there is a clear cliff where the value of an LLM reaches 0.

          At 1t/s you are never going to get anything done. At 6t/s, it's significantly degrades the experience, one mistake setting you back 20-30 minutes.

          At ~20t/s it's much more usable.

    • 190n
      22 dias atrás
      > *2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable.

      I was wondering about that statement. Shouldn't it restrict sampling to only tokens that produce valid JSON matching the schema during a tool call? On the other hand, I have heard a lot about how even production LLM providers don't always call tools accurately, so I suppose either it's hard to implement what I described or there's something I haven't thought of that makes it impossible.

    • kageroumado
      22 dias atrás
      [dead]
  • jllyhill
    22 dias atrás
    To be honest, I'm getting tired of a "laptop" in every one of these clickbait titles turning out to be $3000 Macbook. Sure, it's impressive to achieve this degree of the LLM compression, but I really don't like that the title implies local LLM becomes a viable for an average person with the actual hardware being out of reach for 99%.
    • kelnos
      22 dias atrás
      I enjoy them because I have a reasonably beefy laptop (non-Mac, though, so I can't try out this particular project), and it's nice to see what you can do with laptops in this space.

      I think, though, maybe consider your biases? Whenever I see a headline like this, I absolutely don't assume it's going to run on any random laptop. I don't even expect it to run well, or at all, on my laptop (for stuff that isn't Mac-only, that is), which has an iGPU. I'm generally a big LLM/AI skeptic, but I find your brand of cynicism/dismissal to be kinda boring and uninteresting.

    • freehorse
      22 dias atrás
      You can probably go lower than $3000, I expect, ime, an M1 max with 64GB ram to have similar performance, and you can find such one used with less than $2000 or so probably.

      In any case, I do not think that the range of people that have an M1/2/3/4 max macbook is that narrow, eg people who may do video editing or who benefit from having one of the fastest multicore laptops. It is handy to be able to do work with a machine you already may own for separate reasons, though it is definitely more to the side of a "pro device" than "basic consumer device".

    • kroaton
      22 dias atrás
      It could just as easily be a $3000-4000 Strix Halo laptop.
    • Computer0
      22 dias atrás
      Yeah I understand the sentiment, I think it should’ve been “,on a laptop!” instead of “on a laptop”
    • throw284959
      22 dias atrás
      I ran full version of this model without any swapping on cluster of 2x $3000 laptops (strix halo zbook 128GB) at about 20 tokens per second.

      I would say it is in reach for normal person. If anything buying it was great investment, it is work tool, I will probably sell it for more than what I bought it for :)

      • prmoustache
        22 dias atrás
        > I would say it is in reach for normal person.

        That is a very wealthy country centric thing to say.

        • mycall
          22 dias atrás
          ..and to think $2=3k is basically a minimum to really use local LLMs effectively, with prices definitely rising globally. Most people will just stick with online services for LLMs.
  • homarp
    22 dias atrás
  • zozbot234
    22 dias atrás
    The github page mentions that a naïve mmap approach is bottlenecked by per-page overhead. Can this be mitigated by setting up explicit "huge" pages? (2M using the CONT PTE feature if the "native" page size is 16k; 32M using a PMD level block mapping; or 1G using the CONT PMD feature.) Does macOS support this out of the box? Alternatively, one might use a simple mmap and then something like posix_fadvise to set up prefetching of the data.
  • justacatbot
    22 dias atrás
    The quality degradation at 2-bit is a real issue. For actual work tasks, a well-tuned 30B at 4-bit usually outperforms a 70B+ at 2-bit in my experience. The expert reduction on top of that compounds things - you're essentially running a fairly different model. Still interesting to see the upper bound of what consumer hardware can attempt, even if the result isn't production-ready.
  • bertili
    22 dias atrás
    Very impressive! I wonder if there is a similar path for Linux using system memory instead of SSD? Hell, maybe even a case for the return of some kind of ROMs of weights?
    • daemonologist
      22 dias atrás
      Most definitely - the popular engines have extensive support for doing this and controlling exactly which weights end up where (llama.cpp: https://github.com/ggml-org/llama.cpp/blob/master/tools/cli/... , vllm: https://docs.vllm.ai/en/stable/configuration/engine_args/#of... , sglang (haven't tried this): https://docs.sglang.io/advanced_features/server_arguments.ht...).

      Even with a MoE model, which has to move a relatively small portion of the weights around, you do end up quite bandwidth constrained though.

    • zozbot234
      22 dias atrás
      Loading experts to system memory is supported by most local-AI frameworks. But you do not gain much by running that part of the decode on GPU, since decode is not compute-limited and the CPU-GPU transfer involves overhead. It's best to use the GPU for speeding up the shared part of the model.
    • Aurornis
      22 dias atrás
      Using system memory and CPU compute for some of the layers that don’t fit into GPU memory is already supported by common tools.

      It’s workable for mixture of experts models but the performance falls off a cliff as soon as the model overflows out of the GPU and into system RAM. There is another performance cliff when the model has to be fetched from disk on every pass.

      • zozbot234
        22 dias atrás
        It's less of a "performance falls off a cliff" problem and more of a "once you offload to RAM/storage, your bottleneck is the RAM/storage and basically everything else no longer matters". This means if you know you're going to be relying on heavy offload, you stop optimizing for e.g. lots of VRAM and GPU compute since that doesn't matter. That saves resources that you can use for scaling out.
        • Aurornis
          22 dias atrás
          It depends on the model and the mix. For some MoE models lately it’s been reasonably fast to offload part of the processing to CPU. The speed of the GPU still contributes a lot as long as it’s not too small of a relative portion of compute.
    • K0balt
      22 dias atrás
      My thoughts exactly. Something like this could make it so that modest GPU capacity, like a pair of 3090s , and lots of RAM could make big inference more practical for personal labs
    • kelnos
      22 dias atrás
      Why couldn't you take the same approach on Liunx, and load from SSD?

      I assumed the only reason this particular project wouldn't be usable on Linux is because it uses Metal...

  • andai
    22 dias atrás
    > Metal Compute Shaders — Hand-written Metal kernels

    Hand written... by GPT? ;)

    • Aurornis
      22 dias atrás
      He’s very clear that it was written by AI.
      • kelnos
        22 dias atrás
        Then "hand-written" is a bit of a weird thing to say, no? Unless Claude has grown hands.
  • RandyOrion
    22 dias atrás
    This project shows an interesting automated search for engineering problems that I like to see more.

    The experience of utilizing tiered storage (gpu vram, ram, and ssd) is generally poor for a lot of LLM inference engines out there, e.g., llama.cpp, sglang, vllm, etc..

    My own experience shows that both weight and KV cache offload to ram on sglang and vllm is unavailable or unusable. Copying extra parameters from documents and adding them to already working commands results in errors. Llama.cpp does support weight offload, but the experience is not pleasant, low pcie (gpu <-> ram) utilization, low gpu utilization, and really low tokens per second.

  • mkw
    22 dias atrás
    TLDR I took a stab at leveraging Dan's work and making it more practical:

    https://github.com/matt-k-wong/mlx-flash

    2 bit quantization lobotomizes the model but is impressive nonetheless! Maybe one day we'll be able to have intelligent 2 bit quants... I wonder.

    my version supports - 4bit quantization, hybrid streaming (Disk + ram), arbitrary model compatibility, tested on Mamba2, and lets up the framework for LM Studio integration

    I leveraged this work (Credit to Danveloper) and am in the middle of making this work on more practical models and quants. It still uses flash streaming, but done so with a control knob so you can choose how much ram and how little ram to use. In the craziest case, it uses as little ram as possible but is very slow, however, in the balanced case you use some ram and it's much faster.

    I designed it around the intelligence dense Nemotron 3 Nano 30B and Nemotron Cascade 2 30B models (which are smaller, more intelligence density) and can run on low end 16GB machines, though you can run arbitrarily large models on larger machines (designed for very low end, but capable of high end).

  • JSR_FDED
    22 dias atrás
    This is a very impressive result. If I understand correctly the bottleneck is the SSD in this architecture - the author seems to get almost 15GB/s - but I seem to remember the max b/w was about 8GB/s. What am I missing?
    • Roxxik
      22 dias atrás
      IO is very bursty in these setups. When the router results are in you can start loading experts from SSD. In this brief moment the SSD is saturated.

      Outside of that the SSD is idling.

      Table 3 shows for K=4 experts an IO of 943 MB/Tok at 3.15 Tok/s giving an average IO of 2970 MB/s far below what the SSD could do.

      I'm not sure, but not all expert weights are used immediately. Maybe they could do async reads for the down tensors parallelizing compute with IO.

      Not sure if this works on Mac, I only tested my larger than RAM setup on Linux with io_uring O_DIRECT reads and I saw that about 20% of total reads do finish while my fused upgate matmul is already running.

      Edit: Typos

      • zozbot234
        22 dias atrás
        The github page mentions that you can't overlap SSD traffic and GPU compute on Apple Silicon, you get heavy contention for the shared hardware resources.
    • rado
      22 dias atrás
      MacBook Pro M5 Pro and M5 Max have such SSD speed
      • selimthegrim
        22 dias atrás
        I have an MBP M4 Pro and a WD Black SN850x in an external TB5 enclosure and I easily get 6-7 GB/s
    • Aurornis
      22 dias atrás
      PCIe 5 doubles the maximum throughout. That’s why the numbers for newer SSDs are about double what you recall for the old maximum.
  • druide67
    21 dias atrás
    The finding about removing the 9.8 GB Metal LRU cache for a 38% speedup is the most interesting part. Same lesson as PostgreSQL's advice against application-level buffer pools that compete with the OS page cache : the hardware memory compressor doing 130K decompressions/sec was pure overhead.

    Curious about the remaining gap: 5.7 tok/s vs 18.6 theoretical (from SSD bandwidth). Is the ~70% overhead mostly GPU compute on non-expert layers (attention, norm), or is there I/O scheduling room left?

  • spwa4
    22 dias atrás
    Does this mean that it should be possible to load up a system with ~10 (seems to me at least the number of active experts) SSDs to get 40 tok/s even on truly gigantic models?
    • zozbot234
      22 dias atrás
      SSD bandwidth will ultimately be limited by the amount of PCIe lanes you have available (for something other than the Apple Silicon internal storage). So the approach has inherent limitations. You can of course scale out to multiple systems to get more throughput.

      You can use this approach with Intel Optane, which is wearout-resistant unlike NAND and can thus substitute for RAM. Last I checked, it was available quite cheap on the secondary market, ~$1/GB as opposed to ~$15/GB or more for DRAM. (Of course that's nowhere near as cheap as NAND, which is around ~$0.1/GB but quite wearout-prone with heavy writes.)

      • spwa4
        22 dias atrás
        Yeah, PCIe is the bottleneck. The point being that whether the data originates from RAM or from NVME or Optane, you cannot get data to the GPU faster with RAM than with SSDs.

        Meanwhile PCIe switches exist. So why not build:

        1 CPU + memory + ...

        N PCIe switch with each 1 low-memory GPU + 6 NVME drives (in theory 5 can saturate the GPU)

        Each of those should only bother the CPU when they have some tokens produced and have plenty of PCIe lanes to get at their data.

        Such a setup should be able to get a 6 to 8 times speedup from the solution detailed here, and a model compute increase should make relatively little difference in performance.

    • kelnos
      22 dias atrás
      I'm not sure how you're going to get 10 SSDs to fit inside a laptop. And if you're not going to use a laptop, then you might as well get an expensive machine with plenty of RAM, no? Even with RAM prices as crazy as they are these days, that's probably still cheaper than a machine with enough PCIe bandwidth/lanes/ports for 10 SSDs, not to mention the cost of the SSDs themselves.

      Though I guess this could be an interesting area of research, regardless of the cost.

  • shubhamintech
    22 dias atrás
    4.4 tok/s with reliable structured output is a solid local benchmark altho the question is whether SSD streaming introduces per-token latency variance that messes up tool call parsing downstream. The gap between 400 GB/s unified memory bandwidth and 17.5 GB/s SSD reads means you're in the hot path pretty much every time an expert isn't cached.
  • qiine
    22 dias atrás
    It seem strange to me that the only way to use an llm is to fit it entirely in volatile memory from the get go.

    To render movies we happily wait for the computer to calculate how lights bounce around, for hours even days.

    So why not do the same with AIs? Ask big question to big models and get the answer to the universe tomorrow?

    • Aurornis
      22 dias atrás
      If you don’t care about turnaround time you can do that.

      Most LLM use cases are about accelerating workflows. If you have to wait all night for a response and then possibly discover that it took the wrong direction, misunderstood your intent, or your prompt was missing some key information then you have to start over.

      I don’t let LLMs write my code but I do a lot of codebase exploration, review, and throwaway prototyping. I have hundreds to maybe thousands of turns in the LLM conservation each day. If I had to wait 10X or 100X as long then it wouldn’t be useful. I’d be more productive ignoring a slow LLM and doing it all myself.

      • zozbot234
        22 dias atrás
        > If you have to wait all night for a response and then possibly discover that it took the wrong direction, misunderstood your intent, or your prompt was missing some key information then you have to start over.

        If you have to wait overnight because the model is offloading to disk, that's a model you wouldn't have been able to run otherwise without very expensive hardware. You haven't really lost anything. If anything, it's even easier to check on what a model is doing during a partial inference or agentic workload if the inference process is slower.

      • qiine
        22 dias atrás
        "If you have to wait all night for a response and then possibly discover that it took the wrong direction, misunderstood your intent, or your prompt was missing some key information then you have to start over."

        This exact problem exist for rendering, when you realize that after a long render an object was missing in the background and the costly frame is now useless. To counter that you make multiple "draft" renders first to make sure everything is in the frame and your parameters are properly tuned.

    • andoando
      22 dias atrás
      There's definitely use cases for this for long running tasks, like doing research, but for typical use cases they require way too much constant supervision and interaction
  • maxloh
    22 dias atrás
    Can you add a license to the repo? Legally we couldn't run any code without a license attached to it.
    • freehorse
      22 dias atrás
      The author states that the code was written by Opus, and afaik AI-written code is not considered copyrightable. Without copyright on the code, there should not be something prohibiting you from running it or even redistributing it. Of course this may come down to the extent of human contribution.
    • Wowfunhappy
      22 dias atrás
      ...you can't redistribute code without a license, but surely you can legally run it, can't you?

      Like, if I write a blog post and put it on my blog, you're allowed to read it, right?

      Heck, if my blog contains some Javascript code I wrote, I would imagine your web browser is allowed to run that code without opening you up to copyright infringement, even if I didn't provide an explicit license.

  • haomingkoo
    22 dias atrás
    Really interesting approach. Curious how the 2-bit quantization affects the model's reasoning ability on longer chains of thought vs shorter prompts. The benchmarkslook solid but real-world usage seems like a different story based on the comments here.
  • 999900000999
    22 dias atrás
    If I have a dedicated GPU with 12GB of VRAM and 32 GB of system ram, can I combine the two for LLMs.

    So far ollama will use the 12GB and then give up

  • m-hodges
    22 dias atrás
    As frontier models get closer and closer to consumer hardware, what’s the most for the API-driven $trillion labs?
    • stri8ted
      22 dias atrás
      48 GB is not consumer hardware. But fundamentally, there are economies of scale due to batching, power distribution, better utilization etc.., that means data center tokens will be cheaper. Also, as the cost of training (frontier) models increases, it's not clear the Chinese companies will continue open sourcing them. Notice for example, that Qwen-Max is not open source.
      • zozbot234
        22 dias atrás
        Nothing obviously prevents using this approach, e.g. for 3B-active or 10B-active models, which do run on consumer hardware. I'd love to see how the 3B performs with this on the MacBook Neo, for example. More relevantly, data-center scale tokens are only cheaper for the specific type of tokens data centers sell. If you're willing to wait long enough for your inferences (and your overall volume is low enough that you can afford this) you can use approaches like OP's (offloading read-only data to storage) to handle inference on low-performing, slow "edge" devices.
      • WesolyKubeczek
        22 dias atrás
        It is consumer hardware in the sense that Macbook Pros come with this RAM size as base and that you can buy them as a consumer, without having to sign a special B2B contract, show that your company is big and reputable enough, and order a minimum of 10 or 100.
      • m-hodges
        22 dias atrás
        > 48 GB is not consumer hardware.

        It’s a MacBook.

        • kelnos
          22 dias atrás
          Technically that's correct (which as we all know is the best kind of correct), but really, how many consumers are buying a high-end MacBook Pro with 48GB or more of RAM? That's a very small percentage of the population. In these kinds of discussions, "consumer" is being used as a proxy for "something your average home laptop buyer might have". And a 48GB MBP is not that.

          I know it's annoying, because a 48GB MBP is indeed technically "consumer hardware", but please understand the context and don't be pedantic. You know what the GP meant. (And if not, that's... kinda on you.)

          • m-hodges
            22 dias atrás
            > but please understand the context and don't be pedantic.

            The context is this is something I can pick up at an Apple Store and not some rig I have to build with NVIDIA cards.

            I led with:

            > get closer and closer to consumer hardware

            I think this demonstrates getting closer, whether you think a MacBook is consumer hardware or not. But I'm the one being pedantic.

    • OJFord
      22 dias atrás
      Assuming 'moat' – they'll push the frontier forward; they don't really have to worry until progress levels off.

      At that point, I suppose there's still paid harnesses (people have always paid for IDEs despite FOSS options) partly for mindshare, and they could use expertise & compute capacity to provide application-specific training for enterprises that need it.

    • BoredomIsFun
      22 dias atrás
      > the API-driven $trillion labs?

      here we go: https://huggingface.co/collections/trillionlabs/tri-series

  • lostmsu
    22 dias atrás
    How large is the KV cache?
    • xbar
      22 dias atrás
      0.1 GB per full-attention layer and "The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention." So, 1.5 GB.
  • 383toast
    22 dias atrás
    yeah 4tok/s is kinda unusable though
  • breakingcups
    22 dias atrás
    > No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders.

    Welp, I know where those tokens came from.

  • mannyv
    22 dias atrás
    Everyone is focused on the bad 2 bit result but who cares? He says don’t use it because it’s bad.
    • Aurornis
      22 dias atrás
      If you don’t care about the output, why not reduce to 1-bit and only 1 active expert? It will be completely useless but it will be faster!
  • pdyc
    22 dias atrás
    impressive, i wish someone takes a stab at using this technique on mobile gpu's even if it does not use storage it would still be a win. I am running llama.cpp on adreno 830 with oepncl and i am getting pathetic 2-3t/s for output tokens
  • matchbox
    22 dias atrás
    this is awesome Dan!
  • utopiah
    21 dias atrás
    I honestly don't get "why" despite having done similar things myself, e.g. run on a model on a VR headset itself.

    I mean I've done it because I could, so I imagine others are doing that too. But then... once it's done I don't actually use it. I ticked that box but eventually when STOA aren't that useful I have a hard time imagining actual positive use cases (... not like offline spam or naughty chat in the woods) that would benefit from such technically impressive demos.

  • NamlchakKhandro
    22 dias atrás
    lmao 4.4 tokens per second is hilariously and utterly bad.

    anyone suggesting that it's a reasonable speed should find another career

    • zozbot234
      22 dias atrás
      This reads like an unfairly shallow dismissal. 4.4 tok/s is more than par for the course (even acknowledging that the model was heavily quantized and stripped down by limiting active experts) for something that's streaming weights from storage in order to run on vastly more limited hardware than the model was designed for. It's even more impressive since the OP writeup mentions Apple Silicon hardware cannot overlap GPU compute with SSD access.
      • ActorNightly
        22 dias atrás
        The problem is that the hardware is still like $3000. Making anything run on Macs is an exercise in futility. And its a shame that people get duped into buying Macs for LLM inference.
        • zozbot234
          22 dias atrás
          $3000 for running a 397B total parameters model is quite a bargain. The Mac is being used for its access to fast internal storage here since that's the key bottleneck, you could probably achieve similar outcomes with conventional (even fairly low-end) iGPU/APU hardware plus a fast PCIe x4 5.0 SSD (which would also allow you to overlap SSD transfers with iGPU/APU compute), but the cost would also be in a similar range. (Unless you carefully chose low-end e.g. Intel hardware with proper PCIe x4 5.0 NVMe support - which is still quite uncommon, especially for laptops.)
          • ActorNightly
            19 dias atrás
            If you want to flex on being able to run 397b parameter models at unusably slow tokens/second sure.

            You can buy a 3090 for $2k, and run QWEN3.5 at 50+ token a second, and it will do everything you need, especially if you give it enough context.

  • claud_ia
    21 dias atrás
    [dead]
  • robutsume
    22 dias atrás
    [dead]
  • maxothex
    21 dias atrás
    [dead]
  • fluxist
    22 dias atrás
    [dead]
  • Yanko_11
    21 dias atrás
    [dead]
  • Yanko_11
    22 dias atrás
    [dead]
  • openclaw01
    22 dias atrás
    [dead]
  • leontloveless
    22 dias atrás
    [dead]
  • diablevv
    22 dias atrás
    [flagged]
  • leontloveless
    22 dias atrás
    [dead]
  • leontloveless
    22 dias atrás
    [dead]
  • leontloveless
    22 dias atrás
    [dead]
  • gregfrank
    22 dias atrás
    [dead]
  • aplomb1026
    22 dias atrás
    [dead]
  • thestack_ai
    22 dias atrás
    [dead]
  • qcautomation
    22 dias atrás
    [dead]
  • jee599
    22 dias atrás
    [dead]
  • jee599
    22 dias atrás
    [dead]
  • arikrahman
    22 dias atrás
    [dead]
  • dmonterocrespo
    21 dias atrás
    [dead]
  • jee599
    22 dias atrás
    [dead]
  • techpulse_x
    22 dias atrás
    [flagged]
  • genie3io
    22 dias atrás
    [dead]
  • aplomb1026
    22 dias atrás
    [dead]
  • QubridAI
    21 dias atrás
    [dead]
  • mugivarra69
    22 dias atrás
    [dead]
  • feshbach
    22 dias atrás
    [dead]
  • harshhhhhhhhh
    22 dias atrás
    seems promising , this is the way , can someone benchmark this
    • frwickst
      22 dias atrás
      I'm getting 6.55t/s using the Qwen3.5-397B-A17B-4bit model with the command: ./infer --prompt "Explain quantum computing" --tokens 100

      MacBook Pro M5 Pro (64GB RAM)

      • j45
        22 dias atrás
        Appreciate the data point. M5 Max would also be interesting to see once available in desktop form.
      • logicallee
        22 dias atrás
        can you post the final result (or as far as you got before you killed it) to show us how cohesive and good it is? I'd like to see an example of the output of this.
  • vilequeef
    22 dias atrás
    Why so much RAM?
    • vilequeef
      22 dias atrás
      Oh Mac, unified. Sometimes it takes a downvote
  • rvz
    22 dias atrás
    The technical write up is great, but Mac users should not get too excited just yet on running 300B+ parameter models locally as the TPS isn't that good.

    >...at 4.4+ tokens/second

    That is even when it is using 4-bit quantization and it is still at that speed.

    > The entire 209GB model streams from SSD through a custom Metal compute pipeline.

    This is my main problem.

    If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

    Can't imagine using this in the long term right now, but improvements will follow. Still a great write up anyways.

    • Roxxik
      22 dias atrás
      Does an SSD meaningfully degrade by read only workloads?
      • JSR_FDED
        22 dias atrás
        Nope, reads don’t cause wear
        • zozbot234
          22 dias atrás
          No appreciable wear of course, but read disturb (requiring occasional rewrites) becomes more of an issue as NAND fabrication advances.
    • etiam
      22 dias atrás
      > If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

      How sure are you about that? I've never looked closer at how a large LLM with mixture of experts architecture switches between expert modules, but staying on roughly the same topic for the use (as it often would when editing the same codebase), I wouldn't be surprised to see the switches of composition are fairly rare, fairly small, and to the extent it happens it's repeated reads from the flash disk rather than writes it tends to cause.

      • frotaur
        22 dias atrás
        Afaik the experts are not usually very interpretable, and generally would be surprised if at least one does not change every token. I don't know what happens in practice, but I know at least during training, nothing is done to minimize the number of expert switches between tokens.
        • etiam
          22 dias atrás
          I'd have thought at least a tiny explicit penalty term for switching, to discourage messing around with the composition without any expected gains from it.

          If one is to use these on hardware that can't keep everything loaded I guess someone should examine how it works out in practice. Interpretability may be be a too much to ask, but I can't spontaneously see any reason why the experts can't at least be pushed to incorporate what's needed to remain the good choice for a longer segment.

          • zozbot234
            22 dias atrás
            The switching is done by layer, not just per token. Every layer is loading completely different parameters, you don't really benefit from continuity. You're generally better off shifting this work to the CPU, since CPU RAM is more abundant than the GPU's VRAM hence it matters less that so much of it is "wasted" on inactive expert layers. Disk storage is even more relatively abundant, so offloading experts to disk if you can't keep them in RAM (as OP does) is the next step.
    • Wowfunhappy
      22 dias atrás
      Eh. I mean, 4 tokens a second works fine if you're patient. Go do something else while you wait.

      I feel like whenever I'm trying to find information on which local models will work on my hardware, I have to overestimate because people don't know how to wait for things.

      Also, reading data doesn't cause SSD wear.

    • hrmtst93837
      22 dias atrás
      [flagged]
      • K0balt
        22 dias atrás
        Is it doing a bunch of ssd writes?
        • mkw
          22 dias atrás
          stream from the SSD, perform the calculation, discard, repeat