Two questions: 1) Anyone have any idea of VRAM requirements? 2) When will this b...

causal · on July 18, 2024

1) Rule of thumb is # of params = GB at Q8. So a 12B model generally takes up 12GB of VRAM at 8 bit precision.

But 4bit precision is still pretty good, so 6GB VRAM is viable, not counting additional space for context. Usually about an extra 20% is needed, but 128K is a pretty huge context so more will be needed if you need the whole space.

alecco · on July 18, 2024

The model has 12 billion parameters and uses FP8, so 1 byte each. With some working memory I'd bet you can run it on 24GB.

> Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU