LLMs on Apple Silicon with MLX
Unleash the power of Apple Silicon by running large language models locally with MLX.

Running Large Language Models on Apple Silicon with MLX
In this post, we’ll explore how to leverage the power of Apple Silicon hardware (M1, M2, M3) to run large language models locally using MLX. MLX is an open-source project that enables GPU acceleration on Apple’s Metal backend, allowing you to harness the unified CPU/GPU memory for efficient model execution.
Installing MLX on macOS
MLX supports GPU acceleration on Apple’s Metal backend through
the mlx-lm
Python
package. To get started, follow the instructions provided in
the
mlx-lm package installation guide.
Note: MLX is currently supported only on Mac MX series devices.
Loading Models with MLX
While MLX supports common HuggingFace models directly, it is recommended to use converted and quantized models provided by the mlx-community. These models have been optimized for efficient performance on Apple Silicon hardware, depending on your device’s capabilities.
To load a model with MLX, follow these steps:
-
Browse the available models on HuggingFace.
-
Copy the text from the model page in the format
<author>/<model_id>
(e.g.,mlx-community/Meta-Llama-3-8B-Instruct-4bit
). -
Check the model size. Models that can run in CPU/GPU unified memory tend to perform better.
-
Follow the instructions to launch the model server Run OpenAI Compatible Server Locally by running the command:
Launch the model servermlx_lm.server --model <author>/<model_id>
Configuring RumiTalk
To use MLX with RumiTalk, you’ll need to add it as a separate
endpoint in the
rumitalk.yaml
configuration file. An example configuration for the Llama-3
model is provided. Follow the
Custom Endpoints & Configuration Guide
for more details.
With MLX, you can now enjoy the benefits of running large language models locally on your Apple Silicon hardware, unlocking new possibilities for efficient and powerful natural language processing tasks.