Run LLMs on the Apple Neural Engine.
Open source Library and Pipeline from HuggingFace to on-device inference.
(pronounced like "animal")
The flagship library for porting Large Language Models to the Apple Neural Engine (ANE). Convert models directly from HuggingFace to CoreML, optimized for ANE tensor processing. Includes Swift and Python inference, iOS/macOS/visionOS sample apps, and a full conversion pipeline — all targeting low-power, on-device, fully private AI.
Optimized for Apple Neural Engine tensor processing. Specialized model splitting for iOS (1GB) and macOS (2GB) constraints.
Single-shot conversion from HuggingFace weights to CoreML. Auto-download, FP16, LUT quantization, monolithic and chunked modes.
Reference inference in both Swift CLI and Python. IOSurface-backed buffers, serial prediction, ring buffer patterns for ANE stability.
Redesigned iOS/macOS/visionOS app with voice input, AirDrop sharing, Markdown rendering, and thinking mode. On TestFlight now.
ANE CostModel profiler for CoreML models. Analyze the compute plan without Xcode — see which ops land on ANE vs CPU vs GPU, find bottlenecks, and verify hardware compatibility. Per-op runtime breakdown, GFLOP/s, bandwidth, and CPU/GPU fallback reasons. Ideal for AI agents and automated pipelines — CLI-first, no GUI required.
Apple Neural Engine bandwidth and performance benchmarks. Measure real ANE vs GPU vs CPU throughput across Apple Silicon devices — understand what the hardware can actually deliver.
On the App Store (Beta)
On-device LLM chat powered by Apple Neural Engine. Voice input, AirDrop model sharing, Markdown rendering, thinking mode. Runs models locally with full privacy. Full source included in the Anemll project.
Join TestFlightAI-powered coding assistant for iOS. Intelligent codebase navigation, code generation, and development tools — all on your iPhone.
Trending Now
Flash-MoE sidecar slot-bank runtime for large GGUF MoE models on Apple Silicon. A llama.cpp fork enabling massive Mixture-of-Experts models to run efficiently on devices with limited memory — experts stream from SSD while dense weights stay resident.
Run massive Mixture-of-Experts models on Apple devices — Mac, iPhone, and more. A family of projects that make the impossible possible.
The core Flash-MoE research repo — testing different MoE algorithms, memory management strategies, and expert routing on Apple Silicon.
Flash-MoE sidecar slot-bank runtime for large GGUF MoE models on Apple Silicon — llama.cpp fork.
Flash-MoE iOS — first 400B model running on iPhone. Based on @Alexintosh's work. Adds macOS compat, iOS memory entitlements, fanout I/O, and pread chunking.
Auto-research tuned MLX fork optimized for Flash-MoE experiments — fast MoE inference and algorithm prototyping on Apple's ML stack.
Open Source
Experiments with MLX and RDMA — pushing the boundaries of distributed ML on Apple hardware.
AI-powered iOS coding assistant — intelligent codebase navigation and development tools for mobile.
Visual development agent for macOS — fast UI iterations via iPhone Mirroring, screen capture, and automated interaction loops.
Thunderbolt block floating point experiments — high-bandwidth ML interconnects.