Vladimir Malinovsky, a researcher at Yandex's scientific department, has developed a groundbreaking service that enables running large language models with 8 billion parameters on regular computers or even smartphones, directly through a web browser. Here's an overview of this innovative technology:
Accessible on Standard Devices
- The service makes use of Llama 3.1-8B, a large language model reduced in size by eight times—from 20GB to 2.5GB.
- Users can test the service on a dedicated webpage, where the model is downloaded to their device for offline use.
Offline Functionality
- After downloading, the model operates entirely without requiring an internet connection, ensuring privacy and independence from cloud services.
Performance
- The speed of the model depends on the device's processing power:
- For instance, on a MacBook Pro with an M1 processor, the model generates approximately 3-4 characters per second.
Built with Modern Technologies
- Rust and WebAssembly:
- The service is written in Rust and leverages WebAssembly, a technology that allows applications to run efficiently within a web browser across multiple platforms and languages.
Advanced Compression Techniques
- The service employs cutting-edge methods developed collaboratively by:
- Yandex Research
- Institute of Science and Technology Austria (ISTA)
- King Abdullah University of Science and Technology (KAUST)
Two Core Tools
- Model Compression:
- Compresses models up to eight times, enabling them to run on a single GPU instead of multiple GPUs.
- Error Correction:
- Mitigates errors introduced during compression, ensuring the high quality of the neural network's responses.
Launch and Open Source
- The project was first presented in summer 2024 and has since been made available to the public.
- The source code is openly accessible on GitHub, inviting developers to explore and build upon this innovation.