Vladimir Malinovsky, a researcher at Yandex's scientific department, has developed a groundbreaking service that enables running large language models with 8 billion parameters on regular computers or even smartphones, directly through a web browser. Here's an overview of this innovative technology:

Accessible on Standard Devices

  • The service makes use of Llama 3.1-8B, a large language model reduced in size by eight times—from 20GB to 2.5GB.
  • Users can test the service on a dedicated webpage, where the model is downloaded to their device for offline use.

Offline Functionality

  • After downloading, the model operates entirely without requiring an internet connection, ensuring privacy and independence from cloud services.

Performance

  • The speed of the model depends on the device's processing power:
    • For instance, on a MacBook Pro with an M1 processor, the model generates approximately 3-4 characters per second.

Built with Modern Technologies

  • Rust and WebAssembly:
    • The service is written in Rust and leverages WebAssembly, a technology that allows applications to run efficiently within a web browser across multiple platforms and languages.

Advanced Compression Techniques

  • The service employs cutting-edge methods developed collaboratively by:
    • Yandex Research
    • Institute of Science and Technology Austria (ISTA)
    • King Abdullah University of Science and Technology (KAUST)

Two Core Tools

  1. Model Compression:
    • Compresses models up to eight times, enabling them to run on a single GPU instead of multiple GPUs.
  2. Error Correction:
    • Mitigates errors introduced during compression, ensuring the high quality of the neural network's responses.

Launch and Open Source

  • The project was first presented in summer 2024 and has since been made available to the public.
  • The source code is openly accessible on GitHub, inviting developers to explore and build upon this innovation.