Comment faire de l’inférence avec un modèle de langue de manière efficace ?

Quelles sont les techniques permettant de réduire la taille des modèles ?

Screenshot 2023-07-18 at 18.52.29.png

Screenshot 2023-07-18 at 18.56.08.png

  1. Fine tune a Teacher model
  2. Get Completions from both models baised on training data
  3. Get Distillation Loss to evaluate performance on smaller model
  4. Update the weight of student model based on Student Loss
  5. Use the Student model for deployment

More effective for encoder model

Screenshot 2023-07-18 at 18.57.22.png

PTQ

Screenshot 2023-07-18 at 18.59.36.png

Screenshot 2023-07-18 at 19.00.53.png

GGML - AI at the edge - https://ggml.ai/

ggml is a tensor library for machine learning to enable large models and high performance on commodity hardware. It is used by llama.cpp and whisper.cpp