Comment faire de l’inférence avec un modèle de langue de manière efficace ?

Quelles sont les techniques permettant de réduire la taille des modèles ?

Screenshot 2023-07-18 at 18.52.29.png

Screenshot 2023-07-18 at 18.56.08.png

Fine tune a Teacher model
Get Completions from both models baised on training data
Get Distillation Loss to evaluate performance on smaller model
Update the weight of student model based on Student Loss
Use the Student model for deployment

More effective for encoder model

Screenshot 2023-07-18 at 18.57.22.png

PTQ

Screenshot 2023-07-18 at 18.59.36.png

Screenshot 2023-07-18 at 19.00.53.png

GGML - AI at the edge - https://ggml.ai/

ggml is a tensor library for machine learning to enable large models and high performance on commodity hardware. It is used by llama.cpp and whisper.cpp

Written in C
16-bit float support