
AI inferencing is at the core of the current AI boom. Also known as machine learning inference, this term describes the process wherein a trained AI model makes predictions or decisions on data that it has not seen before. For example, if an AI model has been trained on a billion images of animals, a trained model will be able to look at an image of your cat that you just took and classify it correctly as a cat.
The very act of asking an AI model a question or submitting a request requires it to perform inferencing to respond, and this inferencing is what continuously uses resources at AI data centres and keeps running up the bills. All those investments we hear about in AI data centres, the surge in demand for GPUs, RAM, and the like, are increasingly a result of companies attempting to speed up inference.
Optimisation of AI inference is critical as it allows AI companies to both speed up inferencing and scale up more easily, as well as cut costs. The process of training a new AI model requires a lot of data and resources, but it is a one-time investment relative to inferencing’s ongoing cost.
Engineers largely rely on the following techniques to speed up inferencing for their models.
Also Read: AI-First Architecture: Building Systems for AI Co-Creation
Quantisation
Think of quantisation as a compression technique that reduces the resource usage of an AI model. Much like how JPG compression reduces image size by throwing away mathematically irrelevant data, quantisation does the same for AI models. AI models are generally trained and stored using so-called high-precision data (32-bit float) but can be quantised to use 8-bit integers (reducing the memory needed by about 75%) without sacrificing significant precision. Effectively, you end up with a simpler model that can run significantly faster using a fraction of the resources it used previously.
Pruning
Where quantisation reduces the precision of the model, pruning removes redundant and non-critical data. AI models often have excess parameters and data when training to ensure stability and precision, but a significant portion of this data can be discarded once training is complete. This is a complex process as the model needs to be continually re-tested for baseline accuracy as parameters are cut down. However, the result is a model that is lighter and faster.
Knowledge distillation
This is another compression technique. Rather than mathematically simplify a model, knowledge distillation uses a student–teacher approach where an entirely new model (the student) is trained to replicate the capabilities of the larger model (teacher) by ‘learning’ from it. This process does have limitations where complex tasks are involved and requires more supervision, but it can prove very effective when implemented properly.
Hardware-level optimisations
Choosing the right hardware also plays a key role in optimising AI inference. AI models are very memory intensive as they process and store a lot of data very quickly. Optimising the data layout in a way that minimises unnecessary data transfer and buffering can speed up inferencing. It is also important to use the right hardware for the task. A neural processing unit (NPU), for example, is significantly faster than a CPU and is ideal for small models. In contrast, a GPU is well-suited for training and larger models, while a tensor processing unit (TPU) can be critical in the most advanced models. The kind of hardware you use directly impacts inferencing speed and power consumption.
Optimising AI inferencing is an ongoing challenge that gets harder as models get more complex. Fundamentally, however, quantisation, pruning, and knowledge distillation are still the key techniques used to speed up the process. Hardware constraints have recently become more significant owing to the inability of supply to meet demand and are likely to play a key role in shaping how future models are trained and developed.
