Google Sample Question 10 of 15

You downloaded a TensorFlow language model pre-trained on a proprietary dataset by another company, and you tuned the model with Vertex AI Training by replacing the last layer with a custom dense layer. The model achieves the expected offline accuracy; however, it exceeds the required online prediction latency by 20ms. You want to reduce latency while minimizing the offline performance drop and modifications to the model before deploying the model to production. What should you do?

Source: Google Cloud OFFICIAL

Official sample question published by Google Cloud. WiseOwlLearns is not affiliated with Google LLC.

All explanations and Option Analyzer™ content are generated by WiseOwlLearns and are not endorsed by Google Cloud.

A Apply post-training quantization on the tuned model, and serve the quantized model. ✓ Correct
B Apply knowledge distillation to train a new, smaller "student" model that mimics the behavior of the larger, fine-tuned model.
C Use pruning to tune the pre-trained model on your dataset, and serve the pruned model after stripping it of training variables.
D Use clustering to tune the pre-trained model on your dataset, and serve the clustered model after stripping it of training variables.
🦉 Explanation by WiseOwl Tutor™ — not endorsed by Google

Post-training quantization is the recommended option for reducing model latency when re-training is not possible. Post-training quantization can minimally decrease model performance. Tuning the whole model on a custom dataset only with distillation, pruning, or clustering causes a drop in offline performance or requires significant re-training.

Ready to practice?

These 15 official sample questions are free to practice on WiseOwlLearns — no account required. Get real-time tutoring from WiseOwl Tutor™ and step-by-step elimination reasoning from Option Analyzer™.