You are developing a custom image classification model in Python. You plan to run your training application on Vertex AI. Your input dataset contains several hundred thousand small images. You need...

Question

You are developing a custom image classification model in Python. You plan to run your training application on Vertex AI. Your input dataset contains several hundred thousand small images. You need to determine how to store and access the images for training. You want to maximize data throughput and minimize training time while reducing the amount of additional code. What should you do?

Accepted Answer

Filestore is faster than Cloud Storage for accessing files, and serialized records (like TFRecords or WebDataset format) are faster for feeding training pipelines than individual files. Cloud Storage has high metadata overhead when accessing hundreds of thousands of small files individually. Although serialized records on Cloud Storage improve speed, Filestore + serialized records maximizes data throughput.

Ready to practice?