Notes on the GUIE competition

Brief post on the 1st place solution of the Google Universal Image Embedding competition on Kaggle

The Google Universal Image Embedding competition (GUIE) is, as reported in the competition description page, the first competition in image representations that should work across many object types. Image representations are a key element for computer vision models. In past times, generic embedding learning techniques were applied to different domains separately, rather than developing generic embedding models which could be applied to all domains combined.

Representations are very useful. As a simple example, it is well known that autoencoders find representations for images. These representations are usually cheaper/smaller than the images from which they originate. One can easily work on these representations (for example, comparing them) without resorting to the original images.

Some of the types of images to be evaluated in this competition: apparel & accessories, packaged goods, landmarks, furniture & home decor, storefronts, dishes, artwork, toys, memes, illustrations, and cars. This Competition requires contestants to develop a model with the capacity to generate a 64-dimensional embedding for each image. Then the back-end server will retrieve the image of the same instance via a search based on kNN (k = 5).

The competition ended last October, 2022. In this post we will examine the 1st place solution by Qinghua Cui and Shihao Shao. We report some Shao’s comments on the strategies and development that led to the winning model.

First attempts

The competition is a bit atypical as no dataset is provided. From the discussion emerges that larger datasets result in better score, since weights trained on ImageNet-22K are better than 1K ones. So the first strategy was searching for pre-trained weights on super large datasets. A good choice to begin with was CLIP, whose code can be found here.

Cui and Shao adapted weights of VIT-H pre-trained on Laion-2B, a subset of Laion-5B as their baseline model. They added a linear projection layer to squeeze the embedding into 64d. Then, an ArcFace head was adapted together with this modification. A Dropout layer was inserted between the last and the second-to-last linear layer with a drop rate of 0.2. SGD with momentum was chosen as optimizer with L2 weights decay rate of 1.5e-4.

Dimensionality reduction algorithms like random projections, PCA, t-SNE did not work.

Training

One of the first issues participants became concerned about was the strict competition rule that banned datasets without a commercial license. Later, the rules were updated: licensing for the winning model was no longer required, only the source code used to generate it had to be licensed. All publicly available datasets were fine for model training as long as they were publicly disclosed on the forum.

Some attempts followed the scheme “choose datasets -> decide model, training-related things”. So, the winners decided to test various datasets like Products-10k, Shopee, MET Artwork Dataset, Alibaba goods, H&M Personalized Fashion, GPR1200, GLDv2-Full, DeepFashion -Consumer-to-shop Clothes Retrieval Benchmark part. Datasets were added to the training list iteratively instead of training on each datasets from very beginning. This led the score above 0.610.

The winners decided not to follow usual training methods as LP-FT (linear probing, then full fine-tuning). In the end they trained on the last 2 fully-connected layers to completely converge for 6 epochs and then, froze the linear layer and train only on the backbone part, for 3 epochs. We will present some of the reasons for that decision below.

They noted that the weights of the last layer changed rapidly when training on all the layers. Furthermore, the central embedding of each class changed rapidly and so the Euclidean distance between classes. Hence, they decided to

(a) freeze the final FC layer while training on the rest part (backbone);

(b) adding dropout to the full connection layer, already a well-known trick to avoid over-fitting — not always working, but fine in this case.

Products-10k gave the most improvement, so it was used for fine-tuning respecting the order of “first fc, then backbone”, reaching a score of 0.671.

Ensemble strategy

Another odd fact was that model ensemble (by averaging outputs) did not give a superior result, as also noted by other participants.

Having trained two models on the same datasets, due to the noise of randomly mini-batch selection and possibilities in augmentation, the outputs can differ greatly. This, presumably, could be determined by the oscillating results of the final FC layers. So, ensemble does not work as it is in the most cases.

However, the ensemble of models should work when getting “similar” results from final FC layers in different models. Therefore, the winners tried to apply ensembling keeping the final two FC layers frozen.

Finally, Shiao wrote:

“I need to redo EVERY THING MENTIONED ABOVE to the new laion-2b VIT-H model thanks to this weight:(, except several changes: 1) drop model ensemble, VIT-H is really a huge guy 2) train on all the datasets at the same time, drop products-10k, leave products-10k as the final fine-tuning datasets.”

Overlapping patches can help image segmentation for Vision Transformer models: the last trick was to set 4 pixels overlapping using 290 x 290 resolution.

The final performance results were 0.732 on public leaderboard and 0.728 on private leaderboard.

Useful links

GUIE competition overview page on Kaggle.

1st Place Solution in Google Universal Images Embedding paper.

1st place solution comments by S. Shiao.

1st place solution Github repository.

OpenCLIP repo.

ArcFace paper.

Laion-5B dataset page.