TAPIR- The Secret Behind Google’s Incredible New Image Recognition AI!

Md. Mahmudun Nabi

Researchers at Google DeepMind have created a new AI model known as TAPIR and it's honestly one of the coolest things I've seen in the field of computer vision.

[Computer vision is a branch of artificial intelligence that deals with understanding and analyzing visual information, such as images videos, or live streams.]

TAPIR- The Secret Behind Google’s Incredible New Image Recognition AI!

Its models can do amazing things like- recognize faces, detect objects, segment scenes, generate captions, and much more. These models can help us derive meaningful insights from different types of media and use them for various applications, such as security, entertainment, education, healthcare, and so on.

How do Computer vision systems function?

Computer vision systems use deep learning techniques to learn from large amounts of data and extract features that are relevant to the task at hand. For example, if you want to [Nationality Analysis] recognize a person's face in a photo, you need a model that can learn to identify the key characteristics of a face, such as- the shape of the eyes, nose, mouth, etc.

Then you need a model that can compare the features of the face in the photo with the features of the faces in your database and find the best match.

What if you want to track the tip of someone's nose or the center of their pupil as they move around in a video?

This is where things get tricky. Tracking a point in a video is not as easy as finding a point in a single image. You have to deal with challenges, like-

occlusion: when the point is hidden by another object or part of the object
motion blur: when the point becomes blurry due to fast movement
illumination changes: when the brightness or color of the point changes due to lighting conditions
scale variations: when the size or shape of the point changes due to perspective or distance.

These factors can make it hard for the model to keep track of the point as it moves across different frames. Now this is where TAPIR comes into the picture.

TAPIR

TAPIR stands for Tracking Any Point with per-frame Initialization and Temporal Refinement. It's a new model that can effectively track any point on any physical surface throughout a video sequence. It doesn't matter if the point is on a person's face, a car's wheel, a bird's wing or anything thing else. It can handle it all. It was developed by a team of researchers from Google DeepMind, VGG, Department of Engineering Science, and the University of Oxford. They published their paper on arXiv of on June 14, 2023 and they also open sourced their code and pre-trained models on GitHub.

How does TAPIR work?

It uses a two-stage algorithm. That consists of

1. A matching stage

2. A refinement stage

Matching stage

The matching stage is where it analyzes each video frame separately and tries to find a suitable candidate point match for the query point. The query point is the point that you want to track in the video sequence. For example, if you want to track a tip of someone's nose in a video, then that'll be your query point. To find the candidate point match for the query point in each frame, it uses a Deep Neural Network that takes as input an image patch around the query point and outputs a feature vector that represents its appearance. Then it compares this feature vector. With the feature vectors of all possible in each frame. Using cosine similarity and picks the most similar one is the candidate point match. This way, TAPIR can find the most likely related point for the query point in each frame independently. This makes it robust to occlusion and motion blur. Because even if the query point is not visible or clear in some frames, it can still find its best approximation based on its appearance. But finding candidate point matches is not enough to track the query point accurately.

Refinement stage

You also need to take into account how the query point moves over time and how its appearance changes due to factors, like- illumination or scale variations. This is where the refinement stage comes in. The refinement stage is where TAPIR updates both the trajectory and the query features based on local correlations.

The trajectory is the path followed by the query point throughout the video sequence and the query features are the feature vectors that represent its appearance. Now to update the trajectory in the query features, it uses another deep neural network. That neural network takes as input a small image patch around the candidate point match in each frame and outputs a displacement vector. That displacement vector indicates how much the candidate point match should be shifted to match the query point more precisely. Then it applies this displacement vector to the candidate point match to obtain a refined point match that is closer to the true query point.

In simple terms, the system works by examining small parts of an image, figuring out how much to adjust a selected point to match a target point, and then moving the selected point closer to that target. TAPIR also updates the query features by averaging the feature vectors of the refined point matches over time. This way it can adapt to changes in the query point's appearance and maintain a consistent representation of it. By combining these two stages, it can track any point in a video sequence with high accuracy and precision. It can handle videos of various sizes and quality. It can also track multiple points simultaneously.

How TAPIR performs on some Benchmarks and Demos?

The researchers evaluated TAPIR using the TAP-Vid Benchmark which is a standardized evaluation dataset for video tracking tasks. It contains 50 video sequences with different types of objects and scenes and provides ground truth annotations for 10 points per video. They compared TAPIR with several baseline methods, such as- SIFT, ORB, KLT, SuperPoint and D2Net.

They measured the performance using a metric called Average Jaccard (AJ). AJ is the average intersection over the union between the predicted point Locations and the Ground Truth point locations. The results showed that TAPIR outperformed all the baseline methods by a significant margin on the TAP-Vid Benchmark. It achieved an AJ score of 0.64 which is about 20 percent higher than the second best method D2Net which scored 0.44. This means that TAPIR was able to track the points more closely to their true locations than any other method.

It also performed well on another benchmark called DAVIS. DAVIS is a dataset for video segmentation tasks. It contains 150 video sequences with different types of objects and scenes. It provides ground truth annotations for pixel-level segmentation masks. The researchers used TAPIR to track 10 points per video on DAVIS and computed the AJ score as before. They found that TAPIR achieved an AJ score of 0.59 which is again about 20% higher than the second best method D2Net which scored 0.39. This means that it was able to track the points more consistently across different frames than any other method. But benchmarks are not enough to show you how awesome TAPIR is.

You need to see it in action yourself. Luckily the researchers have provided two online Google Colab Demos that you can use to run TAPIR on your own videos. The first demo is called TAP-Vid demo. It allows you to upload your own video or choose one from YouTube and then select any point on any object in the first frame that you want to track throughout the video. Then it runs TAPER on your video and shows you the results in real time.

The second demo is called Webcam demo. It allows you to use your own webcam as the input source and then select any point on your face or any other object in front of you that you want to track live as you move around. Then it runs TAPER on your webcam feed and shows you the results in real time.

My point of view

I have to say they are insane. You can see here how TAPER can track any point on any object point on any object with amazing accuracy and precision even when there are occlusion, motion blur, illumination changea, scale variations and so on. I think this is a great breakthrough for computer vision.

Search This Blog