TAPIR- The Secret Behind Google’s Incredible New Image Recognition AI!
Researchers at Google DeepMind have created a new AI model known as TAPIR and it's honestly one of the coolest things I've seen in the field of computer vision.
[Computer vision is a branch of artificial intelligence that deals with understanding and analyzing visual information, such as images videos, or live streams.]
Its models can do amazing things like- recognize faces, detect objects, segment scenes, generate captions, and much more. These models can help us derive meaningful insights from different types of media and use them for various applications, such as security, entertainment, education, healthcare, and so on.
How do Computer vision systems function?
Computer vision systems use deep learning techniques to learn from large amounts of data and extract features that are relevant to the task at hand. For example, if you want to [Nationality Analysis] recognize a person's face in a photo, you need a model that can learn to identify the key characteristics of a face, such as- the shape of the eyes, nose, mouth, etc.
Then you need a model that can compare the features of the face in the photo with the features of the faces in your database and find the best match.
What if you want to track the tip of someone's nose or the center of their pupil as they move around in a video?
This is where things get tricky. Tracking a point in a video is not as
easy as finding a point in a single image. You have to deal with challenges,
like-
- occlusion: when the point is hidden by another object or part of the object
- motion blur: when the point becomes blurry due to fast movement
- illumination changes: when the brightness or color of the point changes due to lighting conditions
- scale variations: when the size or shape of the point changes due to perspective or distance.
These factors can make it hard for
the model to keep track of the point as it moves across different frames. Now
this is where TAPIR comes into the picture.
TAPIR
TAPIR stands for Tracking Any Point with per-frame Initialization
and Temporal Refinement. It's a new model that can effectively track any point
on any physical surface throughout a video sequence. It doesn't matter if the
point is on a person's face, a car's wheel, a bird's wing or anything thing else.
It can handle it all. It was developed by a team of researchers from Google
DeepMind, VGG, Department of Engineering Science, and the University of Oxford.
They published their paper on arXiv
of on June 14, 2023 and they also open sourced their code and pre-trained models
on GitHub.
How does TAPIR work?
It uses a two-stage algorithm. That
consists of
1. A matching stage
2. A refinement stage
Matching stage
The matching stage is where it analyzes each video frame separately and
tries to find a suitable candidate point match for the query point. The query point is the point that you want
to track in the video sequence. For example, if you want to track a tip of
someone's nose in a video, then that'll be your query point. To find the
candidate point match for the query point in each frame, it uses a Deep Neural Network that takes as input
an image patch around the query point and outputs a feature vector that
represents its appearance. Then it compares this feature vector. With the feature vectors of all possible in each
frame. Using cosine similarity and picks the most similar one is the candidate
point match. This way, TAPIR can find the most likely related point for the
query point in each frame independently. This makes it robust to occlusion and motion blur. Because even if the query point is not visible or
clear in some frames, it can still find its best approximation based on its
appearance. But finding candidate point matches is not enough to track the
query point accurately.
Refinement stage
You also need to take into account
how the query point moves over time and how its appearance changes due to
factors, like- illumination or scale variations. This is where the refinement stage comes in. The refinement
stage is where TAPIR updates both the trajectory and the query features based
on local correlations.
The trajectory is the path followed by the query point throughout the video sequence and the query features are the feature vectors that represent its appearance. Now to update the trajectory in the query features, it uses another deep neural network. That neural network takes as input a small image patch around the candidate point match in each frame and outputs a displacement vector. That displacement vector indicates how much the candidate point match should be shifted to match the query point more precisely. Then it applies this displacement vector to the candidate point match to obtain a refined point match that is closer to the true query point.
In
simple terms, the system works by examining small parts of an image, figuring
out how much to adjust a selected point to match a target point, and then
moving the selected point closer to that target. TAPIR also updates the query
features by averaging the feature vectors of the refined point matches over
time. This way it can adapt to changes in the query point's appearance and
maintain a consistent representation of it. By combining these two stages, it
can track any point in a video sequence with high accuracy and precision. It
can handle videos of various sizes and quality. It can also track multiple
points simultaneously.
How TAPIR performs on some Benchmarks and Demos?
The researchers evaluated TAPIR using the TAP-Vid Benchmark which is a standardized evaluation dataset for video tracking tasks. It contains 50 video sequences with different types of objects and scenes and provides ground truth annotations for 10 points per video. They compared TAPIR with several baseline methods, such as- SIFT, ORB, KLT, SuperPoint and D2Net.
They
measured the performance using a metric called Average Jaccard (AJ). AJ is the average intersection over the union
between the predicted point Locations and the Ground Truth point locations. The results showed that TAPIR
outperformed all the baseline methods by a significant margin on the TAP-Vid Benchmark. It achieved an AJ score
of 0.64 which is about 20 percent higher
than the second best method D2Net which scored 0.44. This means that TAPIR was able to track the points more closely
to their true locations than any other method.
It also performed well on another benchmark
called DAVIS. DAVIS is a dataset for
video segmentation tasks. It contains 150
video sequences with different types of objects and scenes. It provides ground truth annotations for
pixel-level segmentation masks. The researchers used TAPIR to track 10 points
per video on DAVIS and computed the AJ score as before. They found that TAPIR achieved
an AJ score of 0.59 which is again about 20% higher than the second best method
D2Net which scored 0.39. This means that it was able to track the points more
consistently across different frames than any other method. But benchmarks are
not enough to show you how awesome TAPIR is.
You need to see it in action
yourself. Luckily the researchers have provided
two online Google Colab Demos that you can use to run TAPIR on your own
videos. The first demo is called TAP-Vid
demo. It allows you to upload your own video or choose one from YouTube and
then select any point on any object in the first frame that you want to track throughout
the video. Then it runs TAPER on your video and shows you the results in real
time.
The second demo is called Webcam demo. It allows you to use your own webcam as the input source and then select any point on your face or any other object in front of you that you want to track live as you move around. Then it runs TAPER on your webcam feed and shows you the results in real time.
My point of view
I have to say they are insane. You can see here how TAPER can track any point on any object point on any object with amazing accuracy and precision even when there are occlusion, motion blur, illumination changea, scale variations and so on. I think this is a great breakthrough for computer vision.