Subscribe to the PwC Newsletter
Join the community, computer vision, semantic segmentation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/b45b7a24-e2dd-47e2-9d1f-0f372e5d9074.jpg)
![](http://academicpaper.online/777/templates/cheerup/res/banner1.gif)
Tumor Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000895-78a1eb87.jpg)
Panoptic Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/48d55b59-3af2-4a6d-a195-572f1d4a1867.jpg)
3D Semantic Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000378-49a864d5.jpg)
Weakly-Supervised Semantic Segmentation
Representation learning.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000228-40138330.jpg)
Disentanglement
Graph representation learning, sentence embeddings.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000545-bf65a60c.jpg)
Network Embedding
Classification.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/53be8903-3dc3-4437-8791-c43483b4f962.jpg)
Text Classification
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/d343960e-504c-458c-80f0-8c6014cfaa65.jpg)
Graph Classification
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/d0eafcb3-1a12-430b-8bb5-6f6bbff1a4b3.jpg)
Audio Classification
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/61a8e56a-7bf0-4da8-bf7e-f79e02b66ccd.jpg)
Medical Image Classification
Object detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/dd004e56-bc49-4cc1-b0d5-186f2dd17ce8.jpg)
3D Object Detection
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000741-231617f1.jpg)
Real-Time Object Detection
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000742-06430ae4.jpg)
RGB Salient Object Detection
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/d6fcf503-5564-493b-a34d-1b01fcd80941.jpg)
Few-Shot Object Detection
Image classification.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/0aa45ecb-2bb1-4c8d-bd0c-16b4d9de739d.jpg)
Out of Distribution (OOD) Detection
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/18c20497-9da0-4608-8f11-35c870d99005.jpg)
Few-Shot Image Classification
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/dbefc490-9a44-469c-8114-a066e26699ca.jpg)
Fine-Grained Image Classification
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001329-eb5dcfbd.jpg)
Semi-Supervised Image Classification
2d object detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000355-21d993ca_0d999Sl.jpg)
Edge Detection
Thermal image segmentation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/ac0f46c4-6b3c-4465-b02d-402c11ddb3ba.jpg)
Open Vocabulary Object Detection
Reinforcement learning (rl), off-policy evaluation, multi-objective reinforcement learning, 3d point cloud reinforcement learning, deep hashing, table retrieval, domain adaptation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000588-ecdf2de6.jpg)
Unsupervised Domain Adaptation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000585-880539d1.jpg)
Domain Generalization
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000588-823db955.jpg)
Test-time Adaptation
Source-free domain adaptation, image generation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/9fd658ab-b2eb-41bc-bd14-d313ceb367e1.jpg)
Image-to-Image Translation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/872f0a76-ee06-408e-93c0-4123023fadba.jpg)
Text-to-Image Generation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/ec12422e-fae0-40b2-b2bb-129030e6dd8b.jpg)
Image Inpainting
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/6610198f-494c-4241-a7b7-c4633b48340a.jpg)
Conditional Image Generation
Data augmentation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001560-ec9b8d56.jpg)
Image Augmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001784-52f7b2c2.jpg)
Text Augmentation
Autonomous vehicles.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000363-4a36e709.jpg)
Autonomous Driving
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/a56ccdaa-c8ec-43e5-9386-9ba0b97c0065.jpg)
Self-Driving Cars
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001132-37a847fd.jpg)
Simultaneous Localization and Mapping
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000361-5c74c6f5.jpg)
Autonomous Navigation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/9772666b-6f6f-42fb-961b-54ed39504da6.jpg)
Image Denoising
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/6c4d53f8-9c6d-47c8-80c7-1b8e1c0a7d42.jpg)
Color Image Denoising
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000705-196c29ea.jpg)
Sar Image Despeckling
Grayscale image denoising, meta-learning.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001088-606b0b28.jpg)
Few-Shot Learning
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001088-6b0b3a7f_0bh9941.jpg)
Sample Probing
Universal meta-learning, contrastive learning.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/52a08002-04d0-4679-8154-fcd2ad546613.jpg)
Super-Resolution
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000032-03c2c2ea.jpg)
Image Super-Resolution
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000036-7f9f8d80.jpg)
Video Super-Resolution
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001824-8c53d952.jpg)
Multi-Frame Super-Resolution
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000032-0f0cf3b2.jpg)
Reference-based Super-Resolution
Pose estimation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000772-9d213c7e.jpg)
3D Human Pose Estimation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/27efc689-216a-4b18-b27f-dee62097414a.jpg)
Keypoint Detection
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/9af705d3-1fa0-44a5-903c-b6845910057d.jpg)
3D Pose Estimation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001767-c1b24a25.jpg)
6D Pose Estimation
Self-supervised learning.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001882-4594fa1f.jpg)
Point Cloud Pre-training
Unsupervised video clustering, 2d semantic segmentation, image segmentation, text style transfer.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000189-a5e3056b.jpg)
Scene Parsing
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/tasks/dfb2e45e-22b0-4368-99c2-91a622d8f8f2.jpg)
Reflection Removal
Visual question answering (vqa).
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/ff52c247-71e0-4aca-9dfd-f7af0226b297.jpg)
Visual Question Answering
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000151-77f4f37a.jpg)
Machine Reading Comprehension
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/ad96e5a6-778a-4416-9868-432a19a998a5.jpg)
Chart Question Answering
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000170-7e3af75d.jpg)
Embodied Question Answering
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/1cc3e5ee-6280-4784-ac9e-dac9fc1ac49b.jpg)
Depth Estimation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000780-ab89f9f7.jpg)
3D Reconstruction
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000952-593862fd.jpg)
Neural Rendering
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/846006d5-5b2d-47f6-8eae-93bc46a361fc.jpg)
3D Face Reconstruction
Sentiment analysis.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000572-88a37f14.jpg)
Aspect-Based Sentiment Analysis (ABSA)
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000569-3db5bbfd.jpg)
Multimodal Sentiment Analysis
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002224-e34ed86f.jpg)
Aspect Sentiment Triplet Extraction
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000571-f3ec0c11.jpg)
Twitter Sentiment Analysis
Anomaly detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/e1bdc291-0932-45fd-87af-784f95d26ef8.jpg)
Unsupervised Anomaly Detection
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/e00767d1-66de-4410-adac-a7efd0e00f60.jpg)
One-Class Classification
Supervised anomaly detection, anomaly detection in surveillance videos.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/971695b5-b2fa-44c2-9b90-ac54e33f3950.jpg)
Temporal Action Localization
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000536-c863e3aa.jpg)
Video Understanding
Video generation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001367-6bece674.jpg)
Video Object Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000417-c863e3aa.jpg)
Action Classification
Activity recognition.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000145-af362f59_Q7SFv0d.jpg)
Action Recognition
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000655-ec3df450.jpg)
Human Activity Recognition
Egocentric activity recognition.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000654-a1de5b0a.jpg)
Group Activity Recognition
3d object super-resolution.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000154-e9e9e4ae.jpg)
One-Shot Learning
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/808c42b7-2bc8-434c-92b9-30df6ef65bc3.jpg)
Few-Shot Semantic Segmentation
Cross-domain few-shot.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/ba39ea79-54cf-4ea2-9158-457ddebaa108.jpg)
Unsupervised Few-Shot Learning
Medical image segmentation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000876-6fe8e464.jpg)
Lesion Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/180128ef-b1e3-45a7-9c19-13e9f3332743.jpg)
Brain Tumor Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000876-6fbe75a2_gBlYteG.jpg)
Cell Segmentation
Skin lesion segmentation, monocular depth estimation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000604-2b5b354d.jpg)
Stereo Depth Estimation
Depth and camera motion.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000603-d0cb489d.jpg)
3D Depth Estimation
Exposure fairness, optical character recognition (ocr).
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000012-96a4bb03_wHfYaCD.jpg)
Active Learning
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000012-abcd0c32_qIWLaav.jpg)
Handwriting Recognition
Handwritten digit recognition, irregular text recognition, instance segmentation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000003-fae0daac_XS6W0G2.jpg)
Referring Expression Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001552-8ade3a3c.jpg)
3D Instance Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001409-8b1f0392.jpg)
Real-time Instance Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/3370877f-48a8-40c3-84c1-ec7bcea8e6cb.jpg)
Unsupervised Object Segmentation
Facial recognition and modelling.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000458-10ebe873.jpg)
Face Recognition
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/d2bb4e46-c886-4356-be78-5ad095adfe83.jpg)
Face Swapping
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000351-19fb9a84.jpg)
Face Detection
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000449-5028c9a0.jpg)
Facial Expression Recognition (FER)
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000459-1a318ecd_7asfLRv.jpg)
Face Verification
Object tracking.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000553-1a9eef99.jpg)
Multi-Object Tracking
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000552-8dc245dd.jpg)
Visual Object Tracking
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000554-a42873a2.jpg)
Multiple Object Tracking
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000553-467cdf5d_SvoYQZ2.jpg)
Cell Tracking
Zero-shot learning.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000158-a8a7e2cc.jpg)
Generalized Zero-Shot Learning
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/939b7389-c691-42ef-8ace-024e9a26e4b3.jpg)
Compositional Zero-Shot Learning
Multi-label zero-shot learning, quantization, data free quantization, unet quantization, continual learning.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/1b32d141-9ad3-43d3-87b6-ddcdc17b06ca.jpg)
Class Incremental Learning
Continual named entity recognition, unsupervised class-incremental learning.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000145-670c75d8_lBJNcK5.jpg)
Action Recognition In Videos
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000625-993c77cd.jpg)
3D Action Recognition
Self-supervised action recognition, few shot action recognition.
![research paper about computer vision research paper about computer vision](https://raw.githubusercontent.com/CSAILVision/semantic-segmentation-pytorch/master/./teaser/ADE_val_00000278.png)
Scene Understanding
![research paper about computer vision research paper about computer vision](https://raw.githubusercontent.com/bgshih/crnn/master/./data/demo2.jpg)
Scene Text Recognition
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000215-2a535688.jpg)
Scene Graph Generation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000193-3cfc96f6.jpg)
Scene Recognition
Adversarial attack.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000358-ba21f5af.jpg)
Backdoor Attack
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000357-22bc5c8b.jpg)
Adversarial Text
Adversarial attack detection, real-world adversarial attack, active object detection, image retrieval.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/a7025228-3105-4ac6-b2c1-6e1ea6dacc0d.jpg)
Sketch-Based Image Retrieval
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000111-2ef95d07.jpg)
Content-Based Image Retrieval
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/1016847e-f821-4549-8528-a50e2a1227a0.jpg)
Composed Image Retrieval (CoIR)
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/834263fd-0f2e-47a9-bda1-0fd3f44c71df.jpg)
Medical Image Retrieval
Dimensionality reduction.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000831-0e6ae4c7.jpg)
Supervised dimensionality reduction
Online nonnegative cp decomposition, emotion recognition.
![research paper about computer vision research paper about computer vision](https://raw.githubusercontent.com/fengju514/Expression-Net/master/ExpNet_teaser_v2.jpg)
Speech Emotion Recognition
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/453d2082-dde8-432f-9859-04df8ed44dd6.jpg)
Emotion Recognition in Conversation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000384-2d1d8fd8.jpg)
Multimodal Emotion Recognition
Emotion-cause pair extraction.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000785-645bd197.jpg)
Monocular 3D Object Detection
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001846-fbe29ec8.jpg)
3D Object Detection From Stereo Images
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/81b29110-9cfa-498e-acee-0f5f403eb5e9.jpg)
Multiview Detection
Robust 3d object detection, image reconstruction.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002391-c56e32a0.jpg)
MRI Reconstruction
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/3cc13a40-1344-4431-bcec-1752061f7036.jpg)
Film Removal
Style transfer.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/6475cf9c-06c3-406f-b69b-900ddd6f47f2.jpg)
Image Stylization
Font style transfer, style generalization, face transfer, optical flow estimation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/b77ae75e-6fa4-4dc1-9853-6c3b4a863eda.jpg)
Video Stabilization
Image captioning.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/cd47481d-85b7-4a27-8ee6-5c969609c94f.jpg)
3D dense captioning
Controllable image captioning, aesthetic image captioning.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002356-fdbfb5c2.jpg)
Relational Captioning
Action localization.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000390-6354379e.jpg)
Action Segmentation
Spatio-temporal action localization, person re-identification.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000127-6ba62ef3.jpg)
Unsupervised Person Re-Identification
Video-based person re-identification, generalizable person re-identification, cloth-changing person re-identification, image restoration.
![research paper about computer vision research paper about computer vision](https://raw.githubusercontent.com/titu1994/ImageSuperResolution/master/architectures/SRCNN.png)
Demosaicking
Spectral reconstruction, underwater image restoration.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002526-6f740253.jpg)
JPEG Artifact Correction
Visual relationship detection, lighting estimation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000133-9833a918.jpg)
3D Room Layouts From A Single RGB Panorama
Road scene understanding, action detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/dc08f825-1e11-41c5-ac43-4407cc259c8d.jpg)
Skeleton Based Action Recognition
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/fe4dde36-d569-498d-b386-af61cb831541.jpg)
Online Action Detection
Audio-visual active speaker detection, metric learning.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001355-46cddb3b.jpg)
Object Recognition
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/58ba4d27-a9fb-45c8-8a45-58a7dfad0e2b.jpg)
3D Object Recognition
Continuous object recognition.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001053-fd305adb.jpg)
Depiction Invariant Object Recognition
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/67f4d17a-534a-4c8d-bbf0-2f70d231cc59.jpg)
Monocular 3D Human Pose Estimation
Pose prediction.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001649-7e12f901.jpg)
3D Multi-Person Pose Estimation
3d human pose and shape estimation, image enhancement.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000098-5f928654_Ty92qUH.jpg)
Low-Light Image Enhancement
Image relighting, de-aliasing, multi-label classification.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000674-1f06fb6b.jpg)
Missing Labels
Extreme multi-label classification, hierarchical multi-label classification, medical code prediction, continuous control.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000516-081b94a0.jpg)
Steering Control
Drone controller.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001368-6bece674.jpg)
Semi-Supervised Video Object Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000881-e1482940.jpg)
Unsupervised Video Object Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001367-1082b77e.jpg)
Referring Video Object Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001790-02f23ffd.jpg)
Video Salient Object Detection
3d face modelling.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/3051f6f3-2120-45b0-bfb5-5fc9509d7986.jpg)
Trajectory Prediction
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000020-0ddebb3b.jpg)
Trajectory Forecasting
Human motion prediction, out-of-sight trajectory prediction.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001619-3a5655f8.jpg)
Multivariate Time Series Imputation
Image quality assessment, no-reference image quality assessment, blind image quality assessment.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001676-478c0749.jpg)
Aesthetics Quality Assessment
Stereoscopic image quality assessment, object localization.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000383-22bf0ed9.jpg)
Weakly-Supervised Object Localization
Image-based localization, unsupervised object localization, monocular 3d object localization, novel view synthesis.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001386-6aba34e7.jpg)
Novel LiDAR View Synthesis
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001386-5a4b94dc_tKiQfG2.jpg)
Gournd video synthesis from satellite image
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000014-2dbbecf3.jpg)
Blind Image Deblurring
Single-image blind deblurring, out-of-distribution detection, video semantic segmentation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000910-ef1bd608.jpg)
Camera shot segmentation
Cloud removal.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000444-66c74076.jpg)
Facial Inpainting
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/2c8222e3-b09e-4ffb-b284-211afa8086a8.jpg)
Fine-Grained Image Inpainting
Instruction following, visual instruction following, change detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/42790a97-c01a-4afa-aa7c-3b72d5a52296.jpg)
Semi-supervised Change Detection
Saliency detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000333-b088b240.jpg)
Saliency Prediction
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000331-df0072c9.jpg)
Co-Salient Object Detection
Video saliency detection, unsupervised saliency detection, image compression.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000730-aec83530.jpg)
Feature Compression
Jpeg compression artifact reduction.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000729-ed7408b1.jpg)
Lossy-Compression Artifact Reduction
Color image compression artifact reduction, explainable artificial intelligence, explainable models, explanation fidelity evaluation, fad curve analysis, prompt engineering.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/118d0d79-54e4-49a7-ab9a-e3606e82103d.jpg)
Visual Prompting
Image registration.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000522-69cd01de_zgJiQk0.jpg)
Unsupervised Image Registration
Ensemble learning, visual reasoning.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000242-3efd20a2.jpg)
Visual Commonsense Reasoning
Salient object detection, saliency ranking, visual tracking.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000550-8dc245dd.jpg)
Point Tracking
Rgb-t tracking, real-time visual tracking.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001734-32c857c9.jpg)
RF-based Visual Tracking
3d point cloud classification.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/0eeeb7f4-67f5-4ec3-a76b-20ba51efae6a.jpg)
3D Object Classification
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/a967964d-f619-4d68-b753-c0acae259fa0.jpg)
Few-Shot 3D Point Cloud Classification
Supervised only 3d point cloud classification, zero-shot transfer 3d point cloud classification, motion estimation, 2d classification.
![research paper about computer vision research paper about computer vision](https://raw.githubusercontent.com/cambridge-mlg/miracle/master/figures/mnist_comp.png)
Neural Network Compression
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000139-7e4d4874.jpg)
Music Source Separation
Cell detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/61a791e1-5ac0-40cb-9cdf-5422434bbbfe.jpg)
Plant Phenotyping
Open-set classification, image manipulation detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000625-bb786447.jpg)
Zero Shot Skeletal Action Recognition
Generalized zero shot skeletal action recognition, whole slide images, activity prediction, motion prediction, cyber attack detection, sequential skip prediction, gesture recognition.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000632-7fc5c90c.jpg)
Hand Gesture Recognition
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000631-4ed1fa07.jpg)
Hand-Gesture Recognition
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001732-5fa30f4b.jpg)
RF-based Gesture Recognition
Video captioning.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000542-44908c53.jpg)
Dense Video Captioning
Boundary captioning, visual text correction, audio-visual video captioning, video question answering.
![research paper about computer vision research paper about computer vision](https://raw.githubusercontent.com/jayleicn/TVQA/master/./imgs/example_main.png)
Zero-Shot Video Question Answer
Few-shot video question answering.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/1ebcc422-6179-47b1-b0d5-d4883043a38a.jpg)
Robust 3D Semantic Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/ed554fc5-ed7e-4bb2-9abb-ea0c385e0892.jpg)
Real-Time 3D Semantic Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/8666f2ea-9285-487a-b0b1-5b462667f66e.jpg)
Unsupervised 3D Semantic Segmentation
Furniture segmentation, point cloud registration.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000520-849c9f31.jpg)
Image to Point Cloud Registration
Text detection, medical diagnosis.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000293-ad63354d.jpg)
Alzheimer's Disease Detection
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002230-9dfeec51.jpg)
Retinal OCT Disease Classification
Blood cell count, thoracic disease classification, 3d point cloud interpolation, visual grounding.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/8bdd62b4-c02c-43ea-bbb5-77882024286e.jpg)
Person-centric Visual Grounding
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/441e442f-a774-406d-89d8-4a103876ad91.jpg)
Phrase Extraction and Grounding (PEG)
Visual odometry.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000070-1705b341.jpg)
Face Anti-Spoofing
Monocular visual odometry.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000771-3b256a6c.jpg)
Hand Pose Estimation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000884-78baab10.jpg)
Hand Segmentation
Gesture-to-gesture translation, rain removal.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000440-be3759b0.jpg)
Single Image Deraining
Image clustering.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000702-79f3c03f.jpg)
Online Clustering
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000702-3c2f553a.jpg)
Face Clustering
Multi-view subspace clustering, multi-modal subspace clustering.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000710-8c508a28.jpg)
Image Dehazing
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000100-8c508a28.jpg)
Single Image Dehazing
Colorization.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/34171ffd-b64f-4c99-a8ae-25db8fc5921d.jpg)
Line Art Colorization
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/ea20e877-5ede-4a85-9b69-b2e91ff6838f.jpg)
Point-interactive Image Colorization
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/eb621262-556c-40c9-b6e3-71475de11079.jpg)
Color Mismatch Correction
Robot navigation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000547-5ff26267.jpg)
PointGoal Navigation
Social navigation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002538-5c74c6f5.jpg)
Sequential Place Learning
Image manipulation, conformal prediction.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000108-08b670c5.jpg)
Unsupervised Image-To-Image Translation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001067-0d49abc9.jpg)
Synthetic-to-Real Translation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000109-90fbb2e0.jpg)
Multimodal Unsupervised Image-To-Image Translation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/e1f688c7-99a7-4198-8cf4-7ffcc0ffde8b.jpg)
Cross-View Image-to-Image Translation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002229-c773675f.jpg)
Fundus to Angiography Generation
Visual place recognition.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000663-4df8d036.jpg)
Indoor Localization
3d place recognition, image editing, rolling shutter correction, shadow removal, multimodel-guided image editing, joint deblur and frame interpolation, multimodal fashion image editing, visual localization.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000391-9bc2256b.jpg)
DeepFake Detection
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001775-6d14e362.jpg)
Synthetic Speech Detection
Human detection of deepfakes, multimodal forgery detection, stereo matching, object reconstruction.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000750-67b42af7.jpg)
3D Object Reconstruction
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000471-ceeed704.jpg)
Crowd Counting
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000239-ab2f099a.jpg)
Visual Crowd Analysis
Group detection in crowds, human-object interaction detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000617-9644872d.jpg)
Affordance Recognition
Image deblurring, low-light image deblurring and enhancement, earth observation, video quality assessment, video alignment, temporal sentence grounding, long-video activity recognition, point cloud classification, jet tagging, few-shot point cloud classification, image matching.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/4810ca04-9af3-4e79-9bf9-54f786052a27.jpg)
Semantic correspondence
Patch matching, set matching.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/b478c024-2fd4-46ce-8128-0da3bb642b75.jpg)
Matching Disparate Images
Hyperspectral.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000818-f54abafb.jpg)
Hyperspectral Image Classification
Hyperspectral unmixing, hyperspectral image segmentation, classification of hyperspectral images, document text classification.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/562d955c-73b9-4bab-b731-53860c9e03bc.jpg)
Learning with noisy labels
Multi-label classification of biomedical texts, political salient issue orientation detection, 3d point cloud reconstruction.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/21b7189c-5d64-46b0-aa33-3da648560eaa.jpg)
Weakly Supervised Action Localization
Weakly-supervised temporal action localization.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000210-b1ee5c73.jpg)
Temporal Action Proposal Generation
Activity recognition in videos, scene classification.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000413-143ff75c.jpg)
2D Human Pose Estimation
Action anticipation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001498-3f9c6ea2_65GTaFO.jpg)
3D Face Animation
Semi-supervised human pose estimation, point cloud generation, point cloud completion, referring expression, reconstruction, 3d human reconstruction.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000045-5164a6fa.jpg)
Single-View 3D Reconstruction
4d reconstruction, single-image-based hdr reconstruction, compressive sensing, keyword spotting.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000085-ed8952fd.jpg)
Small-Footprint Keyword Spotting
Visual keyword spotting, scene text detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000507-55533bc2.jpg)
Curved Text Detection
Multi-oriented scene text detection, boundary detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000356-04900360.jpg)
Junction Detection
Camera calibration, image matting.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/e3b47721-0830-4233-9253-d455d3be1c59.jpg)
Semantic Image Matting
Video retrieval, video-text retrieval, video grounding, video-adverb retrieval, replay grounding, composed video retrieval (covr), motion synthesis.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/cf274dc2-54c8-4d2b-b3db-fc42338ebb3e.jpg)
Motion Style Transfer
Temporal human motion composition, emotion classification.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000563-59ba915b.jpg)
Video Summarization
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/f39b8716-7e56-40a4-a528-823472ba7bfc.jpg)
Unsupervised Video Summarization
Supervised video summarization, document ai, document understanding, sensor fusion, superpixels, point cloud segmentation, remote sensing.
![research paper about computer vision research paper about computer vision](https://raw.githubusercontent.com/lehaifeng/RSI-CB/master/osm%E5%88%86%E5%B8%83%E5%9B%BE.png)
Remote Sensing Image Classification
Change detection for remote sensing images, building change detection for remote sensing images.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000484-0b85d8ec.jpg)
Segmentation Of Remote Sensing Imagery
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000485-0b85d8ec.jpg)
The Semantic Segmentation Of Remote Sensing Imagery
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002170-963c86db.jpg)
Few-Shot Transfer Learning for Saliency Prediction
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000847-b12abf24_wqnD1AJ.jpg)
Aerial Video Saliency Prediction
Document layout analysis.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/a118a5c0-8617-4374-ada1-a93e748ca0b5.jpg)
3D Anomaly Detection
Video anomaly detection, artifact detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000780-3d4e01ee.jpg)
Point cloud reconstruction
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/e19e9626-0c29-4ac5-a1e8-a65170ad7c2d.jpg)
3D Semantic Scene Completion
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/c6f9dd48-eb74-459b-9143-4b1d30b11dc8.jpg)
3D Semantic Scene Completion from a single RGB image
Garment reconstruction, face generation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/2ce93ab0-992a-429f-8005-09601dddcb1f.jpg)
Talking Head Generation
Talking face generation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000990-2d591218_2jUWb8G.jpg)
Face Age Editing
Facial expression generation, kinship face generation, cross-modal retrieval, image-text matching, multilingual cross-modal retrieval.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/dcfded35-e2e1-44e8-86da-c09af9cfafa9.jpg)
Zero-shot Composed Person Retrieval
Cross-modal retrieval on rsitmd, video instance segmentation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002416-bc645653.jpg)
Privacy Preserving Deep Learning
Membership inference attack, human detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000497-6d1a4ae6.jpg)
Generalized Few-Shot Semantic Segmentation
Virtual try-on, scene flow estimation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/3cc22ae6-195d-4ff1-8dda-4271733f2ed6.jpg)
Self-supervised Scene Flow Estimation
3d classification, depth completion.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001115-db64b3a0.jpg)
Motion Forecasting
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000363-06d10c79.jpg)
Multi-Person Pose forecasting
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001736-1612b5d3.jpg)
Multiple Object Forecasting
Video editing, video temporal consistency, face reconstruction, object discovery, carla map leaderboard, dead-reckoning prediction.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/33c33ac1-c08c-4d2c-925e-303af2f00b9c.jpg)
Generalized Referring Expression Segmentation
Gaze estimation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000848-7a7f5179.jpg)
Texture Synthesis
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/d5f1e0b9-1215-4f4f-ab92-8ea1e4f036d7.jpg)
Text-based Image Editing
Text-guided-image-editing.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/5f077531-cc40-4eec-be00-a6337e075dfe.jpg)
Zero-Shot Text-to-Image Generation
Concept alignment, conditional text-to-image synthesis, machine unlearning, continual forgetting, sign language recognition.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000649-69922a8d.jpg)
Image Recognition
Fine-grained image recognition, license plate recognition, material recognition, multi-view learning, incomplete multi-view clustering.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000287-fc5b698e.jpg)
Breast Cancer Detection
Skin cancer classification.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000288-c86f61d3.jpg)
Breast Cancer Histology Image Classification
Lung cancer diagnosis, classification of breast cancer histology images, gait recognition.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/c4fc8e61-f5b3-4b2a-87a4-71293efe2e7c.jpg)
Multiview Gait Recognition
Gait recognition in the wild, human parsing.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001200-fb55e254.jpg)
Multi-Human Parsing
Pose tracking.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001299-897c396f.jpg)
3D Human Pose Tracking
Interactive segmentation, scene generation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001652-7e12f901.jpg)
3D Multi-Person Pose Estimation (absolute)
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001651-edc0c2f2.jpg)
3D Multi-Person Pose Estimation (root-relative)
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001649-ecb41cf2.jpg)
3D Multi-Person Mesh Recovery
Event-based vision.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/2eedb385-fb67-459a-a2c0-f7cc8feabc55.jpg)
Event-based Optical Flow
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/6cb48677-e623-4a5c-b329-5a974f537f44.jpg)
Event-Based Video Reconstruction
Event-based motion estimation, disease prediction, disease trajectory forecasting, object counting, training-free object counting, open-vocabulary object counting, interest point detection, homography estimation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002042-ac2cbf8e.jpg)
3D Hand Pose Estimation
Weakly supervised segmentation, facial landmark detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000441-787de252_HtXStMs.jpg)
Unsupervised Facial Landmark Detection
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001971-ec7de5c2.jpg)
3D Facial Landmark Localization
3d character animation from a single photo, scene segmentation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/84b92b4e-836f-4a64-8e53-a972dc5dc618.jpg)
Dichotomous Image Segmentation
Activity detection, inverse rendering, temporal localization.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000389-ae6f548d.jpg)
Language-Based Temporal Localization
Temporal defect localization, multi-label image classification.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/4eee2add-78aa-46df-b8a3-296919b49cb5.jpg)
Multi-label Image Recognition with Partial Labels
3d object tracking.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/d79d32e0-35b4-4fc9-a69c-d373e3097a39.jpg)
3D Single Object Tracking
Template matching, text-to-video generation, text-to-video editing, subject-driven video generation, camera localization.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000474-5eb20b1e.jpg)
Camera Relocalization
Lidar semantic segmentation, visual dialog.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000243-1146f2d1.jpg)
Motion Segmentation
Relation network, intelligent surveillance.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000129-bd0ee47a.jpg)
Vehicle Re-Identification
Text spotting.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000086-2094f367.jpg)
Disparity Estimation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000849-9022569c.jpg)
Few-Shot Class-Incremental Learning
Class-incremental semantic segmentation, non-exemplar-based class incremental learning, handwritten text recognition, handwritten document recognition, unsupervised text recognition, knowledge distillation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/tasks/d0f2dad1-32df-46e2-8686-ce09e263353c.png)
Data-free Knowledge Distillation
Self-knowledge distillation, moment retrieval.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002886-eacfc398.jpg)
Zero-shot Moment Retrieval
Text to video retrieval, partially relevant video retrieval, person search, decision making under uncertainty.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000177-776f95bc.jpg)
Uncertainty Visualization
Semi-supervised object detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/dfedaa2c-0eb8-4247-9cc0-db856cbf64ad.jpg)
Shadow Detection
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000504-1345a6a4.jpg)
Shadow Detection And Removal
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/01d8624d-00e2-4e32-8159-ae45c4a3edd5.jpg)
Unconstrained Lip-synchronization
Mixed reality, video inpainting.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001621-e4fa630c.jpg)
Cross-corpus
Micro-expression recognition, micro-expression spotting.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000448-d9c5224c.jpg)
3D Facial Expression Recognition
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000630-844318ed.jpg)
Smile Recognition
Future prediction, human mesh recovery, video enhancement.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000458-a47f7d65.jpg)
Face Image Quality Assessment
Lightweight face recognition.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000452-a95d6931.jpg)
Age-Invariant Face Recognition
Synthetic face recognition, face quality assessement.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001626-3b0fd806.jpg)
3D Multi-Object Tracking
Real-time multi-object tracking, multi-animal tracking with identification, trajectory long-tail distribution for muti-object tracking, grounded multiple object tracking, image categorization, fine-grained visual categorization, overlapped 10-1, overlapped 15-1, overlapped 15-5, disjoint 10-1, disjoint 15-1.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/0d291bd4-583c-4a65-ac28-ef63253420fe.jpg)
Burst Image Super-Resolution
Stereo image super-resolution, satellite image super-resolution, multispectral image super-resolution, color constancy.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000724-0c23f7fd.jpg)
Few-Shot Camera-Adaptive Color Constancy
Hdr reconstruction, multi-exposure image fusion, open vocabulary semantic segmentation, zero-guidance segmentation, physics-informed machine learning, soil moisture estimation, deep attention, line detection, video reconstruction.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001827-3fb659a2_8tfLn4X.jpg)
Zero Shot Segmentation
Visual recognition.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000718-75613b53.jpg)
Fine-Grained Visual Recognition
Image cropping, sign language translation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001449-2d0892c0.jpg)
Stereo Matching Hand
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000904-a0e8fdfb.jpg)
3D Absolute Human Pose Estimation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001751-381bcadd_22ti1hO.jpg)
Text-to-Face Generation
Image forensics, tone mapping, zero-shot action recognition, natural language transduction, video restoration.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/82695719-6c02-4632-9c43-ef66a18ab565.jpg)
Analog Video Restoration
Novel class discovery.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/8bec3e03-fc06-4d78-96bb-6a6722672130.jpg)
Transparent Object Detection
Transparent objects, surface normals estimation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001360-1808c9b9.jpg)
hand-object pose
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000002042-8135da6d.jpg)
Grasp Generation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002043-e749e306.jpg)
3D Canonical Hand Pose Estimation
Breast cancer histology image classification (20% labels), cross-domain few-shot learning, texture classification, vision-language navigation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000067-333d5dfa.jpg)
Abnormal Event Detection In Video
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002615-8ca44059.jpg)
Semi-supervised Anomaly Detection
Infrared and visible image fusion.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000487-58d20eaf.jpg)
Image Animation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/5e0606c2-1094-410e-a125-c9e3e166aab2.jpg)
![](http://academicpaper.online/777/templates/cheerup/res/banner1.gif)
Image to 3D
Probabilistic deep learning, unsupervised few-shot image classification, generalized few-shot classification, pedestrian attribute recognition.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/948c0e0f-81d8-4c08-b8d9-e504cb114382.jpg)
Steganalysis
![research paper about computer vision research paper about computer vision](https://raw.githubusercontent.com/tosmaster/imagevision/master/images/architecture.png)
Sketch Recognition
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000088-1f169c25.jpg)
Face Sketch Synthesis
Drawing pictures.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000053-9e6cc36d.jpg)
Photo-To-Caricature Translation
Spoof detection, face presentation attack detection, detecting image manipulation, cross-domain iris presentation attack detection, finger dorsal image spoof detection, computer vision techniques adopted in 3d cryogenic electron microscopy, single particle analysis, cryogenic electron tomography, highlight detection, iris recognition, pupil dilation, action quality assessment.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001368-1082b77e.jpg)
One-shot visual object segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/23137faf-69e4-4e35-acec-5b817c16c737.jpg)
Unbiased Scene Graph Generation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/2a773854-1b01-410f-aaf3-a064d695d590.jpg)
Panoptic Scene Graph Generation
Image to video generation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/4a2d703b-aab3-4ce2-9cfe-9c742289b269.jpg)
Unconditional Video Generation
Automatic post-editing.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000257-2b560008_M7RFnV9.jpg)
Dense Captioning
Image stitching.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/ac8ffe2d-fdb4-447e-88e9-b4ad83e5bb38.jpg)
Multi-View 3D Reconstruction
Universal domain adaptation, action understanding, blind face restoration.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002731-0d3b184a.jpg)
Document Image Classification
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000704-356f65e7.jpg)
Face Reenactment
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/9b9603d6-560f-4a69-bc48-90b9da3cf086.jpg)
Geometric Matching
Human action generation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001744-8142135a.jpg)
Action Generation
Object categorization, person retrieval, text based person retrieval, surgical phase recognition, online surgical phase recognition, offline surgical phase recognition, human dynamics.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000619-0a3c8ab0.jpg)
3D Human Dynamics
Meme classification, hateful meme classification, severity prediction, intubation support prediction, cloud detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000502-dfb772c2.jpg)
Text-To-Image
Story visualization, complex scene breaking and synthesis, diffusion personalization.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/f74a439f-308c-4f5d-89f3-0d294865d51c.jpg)
Diffusion Personalization Tuning Free
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/2d639db9-6da1-4cc0-a4bb-48efaead23f4.jpg)
Efficient Diffusion Personalization
Image fusion, pansharpening, image deconvolution.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000713-8500001e.jpg)
Image Outpainting
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/caca4ee3-d312-45f3-95bc-996f1c27034e.jpg)
Object Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002128-167778ba.jpg)
Camouflaged Object Segmentation
Landslide segmentation, text-line extraction, point clouds, point cloud video understanding, point cloud rrepresentation learning.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/b2c1d06e-20f2-4046-8329-7e239774aa84.jpg)
Semantic SLAM
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/575355c4-187c-40f0-9713-674fc2fc5cb1.jpg)
Object SLAM
Intrinsic image decomposition, line segment detection, table recognition, situation recognition, grounded situation recognition, motion detection, multi-target domain adaptation, sports analytics.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001767-3c6c5a0d.jpg)
Robot Pose Estimation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/926d9879-0d61-474f-ae7a-9c50a59ad13f.jpg)
Camouflaged Object Segmentation with a Single Task-generic Prompt
Image morphing, image shadow removal, person identification, visual prompt tuning, weakly-supervised instance segmentation, image smoothing, fake image detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001199-7f4bf1fb_0BXMP1S.jpg)
GAN image forensics
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001199-087e3c6c_qt4yHKY.jpg)
Fake Image Attribution
Image steganography, rotated mnist, contour detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000354-8ae991ad.jpg)
Face Image Quality
Lane detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/8c89efa3-0b9d-4e3e-82d9-08b38f55bcc7.jpg)
3D Lane Detection
Layout design, license plate detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/21e8bf09-6596-410d-b170-0432fcffb2b0.jpg)
Video Panoptic Segmentation
Viewpoint estimation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000851-d564eeeb.jpg)
Drone navigation
Drone-view target localization, value prediction, body mass index (bmi) prediction, multi-object tracking and segmentation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/657c32e8-3131-400f-8474-224bad1a9b6e.jpg)
Occlusion Handling
Zero-shot transfer image classification.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/77dd7e5b-1335-4b2a-bcfd-9c388413b102.jpg)
3D Object Reconstruction From A Single Image
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000751-6cd9fb6b.jpg)
CAD Reconstruction
3d point cloud linear classification, crop classification, crop yield prediction, photo retouching, motion retargeting, shape representation of 3d point clouds, bird's-eye view semantic segmentation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/0d834282-fd21-4e57-be69-d5c2ed538690.jpg)
Dense Pixel Correspondence Estimation
Human part segmentation.
![research paper about computer vision research paper about computer vision](https://raw.githubusercontent.com/facebookresearch/detectron/master/demo/output/33823288584_1d21cf0a26_k_example_output.jpg)
Multiview Learning
Person recognition.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000131-3d972675.jpg)
Document Shadow Removal
Symmetry detection, traffic sign detection, video style transfer, referring image matting.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/tasks/d6b93a69-819b-4ec2-af10-06ffa587bb16.jpg)
Referring Image Matting (Expression-based)
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/3733d707-0d2b-49c9-9b2b-a2a615e74ff8.jpg)
Referring Image Matting (Keyword-based)
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/db80a83a-a702-4edd-aad9-892a147df206.jpg)
Referring Image Matting (RefMatte-RW100)
Referring image matting (prompt-based), human interaction recognition, one-shot 3d action recognition, mutual gaze, affordance detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002845-bd6fff4d.jpg)
Gaze Prediction
Image forgery detection, image instance retrieval, amodal instance segmentation, image quality estimation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000091-e86362c7.jpg)
Image Similarity Search
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000992-fce976fd.jpg)
Precipitation Forecasting
Referring expression generation, road damage detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000480-bdfe2fa5.jpg)
Space-time Video Super-resolution
Video matting.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002360-3d203358.jpg)
Open-World Semi-Supervised Learning
Semi-supervised image classification (cold start), hand detection, material classification.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000422-688b5ff1.jpg)
Open Vocabulary Attribute Detection
Inverse tone mapping, image/document clustering, self-organized clustering, instance search.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000082-7dbded6b.jpg)
Audio Fingerprint
3d shape modeling.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001823-9593b40a.jpg)
Action Analysis
Facial editing.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/68278bb8-5f4a-43f4-bfbb-ab950c58df74.jpg)
Food Recognition
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000634-b236aef7.jpg)
Holdout Set
Motion magnification, semi-supervised instance segmentation, binary classification, llm-generated text detection, cancer-no cancer per breast classification, cancer-no cancer per image classification, suspicous (birads 4,5)-no suspicous (birads 1,2,3) per image classification, cancer-no cancer per view classification, video segmentation, camera shot boundary detection, open-vocabulary video segmentation, open-world video segmentation, lung nodule classification, lung nodule 3d classification, lung nodule detection, lung nodule 3d detection, 3d scene reconstruction, art analysis.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/3f612eed-ab00-41f9-a6a1-91dbf5cf09f1.jpg)
Zero-Shot Composed Image Retrieval (ZS-CIR)
Event segmentation, generic event boundary detection, image retouching, image-variation, jpeg artifact removal, multispectral object detection, point cloud super resolution, skills assessment.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/697a4af1-f65e-4992-8b02-48c6bf021517.jpg)
Sensor Modeling
10-shot image generation, video prediction, earth surface forecasting, predict future video frames, ad-hoc video search, audio-visual synchronization, handwriting generation, pose retrieval, scanpath prediction, scene change detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002411-873b5588_mEzUBaG.jpg)
Sketch-to-Image Translation
Skills evaluation, synthetic image detection, highlight removal, 3d shape reconstruction from a single 2d image.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000135-9e64ef64.jpg)
Shape from Texture
Deception detection, deception detection in videos, handwriting verification, bangla spelling error correction, 3d open-vocabulary instance segmentation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/60ff1d90-d451-4f8a-8c62-94550ec91252.jpg)
3D Shape Representation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000796-71004bae.jpg)
3D Dense Shape Correspondence
Birds eye view object detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000015-28528fd4.jpg)
Multiple People Tracking
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000368-67c60635.jpg)
Network Interpretation
Rgb-d reconstruction, seeing beyond the visible, semi-supervised domain generalization, unsupervised semantic segmentation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001371-00a5d91b.jpg)
Unsupervised Semantic Segmentation with Language-image Pre-training
Multiple object tracking with transformer.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/e33508db-205c-4c2c-a8de-58a8a9e48a0e.jpg)
Multiple Object Track and Segmentation
Constrained lip-synchronization, face dubbing, vietnamese visual question answering, explanatory visual question answering.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002904-8c6ca1c7_YGTMmks.jpg)
Video Visual Relation Detection
Human-object relationship detection, 3d shape reconstruction, defocus blur detection, event data classification, image comprehension, image manipulation localization, instance shadow detection, kinship verification, medical image enhancement, open vocabulary panoptic segmentation, single-object discovery, training-free 3d point cloud classification, video forensics.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000361-36f52818.jpg)
Sequential Place Recognition
Autonomous flight (dense forest), autonomous web navigation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/ae65120e-a9ac-4c25-8a19-5e734fd4cafc.jpg)
Generative 3D Object Classification
Cube engraving classification, multimodal machine translation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001101-fb2e2264.jpg)
Face to Face Translation
Multimodal lexical translation, 2d semantic segmentation task 3 (25 classes), document enhancement, 4d panoptic segmentation, action assessment, bokeh effect rendering, drivable area detection, face anonymization, font recognition, horizon line estimation, image imputation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001396-994a63ac.jpg)
Long Video Retrieval (Background Removed)
Medical image denoising.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/948d7c0f-96f5-43bf-9f62-8d4553eef0e2.jpg)
Occlusion Estimation
Physiological computing.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001878-93ba632b.jpg)
Lake Ice Monitoring
Short-term object interaction anticipation, spatio-temporal video grounding, unsupervised 3d point cloud linear evaluation, wireframe parsing, single-image-generation, unsupervised anomaly detection with specified settings -- 30% anomaly, root cause ranking, anomaly detection at 30% anomaly, anomaly detection at various anomaly percentages.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002676-0f8402b4.jpg)
Unsupervised Contextual Anomaly Detection
2d pose estimation, category-agnostic pose estimation, overlapping pose estimation, facial expression recognition, cross-domain facial expression recognition, zero-shot facial expression recognition, landmark tracking, muscle tendon junction identification, 3d object captioning, animated gif generation, generalized referring expression comprehension, image deblocking, infrared image super-resolution, motion disentanglement, persuasion strategies, scene text editing, traffic accident detection, accident anticipation, unsupervised landmark detection, visual speech recognition, lip to speech synthesis, continual anomaly detection, gaze redirection, weakly supervised action segmentation (transcript), weakly supervised action segmentation (action set)), calving front delineation in synthetic aperture radar imagery, calving front delineation in synthetic aperture radar imagery with fixed training amount.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/8ce700a1-f5b9-4bc0-a529-c75e38f72e22.jpg)
Handwritten Line Segmentation
Handwritten word segmentation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000002615-e42dabda.jpg)
General Action Video Anomaly Detection
Physical video anomaly detection, monocular cross-view road scene parsing(road), monocular cross-view road scene parsing(vehicle).
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000603-3510e464.jpg)
Transparent Object Depth Estimation
3d semantic occupancy prediction, 3d scene editing, age and gender estimation, data ablation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000351-f7066399.jpg)
Occluded Face Detection
Gait identification, historical color image dating, stochastic human motion prediction, image retargeting, image and video forgery detection, motion captioning, personality trait recognition, personalized segmentation, scene-aware dialogue, spatial relation recognition, spatial token mixer, steganographics, story continuation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/92539aed-3b8c-4b62-b66c-b0fbf7ef3d53.jpg)
Unsupervised Anomaly Detection with Specified Settings -- 0.1% anomaly
Unsupervised anomaly detection with specified settings -- 1% anomaly, unsupervised anomaly detection with specified settings -- 10% anomaly, unsupervised anomaly detection with specified settings -- 20% anomaly, vehicle speed estimation, visual analogies, visual social relationship recognition, zero-shot text-to-video generation, text-guided-generation, video frame interpolation, 3d video frame interpolation, unsupervised video frame interpolation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002776-51e5d023.jpg)
eXtreme-Video-Frame-Interpolation
Continual semantic segmentation, overlapped 5-3, overlapped 25-25, evolving domain generalization, source-free domain generalization, micro-expression generation, micro-expression generation (megc2021), mistake detection, online mistake detection, period estimation, art period estimation (544 artists), unsupervised panoptic segmentation, unsupervised zero-shot panoptic segmentation, 3d rotation estimation, camera auto-calibration, defocus estimation, derendering, fingertip detection, hierarchical text segmentation, human-object interaction concept discovery.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/d79f60e0-4c23-412d-8dab-81b1119620ee.jpg)
One-Shot Face Stylization
Speaker-specific lip to speech synthesis, multi-person pose estimation, neural stylization.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/68b38b9c-ed05-4372-a612-01f909dec050.jpg)
Part-aware Panoptic Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002134-7dd96854.jpg)
Population Mapping
Pornography detection, prediction of occupancy grid maps, raw reconstruction, repetitive action counting, svbrdf estimation, semi-supervised video classification, spectrum cartography, supervised image retrieval, synthetic image attribution, training-free 3d part segmentation, unsupervised image decomposition, video propagation, vietnamese multimodal learning, weakly supervised 3d point cloud segmentation, weakly-supervised panoptic segmentation, drone-based object tracking, brain visual reconstruction, brain visual reconstruction from fmri.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/a34d9382-cb10-400e-a6bd-31f40c72623f.jpg)
Human-Object Interaction Generation
Image-guided composition, fashion understanding, semi-supervised fashion compatibility.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000714-068a8901_2PQwzdm.jpg)
intensity image denoising
Lifetime image denoising, observation completion, active observation completion, boundary grounding.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/5e1f2ccb-0696-44cd-a506-ad8524a59b9b.jpg)
Video Narrative Grounding
3d inpainting, 3d scene graph alignment, 4d spatio temporal semantic segmentation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001517-bc8b5f8c.jpg)
Age Estimation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000616-5c9160ff.jpg)
Few-shot Age Estimation
Brdf estimation, camouflage segmentation, clothing attribute recognition, damaged building detection, depth image estimation, detecting shadows, dynamic texture recognition.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000459-e2dd17f7_kfjELuH.jpg)
Disguised Face Verification
Few shot open set object detection, gaze target estimation, generalized zero-shot learning - unseen, hd semantic map learning, human-object interaction anticipation, image deep networks, keypoint detection and image matching, manufacturing quality control, materials imaging, micro-gesture recognition, multi-person pose estimation and tracking.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001240-5386a638.jpg)
Multi-modal image segmentation
Multi-object discovery, neural radiance caching.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/b5f50463-49bd-42d6-b0fc-2fa6388d11ec.jpg)
Parking Space Occupancy
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002889-9b2c229b.jpg)
Partial Video Copy Detection
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/c16085d5-fd52-4516-b585-042f44e000f9.jpg)
Multimodal Patch Matching
Perpetual view generation, procedure learning, prompt-driven zero-shot domain adaptation, single-shot hdr reconstruction, on-the-fly sketch based image retrieval, thermal image denoising, trademark retrieval, unsupervised instance segmentation, unsupervised zero-shot instance segmentation, vehicle key-point and orientation estimation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001689-53363f66.jpg)
Video Individual Counting
Video-adverb retrieval (unseen compositions), video-to-image affordance grounding.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/b4aebc62-bf07-4dad-b3cc-384f37397bde.jpg)
Vietnamese Scene Text
Visual sentiment prediction, human-scene contact detection, localization in video forgery, 3d canonicalization, 3d surface generation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000001065-36e9387d.jpg)
Visibility Estimation from Point Cloud
Amodal layout estimation, blink estimation, camera absolute pose regression, change data generation, constrained diffeomorphic image registration, continuous affect estimation, deep feature inversion, document image skew estimation, earthquake prediction, fashion compatibility learning.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000001467-12407941.jpg)
Displaced People Recognition
Finger vein recognition, flooded building segmentation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000470-a465c9ea.jpg)
Future Hand Prediction
Generative temporal nursing, grounded multimodal named entity recognition, house generation, human fmri response prediction, hurricane forecasting, ifc entity classification, image declipping, image similarity detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002890-fce976fd.jpg)
Image Text Removal
Image-to-gps verification.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000051-9fe1ac0a.jpg)
Image-based Automatic Meter Reading
Dial meter reading, indoor scene reconstruction, jpeg decompression.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/57764734-6608-4cd8-9806-0ba85e6a78b1.jpg)
Kiss Detection
Laminar-turbulent flow localisation.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/73fafcc1-c15b-4f9f-980c-fe4825198d5a.jpg)
Landmark Recognition
Brain landmark detection, corpus video moment retrieval, mllm evaluation: aesthetics, medical image deblurring, mental workload estimation, meter reading, motion expressions guided video segmentation, natural image orientation angle detection, multi-object colocalization, multilingual text-to-image generation, video emotion detection, nwp post-processing, occluded 3d object symmetry detection, open set video captioning, pso-convnets dynamics 1, pso-convnets dynamics 2, partial point cloud matching.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/cc3682a5-f480-439a-ba4f-224210065710.jpg)
Partially View-aligned Multi-view Learning
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002589-297611c0.jpg)
Pedestrian Detection
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/thumbnails/task/task-0000000609-bfee3732.jpg)
Thermal Infrared Pedestrian Detection
Personality trait recognition by face, physical attribute prediction, point cloud semantic completion, point cloud classification dataset, point- of-no-return (pnr) temporal localization, pose contrastive learning, potrait generation, prostate zones segmentation, pulmorary vessel segmentation, pulmonary artery–vein classification, reference expression generation, safety perception recognition, jersey number recognition, interspecies facial keypoint transfer, image to sketch recognition, specular reflection mitigation, specular segmentation, state change object detection, surface normals estimation from point clouds, train ego-path detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/b154d103-3d60-4544-944e-94bc938ba45d.jpg)
Transform A Video Into A Comics
Transparency separation, typeface completion.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000200-bd1e7765.jpg)
Unbalanced Segmentation
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/a261223e-db89-4890-bbcb-4a450c51705a.jpg)
Unsupervised Long Term Person Re-Identification
Video correspondence flow.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/174a98c2-b2fa-43f1-85a9-34dcb0da8001.jpg)
Key-Frame-based Video Super-Resolution (K = 15)
Zero-shot single object tracking, yield mapping in apple orchards, lidar absolute pose regression, opd: single-view 3d openable part detection, self-supervised scene text recognition, spatial-aware image editing, video narration captioning, spectral estimation, spectral estimation from a single rgb image, 3d prostate segmentation, aggregate xview3 metric, atomic action recognition, composite action recognition, calving front delineation from synthetic aperture radar imagery, computer vision transduction, crosslingual text-to-image generation, zero-shot dense video captioning, document to image conversion, frame duplication detection, geometrical view, hyperview challenge.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/d2a36764-c4ff-4697-be82-6215b1619e4d.jpg)
Image Operation Chain Detection
Kinematic based workflow recognition, logo recognition.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000000629-4e8ac622.jpg)
MLLM Aesthetic Evaluation
Motion detection in non-stationary scenes, open-set video tagging, satellite orbit determination.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/84104664-e53c-4e20-97fd-21f04043d5ab.jpg)
Segmentation Based Workflow Recognition
2d particle picking, small object detection.
![research paper about computer vision research paper about computer vision](https://production-media.paperswithcode.com/icons/task/task-0000002111-b5bf20cd.jpg)
Rice Grain Disease Detection
Sperm morphology classification, video & kinematic base workflow recognition, video based workflow recognition, video, kinematic & segmentation base workflow recognition, animal pose estimation.
![research paper about computer vision People detection with computer vision](https://viso.ai/wp-content/uploads/2022/02/people-detection-768x432.png)
- Explore Blog
Data Collection
Building Blocks
Device Enrollment
Monitoring Dashboards
Video Annotation
Application Editor
Device Management
Remote Maintenance
Model Training
Application Library
Deployment Manager
Unified Security Center
AI Model Library
Configuration Manager
IoT Edge Gateway
Privacy-preserving AI
Ready to get started?
- Why Viso Suite
Top Computer Vision Papers of All Time (Updated 2024)
![research paper about computer vision](https://viso.ai/wp-content/uploads/2024/03/best-CV-papers-cover-image-1060x439.png)
Viso Suite is the all-in-one solution for teams to build, deliver, scale computer vision applications.
Viso Suite is the world’s only end-to-end computer vision platform. Request a demo.
Today’s boom in computer vision (CV) started at the beginning of the 21 st century with the breakthrough of deep learning models and convolutional neural networks (CNN). The main CV methods include image classification, image localization, object detection, and segmentation.
In this article, we dive into some of the most significant research papers that triggered the rapid development of computer vision. We split them into two categories – classical CV approaches, and papers based on deep-learning. We chose the following papers based on their influence, quality, and applicability.
Gradient-based Learning Applied to Document Recognition (1998)
Distinctive image features from scale-invariant keypoints (2004), histograms of oriented gradients for human detection (2005), surf: speeded up robust features (2006), imagenet classification with deep convolutional neural networks (2012), very deep convolutional networks for large-scale image recognition (2014), googlenet – going deeper with convolutions (2014), resnet – deep residual learning for image recognition (2015), faster r-cnn: towards real-time object detection with region proposal networks (2015), yolo: you only look once: unified, real-time object detection (2016), mask r-cnn (2017), efficientnet – rethinking model scaling for convolutional neural networks (2019).
About us: Viso Suite is the end-to-end computer vision solution for enterprises. With a simple interface and features that give machine learning teams control over the entire ML pipeline, Viso Suite makes it possible to achieve a 3-year ROI of 695%. Book a demo to learn more about how Viso Suite can help solve business problems.
![research paper about computer vision Viso Platform](https://viso.ai/wp-content/uploads/2024/02/viso-suite-view-1060x625-1.png)
Classic Computer Vision Papers
The authors Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner published the LeNet paper in 1998. They introduced the concept of a trainable Graph Transformer Network (GTN) for handwritten character and word recognition . They researched (non) discriminative gradient-based techniques for training the recognizer without manual segmentation and labeling.
![research paper about computer vision LeNet CNN architecture digits recognition](https://viso.ai/wp-content/uploads/2024/03/lenet-architecture-digits-recognition-1060x286.png)
Characteristics of the model:
- LeNet-5 CNN contains 6 convolution layers with multiple feature maps (156 trainable parameters).
- The input is a 32×32 pixel image and the output layer is composed of Euclidean Radial Basis Function units (RBF) one for each class (letter).
- The training set consists of 30000 examples, and authors achieved a 0.35% error rate on the training set (after 19 passes).
Find the LeNet paper here .
David Lowe (2004), proposed a method for extracting distinctive invariant features from images. He used them to perform reliable matching between different views of an object or scene. The paper introduced Scale Invariant Feature Transform (SIFT), while transforming image data into scale-invariant coordinates relative to local features.
![research paper about computer vision SIFT method keypoints detection](https://viso.ai/wp-content/uploads/2024/03/sift-method-keypoints-selection.jpg)
Model characteristics:
- The method generates large numbers of features that densely cover the image over the full range of scales and locations.
- The model needs to match at least 3 features from each object – in order to reliably detect small objects in cluttered backgrounds.
- For image matching and recognition, the model extracts SIFT features from a set of reference images stored in a database.
- SIFT model matches a new image by individually comparing each feature from the new image to this previous database (Euclidian distance).
Find the SIFT paper here .
The authors Navneet Dalal and Bill Triggs researched the feature sets for robust visual object recognition, by using a linear SVM-based human detection as a test case. They experimented with grids of Histograms of Oriented Gradient (HOG) descriptors that significantly outperform existing feature sets for human detection .
![research paper about computer vision histogram object detection](https://viso.ai/wp-content/uploads/2024/03/histogram-feature-extraction-object-detection.jpg)
Authors achievements:
- The histogram method gave near-perfect separation from the original MIT pedestrian database.
- For good results – the model requires: fine-scale gradients, fine orientation binning, i.e. high-quality local contrast normalization in overlapping descriptor blocks.
- Researchers examined a more challenging dataset containing over 1800 annotated human images with many pose variations and backgrounds.
- In the standard detector, each HOG cell appears four times with different normalizations and improves performance to 89%.
Find the HOG paper here .
Herbert Bay, Tinne Tuytelaars, and Luc Van Goo presented a scale- and rotation-invariant interest point detector and descriptor, called SURF (Speeded Up Robust Features). It outperforms previously proposed schemes concerning repeatability, distinctiveness, and robustness, while computing much faster. The authors relied on integral images for image convolutions, furthermore utilizing the leading existing detectors and descriptors.
![research paper about computer vision surf detecting interest points](https://viso.ai/wp-content/uploads/2024/03/surf-detected-interest-points.jpg)
- Applied a Hessian matrix-based measure for the detector, and a distribution-based descriptor, simplifying these methods to the essential.
- Presented experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application.
- SURF showed strong performance – SURF-128 with an 85.7% recognition rate, followed by U-SURF (83.8%) and SURF (82.6%).
Find the SURF paper here .
Papers Based on Deep-Learning Models
Alex Krizhevsky and his team won the ImageNet Challenge in 2012 by researching deep convolutional neural networks. They trained one of the largest CNNs at that moment over the ImageNet dataset used in the ILSVRC-2010 / 2012 challenges and achieved the best results reported on these datasets. They implemented a highly-optimized GPU of 2D convolution, thus including all required steps in CNN training, and published the results.
![research paper about computer vision alexnet CNN architecture](https://viso.ai/wp-content/uploads/2024/03/alexnet-cnn-architecture.jpg)
- The final CNN contained five convolutional and three fully connected layers, and the depth was quite significant.
- They found that removing any convolutional layer (each containing less than 1% of the model’s parameters) resulted in inferior performance.
- The same CNN, with an extra sixth convolutional layer, was used to classify the entire ImageNet Fall 2011 release (15M images, 22K categories).
- After fine-tuning on ImageNet-2012 it gave an error rate of 16.6%.
Find the ImageNet paper here .
Karen Simonyan and Andrew Zisserman (Oxford University) investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Their main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, specifically focusing on very deep convolutional networks (VGG) . They proved that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers.
![research paper about computer vision image classification CNN results VOC-2007, VOC-2012](https://viso.ai/wp-content/uploads/2024/03/VOC-2012-image-classification.jpg)
- Their ImageNet Challenge 2014 submission secured the first and second places in the localization and classification tracks respectively.
- They showed that their representations generalize well to other datasets, where they achieved state-of-the-art results.
- They made two best-performing ConvNet models publicly available, in addition to the deep visual representations in CV.
Find the VGG paper here .
The Google team (Christian Szegedy, Wei Liu, et al.) proposed a deep convolutional neural network architecture codenamed Inception. They intended to set the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of their architecture was the improved utilization of the computing resources inside the network.
![research paper about computer vision GoogleNet Inception CNN](https://viso.ai/wp-content/uploads/2024/03/googlenet-inception-module-dimension-reductions.jpg)
- A carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant.
- Their submission for ILSVRC14 was called GoogLeNet , a 22-layer deep network. Its quality was assessed in the context of classification and detection.
- They added 200 region proposals coming from multi-box increasing the coverage from 92% to 93%.
- Lastly, they used an ensemble of 6 ConvNets when classifying each region which improved results from 40% to 43.9% accuracy.
Find the GoogLeNet paper here .
Microsoft researchers Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun presented a residual learning framework (ResNet) to ease the training of networks that are substantially deeper than those used previously. They reformulated the layers as learning residual functions concerning the layer inputs, instead of learning unreferenced functions.
![research paper about computer vision resnet error rates](https://viso.ai/wp-content/uploads/2024/03/resnet-error-rates-ImageNet.jpg)
- They evaluated residual nets with a depth of up to 152 layers – 8× deeper than VGG nets, but still having lower complexity.
- This result won 1st place on the ILSVRC 2015 classification task.
- The team also analyzed the CIFAR-10 with 100 and 1000 layers, achieving a 28% relative improvement on the COCO object detection dataset.
- Moreover – in ILSVRC & COCO 2015 competitions, they won 1 st place on the tasks of ImageNet detection, ImageNet localization, COCO detection/segmentation.
Find the ResNet paper here .
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun introduced the Region Proposal Network (RPN) with full-image convolutional features with the detection network, therefore enabling nearly cost-free region proposals. Their RPN was a fully convolutional network that simultaneously predicted object bounds and objective scores at each position. Also, they trained the RPN end-to-end to generate high-quality region proposals, which Fast R-CNN used for detection.
![research paper about computer vision faster R-CNN object detection](https://viso.ai/wp-content/uploads/2024/03/faster-R-CNN-unified-network.jpg)
- Merged RPN and fast R-CNN into a single network by sharing their convolutional features. In addition, they applied neural networks with “ attention” mechanisms .
- For the very deep VGG-16 model, their detection system had a frame rate of 5fps on a GPU.
- Achieved state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image.
- In ILSVRC and COCO 2015 competitions, faster R-CNN and RPN were the foundations of the 1st-place winning entries in several tracks.
Find the Faster R-CNN paper here .
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi developed YOLO, an innovative approach to object detection. Instead of repurposing classifiers to perform detection, the authors framed object detection as a regression problem. In addition, they spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance .
![research paper about computer vision YOLO CNN architecture](https://viso.ai/wp-content/uploads/2024/03/yolo-architecture.jpg)
- The base YOLO model processed images in real-time at 45 frames per second.
- A smaller version of the network, Fast YOLO, processed 155 frames per second, while still achieving double the mAP of other real-time detectors.
- Compared to state-of-the-art detection systems, YOLO was making more localization errors, but was less likely to predict false positives in the background.
- YOLO learned very general representations of objects and outperformed other detection methods, including DPM and R-CNN , when generalizing natural images.
Find the YOLO paper here .
Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick (Facebook) presented a conceptually simple, flexible, and general framework for object instance segmentation. Their approach could detect objects in an image, while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN , extended Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.
![research paper about computer vision mask R-CNN framework](https://viso.ai/wp-content/uploads/2024/03/mask-rcnn-framework.jpg)
- Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.
- Showed great results in all three tracks of the COCO suite of challenges. Also, it includes instance segmentation, bounding box object detection, and person keypoint detection.
- Mask R-CNN outperformed all existing, single-model entries on every task, including the COCO 2016 challenge winners.
- The model served as a solid baseline and eased future research in instance-level recognition.
Find the Mask R-CNN paper here .
The authors (Mingxing Tan, Quoc V. Le) of EfficientNet studied model scaling and identified that carefully balancing network depth, width, and resolution can lead to better performance. They proposed a new scaling method that uniformly scales all dimensions of depth resolution using a simple but effective compound coefficient. They demonstrated the effectiveness of this method in scaling up MobileNet and ResNet .
![research paper about computer vision efficiennet model scaling CNN](https://viso.ai/wp-content/uploads/2024/03/efficientnet-model-scaling.jpg)
- Designed a new baseline network and scaled it up to obtain a family of models, called EfficientNets. It had much better accuracy and efficiency than previous ConvNets.
- EfficientNet-B7 achieved state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet.
- It also transferred well and achieved state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with much fewer parameters.
Find the EfficientNet paper here .
Related Articles
![research paper about computer vision research paper about computer vision](https://viso.ai/wp-content/uploads/2023/12/ArcaneGAN-transform-image-to-arcane-style-768x403.png)
Synthetic Data: A Model Training Solution
Discover the role of synthetic data in AI, ML, and data privacy. Learn how it enhances AI and ML training and how it is generated.
![research paper about computer vision ImageNet Dataset for computer vision](https://viso.ai/wp-content/uploads/2024/02/ImageNet-Dataset-768x432.png)
ImageNet Dataset: Evolution & Applications (2024)
Everything you need to know about the ImageNet dataset and its resounding impact on the world of computer vision and machine learning.
All-in-one platform to build, deploy, and scale computer vision applications
![research paper about computer vision research paper about computer vision](https://viso.ai/wp-content/uploads/2021/07/intel-logo-transparent.png)
Join 6,300+ Fellow AI Enthusiasts
Get expert news and updates straight to your inbox. Subscribe to the Viso Blog.
![research paper about computer vision](https://viso.ai/wp-content/uploads/2021/06/Group-2091.png)
Get expert AI news 2x a month. Subscribe to the most read Computer Vision Blog.
You can unsubscribe anytime. See our privacy policy .
Build any Computer Vision Application, 10x faster
All-in-one Computer Vision Platform for businesses to build, deploy and scale real-world applications.
![research paper about computer vision Pcw-img](https://viso.ai/wp-content/uploads/2023/01/Pcw-img-350x296.png)
- Deploy Apps
- Monitor Apps
- Manage Apps
- Help Center
Privacy Overview
The application of deep learning in computer vision
Ieee account.
- Change Username/Password
- Update Address
Purchase Details
- Payment Options
- Order History
- View Purchased Documents
Profile Information
- Communications Preferences
- Profession and Education
- Technical Interests
- US & Canada: +1 800 678 4333
- Worldwide: +1 732 981 0060
- Contact & Support
- About IEEE Xplore
- Accessibility
- Terms of Use
- Nondiscrimination Policy
- Privacy & Opting Out of Cookies
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
- View all journals
- My Account Login
- Explore content
- About the journal
- Publish with us
- Sign up for alerts
- Review Article
- Open access
- Published: 08 January 2021
Deep learning-enabled medical computer vision
- Andre Esteva ORCID: orcid.org/0000-0003-1937-9682 1 ,
- Katherine Chou 2 na1 ,
- Serena Yeung 3 na1 ,
- Nikhil Naik ORCID: orcid.org/0000-0002-5191-2726 1 na1 ,
- Ali Madani 1 na1 ,
- Ali Mottaghi 3 na1 ,
- Yun Liu ORCID: orcid.org/0000-0003-4079-8275 2 ,
- Eric Topol 4 ,
- Jeff Dean 2 &
- Richard Socher 1
npj Digital Medicine volume 4 , Article number: 5 ( 2021 ) Cite this article
92k Accesses
491 Citations
312 Altmetric
Metrics details
- Computational science
- Health care
- Medical research
A decade of unprecedented progress in artificial intelligence (AI) has demonstrated the potential for many fields—including medicine—to benefit from the insights that AI techniques can extract from data. Here we survey recent progress in the development of modern computer vision techniques—powered by deep learning—for medical applications, focusing on medical imaging, medical video, and clinical deployment. We start by briefly summarizing a decade of progress in convolutional neural networks, including the vision tasks they enable, in the context of healthcare. Next, we discuss several example medical imaging applications that stand to benefit—including cardiology, pathology, dermatology, ophthalmology–and propose new avenues for continued work. We then expand into general medical video, highlighting ways in which clinical workflows can integrate computer vision to enhance care. Finally, we discuss the challenges and hurdles required for real-world clinical deployment of these technologies.
Similar content being viewed by others
![research paper about computer vision research paper about computer vision](https://media.springernature.com/w215h120/springer-static/image/art%3A10.1038%2Fs41746-021-00438-z/MediaObjects/41746_2021_438_Fig1_HTML.png)
Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis
![research paper about computer vision research paper about computer vision](https://media.springernature.com/w215h120/springer-static/image/art%3A10.1038%2Fs41746-022-00733-3/MediaObjects/41746_2022_733_Fig1_HTML.png)
Where do we stand in AI for endoscopic image analysis? Deciphering gaps and future directions
![research paper about computer vision research paper about computer vision](https://media.springernature.com/w215h120/springer-static/image/art%3A10.1038%2Fs41569-021-00527-2/MediaObjects/41569_2021_527_Fig1_HTML.png)
Applications of artificial intelligence in cardiovascular imaging
Introduction.
Computer vision (CV) has a rich history spanning decades 1 of efforts to enable computers to perceive visual stimuli meaningfully. Machine perception spans a range of levels, from low-level tasks such as identifying edges, to high-level tasks such as understanding complete scenes. Advances in the last decade have largely been due to three factors: (1) the maturation of deep learning (DL)—a type of machine learning that enables end-to-end learning of very complex functions from raw data 2 (2) strides in localized compute power via GPUs 3 , and (3) the open-sourcing of large labeled datasets with which to train these algorithms 4 . The combination of these three elements has enabled individual researchers the resource access needed to advance the field. As the research community grew exponentially, so did progress.
The growth of modern CV has overlapped with the generation of large amounts of digital data in a number of scientific fields. Recent medical advances have been prolific 5 , 6 , owing largely to DL’s remarkable ability to learn many tasks from most data sources. Using large datasets, CV models can acquire many pattern-recognition abilities—from physician-level diagnostics 7 to medical scene perception 8 . See Fig. 1 .
![research paper about computer vision figure 1](https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs41746-020-00376-2/MediaObjects/41746_2020_376_Fig1_HTML.png)
a Multimodal discriminative model. Deep learning architectures can be constructed to jointly learn from both image data, typically with convolutional networks, and non-image data, typically with general deep networks. Learned annotations can include disease diagnostics, prognostics, clinical predictions, and combinations thereof. b Generative model. Convolutional neural networks can be trained to generate images. Tasks include image-to-image regression (shown), super-resolution image enhancement, novel image generation, and others.
Here we survey the intersection of CV and medicine, focusing on research in medical imaging, medical video, and real clinical deployment. We discuss key algorithmic capabilities which unlocked these opportunities, and dive into the myriad of accomplishments from recent years. The clinical tasks suitable for CV span many categories, such as screening, diagnosis, detecting conditions, predicting future outcomes, segmenting pathologies from organs to cells, monitoring disease, and clinical research. Throughout, we consider the future growth of this technology and its implications for medicine and healthcare.
Computer vision
Object classification, localization, and detection, respectively refer to identifying the type of an object in an image, the location of objects present, and both type and location simultaneously. The ImageNet Large-Scale Visual Recognition Challenge 9 (ILSVRC) was a spearhead to progress in these tasks over the last decade. It created a large community of DL researchers competing and collaborating together to improve techniques on various CV tasks. The first contemporary, GPU-powered DL approach, in 2012 10 , yielded an inflection point in the growth of this community, heralding an era of significant year-over-year improvements 11 , 12 , 13 , 14 through the competition’s final year in 2017. Notably, classification accuracy achieved human-level performance during this period. Within medicine, fine-grained versions of these methods 15 have successfully been applied to the classification and detection of many diseases (Fig. 2 ). Given sufficient data, the accuracy often matches or surpasses the level of expert physicians 7 , 16 . Similarly, the segmentation of objects has substantially improved 17 , 18 , particularly in challenging scenarios such as the biomedical segmentation of multiple types of overlapping cells in microscopy. The key DL technique leveraged in these tasks is the convolutional neural network 19 (CNN)—a type of DL algorithm which hardcodes translational invariance, a key feature of image data. Many other CV tasks have benefited from this progress, including image registration (identifying corresponding points across similar images), image retrieval (finding similar images), and image reconstruction and enhancement. The specific challenges of working with medical data require the utilization of many types of AI models.
![research paper about computer vision figure 2](https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs41746-020-00376-2/MediaObjects/41746_2020_376_Fig2_HTML.png)
CNNs—trained to classify disease states—have been extensively tested across diseases, and benchmarked against physicians. Their performance is typically on par with experts when both are tested on the same image classification task. a Dermatology 7 and b Radiology 156 . Examples reprinted with permission and adapted for style.
These techniques largely rely on supervised learning, which leverages datasets that contain both data points (e.g. images) and data labels (e.g. object classes). Given the sparsity and access difficulties of medical data, transfer learning—in which an algorithm is first trained on a large and unrelated corpus (e.g. ImageNet 4 ), then fine-tuned on a dataset of interest (e.g. medical)—has been critical for progress. To reduce the costs associated with collecting and labeling data, techniques to generate synthetic data, such as data augmentation 20 and generative adversarial networks (GANs) 21 are being developed. Researchers have even shown that crowd-sourcing image annotations can yield effective medical algorithms 22 , 23 . Recently, self-supervised learning 24 —in which implicit labels are extracted from data points and used to train algorithms (e.g predicting the spatial arrangement of tiles generated from splitting an image into pieces)—have pushed the field towards fully unsupervised learning, which lacks the need for labels. Applying these techniques in medicine will reduce the barrier to development and deployment.
Medical data access is central to this field, and key ethical and legal questions must be addressed. Do patients own their de-identified data? What if methods to re-identify data improve over time? Should the community open-source large quantities of data? To date, academia and industry have largely relied on small, open-source datasets, and data collected through commercial products. Dynamics around data sharing and country-specific availability will impact deployment opportunities. The field of federated learning 25 —in which centralized algorithms can be trained on distributed data that never leaves protected enclosures—may enable a workaround in stricter jurisdictions.
These advances have spurred growth in other domains of CV, such as multimodal learning, which combines vision with other modalities such as language (Fig. 1a ) 26 , time-series data, and genomic data 5 . These methods can combine with 3D vision 27 , 28 to turn depth-cameras into privacy-preserving sensors 29 , making deployment easier for patient settings such as the intensive care unit 8 . The range of tasks is even broader in video. Applications like activity recognition 30 and live scene understanding 31 are useful in detecting and responding to important or adverse clinical events 32 .
Medical imaging
In recent years the number of publications applying computer vision techniques to static medical imagery has grown from hundreds to thousands 33 . A few areas have received substantial attention—radiology, pathology, ophthalmology, and dermatology—owing to the visual pattern-recognition nature of diagnostic tasks in these specialities, and the growing availability of highly structured images.
The unique characteristics of medical imagery pose a number of challenges to DL-based computer vision. For one, images can be massive. Digitizing histopathology slides produces gigapixel images of around 100,000 ×100,000 pixels, whereas typical CNN image inputs are around 200 ×200 pixels. Further, different chemical preparations will render different slides for the same piece of tissue, and different digitization devices or settings may produce different images for the same slide. Radiology modalities such as CT and MRI render equally massive 3D images, forcing standard CNNs to either work with a set of 2D slices, or adjust their internal structure to process in 3D. Similarly, ultrasound renders a time-series of noisy 2D slices of a 3D context–slices which are spatially correlated but not aligned. DL has started to account for the unique challenges of medical data. For instance, multiple-instance-learning (MIL) 34 enables learning from datasets containing massive images and few labels (e.g. histopathology). 3D convolutions in CNNs are enabling better learning from 3D volumes (e.g MRI and CT) 35 . Spatio-temporal models 36 and image registration enable working with time-series images (e.g. ultrasound).
Dozens of companies have obtained US FDA and European CE approval for medical imaging AI 37 , and commercial markets have begun to form as sustainable business models are created. For instance, regions of high-throughput healthcare, such as India and Thailand, have welcomed the deployment of technologies such as diabetic retinopathy screening systems 38 . This rapid growth has now reached the point of directly impacting patient outcomes—the US CMS recently approved reimbursement for a radiology stroke triage use-case which reduces the time it takes for patients to receive treatment 39 .
CV in medical modalities with non-standardized data collection requires the integration of CV into existing physical systems. For instance, in otolaryngology, CNNs can be used to help primary care physicians manage patients’ ears, nose, and throat 40 , through mountable devices attached to smartphones 41 . Hematology and serology can benefit from microscope-integrated AIs 42 that diagnose common conditions 43 or count blood cells of various types 44 —repetitive tasks that are easy to augment with CNNs. AI in gastroenterology has demonstrated stunning capabilities. Video-based CNNs can be integrated into endoscopic procedures 45 for scope guidance, lesion detection, and lesion diagnosis. Applications include esophageal cancer screening 46 , detecting gastric cancer 47 , 48 , detecting stomach infections such as H. Pylori 49 , and even finding hookworms 50 . Scientists have taken this field one step further by building entire medical AI devices designed for monitoring, such as at-home smart toilets outfitted with diagnostic CNNs on cameras 51 . Beyond the analysis of disease states, CV can serve the future of human health and welfare through applications such as screening human embryos for implantation 52 .
Computer vision in radiology is so pronounced that it has quickly burgeoned into its own field of research, growing a corpus of work 53 , 54 , 55 that extends into all modalities, with a focus on X-rays, CT, and MRI. Chest X-ray analysis—a key clinical focus area 33 —has been an exemplar. The field has collected nearly 1 million annotated, open-source images 56 , 57 , 58 —the closest ImageNet 9 equivalent to date in medical CV. Analysis of brain imagery 59 (particularly for time-critical use-cases like stroke), and abdominal imagery 60 have similarly received substantial attention. Disease classification, nodule detection 61 , and region segmentation (e.g. ventricular 62 ) models have been developed for most conditions for which data can be collected. This has enabled the field to respond rapidly in times of crisis—for instance, developing and deploying COVID-19 detection models 63 . The field continues to expand with work in image translation (e.g. converting noisy ultrasound images into MRI), image reconstruction and enhancement (e.g. converting low-dosage, low-resolution CT images into high-resolution images 64 ), automated report generation, and temporal tracking (e.g. image registration to track tumor growth over time). In the sections below, we explore vision-based applications in other specialties.
Cardiac imaging is increasingly used in a wide array of clinical diagnoses and workflows. Key clinical applications for deep learning include diagnosis and screening. The most common imaging modality in cardiovascular medicine is the cardiac ultrasound, or echocardiogram. As a cost-effective, radiation-free technique, echocardiography is uniquely suited for DL due to straightforward data acquisition and interpretation—it is routinely used in most acute inpatient facilities, outpatient centers, and emergency rooms 65 . Further, 3D imaging techniques such as CT and MRI are used for the understanding of cardiac anatomy and to better characterize supply-demand mismatch. CT segmentation algorithms have even been FDA—cleared for coronary artery visualization 66 .
There are many example applications. DL can be trained on a large database of echocardiographic studies and surpass the performance of board-certified echocardiographers in view classification 67 . Computational DL pipelines can assess hypertrophic cardiomyopathy, cardiac amyloid, and pulmonary arterial hypertension 68 . EchoNet 69 —a deep learning model that can recognize cardiac structures, estimate function, and predict systemic phenotypes that are not readily identifiable to human interpretation—has recently furthered the field.
To account for challenges around data access, 70 data-efficient echocardiogram algorithms 70 have been developed, such as semi-supervised GANs that are effective at downstream tasks (e.g predicting left ventricular hypertrophy). To account for the fact that most studies utilize privately held medical imaging datasets, 10,000 annotated echocardiogram videos were recently open-sourced 36 . Alongside this release, a video-based model, EchoNet-Dynamic 36 , was developed. It can estimate ejection fraction and assess cardiomyopathy, alongside a comprehensive evaluation criterion based on results from an external dataset and human experts.
Pathologists play a key role in cancer detection and treatment. Pathological analysis—based on visual inspection of tissue samples under microscope—is inherently subjective in nature. Differences in visual perception and clinical training can lead to inconsistencies in diagnostic and prognostic opinions 71 , 72 , 73 . Here, DL can support critical medical tasks, including diagnostics, prognostication of outcomes and treatment response, pathology segmentation, disease monitoring, and so forth.
Recent years have seen the adoption of sub-micron-level resolution tissue scanners that capture gigapixel whole-slide images (WSI) 74 . This development, coupled with advances in CV has led to research and commercialization activity in AI-driven digital histopathology 75 . This field has the potential to (i) overcome limitations of human visual perception and cognition by improving the efficiency and accuracy of routine tasks, (ii) develop new signatures of disease and therapy from morphological structures invisible to the human eye, and (iii) combine pathology with radiological, genomic, and proteomic measurements to improve diagnosis and prognosis 76 .
One thread of research has focused on automating the routine, time-consuming task of localization and quantification of morphological features. Examples include the detection and classification of cells, nuclei, and mitoses 77 , 78 , 79 , and the localization and segmentation of histological primitives such as nuclei, glands, ducts, and tumors 80 , 81 , 82 , 83 . These methods typically require expensive manual annotation of tissue components by pathologists as training data.
Another research avenue focuses on direct diagnostics 84 , 85 , 86 and prognostics 87 , 88 from WSI or tissue microarrays (TMA) for a variety of cancers—breast, prostate, lung cancer, etc. Studies have even shown that morphological features captured by a hematoxylin and eosin (H&E) stain are predictive of molecular biomarkers utilized in theragnosis 85 , 89 . While histopathology slides digitize into massive, data-rich gigapixel images, region-level annotations are sparse and expensive. To help overcome this challenge, the field has developed DL algorithms based on multiple-instance learning 90 that utilize slide-level “weak” annotations and exploit the sheer size of these images for improved performance.
The data abundance of this domain has further enabled tasks such as virtual staining 91 , in which models are trained to predict one type of image (e.g. a stained image) from another (e.g. a raw microscopy image). See Fig. 1b . Moving forward, AI algorithms that learn to perform diagnosis, prognosis, and theragnosis using digital pathology image archives and annotations readily available from electronic health records have the potential to transform the fields of pathology and oncology.
Dermatology
The key clinical tasks for DL in dermatology include lesion-specific differential diagnostics, finding concerning lesions amongst many benign lesions, and helping track lesion growth over time 92 . A series of works have demonstrated that CNNs can match the performance of board-certified dermatologists at classifying malignant skin lesions from benign ones 7 , 93 , 94 . These studies have sequentially tested increasing numbers of dermatologists (25– 7 57– 93 , 157– 94 ), consistently demonstrating a sensitivity and specificity in classification that matches or even exceeds physician levels. These studies were largely restricted to the binary classification task of discerning benign vs malignant cutaneous lesions, classifying either melanomas from nevi or carcinomas from seborrheic keratoses.
Recently, this line of work has expanded to encompass differential diagnostics across dozens of skin conditions 95 , including non-neoplastic lesions such as rashes and genetic conditions, and incorporating non-visual metadata (e.g. patient demographics) as classifier inputs 96 . These works have been catalyzed by open-access image repositories and AI challenges that encourage teams to compete on predetermined benchmarks 97 .
Incorporating these algorithms into clinical workflows would allow their utility to support other key tasks, including large-scale detection of malignancies on patients with many lesions, and tracking lesions across images in order to capture temporal features, such as growth and color changes. This area remains fairly unexplored, with initial works that jointly train CNNs to detect and track lesions 98 .
Ophthalmology
Ophthalmology, in recent years, has observed a significant uptick in AI efforts, with dozens of papers demonstrating clinical diagnostic and analytical capabilities that extend beyond current human capability 99 , 100 , 101 . The potential clinical impact is significant 102 , 103 —the portability of the machinery used to inspect the eye means that pop-up clinics and telemedicine could be used to distribute testing sites to underserved areas. The field depends largely on fundus imaging, and optical coherence tomography (OCT) to diagnose and manage patients.
CNNs can accurately diagnose a number of conditions. Diabetic retinopathy—a condition in which blood vessels in the eyes of diabetic patients “leak” and can lead to blindness—has been extensively studied. CNNs consistently demonstrate physician-level grading from fundus photographs 104 , 105 , 106 , 107 , which has led to a recent US FDA-cleared system 108 . Similarly, they can diagnose or predict the progression of center-involved diabetic macular edema 109 , age-related macular degeneration 107 , 110 , glaucoma 107 , 111 , manifest visual field loss 112 , childhood blindness 113 , and others.
The eyes contain a number of non-human-interpretable features, indicative of meaningful medical information, that CNNs can pick up on. Remarkably, it was shown that CNNs can classify a number of cardiovascular and diabetic risk factors from fundus photographs 114 , including age, gender, smoking, hemoglobin-A1c, body-mass index, systolic blood pressure, and diastolic blood pressure. CNNs can also pick up signs of anemia 115 and chronic kidney disease 116 from fundus photographs. This presents an exciting opportunity for future AI studies predicting nonocular information from eye images. This could lead to a paradigm shift in care in which eye exams screen you for the presence of both ocular and nonocular disease—something currently limited for human physicians.
Medical video
Surgical applications.
The CV may provide significant utility in procedural fields such as surgery and endoscopy. Key clinical applications for deep learning include enhancing surgeon performance through real-time contextual awareness 117 , skills assessments, and training. Early studies have begun pursuing these objectives, primarily in video-based robotic and laparoscopic surgery—a number of works propose methods for detecting surgical tools and actions 118 , 119 , 120 , 121 , 122 , 123 , 124 . Some studies analyze tool movement or other cues to assess surgeon skill 119 , 121 , 123 , 124 , through established ratings such as the Global Operative Assessment of Laparoscopic Skills (GOALS) criteria for laparoscopic surgery 125 . Another line of work uses CV to recognize distinct phases of surgery during operations, towards developing context-aware computer assistance systems 126 , 127 . CV is also starting to emerge in open surgery settings 128 , of which there is a significant volume. The challenge here lies in the diversity of video capture viewpoints (e.g., head-mounted, side-view, and overhead cameras) and types of surgeries. For all types of surgical video, translating CV analysis to tools and applications that can improve patient outcomes is a natural next direction of research.
Human activity
CV can recognize human activity in physical spaces, such as hospitals and clinics, for a range of “ambient intelligence” applications. Ambient intelligence refers to a continuous, non-invasive awareness of activity in a physical space that can provide clinicians, nurses, and other healthcare workers with assistance such as patient monitoring, automated documentation, and monitoring for protocol compliance (Fig. 3 ). In hospitals, for example, early works have demonstrated CV-based ambient intelligence in intensive care units to monitor for safety-critical behaviors such as hand hygiene activity 32 and patient mobilization 8 , 129 , 130 . CV has also been developed for the emergency department, to transcribe procedures performed during the resuscitation of a patient 131 , and for the operating room (OR), to recognize activities for workflow optimization 132 . At the hospital operations level, CV can be a scalable and detailed form of labor and resource measurement that improves resource allocation for optimal care 133 .
![research paper about computer vision figure 3](https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs41746-020-00376-2/MediaObjects/41746_2020_376_Fig3_HTML.png)
Computer vision coupled with sensors and video streams enables a number of safety applications in clinical and home settings, enabling healthcare providers to scale their ability to monitor patients. Primarily created using models for fine-grained activity recognition, applications may include patient monitoring in ICUs, proper hand hygiene and physical action protocols in hospitals and clinics, anomalous event detection, and others.
Outside of hospitals, ambient intelligence can increase access to healthcare. For instance, it could enable at-risk seniors to live independently at home, by monitoring for safety and abnormalities in daily activities (e.g. detecting falls, which are particularly dangerous for the elderly 134 , 135 ), assisted living, and physiological measurement. Similar work 136 , 137 , 138 has targeted broader categories of daily activity. Recognizing and computing long-term descriptive analytics of activities such as sleeping, walking, and sitting over time can detect clinically meaningful changes or anomalies 136 . To ensure patient privacy, researchers have developed CV algorithms that work with thermal video data 136 . Another application area of CV is assisted living or rehabilitation, such as continuous sign language recognition to assist people with communication difficulties 139 , and monitoring of physiotherapy exercises for stroke rehabilitation 140 . CV also offers potential as a tool for remote physiological measurements. For instance, systems could use video 141 to analyze heart and breathing rates 141 . As telemedicine visits increase in frequency, CV could play a role in patient triaging, particularly in times of high demand such as the COVID-19 pandemic 142 . CV-based ambient intelligence technologies offer a wide range of opportunities for increased access to quality care.; However new ethical and legal questions will arise 143 in the design of these technologies.
Clinical deployment
As medical AI advances into the clinic 144 , it will simultaneously have the power to do great good for society, and to potentially exacerbate long-standing inequalities and perpetuate errors in medicine. If done properly and ethically, medical AI can become a flywheel for more equitable care—the more it is used, the more data it acquires, the more accurate and general it becomes. The key is in understanding the data that the models are built on and the environment in which they are deployed. Here, we present four key considerations when applying ML technologies in healthcare: assessment of data, planning for model limitations, community participation, and trust building.
Data quality largely determines model quality; identifying inequities in the data and taking them into account will lead towards more equitable healthcare. Procuring the right datasets may depend on running human-in-the-loop programs or broad-reaching data collection techniques. There are a number of methods that aim to remove bias in data. Individual-level bias can be addressed via expert discussion 145 and labeling adjudication 146 . Population-level bias can be addressed via missing data supplements and distributional shifts. International multi-institutional evaluation is a robust method to determine generalizability of models across diverse populations, medical equipment, resource settings, and practice patterns. In addition, using multi-task learning 147 to train models to perform a variety of tasks rather than one narrowly defined task, such as multi-cancer detection from histopathology images 148 , makes them more generally useful and often more robust.
Transparent reporting can reveal potential weaknesses and help address model limitations. Guardrails to protect against possible worst-case scenarios—minority, dismissal, or automation bias—must be put in place. It is insufficient to report and be satisfied with strong performance measures on general datasets when delivering care for patients—there should be an understanding of the specific instances in which the model fails. One technique is to assess demographic performance in combination with saliency maps 149 , to visualize what the model pays attention to, and check for potential biases. For instance, when using deep learning to develop a differential diagnosis for skin diseases 95 , researchers examined the model performance based on Fitzpatrick skin types and other demographic information to determine patient types for which there were insufficient examples, and inform future data collection. Further, they used saliency masks to verify the model was informed by skin abnormalities and not skin type. See Fig. 4 .
![research paper about computer vision figure 4](https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs41746-020-00376-2/MediaObjects/41746_2020_376_Fig4_HTML.png)
a Example graphic of biased training data in dermatology. AIs trained primarily on lighter skin tones may not generalize as well when tested on darker skin 157 . Models require diverse training datasets for maximal generalizability (e.g. 95 ). b Gradient Masks project the model’s attention onto the original input image, allowing practitioners to visually confirm regions that most influence predictions. Panel was reproduced from ref. 95 with permission.
A known limitation of ML is its performance on out-of-distribution data–data samples that are unlike any seen during model training. Progress has been made on out-of-distribution detection 150 and developing confidence intervals to help detect anomalies. Additionally, methods are developing to understand the uncertainty 151 around model outputs. This is especially critical when implementing patient-specific predictions that impact safety.
Community participation—from patients, physicians, computer scientists, and other relevant stakeholders—is paramount to successful deployment. This has helped identify structural drivers of racial bias in health diagnostics—particularly in discovering bias in datasets and identifying demographics for which models fail 152 . User-centered evaluations are a valuable tool in ensuring a system’s usability and fit into the real world. What’s the best way to present a model’s output to facilitate clinical decision making? How should a mobile app system be deployed in resource-constrained environments, such as areas with intermittent connectivity? For example, when launching ML-powered diabetic retinopathy models in Thailand and India, researchers noticed that model performance was impacted by socioeconomic factors 38 , and determined that where a model is most useful may not be where the model was generated. Ophthalmology models may need to be deployed in endocrinology care, as opposed to eye centers, due to access issues in the specific local environment. Another effective tool to build physician trust in AI results is side-by-side deployment of ML models with existing workflows (e.g manual grading 16 ). See Fig. 5 . Without question, AI models will require rigorous evaluation through clinical trials, to gauge safety and effectiveness. Excitingly, AI and CV can also help support clinical trials 153 , 154 through a number of applications—including patient selection, tumor tracking, adverse event detection, etc—creating an ecosystem in which AI can help design safe AI.
![research paper about computer vision figure 5](https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs41746-020-00376-2/MediaObjects/41746_2020_376_Fig5_HTML.png)
An example workflow showing the positive compounding effect of AI-enhanced workflows, and the resultant trust that can be built. AI predictions provide immediate value to physicians, and improve over time as bigger datasets are collected.
Trust for AI in healthcare is fundamental to its adoption 155 both by clinical teams and by patients. The foundation of clinical trust will come in large part from rigorous prospective trials that validate AI algorithms in real-world clinical environments. These environments incorporate human and social responses, which can be hard to predict and control, but for which AI technologies must account for. Whereas the randomness and human element of clinical environments are impossible to capture in retrospective studies, prospective trials that best reflect clinical practice will shift the conversation towards measurable benefits in real deployments. Here, AI interpretability will be paramount—predictive models will need the ability to describe why specific factors about the patient or environment lead them to their predictions.
In addition to clinical trust, patient trust—particularly around privacy concerns—must be earned. One significant area of need is next-generation regulations that account for advances in privacy-preserving techniques. ML typically does not require traditional identifiers to produce useful results, but there are meaningful signals in data that can be considered sensitive. To unlock insights from these sensitive data types, the evolution of privacy-preserving techniques must continue, and further advances need to be made in fields such as federated learning and federated analytics.
Each technological wave affords us a chance to reshape our future. In this case, artificial intelligence, deep learning, and computer vision represent an opportunity to make healthcare far more accessible, equitable, accurate, and inclusive than it has ever been.
Data availability
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
Szeliski, R. Computer Vision: Algorithms and Applications (Springer Science & Business Media, 2010).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521 , 436–444 (2015).
Article CAS PubMed Google Scholar
Sanders, J. & Kandrot, E. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional; 2010 Jul 19.BibTeXEndNoteRefManRefWorks
Deng, J. et al. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).
Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25 , 24–29 (2019).
Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25 , 44–56 (2019).
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 , 115–118 (2017).
Article CAS PubMed PubMed Central Google Scholar
Yeung, S. et al. A computer vision system for deep learning-based detection of patient mobilization activities in the ICU. NPJ Digit Med. 2 , 11 (2019).
Article PubMed PubMed Central Google Scholar
Russakovsky, O. et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 115 , 211–252 (2015).
Article Google Scholar
Krizhevsky, A., Sutskever, I. & Hinton, G. E. in Advances in Neural Information Processing Systems 25 (eds Pereira, F., Burges, C. J. C., Bottou, L. & Weinberger, K. Q.) 1097–1105 (Curran Associates, Inc., 2012).
Sermanet, P. et al. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. Preprint at https://arxiv.org/abs/1312.6229 (2013).
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint at https://arxiv.org/abs/1409.1556 (2014).
Szegedy, C. et al. Going deeper with convolutions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1–9 (2015).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016).
Gebru, T., Hoffman, J. & Fei-Fei, L. Fine-grained recognition in the wild: a multi-task domain adaptation approach. In 2017 IEEE International Conference on Computer Vision (ICCV) 1358–1367 (IEEE, 2017).
Gulshan, V. et al. Performance of a deep-learning algorithm vs manual grading for detecting diabetic retinopathy in india. JAMA Ophthalmol. https://doi.org/10.1001/jamaophthalmol.2019.2004 (2014).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention 234–241 (Springer, Cham, 2015).
Isensee, F. et al. nnU-Net: self-adapting framework for U-Net-based medical image segmentation. Preprint at https://arxiv.org/abs/1809.10486 (2018).
LeCun, Y. & Bengio, Y. in The Handbook of Brain Theory and Neural Networks 255–258 (MIT Press, 1998).
Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V. & Le, Q. V. AutoAugment: learning augmentation policies from data. Preprint at https://arxiv.org/abs/1805.09501 (2018).
Goodfellow, I. et al. Generative adversarial nets. In Advances inneural information processing systems 2672–2680 (2014).
Ørting, S. et al. A survey of Crowdsourcing in medical image analysis. Preprint at https://arxiv.org/abs/1902.09159 (2019).
Créquit, P., Mansouri, G., Benchoufi, M., Vivot, A. & Ravaud, P. Mapping of Crowdsourcing in health: systematic review. J. Med. Internet Res. 20 , e187 (2018).
Jing, L. & Tian, Y. in IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE, 2020).
McMahan, B., Moore, E., Ramage, D., Hampson, S. & y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics 1273–1282 (PMLR, 2017).
Karpathy, A. & Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 3128–3137 (IEEE, 2015).
Lv, D. et al. Research on the technology of LIDAR data processing. In 2017 First International Conference on Electronics Instrumentation Information Systems (EIIS) 1–5 (IEEE, 2017).
Lillo, I., Niebles, J. C. & Soto, A. Sparse composition of body poses and atomic actions for human activity recognition in RGB-D videos. Image Vis. Comput. 59 , 63–75 (2017).
Haque, A. et al. Towards vision-based smart hospitals: a system for tracking and monitoring hand hygiene compliance. In Proceedings of the 2nd Machine Learning for Healthcare Conference , 68 , 75–87 (PMLR, 2017).
Heilbron, F. C., Escorcia, V., Ghanem, B. & Niebles, J. C. ActivityNet: a large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 961–970 (IEEE, 2015).
Liu, Y. et al. Learning to describe scenes with programs. In ICLR (Open Access, 2019).
Singh, A. et al. Automatic detection of hand hygiene using computer visiontechnology. J. Am. Med. Inform. Assoc. 27 , 1316–1320 (2020).
Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42 , 60–88 (2017).
Article PubMed Google Scholar
Maron, O. & Lozano-Pérez, T. in A Framework for Multiple-Instance Learning. in Advances in Neural Information Processing Systems 10 (eds Jordan, M. I., Kearns, M. J. & Solla, S. A.) 570–576 (MIT Press, 1998).
Singh, S. P. et al. 3D Deep Learning On Medical Images: A Review. Sensors 20, https://doi.org/10.3390/s20185097 (2020).
Ouyang, D. et al. Video-based AI for beat-to-beat assessment of cardiac function. Nature 580 , 252–256 (2020).
Benjamens, S., Dhunnoo, P. & Meskó, B. The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database. NPJ Digit. Med. 3 , 118 (2020).
Beede, E. et al. A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. In Proc. 2020 CHI Conference on Human Factors in Computing Systems 1–12 (Association for Computing Machinery, 2020).
Viz.ai Granted Medicare New Technology Add-on Payment. PR Newswire https://www.prnewswire.com/news-releases/vizai-granted-medicare-new-technology-add-on-payment-301123603.html (2020).
Crowson, M. G. et al. A contemporary review of machine learning in otolaryngology-head and neck surgery. Laryngoscope 130 , 45–51 (2020).
Livingstone, D., Talai, A. S., Chau, J. & Forkert, N. D. Building an Otoscopic screening prototype tool using deep learning. J. Otolaryngol. Head. Neck Surg. 48 , 66 (2019).
Chen, P.-H. C. et al. An augmented reality microscope with real-time artificial intelligence integration for cancer diagnosis. Nat. Med. 25 , 1453–1457 (2019).
Gunčar, G. et al. An application of machine learning to haematological diagnosis. Sci. Rep. 8 , 411 (2018).
Article PubMed PubMed Central CAS Google Scholar
Alam, M. M. & Islam, M. T. Machine learning approach of automatic identification and counting of blood cells. Health. Technol. Lett. 6 , 103–108 (2019).
El Hajjar, A. & Rey, J.-F. Artificial intelligence in gastrointestinal endoscopy: general overview. Chin. Med. J. 133 , 326–334 (2020).
Horie, Y. et al. Diagnostic outcomes of esophageal cancer by artificial intelligence using convolutional neural networks. Gastrointest. Endosc. 89 , 25–32 (2019).
Hirasawa, T. et al. Application of artificial intelligence using a convolutional neural network for detecting gastric cancer in endoscopic images. Gastric Cancer 21 , 653–660 (2018).
Kubota, K., Kuroda, J., Yoshida, M., Ohta, K. & Kitajima, M. Medical image analysis: computer-aided diagnosis of gastric cancer invasion on endoscopic images. Surg. Endosc. 26 , 1485–1489 (2012).
Itoh, T., Kawahira, H., Nakashima, H. & Yata, N. Deep learning analyzes Helicobacter pylori infection by upper gastrointestinal endoscopy images. Endosc. Int Open 6 , E139–E144 (2018).
He, J.-Y., Wu, X., Jiang, Y.-G., Peng, Q. & Jain, R. Hookworm detection in wireless capsule endoscopy images with deep learning. IEEE Trans. Image Process. 27 , 2379–2392 (2018).
Park, S.-M. et al. A mountable toilet system for personalized health monitoring via the analysis of excreta. Nat. Biomed. Eng. 4 , 624–635 (2020).
VerMilyea, M. et al. Development of an artificial intelligence-based assessment model for prediction of embryo viability using static images captured by optical light microscopy during IVF. Hum. Reprod. 35 , 770–784 (2020).
Choy, G. et al. Current applications and future impact of machine learning in radiology. Radiology 288 , 318–328 (2018).
Saba, L. et al. The present and future of deep learning in radiology. Eur. J. Radiol. 114 , 14–24 (2019).
Mazurowski, M. A., Buda, M., Saha, A. & Bashir, M. R. Deep learning in radiology: an overview of the concepts and a survey of the state of the art with focus on MRI. J. Magn. Reson. Imaging 49 , 939–954 (2019).
Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6 , 317 (2019).
Irvin, J. et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proc. of the AAAI Conference on Artificial Intelligence Vol. 33, 590–597 (2019).
Wang, X. et al. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervisedclassification and localization of common thorax diseases. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2097–2106 (2017).
Chilamkurthy, S. et al. Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study. Lancet 392 , 2388–2396 (2018).
Weston, A. D. et al. Automated abdominal segmentation of CT scans for body composition analysis using deep learning. Radiology 290 , 669–679 (2019).
Ding, J., Li, A., Hu, Z. & Wang, L. in Medical Image Computing and Computer Assisted Intervention—MICCAI 2017 559–567 (Springer International Publishing, 2017).
Tan, L. K., Liew, Y. M., Lim, E. & McLaughlin, R. A. Convolutional neural network regression for short-axis left ventricle segmentation in cardiac cine MR sequences. Med. Image Anal. 39 , 78–86 (2017).
Zhang, J. et al. Viral pneumonia screening on chest X-ray images using confidence-aware anomaly detection. Preprint at https://arxiv.org/abs/2003.12338 (2020).
Zhang, X., Feng, C., Wang, A., Yang, L. & Hao, Y. CT super-resolution using multiple dense residual block based GAN. J. VLSI Signal Process. Syst. Signal Image Video Technol. , https://doi.org/10.1007/s11760-020-01790-5 (2020).
Papolos, A., Narula, J., Bavishi, C., Chaudhry, F. A. & Sengupta, P. P. U. S. Hospital use of echocardiography: insights from the nationwide inpatient sample. J. Am. Coll. Cardiol. 67 , 502–511 (2016).
HeartFlowNXT—HeartFlow Analysis of Coronary Blood Flow Using Coronary CT Angiography—Study Results—ClinicalTrials.gov. https://clinicaltrials.gov/ct2/show/results/NCT01757678 .
Madani, A., Arnaout, R., Mofrad, M. & Arnaout, R. Fast and accurate view classification of echocardiograms using deep learning. NPJ Digit. Med. 1 , 6 (2018).
Zhang, J. et al. Fully automated echocardiogram interpretation in clinical practice. Circulation 138 , 1623–1635 (2018).
Ghorbani, A. et al. Deep learning interpretation of echocardiograms. NPJ Digit. Med. 3 , 10 (2020).
Madani, A., Ong, J. R., Tibrewal, A. & Mofrad, M. R. K. Deep echocardiography: data-efficient supervised and semi-supervised deep learning towards automated diagnosis of cardiac disease. NPJ Digit. Med. 1 , 59 (2018).
Perkins, C., Balma, D. & Garcia, R. Members of the Consensus Group & Susan G. Komen for the Cure. Why current breast pathology practices must be evaluated. A Susan G. Komen for the Cure white paper: June 2006. Breast J. 13 , 443–447 (2007).
Brimo, F., Schultz, L. & Epstein, J. I. The value of mandatory second opinion pathology review of prostate needle biopsy interpretation before radical prostatectomy. J. Urol. 184 , 126–130 (2010).
Elmore, J. G. et al. Diagnostic concordance among pathologists interpreting breast biopsy specimens. JAMA 313 , 1122–1132 (2015).
Evans, A. J. et al. US food and drug administration approval of whole slide imaging for primary diagnosis: a key milestone is reached and new questions are raised. Arch. Pathol. Lab. Med. 142 , 1383–1387 (2018).
Srinidhi, C. L., Ciga, O. & Martel, A. L. Deep neural network models for computational histopathology: A survey. Medical Image Analysis . p. 101813 (2020).
Bera, K., Schalper, K. A., Rimm, D. L., Velcheti, V. & Madabhushi, A. Artificial intelligence in digital pathology - new tools for diagnosis and precision oncology. Nat. Rev. Clin. Oncol. 16 , 703–715 (2019).
Cireşan, D. C., Giusti, A., Gambardella, L. M. & Schmidhuber, J. in Medical Image Computing and Computer-Assisted Intervention—MICCAI 2013 411–418 (Springer Berlin Heidelberg, 2013).
Wang, H. et al. Mitosis detection in breast cancer pathology images by combining handcrafted and convolutional neural network features. J. Med Imaging (Bellingham) 1 , 034003 (2014).
Kashif, M. N., Ahmed Raza, S. E., Sirinukunwattana, K., Arif, M. & Rajpoot, N. Handcrafted features with convolutional neural networks for detection of tumor cells in histology images. In 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI) 1029–1032 (IEEE, 2016).
Wang, D., Khosla, A., Gargeya, R., Irshad, H. & Beck, A. H. Deep learning for identifying metastatic breast cancer. Preprint at https://arxiv.org/abs/1606.05718 (2016).
BenTaieb, A. & Hamarneh, G. in Medical Image Computing and Computer-Assisted Intervention—MICCAI 2016 460–468 (Springer International Publishing, 2016).
Chen, H. et al. DCAN: Deep contour-aware networks for object instance segmentation from histology images. Med. Image Anal. 36 , 135–146 (2017).
Xu, Y. et al. Gland instance segmentation using deep multichannel neural networks. IEEE Trans. Biomed. Eng. 64 , 2901–2912 (2017).
Litjens, G. et al. Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Sci. Rep. 6 , 26286 (2016).
Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24 , 1559–1567 (2018).
Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 25 , 1301–1309 (2019).
Mobadersany, P. et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proc. Natl Acad. Sci. U. S. A. 115 , E2970–E2979 (2018).
Courtiol, P. et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat. Med. 25 , 1519–1525 (2019).
Rawat, R. R. et al. Deep learned tissue ‘fingerprints’ classify breast cancers by ER/PR/Her2 status from H&E images. Sci. Rep. 10 , 7275 (2020).
Dietterich, T. G., Lathrop, R. H. & Lozano-Pérez, T. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89 , 31–71 (1997).
Christiansen, E. M. et al. In silico labeling: predicting fluorescent labels in unlabeled images. Cell 173 , 792–803.e19 (2018).
Esteva, A. & Topol, E. Can skin cancer diagnosis be transformed by AI? Lancet 394 , 1795 (2019).
Haenssle, H. A. et al. Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann. Oncol. 29 , 1836–1842 (2018).
Brinker, T. J. et al. Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task. Eur. J. Cancer 113 , 47–54 (2019).
Liu, Y. et al. A deep learning system for differential diagnosis of skin diseases. Nat. Med. 26 , 900–908 (2020).
Yap, J., Yolland, W. & Tschandl, P. Multimodal skin lesion classification using deep learning. Exp. Dermatol. 27 , 1261–1267 (2018).
Marchetti, M. A. et al. Results of the 2016 International Skin Imaging Collaboration International Symposium on Biomedical Imaging challenge: Comparison of the accuracy of computer algorithms to dermatologists for the diagnosis of melanoma from dermoscopic images. J. Am. Acad. Dermatol. 78 , 270–277 (2018).
Li, Y. et al. Skin cancer detection and tracking using data synthesis and deep learning. Preprint at https://arxiv.org/abs/1612.01074 (2016).
Ting, D. S. W. et al. Artificial intelligence and deep learning in ophthalmology. Br. J. Ophthalmol. 103 , 167–175 (2019).
Keane, P. A. & Topol, E. J. With an eye to AI and autonomous diagnosis. NPJ Digit. Med. 1 , 40 (2018).
Keane, P. & Topol, E. Reinventing the eye exam. Lancet 394 , 2141 (2019).
De Fauw, J. et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24 , 1342–1350 (2018).
Kern, C. et al. Implementation of a cloud-based referral platform in ophthalmology: making telemedicine services a reality in eye care. Br. J. Ophthalmol. 104 , 312–317 (2020).
Gulshan, V. et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316 , 2402–2410 (2016).
Raumviboonsuk, P. et al. Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program. NPJ Digit Med. 2 , 25 (2019).
Abràmoff, M. D. et al. Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning. Invest. Ophthalmol. Vis. Sci. 57 , 5200–5206 (2016).
Ting, D. S. W. et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. JAMA 318 , 2211–2223 (2017).
Abràmoff, M. D., Lavin, P. T., Birch, M., Shah, N. & Folk, J. C. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. NPJ Digit. Med. 1 , 39 (2018).
Varadarajan, A. V. et al. Predicting optical coherence tomography-derived diabetic macular edema grades from fundus photographs using deep learning. Nat. Commun. 11 , 130 (2020).
Yim, J. et al. Predicting conversion to wet age-related macular degeneration using deep learning. Nat. Med. 26 , 892–899 (2020).
Li, Z. et al. Efficacy of a deep learning system for detecting glaucomatous optic neuropathy based on color fundus photographs. Ophthalmology 125 , 1199–1206 (2018).
Yousefi, S. et al. Detection of longitudinal visual field progression in glaucoma using machine learning. Am. J. Ophthalmol. 193 , 71–79 (2018).
Brown, J. M. et al. Automated diagnosis of plus disease in retinopathy of prematurity using deep convolutional neural networks. JAMA Ophthalmol. 136 , 803–810 (2018).
Poplin, R. et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat. Biomed. Eng. 2 , 158–164 (2018).
Mitani, A. et al. Detection of anaemia from retinal fundus images via deep learning. Nat. Biomed. Eng. 4 , 18–27 (2020).
Sabanayagam, C. et al. A deep learning algorithm to detect chronic kidney disease from retinal photographs in community-based populations. Lancet Digital Health 2 , e295–e302 (2020).
Maier-Hein, L. et al. Surgical data science for next-generation interventions. Nat. Biomed. Eng. 1 , 691–696 (2017).
García-Peraza-Herrera, L. C. et al. ToolNet: Holistically-nested real-time segmentation of robotic surgical tools. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 5717–5722 (IEEE, 2017).
Zia, A., Sharma, Y., Bettadapura, V., Sarin, E. L. & Essa, I. Video and accelerometer-based motion analysis for automated surgical skills assessment. Int. J. Comput. Assist. Radiol. Surg. 13 , 443–455 (2018).
Sarikaya, D., Corso, J. J. & Guru, K. A. Detection and localization of robotic tools in robot-assisted surgery videos using deep neural networks for region proposal and detection. IEEE Trans. Med. Imaging 36 , 1542–1549 (2017).
Jin, A. et al. Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) 691–699 (IEEE, 2018).
Twinanda, A. P. et al. EndoNet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans. Med. Imaging 36 , 86–97 (2017).
Lin, H. C., Shafran, I., Yuh, D. & Hager, G. D. Towards automatic skill evaluation: detection and segmentation of robot-assisted surgical motions. Comput. Aided Surg. 11 , 220–230 (2006).
Khalid, S., Goldenberg, M., Grantcharov, T., Taati, B. & Rudzicz, F. Evaluation of deep learning models for identifying surgical actions and measuring performance. JAMA Netw. Open 3 , e201664 (2020).
Vassiliou, M. C. et al. A global assessment tool for evaluation of intraoperative laparoscopic skills. Am. J. Surg. 190 , 107–113 (2005).
Jin, Y. et al. SV-RCNet: Workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans. Med. Imaging 37 , 1114–1126 (2018).
Padoy, N. et al. Statistical modeling and recognition of surgical workflow. Med. Image Anal. 16 , 632–641 (2012).
Azari, D. P. et al. Modeling surgical technical skill using expert assessment for automated computer rating. Ann. Surg. 269 , 574–581 (2019).
Ma, A. J. et al. Measuring patient mobility in the ICU using a novel noninvasive sensor. Crit. Care Med. 45 , 630–636 (2017).
Davoudi, A. et al. Intelligent ICU for autonomous patient monitoring using pervasive sensing and deep learning. Sci. Rep. 9 , 8020 (2019).
Chakraborty, I., Elgammal, A. & Burd, R. S. Video based activity recognition in trauma resuscitation. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) 1–8 (IEEE, 2013).
Twinanda, A. P., Alkan, E. O., Gangi, A., de Mathelin, M. & Padoy, N. Data-driven spatio-temporal RGBD feature encoding for action recognition in operating rooms. Int. J. Comput. Assist. Radiol. Surg. 10 , 737–747 (2015).
Kaplan, R. S. & Porter, M. E. How to solve the cost crisis in health care. Harv. Bus. Rev. 89 , 46–52 (2011). 54, 56–61 passim.
PubMed Google Scholar
Wang, S., Chen, L., Zhou, Z., Sun, X. & Dong, J. Human fall detection in surveillance video based on PCANet. Multimed. Tools Appl. 75 , 11603–11613 (2016).
Núñez-Marcos, A., Azkune, G. & Arganda-Carreras, I. Vision-Based Fall Detection with Convolutional Neural Networks. In Proc. International Wireless Communications and Mobile Computing Conference 2017 (ACM, 2017).
Luo, Z. et al. Computer vision-based descriptive analytics of seniors’ daily activities for long-term health monitoring. In Machine Learning for Healthcare (MLHC) 2 (JMLR, 2018).
Zhang, C. & Tian, Y. RGB-D camera-based daily living activity recognition. J. Comput. Vis. image Process. 2 , 12 (2012).
Pirsiavash, H. & Ramanan, D. Detecting activities of daily living in first-person camera views. In 2012 IEEE Conference on Computer Vision and Pattern Recognition 2847–2854 (IEEE, 2012).
Kishore, P. V. V., Prasad, M. V. D., Kumar, D. A. & Sastry, A. S. C. S. Optical flow hand tracking and active contour hand shape features for continuous sign language recognition with artificial neural networks. In 2016 IEEE 6th International Conference on Advanced Computing (IACC) 346–351 (IEEE, 2016).
Webster, D. & Celik, O. Systematic review of Kinect applications in elderly care and stroke rehabilitation. J. Neuroeng. Rehabil. 11 , 108 (2014).
Chen, W. & McDuff, D. Deepphys: video-based physiological measurement using convolutional attention networks. In Proc. European Conference on Computer Vision (ECCV) 349–365 (Springer Science+Business Media, 2018).
Moazzami, B., Razavi-Khorasani, N., Dooghaie Moghadam, A., Farokhi, E. & Rezaei, N. COVID-19 and telemedicine: Immediate action required for maintaining healthcare providers well-being. J. Clin. Virol. 126 , 104345 (2020).
Gerke, S., Yeung, S. & Cohen, I. G. Ethical and legal aspects of ambient intelligence in hospitals. JAMA https://doi.org/10.1001/jama.2019.21699 (2020).
Young, A. T., Xiong, M., Pfau, J., Keiser, M. J. & Wei, M. L. Artificial intelligence in dermatology: a primer. J. Invest. Dermatol. 140 , 1504–1512 (2020).
Schaekermann, M., Cai, C. J., Huang, A. E. & Sayres, R. Expert discussions improve comprehension of difficult cases in medical image assessment. In Proc. 2020 CHI Conference on Human Factors in Computing Systems 1–13 (Association for Computing Machinery, 2020).
Schaekermann, M. et al. Remote tool-based adjudication for grading diabetic retinopathy. Transl. Vis. Sci. Technol. 8 , 40 (2019).
Caruana, R. Multitask learning. Mach. Learn. 28 , 41–75 (1997).
Wulczyn, E. et al. Deep learning-based survival prediction for multiple cancer types using histopathology images. PLoS ONE 15 , e0233678 (2020).
Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. Preprint at https://arxiv.org/abs/1312.6034 (2013).
Ren, J. et al. in Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) 14707–14718 (Curran Associates, Inc., 2019).
Dusenberry, M. W. et al. Analyzing the role of model uncertainty for electronic health records. In Proc. ACM Conference on Health, Inference, and Learning 204–213 (Association for Computing Machinery, 2020).
Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366 , 447–453 (2019).
Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI Extension. BMJ 370 , m3164 (2020).
Rivera, S. C. et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI Extension. BMJ 370 , m3210 (2020).
Asan, O., Bayrak, A. E. & Choudhury, A. Artificial intelligence and human trust in healthcare: focus on clinicians. J. Med. Internet Res. 22 , e15154 (2020).
McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577 , 89–94 (2020).
Kamulegeya, L. H. et al. Using artificial intelligence on dermatology conditions in Uganda: a case for diversity in training data sets for machine learning. https://doi.org/10.1101/826057 (2019).
Download references
Acknowledgements
The authors would like to thank Melvin Gruesbeck for the design of the figures, and Elise Kleeman for editorial review.
Author information
These authors contributed equally: Katherine Chou, Serena Yeung, Nikhil Naik, Ali Madani, Ali Mottaghi.
Authors and Affiliations
Salesforce AI Research, San Francisco, CA, USA
Andre Esteva, Nikhil Naik, Ali Madani & Richard Socher
Google Research, Mountain View, CA, USA
Katherine Chou, Yun Liu & Jeff Dean
Stanford University, Stanford, CA, USA
Serena Yeung & Ali Mottaghi
Scripps Research Translational Institute, La Jolla, CA, USA
You can also search for this author in PubMed Google Scholar
Contributions
A.E. organized the authors, synthesized the writing, and led the abstract, introduction, computer vision, dermatology, and ophthalmology sections. S.Y. led the medical video section. K.C. led the clinical deployment section. N.N. contributed the pathology section, Ali Madani contributed the cardiology section, Ali Mottaghi contributed to the sections within the medical video, and E.T. and J.D. contributed to the clinical deployment section. Y.L. significantly contributed to the figures, and writing style. All authors contributed to the overall writing and storyline. E.T., J.D., and R.S. oversaw and advised the work.
Corresponding author
Correspondence to Andre Esteva .
Ethics declarations
Competing interests.
A.E., N.N., Ali Madani, and R.S. are or were employees of Salesforce.com and own Salesforce stock. K.C., Y.L., and J.D. are employees of Google, L.L.C. and own Alphabet stock. S.Y., Ali Mottaghi and E.T. have no competing interests to declare.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and permissions
About this article
Cite this article.
Esteva, A., Chou, K., Yeung, S. et al. Deep learning-enabled medical computer vision. npj Digit. Med. 4 , 5 (2021). https://doi.org/10.1038/s41746-020-00376-2
Download citation
Received : 17 August 2020
Accepted : 01 December 2020
Published : 08 January 2021
DOI : https://doi.org/10.1038/s41746-020-00376-2
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
This article is cited by
Deep learning-based predictive classification of functional subpopulations of hematopoietic stem cells and multipotent progenitors.
- Jianzhong Han
Stem Cell Research & Therapy (2024)
Deep learning for determining the difficulty of endodontic treatment: a pilot study
- Hamed Karkehabadi
- Elham Khoshbin
- Soroush Sadr
BMC Oral Health (2024)
Web-based diagnostic platform for microorganism-induced deterioration on paper-based cultural relics with iterative training from human feedback
- Chenshu Liu
- Songbin Ben
Heritage Science (2024)
Identification of difficult laryngoscopy using an optimized hybrid architecture
- XiaoXiao Liu
- Colin Flanagan
- Yongzheng Han
BMC Medical Research Methodology (2024)
Real-time simultaneous refractive index and thickness mapping of sub-cellular biology at the diffraction limit
- Arturo Burguete-Lopez
- Maksim Makarenko
- Andrea Fratalocchi
Communications Biology (2024)
Quick links
- Explore articles by subject
- Guide to authors
- Editorial policies
Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.
![research paper about computer vision research paper about computer vision](https://verify.nature.com/verify/nature.png)
Navigation Menu
Search code, repositories, users, issues, pull requests..., provide feedback.
We read every piece of feedback, and take your input very seriously.
Saved searches
Use saved searches to filter your results more quickly.
To see all available qualifiers, see our documentation .
- Notifications You must be signed in to change notification settings
A curated list of the top 10 computer vision papers in 2021 with video demos, articles, code and paper reference.
louisfb01/top-10-cv-papers-2021
Folders and files, repository files navigation, the top 10 computer vision papers of 2021, the top 10 computer vision papers in 2021 with video demos, articles, code, and paper reference..
While the world is still recovering, research hasn't slowed its frenetic pace, especially in the field of artificial intelligence. More, many important aspects were highlighted this year, like the ethical aspects, important biases, governance, transparency and much more. Artificial intelligence and our understanding of the human brain and its link to AI are constantly evolving, showing promising applications improving our life's quality in the near future. Still, we ought to be careful with which technology we choose to apply.
"Science cannot tell us what we ought to do, only what we can do." - Jean-Paul Sartre, Being and Nothingness
Here are my top 10 of the most interesting research papers of the year in computer vision, in case you missed any of them. In short, it is basically a curated list of the latest breakthroughs in AI and CV with a clear video explanation, link to a more in-depth article, and code (if applicable). Enjoy the read, and let me know if I missed any important papers in the comments, or by contacting me directly on LinkedIn!
The complete reference to each paper is listed at the end of this repository.
Maintainer: louisfb01
Subscribe to my newsletter - The latest updates in AI explained every week.
Feel free to message me any interesting paper I may have missed to add to this repository.
Tag me on Twitter @Whats_AI or LinkedIn @Louis (What's AI) Bouchard if you share the list!
Watch the 2021 CV rewind
Missed last year? Check this out: 2020: A Year Full of Amazing AI papers- A Review
👀 If you'd like to support my work and use W&B (for free) to track your ML experiments and make your work reproducible or collaborate with a team, you can try it out by following this guide ! Since most of the code here is PyTorch-based, we thought that a QuickStart guide for using W&B on PyTorch would be most interesting to share.
👉Follow this quick guide , use the same W&B lines in your code or any of the repos below, and have all your experiments automatically tracked in your w&b account! It doesn't take more than 5 minutes to set up and will change your life as it did for me! Here's a more advanced guide for using Hyperparameter Sweeps if interested :)
🙌 Thank you to Weights & Biases for sponsoring this repository and the work I've been doing, and thanks to any of you using this link and trying W&B!
If you are interested in AI research, here is another great repository for you:
A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more in-depth article, and code.
2021: A Year Full of Amazing AI papers- A Review
The Full List
Dall·e: zero-shot text-to-image generation from openai [1], taming transformers for high-resolution image synthesis [2], swin transformer: hierarchical vision transformer using shifted windows [3], deep nets: what have they ever done for vision [bonus].
- Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [4]
Total Relighting: Learning to Relight Portraits for Background Replacement [5]
- Animating Pictures with Eulerian Motion Fields [6]
- CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation [7]
TimeLens: Event-based Video Frame Interpolation [8]
- (Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [9]
- CityNeRF: Building NeRF at City Scale [10]
Paper references
OpenAI successfully trained a network able to generate images from text captions. It is very similar to GPT-3 and Image GPT and produces amazing results.
- Short read: OpenAI’s DALL·E: Text-to-Image Generation Explained
- Paper: Zero-Shot Text-to-Image Generation
- Code: Code & more information for the discrete VAE used for DALL·E
Tl;DR: They combined the efficiency of GANs and convolutional approaches with the expressivity of transformers to produce a powerful and time-efficient method for semantically-guided high-quality image synthesis.
- Short read: Combining the Transformers Expressivity with the CNNs Efficiency for High-Resolution Image Synthesis
- Paper: Taming Transformers for High-Resolution Image Synthesis
- Code: Taming Transformers
Will Transformers Replace CNNs in Computer Vision? In less than 5 minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer.
- Short read: Will Transformers Replace CNNs in Computer Vision?
- Paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- Click here for the code
"I will openly share everything about deep nets for vision applications, their successes, and the limitations we have to address."
- Short read: What is the state of AI in computer vision?
- Paper: Deep nets: What have they ever done for vision?
Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [4]
The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape!
- Short read: Infinite Nature: Fly into an image and explore the landscape
- Paper: Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image
Properly relight any portrait based on the lighting of the new background you add. Have you ever wanted to change the background of a picture but have it look realistic? If you’ve already tried that, you already know that it isn’t simple. You can’t just take a picture of yourself in your home and change the background for a beach. It just looks bad and not realistic. Anyone will just say “that’s photoshopped” in a second. For movies and professional videos, you need the perfect lighting and artists to reproduce a high-quality image, and that’s super expensive. There’s no way you can do that with your own pictures. Or can you?
- Short read: Realistic Lighting on Different Backgrounds
- Paper: Total Relighting: Learning to Relight Portraits for Background Replacement
If you’d like to read more research papers as well, I recommend you read my article where I share my best tips for finding and reading more research papers.
Animating Pictures with Eulerian Motion Fields [6]
This model takes a picture, understands which particles are supposed to be moving, and realistically animates them in an infinite loop while conserving the rest of the picture entirely still creating amazing-looking videos like this one...
- Short read: Create Realistic Animated Looping Videos from Pictures
- Paper: Animating Pictures with Eulerian Motion Fields
CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation [7]
Using a modified GAN architecture, they can move objects in the image without affecting the background or the other objects!
- Short read: CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation
- Paper: GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields
TimeLens can understand the movement of the particles in-between the frames of a video to reconstruct what really happened at a speed even our eyes cannot see. In fact, it achieves results that our intelligent phones and no other models could reach before!
- Short read: How to Make Slow Motion Videos With AI!
- Paper: TimeLens: Event-based Video Frame Interpolation
Subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!
(Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [9]
Have you ever dreamed of taking the style of a picture, like this cool TikTok drawing style on the left, and applying it to a new picture of your choice? Well, I did, and it has never been easier to do. In fact, you can even achieve that from only text and can try it right now with this new method and their Google Colab notebook available for everyone (see references). Simply take a picture of the style you want to copy, enter the text you want to generate, and this algorithm will generate a new picture out of it! Just look back at the results above, such a big step forward! The results are extremely impressive, especially if you consider that they were made from a single line of text!
- Short read: Text-to-Drawing Synthesis With Artistic Control | CLIPDraw & StyleCLIPDraw
- Paper (CLIPDraw): CLIPDraw: exploring text-to-drawing synthesis through language-image encoders
- Paper (StyleCLIPDraw): StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis
- CLIPDraw Colab demo
- StyleCLIPDraw Colab demo
CityNeRF: Building NeRF at City Scale [10]
The model is called CityNeRF and grows from NeRF, which I previously covered on my channel. NeRF is one of the first models using radiance fields and machine learning to construct 3D models out of images. But NeRF is not that efficient and works for a single scale. Here, CityNeRF is applied to satellite and ground-level images at the same time to produce various 3D model scales for any viewpoint. In simple words, they bring NeRF to city-scale. But how?
- Short read: CityNeRF: 3D Modelling at City Scale!
- Paper: CityNeRF: Building NeRF at City Scale
- Click here for the code (will be released soon)
If you would like to read more papers and have a broader view, here is another great repository for you covering 2020: 2020: A Year Full of Amazing AI papers- A Review and feel free to subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!
[1] A. Ramesh et al., Zero-shot text-to-image generation, 2021. arXiv:2102.12092
[2] Taming Transformers for High-Resolution Image Synthesis, Esser et al., 2020.
[3] Liu, Z. et al., 2021, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, arXiv preprint https://arxiv.org/abs/2103.14030v1
[bonus] Yuille, A.L., and Liu, C., 2021. Deep nets: What have they ever done for vision?. International Journal of Computer Vision, 129(3), pp.781–802, https://arxiv.org/abs/1805.04025 .
[4] Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N. and Kanazawa, A., 2020. Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image, https://arxiv.org/pdf/2012.09855.pdf
[5] Pandey et al., 2021, Total Relighting: Learning to Relight Portraits for Background Replacement, doi: 10.1145/3450626.3459872, https://augmentedperception.github.io/total_relighting/total_relighting_paper.pdf .
[6] Holynski, Aleksander, et al. “Animating Pictures with Eulerian Motion Fields.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
[7] Michael Niemeyer and Andreas Geiger, (2021), "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields", Published in CVPR 2021.
[8] Stepan Tulyakov*, Daniel Gehrig*, Stamatios Georgoulis, Julius Erbach, Mathias Gehrig, Yuanyou Li, Davide Scaramuzza, TimeLens: Event-based Video Frame Interpolation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 2021, http://rpg.ifi.uzh.ch/docs/CVPR21_Gehrig.pdf
[9] a) CLIPDraw: exploring text-to-drawing synthesis through language-image encoders b) StyleCLIPDraw: Schaldenbrand, P., Liu, Z. and Oh, J., 2021. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis.
[10] Xiangli, Y., Xu, L., Pan, X., Zhao, N., Rao, A., Theobalt, C., Dai, B. and Lin, D., 2021. CityNeRF: Building NeRF at City Scale.
Sponsor this project
- Skip to main content
- Skip to primary sidebar
- Skip to footer
![research paper about computer vision TOPBOTS Logo](https://topb0ts.wpenginepowered.com/wp-content/uploads/2020/08/topbots-blue-on-white-3.png)
The Best of Applied Artificial Intelligence, Machine Learning, Automation, Bots, Chatbots
10 Cutting Edge Research Papers In Computer Vision & Image Generation
January 24, 2019 by Mariya Yao
![research paper about computer vision Computer Vision Research Papers](https://topb0ts.wpenginepowered.com/wp-content/uploads/2019/01/topbots_computer_vision_research_1600px_web.jpg)
UPDATE: We’ve also summarized the top 2019 and top 2020 Computer Vision research papers.
Ever since convolutional neural networks began outperforming humans in specific image recognition tasks, research in the field of computer vision has proceeded at breakneck pace.
The basic architecture of CNNs (or ConvNets) was developed in the 1980s . Yann LeCun improved upon the original design in 1989 by using backpropagation to train models to recognize handwritten digits.
We’ve come a long way since then.
In 2018, we saw novel architecture designs that improve upon performance benchmarks and also expand the range of media that machine learning models can analyze. We also saw a number of breakthroughs with media generation which enable photorealistic style transfer, high-resolution image generation, and video-to-video synthesis.
Due to the importance and prevalence of computer vision and image generation for applied and enterprise AI, we did feature some of the papers below in our previous article summarizing the top overall machine learning papers of 2018 . Since you might not have read that previous piece, we chose to highlight the vision-related research ones again here.
We’ve done our best to summarize these papers correctly, but if we’ve made any mistakes, please contact us to request a fix . Special thanks also goes to computer vision specialist Rebecca BurWei for generously offering her expertise in editing and revising drafts of this article.
If these summaries of scientific AI research papers are useful for you, you can subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries. We’re planning to release summaries of important papers in computer vision, reinforcement learning, and conversational AI in the next few weeks.
If you’d like to skip around, here are the papers we featured:
- Spherical CNNs
- Adversarial Examples that Fool both Computer Vision and Time-Limited Humans
- A Closed-form Solution to Photorealistic Image Stylization
- Group Normalization
- Taskonomy: Disentangling Task Transfer Learning
- Self-Attention Generative Adversarial Networks
- GANimation: Anatomically-aware Facial Animation from a Single Image
- Video-to-Video Synthesis
- Everybody Dance Now
- Large Scale GAN Training for High Fidelity Natural Image Synthesis
Important Computer Vision Research Papers of 2018
1. spherical cnns , by taco s. cohen, mario geiger, jonas koehler, and max welling, original abstract.
Convolutional Neural Networks (CNNs) have become the method of choice for learning problems involving 2D planar images. However, a number of problems of recent interest have created a demand for models that can analyze spherical images. Examples include omnidirectional vision for drones, robots, and autonomous cars, molecular regression problems, and global weather and climate modelling. A naive application of convolutional networks to a planar projection of the spherical signal is destined to fail, because the space-varying distortions introduced by such a projection will make translational weight sharing ineffective.
In this paper we introduce the building blocks for constructing spherical CNNs. We propose a definition for the spherical cross-correlation that is both expressive and rotation-equivariant. The spherical correlation satisfies a generalized Fourier theorem, which allows us to compute it efficiently using a generalized (non-commutative) Fast Fourier Transform (FFT) algorithm. We demonstrate the computational efficiency, numerical accuracy, and effectiveness of spherical CNNs applied to 3D model recognition and atomization energy regression.
Our Summary
Omnidirectional cameras that are already used by cars, drones, and other robots capture a spherical image of their entire surroundings. We could analyze such spherical signals by projecting them to the plane and using CNNs. However, any planar projection of a spherical signal results in distortions. To overcome this problem, the group of researchers from the University of Amsterdam introduces the theory of spherical CNNs, the networks that can analyze spherical images without being fooled by distortions. The approach demonstrates its effectiveness for classifying 3D shapes and Spherical MNIST images as well as for molecular energy regression, an important problem in computational chemistry.
What’s the core idea of this paper?
- Planar projections of spherical signals result in significant distortions as some areas look larger or smaller than they really are.
- Traditional CNNs are ineffective for spherical images because as objects move around the sphere, they also appear to shrink and stretch (think maps where Greenland looks much bigger than it actually is).
- The solution is to use a spherical CNN which is robust to spherical rotations in the input data. By preserving the original shape of the input data, spherical CNNs treat all objects on the sphere equally without distortion.
What’s the key achievement?
- Introducing a mathematical framework for building spherical CNNs.
- Providing easy to use, fast and memory efficient PyTorch code for implementation of these CNNs.
- classification of Spherical MNIST images
- classification of 3D shapes,
- molecular energy regression.
What does the AI community think?
- The paper won the Best Paper Award at ICLR 2018, one of the leading machine learning conferences.
What are future research areas?
- Development of a Steerable CNN for the sphere to analyze sections of vector bundles over the sphere (e.g., wind directions).
- Expanding the mathematical theory from 2D spheres to 3D point clouds for classification tasks that are invariant under reflections as well as rotations.
What are possible business applications?
- the omnidirectional vision for drones, robots, and autonomous cars;
- molecular regression problems in computational chemistry;
- global weather and climate modeling.
Where can you get implementation code?
- The authors provide the original implementation for this research paper on GitHub .
![research paper about computer vision Applied AI Book Second Edition](https://topb0ts.wpenginepowered.com/wp-content/uploads/2024/02/2nd-edition_baner_-scaled.jpg)
2. Adversarial Examples that Fool both Computer Vision and Time-Limited Humans , by Gamaleldin F. Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alex Kurakin, Ian Goodfellow, Jascha Sohl-Dickstein
Machine learning models are vulnerable to adversarial examples: small changes to images can cause computer vision models to make mistakes such as identifying a school bus as an ostrich. However, it is still an open question whether humans are prone to similar mistakes. Here, we address this question by leveraging recent techniques that transfer adversarial examples from computer vision models with known parameters and architecture to other models with unknown parameters and architecture, and by matching the initial processing of the human visual system. We find that adversarial examples that strongly transfer across computer vision models influence the classifications made by time-limited human observers.
Google Brain researchers seek an answer to the question: do adversarial examples that are not model-specific and can fool different computer vision models without access to their parameters and architectures, can also fool time-limited humans? They leverage key ideas from machine learning, neuroscience, and psychophysics to create adversarial examples that do in fact impact human perception in a time-limited setting. Thus, the paper introduces a new class of illusions that are shared between machines and humans.
![research paper about computer vision TOP Computer Vision Papers](https://topb0ts.wpenginepowered.com/wp-content/uploads/2019/01/02_Adversarial_2_web.jpg)
- As the first step, the researchers use the black box adversarial example construction techniques that create adversarial examples without access to the model’s architecture or parameters.
- prepending each model with a retinal layer that pre-processes the input to incorporate some of the transformations performed by the human eye;
- performing an eccentricity-dependent blurring of the image to approximate the input which is received by the visual cortex of human subjects through their retinal lattice.
- Classification decisions of humans are evaluated in a time-limited setting to detect even subtle effects in human perception.
- Showing that adversarial examples that transfer across computer vision models do also successfully influence the perception of humans.
- Demonstrating the similarity between convolutional neural networks and the human visual system.
- The paper is widely discussed by the AI community. While most of the researchers are stunned by the results , some argue that we need a stricter definition of adversarial image because if humans classify the perturbated picture of a cat as a dog than it’s probably already a dog, not a cat.
- Researching which techniques are crucial for the transfer of adversarial examples to humans (i.e., retinal preprocessing, model ensembling).
- Practitioners should consider the risk that imagery could be manipulated to cause human observers to have unusual reactions because adversarial images can affect us below the horizon of awareness .
3. A Closed-form Solution to Photorealistic Image Stylization , by Yijun Li, Ming-Yu Liu, Xueting Li, Ming-Hsuan Yang, Jan Kautz
Photorealistic image stylization concerns transferring style of a reference photo to a content photo with the constraint that the stylized photo should remain photorealistic. While several photorealistic image stylization methods exist, they tend to generate spatially inconsistent stylizations with noticeable artifacts. In this paper, we propose a method to address these issues. The proposed method consists of a stylization step and a smoothing step. While the stylization step transfers the style of the reference photo to the content photo, the smoothing step ensures spatially consistent stylizations. Each of the steps has a closed-form solution and can be computed efficiently. We conduct extensive experimental validations. The results show that the proposed method generates photorealistic stylization outputs that are more preferred by human subjects as compared to those by the competing methods while running much faster. Source code and additional results are available at https://github.com/NVIDIA/FastPhotoStyle .
The team of scientists at NVIDIA and the University of California, Merced propose a new solution to photorealistic image stylization, FastPhotoStyle. The method consists of two steps: stylization and smoothing. Extensive experiments show that the suggested approach generates more realistic and compelling images than previous state-of-the-art. Even more, thanks to the closed-form solution, FastPhotoStyle can produce the stylized image 49 times faster than traditional methods.
![research paper about computer vision Top Computer Vision Research Papers](https://topb0ts.wpenginepowered.com/wp-content/uploads/2019/01/03_Stylization_web.jpg)
- The goal of photorealistic image stylization is to transfer style of a reference photo to a content photo while keeping the stylized image photorealistic.
- The stylization step is based on the whitening and coloring transform (WCT), which processes images via feature projections. However, WCT was developed for artistic image stylizations, and thus, often generates structural artifacts for photorealistic image stylization. To overcome this problem, the paper introduces PhotoWCT method, which replaces the upsampling layers in the WCT with unpooling layers, and so, preserves more spatial information.
- The smoothing step is required to solve spatially inconsistent stylizations that could arise after the first step. Smoothing is based on a manifold ranking algorithm.
- Both steps have a closed-form solution, which means that the solution can be obtained in a fixed number of operations (i.e., convolutions, max-pooling, whitening, etc.). Thus, computations are much more efficient compared to the traditional methods.
- outperforms artistic stylization algorithms by rendering much fewer structural artifacts and inconsistent stylizations, and
- outperforms photorealistic stylization algorithms by synthesizing not only colors but also patterns in the style photos.
- The experiments demonstrate that users prefer FastPhotoStyle results over the previous state-of-the-art in terms of both stylization effects (63.1%) and photorealism (73.5%).
- FastPhotoSyle can synthesize an image of 1024 x 512 resolution in only 13 seconds, while the previous state-of-the-art method needs 650 seconds for the same task.
- The paper was presented at ECCV 2018, leading European Conference on Computer Vision.
- Finding the way to transfer small patterns from the style photo as they are smoothed away by the suggested method.
- Exploring the possibilities to further reduce the number of structural artifacts in the stylized photos.
- Content creators in the business settings can largely benefit from photorealistic image stylization as the tool basically allows you to automatically change the style of any photo based on what fits the narrative.
- The photographers also discuss the tremendous impact that this technology can have in real estate photography.
- NVIDIA team provides the original implementation for this research paper on GitHub .
4. Group Normalization , by Yuxin Wu and Kaiming He
Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems – BN’s error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN’s usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN’s computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. GN can be easily implemented by a few lines of code in modern libraries.
Facebook AI research team suggest Group Normalization (GN) as an alternative to Batch Normalization (BN). They argue that BN’s error increases dramatically for small batch sizes. This limits the usage of BN when working with large models to solve computer vision tasks that require small batches due to memory constraints. On the contrary, Group Normalization is independent of batch sizes as it divides the channels into groups and computes the mean and variance for normalization within each group. The experiments confirm that GN outperforms BN in a variety of tasks, including object detection, segmentation, and video classification.
![research paper about computer vision TOP Computer Vision Papers](https://topb0ts.wpenginepowered.com/wp-content/uploads/2019/01/04_Group_norm_web.png)
- Group Normalization is a simple alternative to Batch Normalization, especially in the scenarios where batch size tends to be small, for example, computer vision tasks, requiring high-resolution input.
- GN explores only the layer dimensions, and thus, its computation is independent of batch size. Specifically, GN divides channels, or feature maps, into groups and normalizes the features within each group.
- Group Normalization can be easily implemented by a few lines of code in PyTorch and TensorFlow.
- Introducing Group Normalization, new effective normalization method.
- GN’s accuracy is stable in a wide range of batch sizes as its computation is independent of batch size. For example, GN demonstrated a 10.6% lower error rate than its BN-based counterpart for ResNet-50 in ImageNet with a batch size of 2.
- GN can be also transferred to fine-tuning. The experiments show that GN can outperform BN counterparts for object detection and segmentation in COCO dataset and video classification in Kinetics dataset.
- The paper received an honorable mention at ECCV 2018, leading European Conference on Computer Vision.
- It is also the second most popular paper in 2018 based on the people’s libraries at Arxiv Sanity Preserver.
- Applying group normalization to sequential or generative models.
- Investigating GN’s performance on learning representations for reinforcement learning.
- Exploring if GN combined with a suitable regularizer will improve results.
- Business applications that rely on BN-based models for object detection, segmentation, video classification and other computer vision tasks that require high-resolution input may benefit from moving to GN-based models as they are more accurate in these settings.
- Facebook AI research team provides Mask R-CNN baseline results and models trained with Group Normalization .
- PyTorch implementation of group normalization is also available on GitHub.
5. Taskonomy: Disentangling Task Transfer Learning , by Amir R. Zamir, Alexander Sax, William Shen, Leonidas J. Guibas, Jitendra Malik, and Silvio Savarese
Do visual tasks have a relationship, or are they unrelated? For instance, could having surface normals simplify estimating the depth of an image? Intuition answers these questions positively, implying existence of a structure among visual tasks. Knowing this structure has notable values; it is the concept underlying transfer learning and provides a principled way for identifying redundancies across tasks, e.g., to seamlessly reuse supervision among related tasks or solve many tasks in one system without piling up the complexity.
We proposes a fully computational approach for modeling the structure of space of visual tasks. This is done via finding (first and higher-order) transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D, and semantic tasks in a latent space. The product is a computational taxonomic map for task transfer learning. We study the consequences of this structure, e.g. nontrivial emerged relationships, and exploit them to reduce the demand for labeled data. For example, we show that the total number of labeled datapoints needed for solving a set of 10 tasks can be reduced by roughly 2/3 (compared to training independently) while keeping the performance nearly the same. We provide a set of tools for computing and probing this taxonomical structure including a solver that users can employ to devise efficient supervision policies for their use cases.
Assertions of the existence of a structure among visual tasks have been made by many researchers since the early years of modern computer science. And now Amir Zamir and his team make an attempt to actually find this structure. They model it using a fully computational approach and discover lots of useful relationships between different visual tasks, including the nontrivial ones. They also show that by taking advantage of these interdependencies, it is possible to achieve the same model performance with the labeled data requirements reduced by roughly ⅔.
![research paper about computer vision TOP Computer Vision Papers](https://topb0ts.wpenginepowered.com/wp-content/uploads/2019/01/05_taskonomy_web.jpg)
- A model aware of the relationships among different visual tasks demands less supervision, uses less computation, and behaves in more predictable ways.
- A fully computational approach to discovering the relationships between visual tasks is preferable because it avoids imposing prior, and possibly incorrect, assumptions: the priors are derived from either human intuition or analytical knowledge, while neural networks might operate on different principles.
- Identifying relationships between 26 common visual tasks.
- Showing how this structure helps in discovering types of transfer learning that will be most effective for each visual task.
- Creating a new dataset of 4 million images of indoor scenes including 600 buildings annotated with 26 tasks.
- The paper won the Best Paper Award at CVPR 2018, the key conference on computer vision and pattern recognition.
- The results are very important as for the most real-world tasks large-scale labeled datasets are not available .
- To move from a model where common visual tasks are entirely defined by humans and try an approach where human-defined visual tasks are viewed as observed samples which are composed of computationally found latent subtasks.
- Exploring the possibility to transfer the findings to not entirely visual tasks, e.g. robotic manipulation.
- Relationships discovered in this paper can be used to build more effective visual systems that will require less labeled data and lower computational costs.
6. Self-Attention Generative Adversarial Networks , by Han Zhang, Ian Goodfellow, Dimitris Metaxas, Augustus Odena
In this paper, we propose the Self-Attention Generative Adversarial Network (SAGAN) which allows attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other. Furthermore, recent work has shown that generator conditioning affects GAN performance. Leveraging this insight, we apply spectral normalization to the GAN generator and find that this improves training dynamics. The proposed SAGAN achieves the state-of-the-art results, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset. Visualization of the attention layers shows that the generator leverages neighborhoods that correspond to object shapes rather than local regions of fixed shape.
Traditional convolutional GANs demonstrated some very promising results with respect to image synthesis. However, they have at least one important weakness – convolutional layers alone fail to capture geometrical and structural patterns in the images. Since convolution is a local operation, it is hardly possible for an output on the top-left position to have any relation to the output at bottom-right . The paper introduces a simple solution to this problem – incorporating the self-attention mechanism into the GAN framework. This solution combined with several stabilization techniques helps the Senf-Attention Generative Adversarial Networks (SAGANs) achieve the state-of-the-art results in image synthesis.
![research paper about computer vision TOP Computer Vision papers](https://topb0ts.wpenginepowered.com/wp-content/uploads/2019/01/06_SAGAN_2_web.jpg)
- Convolutional layers alone are computationally inefficient for modeling long-range dependencies in images. On the contrary, a self-attention mechanism incorporated into the GAN framework will enable both the generator and the discriminator to efficiently model relationships between widely separated spatial regions.
- The self-attention module calculates response at a position as a weighted sum of the features at all positions.
- Applying spectral normalization for both generator and discriminator – the researchers argue that not only the discriminator but also the generator can benefit from spectral normalization, as it can prevent the escalation of parameter magnitudes and avoid unusual gradients.
- Using separate learning rates for the generator and the discriminator to compensate for the problem of slow learning in a regularized discriminator and make it possible to use fewer generator steps per discriminator step.
- Showing that self-attention module incorporated into the GAN framework is, in fact, effective in modeling long-range dependencies.
- spectral normalization applied to the generator stabilizes GAN training;
- utilizing imbalanced learning rates speeds up training of regularized discriminators.
- Achieving state-of-the-art results in image synthesis by boosting the Inception Score from 36.8 to 52.52 and reducing Fréchet Inception Distance from 27.62 to 18.65.
- “The idea is simple and intuitive yet very effective, plus easy to implement.” – Sebastian Raschka , assistant professor of Statistics at the University of Wisconsin-Madison.
- Exploring the possibilities to reduce the number of weird samples generated by GANs.
- Image synthesis with GANs can replace expensive manual media creation for advertising and e-commerce purposes.
- PyTorch and TensorFlow implementations of Self-Attention GANs are available on GitHub.
7. GANimation: Anatomically-aware Facial Animation from a Single Image , by Albert Pumarola, Antonio Agudo, Aleix M. Martinez, Alberto Sanfeliu, Francesc Moreno-Noguer
Recent advances in Generative Adversarial Networks (GANs) have shown impressive results for task of facial expression synthesis. The most successful architecture is StarGAN, that conditions GANs generation process with images of a specific domain, namely a set of images of persons sharing the same expression. While effective, this approach can only generate a discrete number of expressions, determined by the content of the dataset. To address this limitation, in this paper, we introduce a novel GAN conditioning scheme based on Action Units (AU) annotations, which describes in a continuous manifold the anatomical facial movements defining a human expression. Our approach allows controlling the magnitude of activation of each AU and combine several of them. Additionally, we propose a fully unsupervised strategy to train the model, that only requires images annotated with their activated AUs, and exploit attention mechanisms that make our network robust to changing backgrounds and lighting conditions. Extensive evaluation show that our approach goes beyond competing conditional generators both in the capability to synthesize a much wider range of expressions ruled by anatomically feasible muscle movements, as in the capacity of dealing with images in the wild.
The paper introduces a novel GAN model that is able to generate anatomically-aware facial animations from a single image under changing backgrounds and illumination conditions. It advances current works, which had only addressed the problem for discrete emotions category editing and portrait images. The approach renders a wide range of emotions by encoding facial deformations as Action Units. The resulting animations demonstrate a remarkably smooth and consistent transformation across frames even with challenging light conditions and backgrounds.
![research paper about computer vision TOP Computer Vision Papers](https://topb0ts.wpenginepowered.com/wp-content/uploads/2019/01/07_GANimation_web.jpg)
- Facial expressions can be described in terms of Action Units (AUs), which anatomically describe the contractions of specific facial muscles. For example, the facial expression for ‘fear’ is generally produced with the following activations: Inner Brow Raiser (AU1), Outer Brow Raiser (AU2), Brow Lowerer (AU4), Upper Lid Raiser (AU5), Lid Tightener (AU7), Lip Stretcher (AU20) and Jaw Drop (AU26). The magnitude of each AU defines the extent of emotion.
- A model for synthetic facial animation is based on the GAN architecture, which is conditioned on a one-dimensional vector indicating the presence/absence and the magnitude of each Action Unit.
- To circumvent the need for pairs of training images of the same person under different expressions, a bidirectional generator is used to both transform an image into a desired expression and transform the synthesized image back into the original pose.
- To handle images under changing backgrounds and illumination conditions, the model includes an attention layer that focuses the action of the network only in those regions of the image that are relevant to convey the novel expression.
- Introducing a novel GAN model for face animation in the wild that can be trained in a fully unsupervised manner and generate visually compelling images with remarkably smooth and consistent transformation across frames even with challenging light conditions and non-real world data.
- Demonstrating how a wider range of emotions can be generated by interpolating between emotions the GAN has already seen.
- Applying the introduced approach to video sequences.
- The technology that automatically animates the facial expression from a single image can be applied in several areas including the fashion and e-commerce business, the movie industry, photography technologies.
- The authors provide the original implementation of this research paper on GitHub .
8. Video-to-Video Synthesis , by Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro
We study the problem of video-to-video synthesis, whose goal is to learn a mapping function from an input source video (e.g., a sequence of semantic segmentation masks) to an output photorealistic video that precisely depicts the content of the source video. While its image counterpart, the image-to-image synthesis problem, is a popular topic, the video-to-video synthesis problem is less explored in the literature. Without understanding temporal dynamics, directly applying existing image synthesis approaches to an input video often results in temporally incoherent videos of low visual quality. In this paper, we propose a novel video-to-video synthesis approach under the generative adversarial learning framework. Through carefully-designed generator and discriminator architectures, coupled with a spatio-temporal adversarial objective, we achieve high-resolution, photorealistic, temporally coherent video results on a diverse set of input formats including segmentation masks, sketches, and poses. Experiments on multiple benchmarks show the advantage of our method compared to strong baselines. In particular, our model is capable of synthesizing 2K resolution videos of street scenes up to 30 seconds long, which significantly advances the state-of-the-art of video synthesis. Finally, we apply our approach to future video prediction, outperforming several state-of-the-art competing systems.
Researchers from NVIDIA have introduced a novel video-to-video synthesis approach. The framework is based on conditional GANs. Specifically, the method couples carefully-designed generator and discriminator with a spatio-temporal adversarial objective. The experiments demonstrate that the suggested vid2vid approach can synthesize high-resolution, photorealistic, temporally coherent videos on a diverse set of input formats including segmentation masks, sketches, and poses. It can also predict the next frames with far superior results than the baseline models.
![research paper about computer vision TOP Computer Vision Papers](https://topb0ts.wpenginepowered.com/wp-content/uploads/2019/01/08_vid2vid_web.png)
- current source frame;
- past two source frames;
- past two generated frames.
- Conditional image discriminator ensures that each output frame resembles a real image given the same source image.
- Conditional video discriminator ensures that consecutive output frames resemble the temporal dynamics of a real video given the same optical flow.
- Foreground-background prior in the generator design further improves the synthesis performance of the proposed model.
- Using a soft occlusion mask instead of binary allows to better handle the “zoom in” scenario: we can add details by gradually blending the warped pixels and the newly synthesized pixels.
- Generating high-resolution (2048х2048), photorealistic, temporally coherent videos up to 30 seconds long.
- Outputting several videos with different visual appearances depending on sampling different feature vectors.
- Outperforming the baseline models in future video prediction.
- Converting semantic labels into realistic real-world videos.
- Generating multiple outputs of talking people from edge maps.
- Generating an entire human body given a pose.
- “NVIDIA’s new vid2vid is the first open-source code that lets you fake anybody’s face convincingly from one source video. […] interesting times ahead…”, Gene Kogan , an artist and a programmer.
- The paper has also received some criticism over the concern that it can be used to create deepfakes or tampered videos which can deceive people.
- Using object tracking information to make sure that each object has a consistent appearance across the whole video.
- Researching if training the model with coarser semantic labels will help reduce the visible artifacts that appear after semantic manipulations (e.g., turning trees into buildings).
- Adding additional 3D cues, such as depth maps, to enable synthesis of turning cars.
- Marketing and advertising can benefit from the opportunities created by the vid2vid method (e.g., replacing the face or even the entire body in the video). However, this should be used with caution, keeping in mind the ethical considerations.
- NVIDIA team provides the original implementation of this research paper on GitHub .
9. Everybody Dance Now , by Caroline Chan, Shiry Ginosar, Tinghui Zhou, Alexei A. Efros
This paper presents a simple method for “do as I do” motion transfer: given a source video of a person dancing we can transfer that performance to a novel (amateur) target after only a few minutes of the target subject performing standard moves. We pose this problem as a per-frame image-to-image translation with spatio-temporal smoothing. Using pose detections as an intermediate representation between source and target, we learn a mapping from pose images to a target subject’s appearance. We adapt this setup for temporally coherent video generation including realistic face synthesis. Our video demo can be found at https://youtu.be/PCBTZh41Ris .
UC Berkeley researchers present a simple method for generating videos with amateur dancers performing like professional dancers. If you want to take part in the experiment, all you need to do is to record a few minutes of yourself performing some standard moves and then pick up the video with the dance you want to repeat. The neural network will do the main job: it solves the problem as a per-frame image-to-image translation with spatio-temporal smoothing. By conditioning the prediction at each frame on that of the previous time step for temporal smoothness and applying a specialized GAN for realistic face synthesis, the method achieves really amazing results.
![research paper about computer vision TOP Computer Vision Papers](https://topb0ts.wpenginepowered.com/wp-content/uploads/2019/01/09_Dance_web.jpg)
- A pre-trained state-of-the-art pose detector creates pose stick figures from the source video.
- Global pose normalization is applied to account for differences between the source and target subjects in body shapes and locations within the frame.
- Normalized pose stick figures are mapped to the target subject.
- To make videos smooth, the researchers suggest conditioning the generator on the previously generated frame and then giving both images to the discriminator. Gaussian smoothing on the pose keypoints allows to further reduce jitter.
- To generate more realistic faces, the method includes an additional face-specific GAN that brushes up the face after the main generation is finished.
- Suggesting a novel approach to motion transfer that outperforms a strong baseline (pix2pixHD), according to both qualitative and quantitative assessments.
- Demonstrating that face-specific GAN adds considerable detail to the output video.
- “Overall I thought this was really fun and well executed. Looking forward to the code release so that I can start training my dance moves.”, Tom Brown , member of technical staff at Google Brain.
- “’Everybody Dance Now’ from Caroline Chan, Alyosha Efros and team transfers dance moves from one subject to another. The only way I’ll ever dance well. Amazing work!!!”, Soumith Chintala, AI Research Engineer at Facebook.
- Replacing pose stick figures with temporally coherent inputs and representation specifically optimized for motion transfer.
- “Do as I do” motion transfer might be applied to replace subjects when creating marketing and promotional videos.
- PyTorch implementation of this research paper is available on GitHub .
10. Large Scale GAN Training for High Fidelity Natural Image Synthesis , by Andrew Brock, Jeff Donahue, and Karen Simonyan
Despite recent progress in generative image modeling, successfully generating high-resolution, diverse samples from complex datasets such as ImageNet remains an elusive goal. To this end, we train Generative Adversarial Networks at the largest scale yet attempted, and study the instabilities specific to such scale. We find that applying orthogonal regularization to the generator renders it amenable to a simple “truncation trick”, allowing fine control over the trade-off between sample fidelity and variety by truncating the latent space. Our modifications lead to models which set the new state of the art in class-conditional image synthesis. When trained on ImageNet at 128×128 resolution, our models (BigGANs) achieve an Inception Score (IS) of 166.3 and Frechet Inception Distance (FID) of 9.6, improving over the previous best IS of 52.52 and FID of 18.65.
DeepMind team finds that current techniques are sufficient for synthesizing high-resolution, diverse images from available datasets such as ImageNet and JFT-300M. In particular, they show that Generative Adversarial Networks (GANs) can generate images that look very realistic if they are trained at the very large scale, i.e. using two to four times as many parameters and eight times the batch size compared to prior art. These large-scale GANs, or BigGANs, are the new state-of-the-art in class-conditional image synthesis.
![research paper about computer vision TOP Computer Vision Papers](https://topb0ts.wpenginepowered.com/wp-content/uploads/2019/01/10_biggan_web.jpg)
- GANs perform much better with the increased batch size and number of parameters.
- Applying orthogonal regularization to the generator makes the model responsive to a specific technique (“truncation trick”), which provides control over the trade-off between sample fidelity and variety.
- Demonstrating that GANs can benefit significantly from scaling.
- Building models that allow explicit, fine-grained control of the trade-off between sample variety and fidelity.
- Discovering instabilities of large-scale GANs and characterizing them empirically.
- an Inception Score (IS) of 166.3 with the previous best IS of 52.52;
- Frechet Inception Distance (FID) of 9.6 with the previous best FID of 18.65.
- The paper is under review for next ICLR 2019.
- After BigGAN generators become available on TF Hub, AI researchers from all over the world are playing with BigGANs to generate dogs, watches, bikini images, Mona Lisa, seashores and many more.
- Moving to larger datasets to mitigate GAN stability issues.
- Replacing expensive manual media creation for advertising and e-commerce purposes.
- A BigGAN demo implemented in TensorFlow is available to use on Google’s Colab tool.
- Aaron Leong has a Github repository for BigGAN implemented in PyTorch .
Want Deeper Dives Into Specific AI Research Topics?
Due to popular demand, we’ve released several of these easy-to-read summaries and syntheses of major research papers for different subtopics within AI and machine learning.
- Top 10 machine learning & AI research papers of 2018
- Top 10 AI fairness, accountability, transparency, and ethics (FATE) papers of 2018
- Top 14 natural language processing (NLP) research papers of 2018
- Top 10 computer vision and image generation research papers of 2018
- Top 10 conversational AI and dialog systems research papers of 2018
- Top 10 deep reinforcement learning research papers of 2018
Update: 2019 Research Summaries Are Released
- Top 10 AI & machine learning research papers from 2019
- Top 11 NLP achievements & papers from 2019
- Top 10 research papers in conversational AI from 2019
- Top 10 computer vision research papers from 2019
- Top 12 AI ethics research papers introduced in 2019
- Top 10 reinforcement learning research papers from 2019
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
- Email Address *
- Name * First Last
- Natural Language Processing (NLP)
- Chatbots & Conversational AI
- Computer Vision
- Ethics & Safety
- Machine Learning
- Deep Learning
- Reinforcement Learning
- Generative Models
- Other (Please Describe Below)
- What is your biggest challenge with AI research? *
Reader Interactions
About Mariya Yao
Mariya is the co-author of Applied AI: A Handbook For Business Leaders and former CTO at Metamaven. She "translates" arcane technical concepts into actionable business advice for executives and designs lovable products people actually want to use. Follow her on Twitter at @thinkmariya to raise your AI IQ.
March 13, 2024 at 4:32 pm
If you have a patio, deck or pool and are looking for some fun ways to resurface it, you may be wondering how to do stamped concrete over existing patio surfaces. https://www.google.com/maps/place/?cid=10866013157741552281
March 21, 2024 at 6:18 am
Yes! Finally someone writes about tote bags.
March 27, 2024 at 7:39 am
A coloured concrete driveway can be a great option if you want to add character to a plain concrete driveway. It is durable, weatherproof, and offers many different design options. https://search.google.com/local/reviews?placeid=ChIJLRrbgctL4okRbNmXXl3Lpkk
Leave a Reply
You must be logged in to post a comment.
About TOPBOTS
- Expert Contributors
- Terms of Service & Privacy Policy
- Contact TOPBOTS
Suggestions or feedback?
MIT News | Massachusetts Institute of Technology
- Machine learning
- Social justice
- Black holes
- Classes and programs
Departments
- Aeronautics and Astronautics
- Brain and Cognitive Sciences
- Architecture
- Political Science
- Mechanical Engineering
Centers, Labs, & Programs
- Abdul Latif Jameel Poverty Action Lab (J-PAL)
- Picower Institute for Learning and Memory
- Lincoln Laboratory
- School of Architecture + Planning
- School of Engineering
- School of Humanities, Arts, and Social Sciences
- Sloan School of Management
- School of Science
- MIT Schwarzman College of Computing
When computer vision works more like a brain, it sees more like people do
Press contact :.
![research paper about computer vision Monotone image of a human eye with grahic representations of a computer network superimposed](https://news.mit.edu/sites/default/files/styles/news_article__image_gallery/public/images/202306/dicarlo_900x600.jpg?itok=mCEPNnW-)
Previous image Next image
From cameras to self-driving cars, many of today’s technologies depend on artificial intelligence to extract meaning from visual information. Today’s AI technology has artificial neural networks at its core, and most of the time we can trust these AI computer vision systems to see things the way we do — but sometimes they falter. According to MIT and IBM research scientists, one way to improve computer vision is to instruct the artificial neural networks that they rely on to deliberately mimic the way the brain’s biological neural network processes visual images.
Researchers led by MIT Professor James DiCarlo , the director of MIT’s Quest for Intelligence and member of the MIT-IBM Watson AI Lab, have made a computer vision model more robust by training it to work like a part of the brain that humans and other primates rely on for object recognition. This May, at the International Conference on Learning Representations, the team reported that when they trained an artificial neural network using neural activity patterns in the brain’s inferior temporal (IT) cortex, the artificial neural network was more robustly able to identify objects in images than a model that lacked that neural training. And the model’s interpretations of images more closely matched what humans saw, even when images included minor distortions that made the task more difficult.
Comparing neural circuits
Many of the artificial neural networks used for computer vision already resemble the multilayered brain circuits that process visual information in humans and other primates. Like the brain, they use neuron-like units that work together to process information. As they are trained for a particular task, these layered components collectively and progressively process the visual information to complete the task — determining, for example, that an image depicts a bear or a car or a tree.
DiCarlo and others previously found that when such deep-learning computer vision systems establish efficient ways to solve visual problems, they end up with artificial circuits that work similarly to the neural circuits that process visual information in our own brains. That is, they turn out to be surprisingly good scientific models of the neural mechanisms underlying primate and human vision.
That resemblance is helping neuroscientists deepen their understanding of the brain. By demonstrating ways visual information can be processed to make sense of images, computational models suggest hypotheses about how the brain might accomplish the same task. As developers continue to refine computer vision models, neuroscientists have found new ideas to explore in their own work.
“As vision systems get better at performing in the real world, some of them turn out to be more human-like in their internal processing. That’s useful from an understanding-biology point of view,” says DiCarlo, who is also a professor of brain and cognitive sciences and an investigator at the McGovern Institute for Brain Research.
Engineering a more brain-like AI
While their potential is promising, computer vision systems are not yet perfect models of human vision. DiCarlo suspected one way to improve computer vision may be to incorporate specific brain-like features into these models.
To test this idea, he and his collaborators built a computer vision model using neural data previously collected from vision-processing neurons in the monkey IT cortex — a key part of the primate ventral visual pathway involved in the recognition of objects — while the animals viewed various images. More specifically, Joel Dapello, a Harvard University graduate student and former MIT-IBM Watson AI Lab intern; and Kohitij Kar, assistant professor and Canada Research Chair (Visual Neuroscience) at York University and visiting scientist at MIT; in collaboration with David Cox, IBM Research’s vice president for AI models and IBM director of the MIT-IBM Watson AI Lab; and other researchers at IBM Research and MIT asked an artificial neural network to emulate the behavior of these primate vision-processing neurons while the network learned to identify objects in a standard computer vision task.
“In effect, we said to the network, ‘please solve this standard computer vision task, but please also make the function of one of your inside simulated “neural” layers be as similar as possible to the function of the corresponding biological neural layer,’” DiCarlo explains. “We asked it to do both of those things as best it could.” This forced the artificial neural circuits to find a different way to process visual information than the standard, computer vision approach, he says.
After training the artificial model with biological data, DiCarlo’s team compared its activity to a similarly-sized neural network model trained without neural data, using the standard approach for computer vision. They found that the new, biologically informed model IT layer was — as instructed — a better match for IT neural data. That is, for every image tested, the population of artificial IT neurons in the model responded more similarly to the corresponding population of biological IT neurons.
The researchers also found that the model IT was also a better match to IT neural data collected from another monkey, even though the model had never seen data from that animal, and even when that comparison was evaluated on that monkey’s IT responses to new images. This indicated that the team’s new, “neurally aligned” computer model may be an improved model of the neurobiological function of the primate IT cortex — an interesting finding, given that it was previously unknown whether the amount of neural data that can be currently collected from the primate visual system is capable of directly guiding model development.
With their new computer model in hand, the team asked whether the “IT neural alignment” procedure also leads to any changes in the overall behavioral performance of the model. Indeed, they found that the neurally-aligned model was more human-like in its behavior — it tended to succeed in correctly categorizing objects in images for which humans also succeed, and it tended to fail when humans also fail.
Adversarial attacks
The team also found that the neurally aligned model was more resistant to “adversarial attacks” that developers use to test computer vision and AI systems. In computer vision, adversarial attacks introduce small distortions into images that are meant to mislead an artificial neural network.
“Say that you have an image that the model identifies as a cat. Because you have the knowledge of the internal workings of the model, you can then design very small changes in the image so that the model suddenly thinks it’s no longer a cat,” DiCarlo explains.
These minor distortions don’t typically fool humans, but computer vision models struggle with these alterations. A person who looks at the subtly distorted cat still reliably and robustly reports that it’s a cat. But standard computer vision models are more likely to mistake the cat for a dog, or even a tree.
“There must be some internal differences in the way our brains process images that lead to our vision being more resistant to those kinds of attacks,” DiCarlo says. And indeed, the team found that when they made their model more neurally aligned, it became more robust, correctly identifying more images in the face of adversarial attacks. The model could still be fooled by stronger “attacks,” but so can people, DiCarlo says. His team is now exploring the limits of adversarial robustness in humans.
A few years ago, DiCarlo’s team found they could also improve a model’s resistance to adversarial attacks by designing the first layer of the artificial network to emulate the early visual processing layer in the brain. One key next step is to combine such approaches — making new models that are simultaneously neurally aligned at multiple visual processing layers.
The new work is further evidence that an exchange of ideas between neuroscience and computer science can drive progress in both fields. “Everybody gets something out of the exciting virtuous cycle between natural/biological intelligence and artificial intelligence,” DiCarlo says. “In this case, computer vision and AI researchers get new ways to achieve robustness, and neuroscientists and cognitive scientists get more accurate mechanistic models of human vision.”
This work was supported by the MIT-IBM Watson AI Lab, Semiconductor Research Corporation, the U.S. Defense Research Projects Agency, the MIT Shoemaker Fellowship, U.S. Office of Naval Research, the Simons Foundation, and Canada Research Chair Program.
Share this news article on:
Related links.
- Jim DiCarlo
- McGovern Institute for Brain Research
- MIT-IBM Watson AI Lab
- MIT Quest for Intelligence
- Department of Brain and Cognitive Sciences
Related Topics
- Brain and cognitive sciences
- McGovern Institute
- Artificial intelligence
- Computer vision
- Neuroscience
- Computer modeling
- Quest for Intelligence
Related Articles
![research paper about computer vision color change pixels of cat](https://news.mit.edu/sites/default/files/styles/news_article__archive/public/images/202012/MIT-BetterComputerVision-01-Press.jpg?itok=rRc2qxxc)
Neuroscientists find a way to make object-recognition models perform better
![research paper about computer vision A computer model of vision created by MIT neuroscientists designed these images that can stimulate very high activity in individual neurons.](https://news.mit.edu/sites/default/files/styles/news_article__archive/public/images/201905/MIT-Vision-Model.jpg?itok=Lj0jDsFd)
Putting vision models to the test
![research paper about computer vision MIT researchers have found that the part of the visual cortex known as the inferotemporal (IT) cortex is required to distinguish between different objects.](https://news.mit.edu/sites/default/files/styles/news_article__archive/public/images/201903/MIT-Object-Recognition.jpg?itok=mIj_3lqp)
How the brain distinguishes between objects
Previous item Next item
More MIT News
![research paper about computer vision In between two rocky hills, an icy blue glacier flows down and meets the water.](https://news.mit.edu/sites/default/files/styles/news_article__recent_news/public/images/202405/MIT-Ice-Flow-Ranganathan-01-press.jpg?itok=r4KPdGNk)
Microscopic defects in ice influence how massive glaciers flow, study shows
Read full story →
![research paper about computer vision View of the torso of a woman wearing a white lab coat and gloves in a lab holding a petri dish with green material oozing in one hand and a small pipette in the other hand](https://news.mit.edu/sites/default/files/styles/news_article__recent_news/public/images/202405/woman-laboratory-doing-experiments-close-up.jpg?itok=SYzloUJ0)
Scientists identify mechanism behind drug resistance in malaria parasite
![research paper about computer vision Ani Dasgupta gestures with his hands as he speaks while standing in front of chalkboard.](https://news.mit.edu/sites/default/files/styles/news_article__recent_news/public/images/202405/ani-dasgupta.jpg?itok=KdWP60I5)
Getting to systemic sustainability
![research paper about computer vision Maja Hoffmann and Hashim Sarkis pose outdoors on a sunny day. Behind them are a small pond, several older single-storey buildings, and a multi-storey building with a central tower, half metallic with windows jutting out in odd angles, and half tan stone](https://news.mit.edu/sites/default/files/styles/news_article__recent_news/public/images/202405/mit-luma.jpg?itok=1tojXBAG)
New MIT-LUMA Lab created to address climate challenges in the Mediterranean region
![research paper about computer vision Photo of MIT Press Book Store shelves nestled beneath a glass stairwell](https://news.mit.edu/sites/default/files/styles/news_article__recent_news/public/images/202405/MITP-Bookstore.png?itok=08BAzGa3)
MIT Press releases Direct to Open impact report
![research paper about computer vision Eli Sanchez stands in a naturally lit, out-of-focus hallway](https://news.mit.edu/sites/default/files/styles/news_article__recent_news/public/images/202405/MIT-Eli-Sanchez-cov.jpg?itok=yexwnKKk)
Modeling the threat of nuclear war
- More news on MIT News homepage →
Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA
- Map (opens in new window)
- Events (opens in new window)
- People (opens in new window)
- Careers (opens in new window)
- Accessibility
- Social Media Hub
- MIT on Facebook
- MIT on YouTube
- MIT on Instagram
- Skip to primary navigation
- Skip to main content
![research paper about computer vision OpenCV](https://opencv.org/wp-content/uploads/2022/05/logo.png)
Open Computer Vision Library
Research Areas in Computer Vision: Trends and Challenges
Farooq Alvi February 7, 2024 Leave a Comment AI Careers
![research paper about computer vision research areas in computer vision](https://opencv.org/wp-content/uploads/2024/02/Research-areas-in-Computer-vision.png)
Basics of Computer Vision
Computer Vision (CV) is a field of artificial intelligence that trains computers to interpret and understand the visual world. Using digital images from cameras and videos, along with deep learning models, computers can accurately identify and classify objects, and then react to what they “see.”
Key Concepts in Computer Vision
Image Processing: At the heart of CV is image processing, which involves enhancing image data (removing noise, sharpening, or brightening an image) and preparing it for further analysis.
Feature Detection and Matching: This involves identifying and using specific features of an image, like edges, corners, or objects, to understand the content of the image.
Pattern Recognition: CV uses pattern recognition to identify patterns and regularities in data. This can be as simple as recognizing the shape of an object or as complex as identifying a person’s face.
Core Technologies Powering Computer Vision
Machine Learning and Deep Learning: These are crucial for teaching computers to recognize patterns in visual data. Deep learning, especially, has been a game-changer, enabling advancements in facial recognition, object detection, and more.
Neural Networks: A type of machine learning, neural networks, particularly Convolutional Neural Networks (CNNs), are pivotal in analyzing visual imagery.
Image Recognition and Classification: This is the process of identifying and labeling objects within an image. It’s one of the most common applications of CV.
Object Detection: This goes a step further than image classification by not only identifying objects in images but also locating them.
Applications of Basic Computer Vision
Automated Inspection: Used in manufacturing to identify defects.
Surveillance: Helps in monitoring activities for security purposes.
Retail: For example, in cashier-less stores where CV tracks what customers pick up.
Healthcare: Assisting in diagnostic procedures through medical image analysis.
Challenges and Limitations
Data Quality and Quantity: The accuracy of a computer vision system is highly dependent on the quality and quantity of the data it’s trained on.
Computational Requirements: Advanced CV models require significant computational power, making them resource-intensive.
Ethical and Privacy Concerns: The use of CV in surveillance and data collection raises ethical and privacy issues that need to be addressed.
This interesting topic “2024 Guide to becoming a Computer Vision Engineer ” will help you set off on your journey to becoming one.
Key Research Areas in Computer Vision
![research paper about computer vision research areas in computer vision](https://opencv.org/wp-content/uploads/2024/02/Research-areas-in-CV-1024x576.png)
Augmented Reality: The Convergence with Computer Vision
In 2024, Augmented Reality (AR) continues to make significant strides, increasingly integrating with computer vision (CV) to create more immersive and interactive experiences across various sectors. This integration is crucial as AR requires understanding and interacting with the real world through visual information, a capability at the core of CV.
Manufacturing, Retail, and Education: Transformative Sectors
Manufacturing : AR devices enable manufacturing workers to access real-time instructional and administrative information. This integration significantly enhances efficiency and accuracy in production processes.
Retail : In the retail sector, AR is revolutionizing the shopping experience. Consumers can now visualize products in great detail, including pricing and features, right from their AR devices, offering a more engaging and informed shopping experience.
Education: The impact of AR in education is substantial. Traditional teaching methods are being supplemented with immersive and interactive AR experiences, making learning more engaging and effective for students.
Technological Advances in AR
The advancement in AR technology, backed by major companies like Apple and Meta, is seeing a surge of consumer-grade AR devices entering the market. These devices are set to become more widely available, making AR more integral to daily life and work.
The development of sophisticated AR gaming is a testament to this growth. AR games now offer realistic gameplay, integrating virtual objects and characters into the real world, enhancing player engagement, and creating new possibilities in gaming and non-gaming applications. Startups like Mohx-games and smar.toys are at the forefront of this innovation, developing platforms and controllers that elevate the AR gaming experience.
Mobile AR tools are another significant advancement. These tools utilize the increasing capabilities of smartphone cameras and sensors to enhance AR interactions’ realism and immersion. Platforms like Phantom Technology’s PhantomEngine enable developers to create more sophisticated and context-aware AR applications.
Wearables with AR capabilities , such as those developed by ARKH and Wavelens, are offering hands-free experiences, further expanding the usability and applications of AR in various industries, including manufacturing and logistics. These wearables provide real-time guidance and information directly in the user’s field of view, enhancing convenience and efficiency.
3D design and prototyping in AR , as exemplified by Virtualist’s building design platform, are enabling industries like architecture and automotive to visualize products and designs in real-world contexts, significantly improving the decision-making process and reducing design errors.
Robotic Language-Vision Models (RLVM)
Integration of vision and language in robotics.
In 2024, the field of robotics is witnessing a significant shift with the integration of Language-Vision Models (RLVM), which are transforming how robots understand and interact with their environment. This blend of visual comprehension and language interpretation is paving the way for a new era of intelligent, responsive robotics.
Advancements in Robotic Language-Vision Models
Enhanced Learning Capabilities: Research and development efforts are increasingly focusing on using generative AI to make robots faster learners, especially for complex manipulation tasks. This advancement is likely to continue throughout 2024, potentially leading to commercial applications in robotics.
Natural Language Understanding:
Robots are becoming more personable, thanks to their improved ability to understand natural language instructions. This evolution is exemplified by projects where robots, such as Boston Dynamics’ Spot, are turned into interactive agents like tour guides.
Wider Application Spectrum:
Robots are moving beyond traditional environments like warehouses and manufacturing into public-facing roles in restaurants, hotels, hospitals, and more. Enabled by generative AI, these robots are expected to interact more naturally with people, enhancing their utility in these new roles.
Autonomous Mobile Robots (AMRs):
AMRs, combining sensors, AI, and computer vision, are increasingly used in varied settings, from factory floors to hospital corridors, for tasks like material handling, disinfection, and delivery services.
Intelligent Robotics:
Integration of AI in robotics is allowing robots to use real-time information to optimize tasks. This includes leveraging computer vision and machine learning for improved accuracy and performance in applications such as manufacturing automation and customer service in retail and hospitality.
Collaborative Robots (Cobots):
Cobots are being designed to safely interact and work alongside humans, augmenting human efforts in various industrial processes. Advances in sensor technology and software are enabling these robots to perform tasks more safely and efficiently alongside human workers.
Robotics as a Service (RaaS):
RaaS models are becoming more popular, providing businesses with flexible and scalable access to robotic solutions. This approach is particularly beneficial for small and medium-sized enterprises that can leverage robotic technology without incurring significant upfront costs.
Robotics Cybersecurity:
As robotics systems become more interconnected, the importance of cybersecurity in robotics is growing. Solutions are being developed to protect robotic systems from cyber threats, ensuring the safety and reliability of these systems in various applications.
Top research universities in the US
Advanced Satellite Vision:
Monitoring environmental and urban changes.
In 2024, the capabilities of satellite imagery have been significantly enhanced by advancements in computer vision (CV), leading to more effective monitoring of environmental and urban changes.
Satellite Imagery and Computer Vision
High-Resolution Monitoring: CV-powered satellite imagery provides high-resolution monitoring of various terrestrial phenomena. This includes tracking urban sprawl, deforestation, and changes in marine environments.
Environmental Management
These technological advancements are crucial for environmental monitoring and management. The detailed data from satellite imagery enables the study of ecological and climatic changes with unprecedented precision.
Urban Planning and Development
In urban areas, satellite vision assists in planning and development, providing critical data for infrastructure development, land use planning, and resource management.
Disaster Response and Management
Advanced satellite vision plays a key role in disaster management. It helps in assessing the impact of natural disasters and planning effective response strategies.
Agricultural Applications
In agriculture, satellite imagery helps in monitoring crop health, soil conditions, and water resources, enabling more efficient and sustainable farming practices.
Climate Change Analysis
Satellite vision is instrumental in understanding and monitoring the effects of climate change globally, including polar ice melt, sea-level rise, and changes in weather patterns.
3D Computer Vision: Enhancing Autonomous Vehicles and Digital Twin Modeling
In 2024, 3D Computer Vision (3D CV) is playing a pivotal role in advancing technologies in various sectors, particularly in autonomous vehicles and digital twin modeling.
3D Computer Vision in Autonomous Vehicles
Depth Perception: 3D CV enables autonomous vehicles to accurately perceive depth and distance. This is crucial for navigating complex environments and ensuring safety on the roads.
Object Detection and Tracking: It allows for precise detection and tracking of objects around the vehicle, including other vehicles, pedestrians, and road obstacles.
Environment Mapping: Advanced 3D imaging and processing help in creating detailed maps of the vehicle’s surroundings, essential for route planning and navigation.
Digital Twin Modeling with 3D Computer Vision
Accurate Replication: 3D CV is integral in creating accurate digital replicas of physical objects, buildings, or even entire cities for digital twin applications.
Simulation and Analysis: These digital twins are used for simulations, allowing for analysis and optimization of systems in a virtual environment before actual implementation.
Predictive Maintenance and Planning: In industries such as manufacturing and urban planning, digital twins aid in predictive maintenance and strategic planning, minimizing risks and enhancing efficiency.
Ethics in Computer Vision: Navigating Bias and Privacy Concerns
As computer vision (CV) technologies become increasingly integrated into various aspects of life, ethical considerations, particularly related to bias and privacy, are gaining prominence.
Addressing Bias in Computer Vision
Data Diversity: One major ethical challenge in CV is the bias in algorithms, often stemming from non-representative training data. Efforts are being made to create more diverse and inclusive datasets to help overcome biases related to race, gender, and other factors.
Fairness in Algorithms: There is a growing focus on developing algorithms that are fair and non-discriminatory. This includes techniques to detect and correct biases in CV systems.
![research paper about computer vision Your Image Alt Text](https://opencv.org/wp-content/uploads/2024/02/Free_OpenCV_course.jpg)
Transparent and Explainable AI: Transparency in how CV models are built and function is crucial. There’s an emphasis on explainable AI, where the decision-making process of CV systems can be understood and interrogated by users.
Ensuring Privacy in Computer Vision
Consent and Anonymity: With CV technologies being used in public spaces, ensuring individual privacy is paramount. Techniques like face-blurring in videos and images are being adopted to protect identities.
Regulatory Compliance: Governments and regulatory bodies are proposing strict regulations to ensure responsible development and use of AI and CV technologies. This includes guidelines for data collection, processing, and storage to protect individual privacy.
Ethical Design and Deployment: Ethical considerations are increasingly becoming a part of the design and deployment process of CV technologies. This involves assessing the potential impact on society and individuals and ensuring that privacy and individual rights are safeguarded.
Synthetic Data and Generative AI in Computer Vision
The role of generative AI in creating synthetic data has become increasingly significant in developing and improving computer vision (CV) systems.
Generative AI and Synthetic Data Creation
Enhancing Training of CV Models: Generative AI algorithms can create realistic, high-quality synthetic data. This data is particularly valuable for training CV models, especially when real-world data is scarce, sensitive, or difficult to obtain.
Diversity and Volume: Synthetic data generated by AI can encompass various scenarios and variations, offering a rich and diverse dataset. This diversity is crucial for training robust CV models capable of performing accurately in various real-world conditions.
Privacy and Ethical Compliance: Using synthetic data mitigates privacy concerns associated with using real data, especially in sensitive areas like healthcare and security. It offers a way to train effective CV models without compromising individual privacy.
Cost-Effectiveness and Efficiency: Generating synthetic data can be more cost-effective and efficient than collecting and labeling vast amounts of real-world data. It also speeds up the iterative process of training and refining CV models.
Computer Vision in Edge Computing
In 2024, the trend of integrating Computer Vision (CV) with edge computing is becoming increasingly prominent, revolutionizing how data is processed in various applications.
The Shift to On-Device Processing
Reduced Latency: By processing visual data directly on the device (edge computing), response times are significantly decreased. This is vital in applications where real-time analysis is crucial, such as in autonomous vehicles or real-time monitoring systems.
Improved Privacy and Security: Edge computing allows for sensitive data to be processed locally, reducing the risk of data breaches during transmission to cloud-based servers. This is particularly important in applications involving personal or sensitive information.
Enhanced Efficiency: Local data processing minimizes the need to transfer large volumes of data to the cloud, thereby reducing bandwidth usage and associated costs. This is beneficial for devices operating in remote or bandwidth-constrained environments.
Scalability : Edge computing enables scalability in CV applications. Devices can process data independently, alleviating the load on central servers and allowing for the deployment of more devices without a proportional increase in central processing requirements.
Applications in Diverse Fields
Intelligent Security Systems: In security and surveillance, edge computing allows for immediate processing and analysis of visual data, enabling quicker response to potential security threats.
Healthcare: Portable medical devices with integrated CV can process data on the edge, aiding in immediate diagnostic procedures and patient monitoring.
Retail and Consumer Applications: In retail, edge computing enables smart shelves and inventory management systems to process visual data in real time, improving efficiency and customer experience.
Industrial and Manufacturing: In industrial settings, edge computing facilitates real-time monitoring and quality inspection, improving operational efficiency and safety.
Computer Vision in Healthcare
Computer Vision (CV) is significantly impacting the healthcare sector, offering innovative solutions for medical image analysis, surgical assistance, and patient monitoring.
Medical Image Analysis
Diagnostic Accuracy: CV algorithms are increasingly used to analyze medical images such as X-rays, MRIs, and CT scans. They assist in identifying abnormalities, leading to quicker and more accurate diagnoses.
Cancer Detection : In oncology, CV aids in the early detection of cancers, such as breast or skin cancer, through detailed analysis of medical imagery.
Automated Analysis: Automated image analysis can handle large volumes of medical images, reducing the workload on radiologists and increasing efficiency.
Aiding Surgeries
Surgical Robotics: CV is integral to the functioning of surgical robots, providing them with the necessary visual information to assist surgeons in performing precise and minimally invasive procedures.
Real-Time Navigation: During surgeries, CV provides real-time imaging, aiding surgeons in navigating complex procedures and avoiding critical structures.
Training and Simulation: CV technologies are used in surgical training, providing simulations that help surgeons hone their skills in a risk-free environment.
Patient Monitoring
Remote Monitoring : CV enables remote patient monitoring, allowing healthcare providers to observe patients’ physical condition and movements without being physically present. This is particularly beneficial for elderly care and monitoring patients in intensive care units.
Fall Detection and Prevention: In elderly care, CV systems can detect falls or unusual behaviors, alerting caregivers to potential emergencies.
Behavioral Analysis: CV is also used in analyzing patients’ behaviors and movements, which can be vital in psychiatric care and physical therapy.
Challenges and Future Directions
While CV is bringing transformative changes to healthcare, it also presents challenges such as data privacy concerns, the need for large annotated datasets, and ensuring the accuracy and reliability of algorithms. The future of CV in healthcare is promising, with ongoing research and development aimed at addressing these challenges and expanding its applications.
Top 7 research universities in India
Detecting Deepfakes: The Crucial Role of Computer Vision
As AI-generated deepfakes become increasingly realistic and pervasive, the importance of Computer Vision (CV) in detecting and combating them has become more critical.
The Challenge of Deepfakes
Realism and Proliferation: Deepfakes, synthesized using advanced AI algorithms, are becoming more sophisticated, making them harder to distinguish from real footage. Their potential use in spreading misinformation or malicious content poses significant challenges.
Misinformation and Security Threats: The use of deepfakes in spreading false information can have serious implications in various spheres, including politics, security, and personal privacy.
CV’s Role in Deepfake Detection
Analyzing Visual Inconsistencies: CV algorithms are trained to detect subtle inconsistencies in videos and images that are typically overlooked by the human eye. This includes irregularities in facial expressions, lip movements, and eye blinking patterns.
Temporal and Spatial Analysis: CV techniques analyze both spatial features (like facial features) and temporal features (like movement over time) in videos to identify anomalies that suggest manipulation.
Training on Diverse Data Sets: To improve the accuracy of deepfake detection, CV systems are trained on diverse datasets that include various types of manipulations and original content.
The importance of CV in identifying deepfakes cannot be understated, as it stands at the forefront of preserving information integrity in the digital age. The advancements in this field will be instrumental in maintaining trust and authenticity in digital media.
Real-Time Computer Vision
Enhancing security, crowd monitoring, and industrial safety.
Real-time computer vision (CV) technologies are increasingly being deployed in various fields like security, crowd monitoring, and industrial safety, offering dynamic and immediate data analysis for enhanced operational efficiency and safety.
Applications in Security
Surveillance Systems: Real-time CV is revolutionizing surveillance by enabling immediate identification and alerting of security breaches or unusual activities. This includes facial recognition, intrusion detection, and unauthorized access alerts.
Automated Threat Detection: CV systems can detect potential threats in real-time, such as identifying unattended bags in public areas or spotting unusual behaviors that could indicate criminal activities.
Crowd Monitoring and Management
Public Safety: In large public gatherings, real-time CV aids in crowd density analysis, helping to prevent stampedes or accidents by alerting authorities to potential dangers due to overcrowding.
Traffic Management: In urban settings, CV systems monitor and analyze traffic flow in real time, helping in congestion management and accident prevention.
Event Management: For events like concerts or sports games, real-time CV can assist in crowd control, ensuring that safety regulations are adhered to and identifying potential bottlenecks or overcrowding situations.
Industrial Safety
Workplace Monitoring: CV systems monitor industrial environments in real time, detecting potential hazards like equipment malfunctions or unsafe worker behavior, thus preventing accidents and ensuring compliance with safety protocols.
Quality Control: In manufacturing, real-time CV assists in continuous monitoring of production lines, instantly identifying defects or deviations from standard protocols.
Equipment Maintenance: CV can help in predictive maintenance by detecting early signs of wear and tear in machinery, preventing costly downtime and accidents.
Top research universities in Europe
Conclusion: Navigating the Future of Computer Vision
From enhancing healthcare and security to revolutionizing interactive technologies like AR, CV is reshaping our interaction with the digital world. Its advancements, including AI integration and edge computing, highlight a future rich with potential.
Yet, this journey forward isn’t without challenges. Balancing innovation with ethical responsibility, privacy, and fairness remains crucial. As CV becomes more embedded in our lives, it calls for a collaborative approach among technologists, ethicists, and policymakers to ensure it benefits society responsibly and equitably.
In essence, CV’s future is not just about technological growth but also about addressing ethical and societal needs, marking an exciting, transformative journey ahead.
Related Posts
![research paper about computer vision introduction to ai jobs in 2023](https://opencv.org/wp-content/uploads/2023/08/Your-2023.png)
August 16, 2023 Leave a Comment
![research paper about computer vision introduction to artificial intelligence](https://opencv.org/wp-content/uploads/2023/08/Blog-2-feature-img-final.png)
August 23, 2023 Leave a Comment
![research paper about computer vision Knowing the history of AI is important in understanding where AI is now and where it may go in the future.](https://opencv.org/wp-content/uploads/2023/08/word-Cloud-OpenCV-768x432-01-scaled.jpg)
August 30, 2023 Leave a Comment
Become a Member
Stay up to date on OpenCV and Computer Vision news
Free Courses
- TensorFlow & Keras Bootcamp
- OpenCV Bootcamp
- Python for Beginners
- Mastering OpenCV with Python
- Fundamentals of CV & IP
- Deep Learning with PyTorch
- Deep Learning with TensorFlow & Keras
- Computer Vision & Deep Learning Applications
- Mastering Generative AI for Art
Partnership
- Intel, OpenCV’s Platinum Member
- Gold Membership
- Development Partnership
General Link
Subscribe and Start Your Free Crash Course
![research paper about computer vision research paper about computer vision](https://opencv.org/wp-content/uploads/2021/06/seperator.png)
Stay up to date on OpenCV and Computer Vision news and our new course offerings
- We hate SPAM and promise to keep your email address safe.
Join the waitlist to receive a 20% discount
Courses are (a little) oversubscribed and we apologize for your enrollment delay. As an apology, you will receive a 20% discount on all waitlist course purchases. Current wait time will be sent to you in the confirmation email. Thank you!
Computer Vision: 10 Papers to Start
Dec 25, 2015
“How do I know what papers to read in computer vision? There are so many. And they are so different.” Graduate Student. Xi’An. China. November, 2011.
This is a quote from an opinion paper by my advisor. Having worked on computer vision for nearly 2 years, I can absolutely resonate with the comment. The diversity of computer vision may be especially confusing for starters.
This post serves as a humble attempt to answer the opening question. Of course it is subjective, but a good starting point for sure.
This post is intended for computer vision starters , mostly undergraduate students . An important lesson is that unlike undergraduate education, when doing research, you learn primarily from reading papers, which is why I am recommending 10 to start.
Before getting to the list, it is good to know where CV papers are usually published. CV people like to publish in conferences. The three top tier CV conferences are: CVPR (each year), ICCV (odd year), ECCV (even year). Since CV is an application of machine learning, people also publish in NIPS and ICML. ICLR is new but rapidly rising to the top tier. As for journals, PAMI and IJCV are the best.
I am partitioning the 10 papers into 5 categories, and the list is loosely sorted by publication time. Here it goes!
Finding good features has always been a core problem of computer vision. A good feature can summarize the information of the image and enable the subsequent use of powerful mathematical tools. In the 2000s, a lot of feature designs were proposed.
Distinctive Image Features from Scale-Invariant Keypoints , IJCV 2004
SIFT feature is designed to establish correspondence between two images. Its most important applications are in reconstruction and tracking.
Histograms of Oriented Gradients for Human Detection , CVPR 2005
HOG has the same philosophy of feature design as SIFT, but is even simpler. While SIFT is more low-level understanding, HOG is more high-level understanding.
Reconstruction
Reconstruction is an important branch of computer vision. Since the 2000s, structure from motion (SfM) has been formalized and is still the standard practice today.
Photo Tourism: Exploring Photo Collections in 3D , ACM Transactions on Graphics 2006
This paper uses SfM to reconstruct scenes from photos collected from the internet. Since then, the core pipeline remains more or less the same, and people seek improvement in, for instance, scalability and visualization. There is also an extended IJCV version later.
Graphical Models
Graphical model is a machine learning tool that tries to capture the relationship between random variables. It is quite general in nature, and is suitable for many computer vision tasks.
Structured Learning and Prediction in Computer Vision , Foundations and Trends in Computer Graphics and Vision 2011
This 180+ page paper is one of the first paper that I have read, and remains my personal favourite. It is a comprehensive overview of both theory and application of graphical models in various computer vision tasks.
The advancement in computer vision can hardly live without good datasets. The evaluation on a suited and unbiased dataset is the valid proof of the proposed algorithm. Interestingly, the evolution of dataset can also reflect the progress of computer vision research.
The PASCAL Visual Object Classes (VOC) Challenge , IJCV 2010
PASCAL VOC is the standard evaluation dataset of semantic segmentation and object detection. While the annual challenge has ended, the evaluation server is still open, and the leaderboard is definitely something you want to check out to find the state-of-the-art result/algorithm. There is also a recent retrospect paper on IJCV.
ImageNet: A Large-Scale Hierarchical Image Database , CVPR 2009
ImageNet is the first large scale dataset, containing millions of images of 1000 categories. It is the standard evaluation dataset of classification, and is one of the driving force behind the recent success of deep convolutional neural networks. There is also a recent retrospect paper on IJCV.
Microsoft COCO: Common Objects in Context , ECCV 2014
This dataset is relatively new. Similar to PASCAL VOC, it aims at instance segmentation and object detection, but the number of images is much larger. More interestingly, it contains language descriptions for each image, bridging computer vision with natural language processing.
Deep Learning
I am sure you have heard of deep learning. It is an end-to-end hierarchical model optimized by simply chain rule and gradient descent. What makes it powerful is its billions of parameters, which enables unprecedented representation capacity.
ImageNet Classification with Deep Convolutional Neural Networks , NIPS 2012
This paper marks the big breakthrough of applying deep learning to computer vision. Made possible by the large ImageNet dataset and the fast GPU, the model took 1 week to train, and outperforms the traditional method on image classification by 10%.
DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , ICML 2014
This paper shows that while the model mentioned above is trained for image classification, its intermediate representation is a powerful feature that can transfer to other tasks. This comes back to finding good features for images. In high-level tasks, deep features consistently show superiority over traditional features.
Visualizing and Understanding Convolutional Networks , ECCV 2014
Understanding what is indeed going on inside the deep neural network remains a challenging task. This paper is perhaps the most famous and important work towards this goal. It looks at individual neurons and uses deconvolution to visualize. However, there is still much to be done.
Again, this has been a humble attempt to address the opening question. Hope these excellent papers can kindle your enthusiasm for computer vision!
Merry Christmas!
![research paper about computer vision research paper about computer vision](https://www.cs.columbia.edu/wp-content/themes/columbia-cs/assets/img/cs-at-cu.png)
10 Research Papers Accepted to CVPR 2023
![research paper about computer vision Share](https://static.addtoany.com/buttons/favicon.png)
Research from the department has been accepted to the 2023 Computer Vision and Pattern Recognition (CVPR) Conference . The annual event explores machine learning, artificial intelligence, and computer vision research and its applications.
CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation Samir Yitzhak Gadre Columbia University , Mitchell Wortsman University of Washington , Gabriel Ilharco University of Washington , Ludwig Schmidt University of Washington , Shuran Song Columbia University
For robots to be generally useful, they must be able to find arbitrary objects described by people (i.e., be language-driven) even without expensive navigation training on in-domain data (i.e., perform zero-shot inference). We explore these capabilities in a unified setting: language-driven zero-shot object navigation (L-ZSON). Inspired by the recent success of open-vocabulary models for image classification, we investigate a straightforward framework, CLIP on Wheels (CoW), to adapt open-vocabulary models to this task without fine-tuning. To better evaluate L-ZSON, we introduce the Pasture benchmark, which considers finding uncommon objects, objects described by spatial and appearance attributes, and hidden objects described relative to visible objects. We conduct an in-depth empirical study by directly deploying 21 CoW baselines across Habitat, RoboTHOR, and Pasture. In total, we evaluate over 90k navigation episodes and find that (1) CoW baselines often struggle to leverage language descriptions, but are proficient at finding uncommon objects. (2) A simple CoW, with CLIP-based object localization and classical exploration — and no additional training — matches the navigation efficiency of a state-of-the-art ZSON method trained for 500M steps on Habitat MP3D data. This same CoW provides a 15.6 percentage point improvement in success over a state-of-the-art RoboTHOR ZSON model.
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-Channel Video-Language Retrieval Xudong Lin Columbia University , Simran Tiwari Columbia University , Shiyuan Huang Columbia University , Manling Li UIUC , Mike Zheng Shou National University of Singapore , Heng Ji UIUC , Shih-Fu Chang Columbia University
Multi-channel video-language retrieval require models to understand information from different channels (e.g. video+question, video+speech) to correctly link a video with a textual response or query. Fortunately, contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text, e.g., CLIP; text contrastive models are extensively studied recently for their strong ability of producing discriminative sentence embeddings, e.g., SimCSE. However, there is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources. In this paper, we identify a principled model design space with two axes: how to represent videos and how to fuse video and text information. Based on categorization of recent methods, we investigate the options of representing videos using continuous feature vectors or discrete text tokens; for the fusion method, we explore the use of a multimodal transformer or a pretrained contrastive text model. We extensively evaluate the four combinations on five video-language datasets. We surprisingly find that discrete text tokens coupled with a pretrained contrastive text model yields the best performance, which can even outperform state-of-the-art on the iVQA and How2QA datasets without additional training on millions of video-text data. Further analysis shows that this is because representing videos as text tokens captures the key visual information and text tokens are naturally aligned with text models that are strong retrievers after the contrastive pretraining process. All the empirical analysis establishes a solid foundation for future research on affordable and upgradable multimodal intelligence.
DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection Jiawei Ma Columbia University , Yulei Niu Columbia University , Jincheng Xu Columbia University , Shiyuan Huang Columbia University , Guangxing Han Columbia University , Shih-Fu Chang Columbia University
Generalized few-shot object detection aims to achieve precise detection on both base classes with abundant annotations and novel classes with limited training data. Existing approaches enhance few-shot generalization with the sacrifice of base-class performance, or maintain high precision in base-class detection with limited improvement in novel-class adaptation. In this paper, we point out the reason is insufficient Discriminative feature learning for all of the classes. As such, we propose a new training framework, DiGeo, to learn Geometry-aware features of inter-class separation and intra-class compactness. To guide the separation of feature clusters, we derive an offline simplex equiangular tight frame (ETF) classifier whose weights serve as class centers and are maximally and equally separated. To tighten the cluster for each class, we include adaptive class-specific margins into the classification loss and encourage the features close to the class centers. Experimental studies on two few-shot benchmark datasets (VOC, COCO) and one long-tail dataset (LVIS) demonstrate that, with a single model, our method can effectively improve generalization on novel classes without hurting the detection of base classes.
Supervised Masked Knowledge Distillation for Few-Shot Transformers Han Lin Columbia University , Guangxing Han Columbia University , Jiawei Ma Columbia University , Shiyuan Huang Columbia University , Xudong Lin Columbia University , Shih-Fu Chang Columbia University
Vision Transformers (ViTs) emerge to achieve impressive performance on many data-abundant computer vision tasks by capturing long-range dependencies among local features. However, under few-shot learning (FSL) settings on small datasets with only a few labeled data, ViT tends to overfit and suffers from severe performance degradation due to its absence of CNN-alike inductive bias. Previous works in FSL avoid such problem either through the help of self-supervised auxiliary losses, or through the dextile uses of label information under supervised settings. But the gap between self-supervised and supervised few-shot Transformers is still unfilled. Inspired by recent advances in self-supervised knowledge distillation and masked image modeling (MIM), we propose a novel Supervised Masked Knowledge Distillation model (SMKD) for few-shot Transformers which incorporates label information into self-distillation frameworks. Compared with previous self-supervised methods, we allow intra-class knowledge distillation on both class and patch tokens, and introduce the challenging task of masked patch tokens reconstruction across intra-class images. Experimental results on four few-shot classification benchmark datasets show that our method with simple design outperforms previous methods by a large margin and achieves a new start-of-the-art. Detailed ablation studies confirm the effectiveness of each component of our model. Code for this paper is available here: this https URL .
FLEX: Full-Body Grasping Without Full-Body Grasps Purva Tendulkar Columbia University , Dídac Surís Columbia University , Carl Vondrick Columbia University
Synthesizing 3D human avatars interacting realistically with a scene is an important problem with applications in AR/VR, video games and robotics. Towards this goal, we address the task of generating a virtual human — hands and full body — grasping everyday objects. Existing methods approach this problem by collecting a 3D dataset of humans interacting with objects and training on this data. However, 1) these methods do not generalize to different object positions and orientations, or to the presence of furniture in the scene, and 2) the diversity of their generated full-body poses is very limited. In this work, we address all the above challenges to generate realistic, diverse full-body grasps in everyday scenes without requiring any 3D full-body grasping data. Our key insight is to leverage the existence of both full-body pose and hand grasping priors, composing them using 3D geometrical constraints to obtain full-body grasps. We empirically validate that these constraints can generate a variety of feasible human grasps that are superior to baselines both quantitatively and qualitatively. See our webpage for more details: this https URL .
Humans As Light Bulbs: 3D Human Reconstruction From Thermal Reflection Ruoshi Liu Columbia University , Carl Vondrick Columbia University
The relatively hot temperature of the human body causes people to turn into long-wave infrared light sources. Since this emitted light has a larger wavelength than visible light, many surfaces in typical scenes act as infrared mirrors with strong specular reflections. We exploit the thermal reflections of a person onto objects in order to locate their position and reconstruct their pose, even if they are not visible to a normal camera. We propose an analysis-by-synthesis framework that jointly models the objects, people, and their thermal reflections, which combines generative models with differentiable rendering of reflections. Quantitative and qualitative experiments show our approach works in highly challenging cases, such as with curved mirrors or when the person is completely unseen by a normal camera.
Tracking Through Containers and Occluders in the Wild Basile Van Hoorick Columbia University , Pavel Tokmakov Toyota Research Institute , Simon Stent Woven Planet , Jie Li Toyota Research Institute , Carl Vondrick Columbia University
Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems. In this paper, we introduce TCOW, a new benchmark and model for visual tracking through heavy occlusion and containment. We set up a task where the goal is to, given a video sequence, segment both the projected extent of the target object, as well as the surrounding container or occluder whenever one exists. To study this task, we create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance under various forms of task variation, such as moving or nested containment. We evaluate two recent transformer-based video models and find that while they can be surprisingly capable of tracking targets under certain settings of task variation, there remains a considerable performance gap before we can claim a tracking model to have acquired a true notion of object permanence.
Doubly Right Object Recognition: A Why Prompt for Visual Rationales Chengzhi Mao Columbia University , Revant Teotia Columbia University , Amrutha Sundar Columbia University , Sachit Menon Columbia University , Junfeng Yang Columbia University , Xin Wang Microsoft Research , Carl Vondrick Columbia University
Many visual recognition models are evaluated only on their classification accuracy, a metric for which they obtain strong performance. In this paper, we investigate whether computer vision models can also provide correct rationales for their predictions. We propose a “doubly right” object recognition benchmark, where the metric requires the model to simultaneously produce both the right labels as well as the right rationales. We find that state-of-the-art visual models, such as CLIP, often provide incorrect rationales for their categorical predictions. However, by transferring the rationales from language models into visual representations through a tailored dataset, we show that we can learn a “why prompt,” which adapts large visual representations to produce correct rationales. Visualizations and empirical experiments show that our prompts significantly improve performance on doubly right object recognition, in addition to zero-shot transfer to unseen tasks and datasets.
What You Can Reconstruct From a Shadow Ruoshi Liu Columbia University , Sachit Menon Columbia University , Chengzhi Mao Columbia University , Dennis Park Toyota Research Institute , Simon Stent Woven Planet , Carl Vondrick Columbia University
3D reconstruction is a fundamental problem in computer vision, and the task is especially challenging when the object to reconstruct is partially or fully occluded. We introduce a method that uses the shadows cast by an unobserved object in order to infer the possible 3D volumes under occlusion. We create a differentiable image formation model that allows us to jointly infer the 3D shape of an object, its pose, and the position of a light source. Since the approach is end-to-end differentiable, we are able to integrate learned priors of object geometry in order to generate realistic 3D shapes of different object categories. Experiments and visualizations show that the method is able to generate multiple possible solutions that are consistent with the observation of the shadow. Our approach works even when the position of the light source and object pose are both unknown. Our approach is also robust to real-world images where ground-truth shadow mask is unknown.
CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes From Natural Language Aditya Sanghi Autodesk Research , Rao Fu Brown University , Vivian Liu Columbia University , Karl D.D. Willis Autodesk Research , Hooman Shayani Autodesk Research , Amir H. Khasahmadi Autodesk Research , Srinath Sridhar Brown University , Daniel Ritchie Brown University
Recent works have demonstrated that natural language can be used to generate and edit 3D shapes. However, these methods generate shapes with limited fidelity and diversity. We introduce CLIP-Sculptor, a method to address these constraints by producing high-fidelity and diverse 3D shapes without the need for (text, shape) pairs during training. CLIP-Sculptor achieves this in a multi-resolution approach that first generates in a low-dimensional latent space and then upscales to a higher resolution for improved shape fidelity. For improved shape diversity, we use a discrete latent space which is modeled using a transformer conditioned on CLIP’s image-text embedding space. We also present a novel variant of classifier-free guidance, which improves the accuracy-diversity trade-off. Finally, we perform extensive experiments demonstrating that CLIP-Sculptor outperforms state-of-the-art baselines.
Find open faculty positions here .
Computer Science at Columbia University
Upcoming events, in the news, press mentions, dean boyce's statement on amicus brief filed by president bollinger.
President Bollinger announced that Columbia University along with many other academic institutions (sixteen, including all Ivy League universities) filed an amicus brief in the U.S. District Court for the Eastern District of New York challenging the Executive Order regarding immigrants from seven designated countries and refugees. Among other things, the brief asserts that “safety and security concerns can be addressed in a manner that is consistent with the values America has always stood for, including the free flow of ideas and people across borders and the welcoming of immigrants to our universities.”
This recent action provides a moment for us to collectively reflect on our community within Columbia Engineering and the importance of our commitment to maintaining an open and welcoming community for all students, faculty, researchers and administrative staff. As a School of Engineering and Applied Science, we are fortunate to attract students and faculty from diverse backgrounds, from across the country, and from around the world. It is a great benefit to be able to gather engineers and scientists of so many different perspectives and talents – all with a commitment to learning, a focus on pushing the frontiers of knowledge and discovery, and with a passion for translating our work to impact humanity.
I am proud of our community, and wish to take this opportunity to reinforce our collective commitment to maintaining an open and collegial environment. We are fortunate to have the privilege to learn from one another, and to study, work, and live together in such a dynamic and vibrant place as Columbia.
Mary C. Boyce Dean of Engineering Morris A. and Alma Schapiro Professor
![research paper about computer vision Add Event to GMail](https://www.cs.columbia.edu/wp-content/themes/columbia-cs/assets/img/gmail_icon.png)
{{title}} {{fullname}}
Courses This Semester
- {{title}} ({{dept}} {{prefix}}{{course_num}}-{{section}})
Multi-Constraint Transferable Generative Adversarial Networks for Cross-Modal Brain Image Synthesis
- Published: 28 May 2024
Cite this article
- Yawen Huang 1 ,
- Hao Zheng 1 ,
- Yuexiang Li 2 ,
- Feng Zheng 3 ,
- Xiantong Zhen 4 ,
- GuoJun Qi 5 ,
- Ling Shao 6 &
- Yefeng Zheng 1
Recent progress in generative models has led to the drastic growth of research in image generation. Existing approaches show visually compelling results by learning multi-modal distributions, but they still lack realism, especially in certain scenarios like medical image synthesis. In this paper, we propose a novel Brain Generative Adversarial Network (BrainGAN) that explores GANs with multi-constraint and transferable property for cross-modal brain image synthesis. We formulate BrainGAN by introducing a unified framework with new constraints that can enhance modal matching, texture details and anatomical structure, simultaneously. We show how BrainGAN can learn meaningful tissue representations with rich variability of brain images. In addition to generating 3D volumes that are visually indistinguishable from real ones, we model adversarial discriminators and segmentors jointly, along with the proposed cost functions, which forces our networks to synthesize brain MRIs with realistic textures conditioned on anatomical structures. BrainGAN is evaluated on three public datasets, where it consistently outperforms the other state-of-the-art approaches by a large margin, advancing cross-modal synthesis of brain images both visually and practically.
This is a preview of subscription content, log in via an institution to check access.
Access this article
Price includes VAT (Russian Federation)
Instant access to the full article PDF.
Rent this article via DeepDyve
Institutional subscriptions
![research paper about computer vision](https://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-024-02109-4/MediaObjects/11263_2024_2109_Fig1_HTML.png)
Similar content being viewed by others
Make-A-Volume: Leveraging Latent Diffusion Models for Cross-Modality 3D Brain MRI Synthesis
3D-StyleGAN: A Style-Based Generative Adversarial Network for Generative Modeling of Three-Dimensional Medical Images
![research paper about computer vision research paper about computer vision](https://media.springernature.com/w215h120/springer-static/image/art%3A10.1007%2Fs11222-023-10282-8/MediaObjects/11222_2023_10282_Fig1_HTML.png)
Trans-cGAN: transformer-Unet-based generative adversarial networks for cross-modality magnetic resonance image synthesis
https://brain-development.org/ixi-dataset/ .
https://insight-journal.org/midas/collection/view/190 .
https://www.med.upenn.edu/sbia/brats2018/data.html .
Please note that the segmentation mask is not real “ground truth”, thus the Dice score is calculated against noisy labels.
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35 (8), 1798–1828.
Article Google Scholar
Bińkowski, M., Sutherland, D. J., Arbel, M., & Gretton, A. (2018). Demystifying mmd gans. arXiv preprint arXiv:1801.01401 .
Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H. P., Schölkopf, B., & Smola, A. J. (2006). Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22 (14), e49–e57.
Chartsias, A., Joyce, T., Giuffrida, M. V., & Tsaftaris, S. A. (2017). Multimodal MR synthesis via modality-invariant latent representation. IEEE Transactions on Medical Imaging, 37 (3), 803–814.
Chen, C., Dou, Q., Chen, H., Qin, J., & Heng, P. A. (2019). Synergistic image and feature adaptation: Towards cross-modality domain adaptation for medical image segmentation. In Proceedings of the AAAI conference on artificial intelligence (pp. 865–872).
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34 , 8780–8794.
Google Scholar
Dziugaite, G. K., Roy, D. M., & Ghahramani, Z. (2015). Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906 .
Efros, A. A., & Freeman, W. T. (2001). Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on computer graphics and interactive techniques (pp. 341–346). ACM.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial nets. ACM, 63 , 139–144.
Havaei, M., Guizard, N., Chapados, N., & Bengio, Y. (2016). HEMIS: Hetero-modal image segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 469–477). Springer.
Heide, F., Heidrich, W., & Wetzstein, G. (2015). Fast and flexible convolutional sparse coding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5135–5143).
Huang, Y., Shao, L., & Frangi, A. F. (2017b). DOTE: Dual convolutional filter learning for super-resolution and cross-modality synthesis in MRI. In International conference on medical image computing and computer-assisted intervention (pp. 89–98). Springer.
Huang, Y., Zheng, F., Wang, D., Huang, W., Scott, M. R., & Shao, L. (2021). Brain image synthesis with unsupervised multivariate canonical CSCl4Net. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5881–5890).
Huang, Y., Shao, L., & Frangi, A. F. (2017). Cross-modality image synthesis via weakly coupled and geometry co-regularized joint dictionary learning. IEEE Transactions on Medical Imaging, 37 (3), 815–827.
Huang, Y., Zheng, F., Cong, R., Huang, W., Scott, M. R., & Shao, L. (2020). MCMT-GAN: Multi-task coherent modality transferable GAN for 3D brain image synthesis. IEEE Transactions on Image Processing, 29 , 8187–8198.
Hung, W. C., Tsai, Y. H., Liou, Y. T., Lin, Y. Y., & Yang, M. H. (2018). Adversarial learning for semi-supervised semantic segmentation. arXiv preprint arXiv:1802.07934 .
Iglesias, J. E., Modat, M., Peter, L., Stevens, A., Annunziata, R., Vercauteren, T., Lein, E., Fischl, B., & Ourselin, S. (2018). Joint registration and synthesis using a probabilistic model for alignment of MRI and histological sections. Medical Image Analysis, 50 , 127–144.
Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125–1134).
IXI. (2015). Information eXtraction from Images. https://brain-development.org/ixi-dataset/ .
Jog, A., Carass, A., Roy, S., Pham, D. L., & Prince, J. L. (2017). Random forest regression for magnetic resonance image synthesis. Medical Image Analysis, 35 , 475–488.
Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision (pp. 694–711). Springer.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. ACM, 60 , 84–90.
Li, C. L., Chang, W. C., Cheng, Y., Yang, Y., & Póczos, B. (2017). MMD GAN: Towards deeper understanding of moment matching network. Advances in Neural Information Processing Systems, 30 .
Liu, M. Y., Huang, X., Mallya, A., Karras, T., Aila, T., Lehtinen, J., & Kautz, J. (2019). Few-shot unsupervised image-to-image translation. arXiv preprint arXiv:1905.01723 .
Long, J., Shelhamer, E., Darrell, & T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).
Mahapatra, D., Bozorgtabar, B., Thiran, J. P., & Reyes, M. (2018). Efficient active learning for image classification and segmentation using a sample selection and conditional generative adversarial network. In International conference on medical image computing and computer-assisted intervention (pp. 580–588). Springer.
...Menze, B. H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., Lanczi, L., Gerstner, E., Weber, M. A., Arbel, T., Avants, B. B., Ayache, N., Buendia, P., Collins, D. L., Cordier, N., Van Leemput, K. (2015). The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Transactions on Medical Imaging, 34 (10), 1993–2024. https://doi.org/10.1109/TMI.2014.2377694
Najdenkoska, I., Zhen, X., Worring, M., & Shao, L. (2022). Uncertainty-aware report generation for chest x-rays by variational topic inference. Medical Image Analysis, 82 , 102603.
NAMIC. (2018). Brain multimodality dataset. https://www.med.upenn.edu/sbia/brats2018/data.html .
Nguyen, H. V., Zhou, K., & Vemulapalli, R. (2015). Cross-domain synthesis of medical images using efficient location-sensitive deep network. In International conference on medical image computing and computer-assisted intervention (pp. 677–684). Springer.
Nie, D., Wang, L., Xiang, L., Zhou, S., Adeli, E., & Shen, D. (2019). Difficulty-aware attention network with confidence learning for medical image segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 33 , 1085–1092.
Osokin, A., Chessel, A., Carazo Salas, R. E., & Vaggi, F. (2017). GANs for biological image synthesis. In Proceedings of the IEEE international conference on computer vision (pp. 2233–2242).
Pan, Y., Liu, M., Xia, Y., & Shen, D. (2021). Disease-image-specific learning for diagnosis-oriented neuroimage synthesis with incomplete multi-modality data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (10), 6839–6853.
Park, T., Efros, A. A., Zhang, R., & Zhu, J. Y. (2020). Contrastive learning for conditional image synthesis. In European conference on computer vision .
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., & Lee, H. (2016). Generative adversarial text to image synthesis. In International conference on machine learning (pp. 1060–1069). PMLR.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 10684–10695).
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 234–241). Springer.
Rousseau, F. (2008). Brain hallucination. In European conference on computer vision (pp. 497–508). Springer.
Roy, S., Carass, A., & Prince, J. L. (2013). Magnetic resonance image example-based contrast synthesis. IEEE Transactions on Medical Imaging, 32 (12), 2348–2363.
Shao, W., Wang, T., Huang, Z., Cheng, J., Han, Z., Zhang, D., & Huang, K. (2019). Diagnosis-guided multi-modal feature selection for prognosis prediction of lung squamous cell carcinoma. In International conference on medical image computing and computer-assisted intervention (pp. 113–121). Springer.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 .
Singh, J., Gould, S., & Zheng, L. (2023). High-fidelity guided image synthesis with latent diffusion models. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5997–6006). IEEE.
Souly, N., Spampinato, C., & Shah, M. (2017). Semi supervised semantic segmentation using generative adversarial network. In Proceedings of the IEEE international conference on computer vision (pp. 5688–5696).
Tang, H., Shao, L., Torr, P. H., & Sebe, N. (2023). Bipartite graph reasoning gans for person pose and facial image synthesis. International Journal of Computer Vision, 131 (3), 644–658.
Vemulapalli, R., Van Nguyen, H., & Zhou, K. S. (2015). Unsupervised cross-modal synthesis of subject-specific scans. In Proceedings of the IEEE international conference on computer vision (pp. 630–638).
Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8798–8807).
Wang, J., Zhou, W., Qi, G. J., Fu, Z., Tian, Q., & Li, H. (2020). Transformation GAN for unsupervised image synthesis and representation learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 472–481).
Xian, W., Sangkloy, P., Agrawal, V., Raj, A., Lu, J., Fang, C., Yu, F., & Hays, J. (2018). TextureGAN: Controlling deep image synthesis with texture patches. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8456–8465).
Xue, Y., Xu, T., Zhang, H., Long, L. R., & Huang, X. (2018). Segan: Adversarial network with multi-scale l 1 loss for medical image segmentation. Neuroinformatics, 16 , 383–392.
Yang, J., Wright, J., Huang, T. S., & Ma, Y. (2010). Image super-resolution via sparse representation. IEEE Transactions on Image Processing, 19 (11), 2861–2873.
Article MathSciNet Google Scholar
Zhao, A., Balakrishnan, G., Durand, F., Guttag, J. V., & Dalca, A. V. (2019). Data augmentation using learned transformations for one-shot medical image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8543–8553).
Zhou, Y., He, X., Cui, S., Zhu, F., Liu, L., & Shao, L. (2019). High-resolution diabetic retinopathy image synthesis manipulated by grading and lesions. In International conference on medical image computing and computer-assisted intervention (pp. 505–513). Springer.
Zhou, T., Fu, H., Chen, G., Shen, J., & Shao, L. (2020). Hi-Net: Hybrid-fusion network for multi-modal MR image synthesis. IEEE Transactions on Medical Imaging, 39 (9), 2772–2781.
Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223–2232).
Download references
Author information
Authors and affiliations.
Jarvis Research Center, Tencent YouTu Lab, Shenzhen, China
Yawen Huang, Hao Zheng & Yefeng Zheng
Medical AI ReSearch (MARS) Group, Guangxi Key Laboratory for Genomic and Personalized Medicine, Guangxi Medical University, Nanning, 530021, Guangxi, China
Yuexiang Li
Southern University of Science and Technology, Shenzhen, China
Central Research Institute, United Imaging Healthcare Co., Ltd., Beijing, China
Xiantong Zhen
University of Central Florida, Orlando, FL, USA
UCAS-Terminus AI Lab, University of Chinese Academy of Sciences, Beijing, 100049, China
You can also search for this author in PubMed Google Scholar
Corresponding authors
Correspondence to Yuexiang Li or Yefeng Zheng .
Additional information
Communicated by Paolo Rota.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Reprints and permissions
About this article
Huang, Y., Zheng, H., Li, Y. et al. Multi-Constraint Transferable Generative Adversarial Networks for Cross-Modal Brain Image Synthesis. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02109-4
Download citation
Received : 03 April 2023
Accepted : 22 April 2024
Published : 28 May 2024
DOI : https://doi.org/10.1007/s11263-024-02109-4
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Image synthesis
- Cross-modal
- Multi-constraint
- Generative adversarial network
Advertisement
- Find a journal
- Publish with us
- Track your research
Two big computer vision papers boost prospect of safer self-driving vehicles
New chip and camera technology bring closer potential of hands-free road time.
Like nuclear fusion and jet-packs, the self-driving car is a long-promised technology that has stalled for years - yet armed with research, boffins think they have created potential improvements.…
Citizens of Phoenix, San Francisco, and Los Angeles are able to take one of Waymo's self-driving taxis, first introduced to the public in December 2020. But they have not been without their glitches. Just last month in San Francisco, for example, one of the taxi service's autonomous vehicles drove down the wrong side of the street to pass a unicycle. In December last year, a Waymo vehicle hit a backwards-facing pickup truck, resulting in a report with the US National Highway Traffic Safety Administration (NHTSA) and a software update.
But this week, not one but two groups of researchers bidding to improve the performance of self-driving cars and other autonomous vehicles have published papers in the international science journal Nature.
A design for a new chip geared towards autonomous vehicles has arrived from China. Tsinghua University's Luping Shi and colleagues have taken inspiration from the human visual system by both combining low-accuracy, fast event-based detection with more accurate, but slower visualization of an image.
The researchers were able to show the chip — dubbed Tianmouc — could process pixel arrays quickly and robustly in an automotive driving perception system.
In a paper published today, the authors said: "We demonstrate the integration of a Tianmouc chip into an autonomous driving system, showcasing its abilities to enable accurate, fast and robust perception, even in challenging corner cases on open roads. The primitive-based complementary sensing paradigm helps in overcoming fundamental limitations in developing vision systems for diverse open-world applications."
In a separate paper, Davide Scaramuzza, University of Zurich robotics and perception professor, and his colleagues adopt a similar hybrid approach but apply it to camera technologies.
Youtube Video
Cameras for self-driving vehicles navigate a trade-off between bandwidth and latency. While high-res color cameras have good resolution, they require high bandwidth to detect rapid changes. Conversely, reducing the bandwidth increases latency, affecting the timely processing of data for potentially life-saving decision making.
To get out of this bind, the Swiss-based researchers developed a hybrid camera combining event processing with high-bandwidth image processing. Events cameras only record intensity changes, and report them as sparse measurements, meaning the system does not suffer from the bandwidth/latency trade-off.
The event camera is used to detect changes in the blind time between image frames using events. Event data converted into a graph, which changes over time and connects nearby points, is computed locally. The resulting hybrid object detector reduces the detection time in dangerous high-speed situations, according to an explanatory video.
In their paper, the authors say: "Our method exploits the high temporal resolution and sparsity of events and the rich but low temporal resolution information in standard images to generate efficient, high-rate object detections, reducing perceptual and computational latency."
They argue their use of a 20 frames per second RGB camera plus an event camera can achieve the same latency as a 5,000-fps camera with the bandwidth of a 45-fps camera without compromising accuracy.
"Our approach paves the way for efficient and robust perception in edge-case scenarios by uncovering the potential of event cameras," the authors write.
With a hybrid approach to both cameras and data processing in the offing, more widespread adoption of self-driving vehicles may be just around the corner. ®
Help | Advanced Search
Computer Science > Computer Vision and Pattern Recognition
Title: controllable longer image animation with diffusion models.
Abstract: Generating realistic animated videos from static images is an important area of research in computer vision. Methods based on physical simulation and motion prediction have achieved notable advances, but they are often limited to specific object textures and motion trajectories, failing to exhibit highly complex environments and physical dynamics. In this paper, we introduce an open-domain controllable image animation method using motion priors with video diffusion models. Our method achieves precise control over the direction and speed of motion in the movable region by extracting the motion field information from videos and learning moving trajectories and strengths. Current pretrained video generation models are typically limited to producing very short videos, typically less than 30 frames. In contrast, we propose an efficient long-duration video generation method based on noise reschedule specifically tailored for image animation tasks, facilitating the creation of videos over 100 frames in length while maintaining consistency in content scenery and motion coordination. Specifically, we decompose the denoise process into two distinct phases: the shaping of scene contours and the refining of motion details. Then we reschedule the noise to control the generated frame sequences maintaining long-distance noise correlation. We conducted extensive experiments with 10 baselines, encompassing both commercial tools and academic methodologies, which demonstrate the superiority of our method. Our project page: this https URL
Submission history
Access paper:.
- HTML (experimental)
- Other Formats
References & Citations
- Google Scholar
- Semantic Scholar
BibTeX formatted citation
![research paper about computer vision BibSonomy logo](https://arxiv.org/static/browse/0.3.4/images/icons/social/bibsonomy.png)
Bibliographic and Citation Tools
Code, data and media associated with this article, recommenders and search tools.
- Institution
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .
![](http://academicpaper.online/777/templates/cheerup/res/banner1.gif)
COMMENTS
The features of big data could be captured by DL automatically and efficiently. The current applications of DL include computer vision (CV), natural language processing (NLP), video/speech recognition (V/SP), and finance and banking (F&B). Chai and Li (2019) provided a survey of DL on NLP and the advances on V/SP. The survey emphasized the ...
Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. ... You can create a new account if you don't have one. Browse SoTA > Computer Vision Computer Vision. 4656 benchmarks • 1431 tasks • 3023 datasets • 47702 papers with code Semantic Segmentation ... 5299 papers with code
The machine learning and computer vision research is still evolving [1]. Computer vision is an essential part of Internet of Things, Industrial Internet of Things, and brain human interfaces. The complex human activities are recognized and monitored in multimedia streams using machine learning and computer vison.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) [11] arXiv:2405.14855 [ pdf , ps , html , other ] Title: Synergistic Global-space Camera and Human Reconstruction from Videos
We explore the groundbreaking research that has shaped the field of computer vision with our list of the top papers of all time. ... Classic Computer Vision Papers Gradient-based Learning Applied to Document Recognition (1998) The authors Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner published the LeNet paper in 1998. ...
IET Computer Vision is a fully open access journal that introduces new horizons and sets the agenda for future avenues of research in a wide range of areas of computer vision. We are a fully open access journal that welcomes research articles reporting novel methodologies and significant results of interest.
As the deep learning exhibits strong advantages in the feature extraction, it has been widely used in the field of computer vision and among others, and gradually replaced traditional machine learning algorithms. This paper first reviews the main ideas of deep learning, and displays several related frequently-used algorithms for computer vision. Afterwards, the current research status of ...
Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is ...
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different ...
Meng-Hao Guo is a Ph.D. candidate supervised by Prof. Shi-Min Hu in the Department of Computer Science and Technology at Tsinghua University, Beijing, China. His research interests include computer graphics, computer vision, and machine learning. Tian-Xing Xu received his bachelor degree in computer science from Tsinghua University in 2021. He is currently a Ph.D. candidate in the Department ...
The research work was done during the period from 2019 till 2022 in ISCTE taking in consideration artificial intelligence for computer vision [48] concepts and software engineering practices [49 ...
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image ...
Overview. International Journal of Computer Vision (IJCV) details the science and engineering of this rapidly growing field. Regular articles present major technical advances of broad general interest. Survey articles offer critical reviews of the state of the art and/or tutorial presentations of pertinent topics. Coverage includes:
Computer vision in radiology is so pronounced that it has quickly burgeoned into its own field of research, growing a corpus of work 53,54,55 that extends into all modalities, with a focus on X ...
A curated list of the top 10 computer vision papers in 2021 with video demos, articles, code and paper reference. - louisfb01/top-10-cv-papers-2021 ... If you'd like to read more research papers as well, I recommend you read my article where I share my best tips for finding and reading more research papers.
This paper provides contribution of recent development on reviews related to computer vision, image processing, and their related studies. We categorized the computer vision mainstream into four ...
UPDATE: We've also summarized the top 2019 and top 2020 Computer Vision research papers. Ever since convolutional neural networks began outperforming humans in specific image recognition tasks, research in the field of computer vision has proceeded at breakneck pace. The basic architecture of CNNs (or ConvNets) was developed in the 1980s. Yann LeCun improved upon […]
Scientists from MIT and IBM Research made a computer vision model more robust by training it to work like a part of the brain that humans and other primates rely on for object recognition. ... Paper. Paper: "Aligning Model and Macaque Inferior Temporal Cortex Representations Improves Model-to-Human Behavioral Alignment and Adversarial Robustness"
Basics of Computer Vision. Computer Vision (CV) is a field of artificial intelligence that trains computers to interpret and understand the visual world. Using digital images from cameras and videos, along with deep learning models, computers can accurately identify and classify objects, and then react to what they "see.".
This post is intended for computer vision starters, mostly undergraduate students. An important lesson is that unlike undergraduate education, when doing research, you learn primarily from reading papers, which is why I am recommending 10 to start. Before getting to the list, it is good to know where CV papers are usually published.
Proceedings of National Conference on Big Data and Cloud Computing (NCBDC'15), March 20, 2015. OpenCV for Computer Vision Applications. M. Naveenkumar. Department of Computer Applications ...
Research from the department has been accepted to the 2023 Computer Vision and Pattern Recognition (CVPR) Conference. The annual event explores machine learning, artificial intelligence, and computer vision research and its applications. CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation
Recent progress in generative models has led to the drastic growth of research in image generation. Existing approaches show visually compelling results by learning multi-modal distributions, but they still lack realism, especially in certain scenarios like medical image synthesis. In this paper, we propose a novel Brain Generative Adversarial Network (BrainGAN) that explores GANs with multi ...
A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications. Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the ...
Algorithmic progress in computer vision. Ege Erdil, Tamay Besiroglu. We investigate algorithmic progress in image classification on ImageNet, perhaps the most well-known test bed for computer vision. We estimate a model, informed by work on neural scaling laws, and infer a decomposition of progress into the scaling of compute, data, and algorithms.
New chip and camera technology bring closer potential of hands-free road time Like nuclear fusion and jet-packs, the self-driving car is a long-promised technology that has stalled for years - yet ...
Generating realistic animated videos from static images is an important area of research in computer vision. Methods based on physical simulation and motion prediction have achieved notable advances, but they are often limited to specific object textures and motion trajectories, failing to exhibit highly complex environments and physical dynamics. In this paper, we introduce an open-domain ...