research paper about computer vision

Subscribe to the PwC Newsletter

Join the community, computer vision, semantic segmentation.

Tumor Segmentation

Panoptic Segmentation

3D Semantic Segmentation

Weakly-Supervised Semantic Segmentation

Representation learning.

Disentanglement

Graph representation learning, sentence embeddings.

Network Embedding

Classification.

Text Classification

Graph Classification

Audio Classification

Medical Image Classification

Object detection.

3D Object Detection

Real-Time Object Detection

RGB Salient Object Detection

Few-Shot Object Detection

Image classification.

Out of Distribution (OOD) Detection

Few-Shot Image Classification

Fine-Grained Image Classification

Semi-Supervised Image Classification

2d object detection.

Edge Detection

Thermal image segmentation.

Open Vocabulary Object Detection

Reinforcement learning (rl), off-policy evaluation, multi-objective reinforcement learning, 3d point cloud reinforcement learning, deep hashing, table retrieval, domain adaptation.

Unsupervised Domain Adaptation

Domain Generalization

Test-time Adaptation

Source-free domain adaptation, image generation.

Image-to-Image Translation

Text-to-Image Generation

Image Inpainting

Conditional Image Generation

Data augmentation.

Image Augmentation

Text Augmentation

Autonomous vehicles.

Autonomous Driving

Self-Driving Cars

Simultaneous Localization and Mapping

Autonomous Navigation

Image Denoising

Color Image Denoising

Sar Image Despeckling

Grayscale image denoising, meta-learning.

Few-Shot Learning

Sample Probing

Universal meta-learning, contrastive learning.

Super-Resolution

Image Super-Resolution

Video Super-Resolution

Multi-Frame Super-Resolution

Reference-based Super-Resolution

Pose estimation.

3D Human Pose Estimation

Keypoint Detection

3D Pose Estimation

6D Pose Estimation

Self-supervised learning.

Point Cloud Pre-training

Unsupervised video clustering, 2d semantic segmentation, image segmentation, text style transfer.

Scene Parsing

Reflection Removal

Visual question answering (vqa).

Visual Question Answering

Machine Reading Comprehension

Chart Question Answering

Embodied Question Answering

Depth Estimation

3D Reconstruction

Neural Rendering

3D Face Reconstruction

Sentiment analysis.

Aspect-Based Sentiment Analysis (ABSA)

Multimodal Sentiment Analysis

Aspect Sentiment Triplet Extraction

Twitter Sentiment Analysis

Anomaly detection.

Unsupervised Anomaly Detection

One-Class Classification

Supervised anomaly detection, anomaly detection in surveillance videos.

Temporal Action Localization

Video Understanding

Video generation.

Video Object Segmentation

Action Classification

Activity recognition.

Action Recognition

Human Activity Recognition

Egocentric activity recognition.

Group Activity Recognition

3d object super-resolution.

One-Shot Learning

Few-Shot Semantic Segmentation

Cross-domain few-shot.

Unsupervised Few-Shot Learning

Medical image segmentation.

Lesion Segmentation

Brain Tumor Segmentation

Cell Segmentation

Skin lesion segmentation, monocular depth estimation.

Stereo Depth Estimation

Depth and camera motion.

3D Depth Estimation

Exposure fairness, optical character recognition (ocr).

Active Learning

Handwriting Recognition

Handwritten digit recognition, irregular text recognition, instance segmentation.

Referring Expression Segmentation

3D Instance Segmentation

Real-time Instance Segmentation

Unsupervised Object Segmentation

Facial recognition and modelling.

Face Recognition

Face Swapping

Face Detection

Facial Expression Recognition (FER)

Face Verification

Object tracking.

Multi-Object Tracking

Visual Object Tracking

Multiple Object Tracking

Cell Tracking

Zero-shot learning.

Generalized Zero-Shot Learning

Compositional Zero-Shot Learning

Multi-label zero-shot learning, quantization, data free quantization, unet quantization, continual learning.

Class Incremental Learning

Continual named entity recognition, unsupervised class-incremental learning.

Action Recognition In Videos

3D Action Recognition

Self-supervised action recognition, few shot action recognition.

Scene Understanding

Scene Text Recognition

Scene Graph Generation

Scene Recognition

Adversarial attack.

Backdoor Attack

Adversarial Text

Adversarial attack detection, real-world adversarial attack, active object detection, image retrieval.

Sketch-Based Image Retrieval

Content-Based Image Retrieval

Composed Image Retrieval (CoIR)

Medical Image Retrieval

Dimensionality reduction.

Supervised dimensionality reduction

Online nonnegative cp decomposition, emotion recognition.

Speech Emotion Recognition

Emotion Recognition in Conversation

Multimodal Emotion Recognition

Emotion-cause pair extraction.

Monocular 3D Object Detection

3D Object Detection From Stereo Images

Multiview Detection

Robust 3d object detection, image reconstruction.

MRI Reconstruction

Film Removal

Style transfer.

Image Stylization

Font style transfer, style generalization, face transfer, optical flow estimation.

Video Stabilization

Image captioning.

3D dense captioning

Controllable image captioning, aesthetic image captioning.

Relational Captioning

Action localization.

Action Segmentation

Spatio-temporal action localization, person re-identification.

Unsupervised Person Re-Identification

Video-based person re-identification, generalizable person re-identification, cloth-changing person re-identification, image restoration.

Demosaicking

Spectral reconstruction, underwater image restoration.

JPEG Artifact Correction

Visual relationship detection, lighting estimation.

3D Room Layouts From A Single RGB Panorama

Road scene understanding, action detection.

Skeleton Based Action Recognition

Online Action Detection

Audio-visual active speaker detection, metric learning.

Object Recognition

3D Object Recognition

Continuous object recognition.

Depiction Invariant Object Recognition

Monocular 3D Human Pose Estimation

Pose prediction.

3D Multi-Person Pose Estimation

3d human pose and shape estimation, image enhancement.

Low-Light Image Enhancement

Image relighting, de-aliasing, multi-label classification.

Missing Labels

Extreme multi-label classification, hierarchical multi-label classification, medical code prediction, continuous control.

Steering Control

Drone controller.

Semi-Supervised Video Object Segmentation

Unsupervised Video Object Segmentation

Referring Video Object Segmentation

Video Salient Object Detection

3d face modelling.

Trajectory Prediction

Trajectory Forecasting

Human motion prediction, out-of-sight trajectory prediction.

Multivariate Time Series Imputation

Image quality assessment, no-reference image quality assessment, blind image quality assessment.

Aesthetics Quality Assessment

Stereoscopic image quality assessment, object localization.

Weakly-Supervised Object Localization

Image-based localization, unsupervised object localization, monocular 3d object localization, novel view synthesis.

Novel LiDAR View Synthesis

Gournd video synthesis from satellite image

Blind Image Deblurring

Single-image blind deblurring, out-of-distribution detection, video semantic segmentation.

Camera shot segmentation

Cloud removal.

Facial Inpainting

Fine-Grained Image Inpainting

Instruction following, visual instruction following, change detection.

Semi-supervised Change Detection

Saliency detection.

Saliency Prediction

Co-Salient Object Detection

Video saliency detection, unsupervised saliency detection, image compression.

Feature Compression

Jpeg compression artifact reduction.

Lossy-Compression Artifact Reduction

Color image compression artifact reduction, explainable artificial intelligence, explainable models, explanation fidelity evaluation, fad curve analysis, prompt engineering.

Visual Prompting

Image registration.

Unsupervised Image Registration

Ensemble learning, visual reasoning.

Visual Commonsense Reasoning

Salient object detection, saliency ranking, visual tracking.

Point Tracking

Rgb-t tracking, real-time visual tracking.

RF-based Visual Tracking

3d point cloud classification.

3D Object Classification

Few-Shot 3D Point Cloud Classification

Supervised only 3d point cloud classification, zero-shot transfer 3d point cloud classification, motion estimation, 2d classification.

Neural Network Compression

Music Source Separation

Cell detection.

Plant Phenotyping

Open-set classification, image manipulation detection.

Zero Shot Skeletal Action Recognition

Generalized zero shot skeletal action recognition, whole slide images, activity prediction, motion prediction, cyber attack detection, sequential skip prediction, gesture recognition.

Hand Gesture Recognition

Hand-Gesture Recognition

RF-based Gesture Recognition

Video captioning.

Dense Video Captioning

Boundary captioning, visual text correction, audio-visual video captioning, video question answering.

Zero-Shot Video Question Answer

Few-shot video question answering.

Robust 3D Semantic Segmentation

Real-Time 3D Semantic Segmentation

Unsupervised 3D Semantic Segmentation

Furniture segmentation, point cloud registration.

Image to Point Cloud Registration

Text detection, medical diagnosis.

Alzheimer's Disease Detection

Retinal OCT Disease Classification

Blood cell count, thoracic disease classification, 3d point cloud interpolation, visual grounding.

Person-centric Visual Grounding

Phrase Extraction and Grounding (PEG)

Visual odometry.

Face Anti-Spoofing

Monocular visual odometry.

Hand Pose Estimation

Hand Segmentation

Gesture-to-gesture translation, rain removal.

Single Image Deraining

Image clustering.

Online Clustering

Face Clustering

Multi-view subspace clustering, multi-modal subspace clustering.

Image Dehazing

Single Image Dehazing

Colorization.

Line Art Colorization

Point-interactive Image Colorization

Color Mismatch Correction

Robot navigation.

PointGoal Navigation

Social navigation.

Sequential Place Learning

Image manipulation, conformal prediction.

Unsupervised Image-To-Image Translation

Synthetic-to-Real Translation

Multimodal Unsupervised Image-To-Image Translation

Cross-View Image-to-Image Translation

Fundus to Angiography Generation

Visual place recognition.

Indoor Localization

3d place recognition, image editing, rolling shutter correction, shadow removal, multimodel-guided image editing, joint deblur and frame interpolation, multimodal fashion image editing, visual localization.

DeepFake Detection

Synthetic Speech Detection

Human detection of deepfakes, multimodal forgery detection, stereo matching, object reconstruction.

3D Object Reconstruction

Crowd Counting

Visual Crowd Analysis

Group detection in crowds, human-object interaction detection.

Affordance Recognition

Image deblurring, low-light image deblurring and enhancement, earth observation, video quality assessment, video alignment, temporal sentence grounding, long-video activity recognition, point cloud classification, jet tagging, few-shot point cloud classification, image matching.

Semantic correspondence

Patch matching, set matching.

Matching Disparate Images

Hyperspectral.

Hyperspectral Image Classification

Hyperspectral unmixing, hyperspectral image segmentation, classification of hyperspectral images, document text classification.

Learning with noisy labels

Multi-label classification of biomedical texts, political salient issue orientation detection, 3d point cloud reconstruction.

Weakly Supervised Action Localization

Weakly-supervised temporal action localization.

Temporal Action Proposal Generation

Activity recognition in videos, scene classification.

2D Human Pose Estimation

Action anticipation.

3D Face Animation

Semi-supervised human pose estimation, point cloud generation, point cloud completion, referring expression, reconstruction, 3d human reconstruction.

Single-View 3D Reconstruction

4d reconstruction, single-image-based hdr reconstruction, compressive sensing, keyword spotting.

Small-Footprint Keyword Spotting

Visual keyword spotting, scene text detection.

Curved Text Detection

Multi-oriented scene text detection, boundary detection.

Junction Detection

Camera calibration, image matting.

Semantic Image Matting

Video retrieval, video-text retrieval, video grounding, video-adverb retrieval, replay grounding, composed video retrieval (covr), motion synthesis.

Motion Style Transfer

Temporal human motion composition, emotion classification.

Video Summarization

Unsupervised Video Summarization

Supervised video summarization, document ai, document understanding, sensor fusion, superpixels, point cloud segmentation, remote sensing.

Remote Sensing Image Classification

Change detection for remote sensing images, building change detection for remote sensing images.

Segmentation Of Remote Sensing Imagery

The Semantic Segmentation Of Remote Sensing Imagery

Few-Shot Transfer Learning for Saliency Prediction

Aerial Video Saliency Prediction

Document layout analysis.

3D Anomaly Detection

Video anomaly detection, artifact detection.

Point cloud reconstruction

3D Semantic Scene Completion

3D Semantic Scene Completion from a single RGB image

Garment reconstruction, face generation.

Talking Head Generation

Talking face generation.

Face Age Editing

Facial expression generation, kinship face generation, cross-modal retrieval, image-text matching, multilingual cross-modal retrieval.

Zero-shot Composed Person Retrieval

Cross-modal retrieval on rsitmd, video instance segmentation.

Privacy Preserving Deep Learning

Membership inference attack, human detection.

Generalized Few-Shot Semantic Segmentation

Virtual try-on, scene flow estimation.

Self-supervised Scene Flow Estimation

3d classification, depth completion.

Motion Forecasting

Multi-Person Pose forecasting

Multiple Object Forecasting

Video editing, video temporal consistency, face reconstruction, object discovery, carla map leaderboard, dead-reckoning prediction.

Generalized Referring Expression Segmentation

Gaze estimation.

Texture Synthesis

Text-based Image Editing

Text-guided-image-editing.

Zero-Shot Text-to-Image Generation

Concept alignment, conditional text-to-image synthesis, machine unlearning, continual forgetting, sign language recognition.

Image Recognition

Fine-grained image recognition, license plate recognition, material recognition, multi-view learning, incomplete multi-view clustering.

Breast Cancer Detection

Skin cancer classification.

Breast Cancer Histology Image Classification

Lung cancer diagnosis, classification of breast cancer histology images, gait recognition.

Multiview Gait Recognition

Gait recognition in the wild, human parsing.

Multi-Human Parsing

Pose tracking.

3D Human Pose Tracking

Interactive segmentation, scene generation.

3D Multi-Person Pose Estimation (absolute)

3D Multi-Person Pose Estimation (root-relative)

3D Multi-Person Mesh Recovery

Event-based vision.

Event-based Optical Flow

Event-Based Video Reconstruction

Event-based motion estimation, disease prediction, disease trajectory forecasting, object counting, training-free object counting, open-vocabulary object counting, interest point detection, homography estimation.

3D Hand Pose Estimation

Weakly supervised segmentation, facial landmark detection.

Unsupervised Facial Landmark Detection

3D Facial Landmark Localization

3d character animation from a single photo, scene segmentation.

Dichotomous Image Segmentation

Activity detection, inverse rendering, temporal localization.

Language-Based Temporal Localization

Temporal defect localization, multi-label image classification.

Multi-label Image Recognition with Partial Labels

3d object tracking.

3D Single Object Tracking

Template matching, text-to-video generation, text-to-video editing, subject-driven video generation, camera localization.

Camera Relocalization

Lidar semantic segmentation, visual dialog.

Motion Segmentation

Relation network, intelligent surveillance.

Vehicle Re-Identification

Text spotting.

Disparity Estimation

Few-Shot Class-Incremental Learning

Class-incremental semantic segmentation, non-exemplar-based class incremental learning, handwritten text recognition, handwritten document recognition, unsupervised text recognition, knowledge distillation.

Data-free Knowledge Distillation

Self-knowledge distillation, moment retrieval.

Zero-shot Moment Retrieval

Text to video retrieval, partially relevant video retrieval, person search, decision making under uncertainty.

Uncertainty Visualization

Semi-supervised object detection.

Shadow Detection

Shadow Detection And Removal

Unconstrained Lip-synchronization

Mixed reality, video inpainting.

Cross-corpus

Micro-expression recognition, micro-expression spotting.

3D Facial Expression Recognition

Smile Recognition

Future prediction, human mesh recovery, video enhancement.

Face Image Quality Assessment

Lightweight face recognition.

Age-Invariant Face Recognition

Synthetic face recognition, face quality assessement.

3D Multi-Object Tracking

Real-time multi-object tracking, multi-animal tracking with identification, trajectory long-tail distribution for muti-object tracking, grounded multiple object tracking, image categorization, fine-grained visual categorization, overlapped 10-1, overlapped 15-1, overlapped 15-5, disjoint 10-1, disjoint 15-1.

Burst Image Super-Resolution

Stereo image super-resolution, satellite image super-resolution, multispectral image super-resolution, color constancy.

Few-Shot Camera-Adaptive Color Constancy

Hdr reconstruction, multi-exposure image fusion, open vocabulary semantic segmentation, zero-guidance segmentation, physics-informed machine learning, soil moisture estimation, deep attention, line detection, video reconstruction.

Zero Shot Segmentation

Visual recognition.

Fine-Grained Visual Recognition

Image cropping, sign language translation.

Stereo Matching Hand

3D Absolute Human Pose Estimation

Text-to-Face Generation

Image forensics, tone mapping, zero-shot action recognition, natural language transduction, video restoration.

Analog Video Restoration

Novel class discovery.

Transparent Object Detection

Transparent objects, surface normals estimation.

hand-object pose

Grasp Generation

3D Canonical Hand Pose Estimation

Breast cancer histology image classification (20% labels), cross-domain few-shot learning, texture classification, vision-language navigation.

Abnormal Event Detection In Video

Semi-supervised Anomaly Detection

Infrared and visible image fusion.

Image Animation

Image to 3D

Probabilistic deep learning, unsupervised few-shot image classification, generalized few-shot classification, pedestrian attribute recognition.

Steganalysis

Sketch Recognition

Face Sketch Synthesis

Drawing pictures.

Photo-To-Caricature Translation

Spoof detection, face presentation attack detection, detecting image manipulation, cross-domain iris presentation attack detection, finger dorsal image spoof detection, computer vision techniques adopted in 3d cryogenic electron microscopy, single particle analysis, cryogenic electron tomography, highlight detection, iris recognition, pupil dilation, action quality assessment.

One-shot visual object segmentation

Unbiased Scene Graph Generation

Panoptic Scene Graph Generation

Image to video generation.

Unconditional Video Generation

Automatic post-editing.

Dense Captioning

Image stitching.

Multi-View 3D Reconstruction

Universal domain adaptation, action understanding, blind face restoration.

Document Image Classification

Face Reenactment

Geometric Matching

Human action generation.

Action Generation

Object categorization, person retrieval, text based person retrieval, surgical phase recognition, online surgical phase recognition, offline surgical phase recognition, human dynamics.

3D Human Dynamics

Meme classification, hateful meme classification, severity prediction, intubation support prediction, cloud detection.

Text-To-Image

Story visualization, complex scene breaking and synthesis, diffusion personalization.

Diffusion Personalization Tuning Free

Efficient Diffusion Personalization

Image fusion, pansharpening, image deconvolution.

Image Outpainting

Object Segmentation

Camouflaged Object Segmentation

Landslide segmentation, text-line extraction, point clouds, point cloud video understanding, point cloud rrepresentation learning.

Semantic SLAM

Object SLAM

Intrinsic image decomposition, line segment detection, table recognition, situation recognition, grounded situation recognition, motion detection, multi-target domain adaptation, sports analytics.

Robot Pose Estimation

Camouflaged Object Segmentation with a Single Task-generic Prompt

Image morphing, image shadow removal, person identification, visual prompt tuning, weakly-supervised instance segmentation, image smoothing, fake image detection.

GAN image forensics

Fake Image Attribution

Image steganography, rotated mnist, contour detection.

Face Image Quality

Lane detection.

3D Lane Detection

Layout design, license plate detection.

Video Panoptic Segmentation

Viewpoint estimation.

Drone navigation

Drone-view target localization, value prediction, body mass index (bmi) prediction, multi-object tracking and segmentation.

Occlusion Handling

Zero-shot transfer image classification.

3D Object Reconstruction From A Single Image

CAD Reconstruction

3d point cloud linear classification, crop classification, crop yield prediction, photo retouching, motion retargeting, shape representation of 3d point clouds, bird's-eye view semantic segmentation.

Dense Pixel Correspondence Estimation

Human part segmentation.

Multiview Learning

Person recognition.

Document Shadow Removal

Symmetry detection, traffic sign detection, video style transfer, referring image matting.

Referring Image Matting (Expression-based)

Referring Image Matting (Keyword-based)

Referring Image Matting (RefMatte-RW100)

Referring image matting (prompt-based), human interaction recognition, one-shot 3d action recognition, mutual gaze, affordance detection.

Gaze Prediction

Image forgery detection, image instance retrieval, amodal instance segmentation, image quality estimation.

Image Similarity Search

Precipitation Forecasting

Referring expression generation, road damage detection.

Space-time Video Super-resolution

Video matting.

Open-World Semi-Supervised Learning

Semi-supervised image classification (cold start), hand detection, material classification.

Open Vocabulary Attribute Detection

Inverse tone mapping, image/document clustering, self-organized clustering, instance search.

Audio Fingerprint

3d shape modeling.

Action Analysis

Facial editing.

Food Recognition

Holdout Set

Motion magnification, semi-supervised instance segmentation, binary classification, llm-generated text detection, cancer-no cancer per breast classification, cancer-no cancer per image classification, suspicous (birads 4,5)-no suspicous (birads 1,2,3) per image classification, cancer-no cancer per view classification, video segmentation, camera shot boundary detection, open-vocabulary video segmentation, open-world video segmentation, lung nodule classification, lung nodule 3d classification, lung nodule detection, lung nodule 3d detection, 3d scene reconstruction, art analysis.

Zero-Shot Composed Image Retrieval (ZS-CIR)

Event segmentation, generic event boundary detection, image retouching, image-variation, jpeg artifact removal, multispectral object detection, point cloud super resolution, skills assessment.

Sensor Modeling

10-shot image generation, video prediction, earth surface forecasting, predict future video frames, ad-hoc video search, audio-visual synchronization, handwriting generation, pose retrieval, scanpath prediction, scene change detection.

Sketch-to-Image Translation

Skills evaluation, synthetic image detection, highlight removal, 3d shape reconstruction from a single 2d image.

Shape from Texture

Deception detection, deception detection in videos, handwriting verification, bangla spelling error correction, 3d open-vocabulary instance segmentation.

3D Shape Representation

3D Dense Shape Correspondence

Birds eye view object detection.

Multiple People Tracking

Network Interpretation

Rgb-d reconstruction, seeing beyond the visible, semi-supervised domain generalization, unsupervised semantic segmentation.

Unsupervised Semantic Segmentation with Language-image Pre-training

Multiple object tracking with transformer.

Multiple Object Track and Segmentation

Constrained lip-synchronization, face dubbing, vietnamese visual question answering, explanatory visual question answering.

Video Visual Relation Detection

Human-object relationship detection, 3d shape reconstruction, defocus blur detection, event data classification, image comprehension, image manipulation localization, instance shadow detection, kinship verification, medical image enhancement, open vocabulary panoptic segmentation, single-object discovery, training-free 3d point cloud classification, video forensics.

Sequential Place Recognition

Autonomous flight (dense forest), autonomous web navigation.

Generative 3D Object Classification

Cube engraving classification, multimodal machine translation.

Face to Face Translation

Multimodal lexical translation, 2d semantic segmentation task 3 (25 classes), document enhancement, 4d panoptic segmentation, action assessment, bokeh effect rendering, drivable area detection, face anonymization, font recognition, horizon line estimation, image imputation.

Long Video Retrieval (Background Removed)

Medical image denoising.

Occlusion Estimation

Physiological computing.

Lake Ice Monitoring

Short-term object interaction anticipation, spatio-temporal video grounding, unsupervised 3d point cloud linear evaluation, wireframe parsing, single-image-generation, unsupervised anomaly detection with specified settings -- 30% anomaly, root cause ranking, anomaly detection at 30% anomaly, anomaly detection at various anomaly percentages.

Unsupervised Contextual Anomaly Detection

2d pose estimation, category-agnostic pose estimation, overlapping pose estimation, facial expression recognition, cross-domain facial expression recognition, zero-shot facial expression recognition, landmark tracking, muscle tendon junction identification, 3d object captioning, animated gif generation, generalized referring expression comprehension, image deblocking, infrared image super-resolution, motion disentanglement, persuasion strategies, scene text editing, traffic accident detection, accident anticipation, unsupervised landmark detection, visual speech recognition, lip to speech synthesis, continual anomaly detection, gaze redirection, weakly supervised action segmentation (transcript), weakly supervised action segmentation (action set)), calving front delineation in synthetic aperture radar imagery, calving front delineation in synthetic aperture radar imagery with fixed training amount.

Handwritten Line Segmentation

Handwritten word segmentation.

General Action Video Anomaly Detection

Physical video anomaly detection, monocular cross-view road scene parsing(road), monocular cross-view road scene parsing(vehicle).

Transparent Object Depth Estimation

3d semantic occupancy prediction, 3d scene editing, age and gender estimation, data ablation.

Occluded Face Detection

Gait identification, historical color image dating, stochastic human motion prediction, image retargeting, image and video forgery detection, motion captioning, personality trait recognition, personalized segmentation, scene-aware dialogue, spatial relation recognition, spatial token mixer, steganographics, story continuation.

Unsupervised Anomaly Detection with Specified Settings -- 0.1% anomaly

Unsupervised anomaly detection with specified settings -- 1% anomaly, unsupervised anomaly detection with specified settings -- 10% anomaly, unsupervised anomaly detection with specified settings -- 20% anomaly, vehicle speed estimation, visual analogies, visual social relationship recognition, zero-shot text-to-video generation, text-guided-generation, video frame interpolation, 3d video frame interpolation, unsupervised video frame interpolation.

eXtreme-Video-Frame-Interpolation

Continual semantic segmentation, overlapped 5-3, overlapped 25-25, evolving domain generalization, source-free domain generalization, micro-expression generation, micro-expression generation (megc2021), mistake detection, online mistake detection, period estimation, art period estimation (544 artists), unsupervised panoptic segmentation, unsupervised zero-shot panoptic segmentation, 3d rotation estimation, camera auto-calibration, defocus estimation, derendering, fingertip detection, hierarchical text segmentation, human-object interaction concept discovery.

One-Shot Face Stylization

Speaker-specific lip to speech synthesis, multi-person pose estimation, neural stylization.

Part-aware Panoptic Segmentation

Population Mapping

Pornography detection, prediction of occupancy grid maps, raw reconstruction, repetitive action counting, svbrdf estimation, semi-supervised video classification, spectrum cartography, supervised image retrieval, synthetic image attribution, training-free 3d part segmentation, unsupervised image decomposition, video propagation, vietnamese multimodal learning, weakly supervised 3d point cloud segmentation, weakly-supervised panoptic segmentation, drone-based object tracking, brain visual reconstruction, brain visual reconstruction from fmri.

Human-Object Interaction Generation

Image-guided composition, fashion understanding, semi-supervised fashion compatibility.

intensity image denoising

Lifetime image denoising, observation completion, active observation completion, boundary grounding.

Video Narrative Grounding

3d inpainting, 3d scene graph alignment, 4d spatio temporal semantic segmentation.

Age Estimation

Few-shot Age Estimation

Brdf estimation, camouflage segmentation, clothing attribute recognition, damaged building detection, depth image estimation, detecting shadows, dynamic texture recognition.

Disguised Face Verification

Few shot open set object detection, gaze target estimation, generalized zero-shot learning - unseen, hd semantic map learning, human-object interaction anticipation, image deep networks, keypoint detection and image matching, manufacturing quality control, materials imaging, micro-gesture recognition, multi-person pose estimation and tracking.

Multi-modal image segmentation

Multi-object discovery, neural radiance caching.

Parking Space Occupancy

Partial Video Copy Detection

Multimodal Patch Matching

Perpetual view generation, procedure learning, prompt-driven zero-shot domain adaptation, single-shot hdr reconstruction, on-the-fly sketch based image retrieval, thermal image denoising, trademark retrieval, unsupervised instance segmentation, unsupervised zero-shot instance segmentation, vehicle key-point and orientation estimation.

Video Individual Counting

Video-adverb retrieval (unseen compositions), video-to-image affordance grounding.

Vietnamese Scene Text

Visual sentiment prediction, human-scene contact detection, localization in video forgery, 3d canonicalization, 3d surface generation.

Visibility Estimation from Point Cloud

Amodal layout estimation, blink estimation, camera absolute pose regression, change data generation, constrained diffeomorphic image registration, continuous affect estimation, deep feature inversion, document image skew estimation, earthquake prediction, fashion compatibility learning.

Displaced People Recognition

Finger vein recognition, flooded building segmentation.

Future Hand Prediction

Generative temporal nursing, grounded multimodal named entity recognition, house generation, human fmri response prediction, hurricane forecasting, ifc entity classification, image declipping, image similarity detection.

Image Text Removal

Image-to-gps verification.

Image-based Automatic Meter Reading

Dial meter reading, indoor scene reconstruction, jpeg decompression.

Kiss Detection

Laminar-turbulent flow localisation.

Landmark Recognition

Brain landmark detection, corpus video moment retrieval, mllm evaluation: aesthetics, medical image deblurring, mental workload estimation, meter reading, motion expressions guided video segmentation, natural image orientation angle detection, multi-object colocalization, multilingual text-to-image generation, video emotion detection, nwp post-processing, occluded 3d object symmetry detection, open set video captioning, pso-convnets dynamics 1, pso-convnets dynamics 2, partial point cloud matching.

Partially View-aligned Multi-view Learning

Pedestrian Detection

Thermal Infrared Pedestrian Detection

Personality trait recognition by face, physical attribute prediction, point cloud semantic completion, point cloud classification dataset, point- of-no-return (pnr) temporal localization, pose contrastive learning, potrait generation, prostate zones segmentation, pulmorary vessel segmentation, pulmonary artery–vein classification, reference expression generation, safety perception recognition, jersey number recognition, interspecies facial keypoint transfer, image to sketch recognition, specular reflection mitigation, specular segmentation, state change object detection, surface normals estimation from point clouds, train ego-path detection.

Transform A Video Into A Comics

Transparency separation, typeface completion.

Unbalanced Segmentation

Unsupervised Long Term Person Re-Identification

Video correspondence flow.

Key-Frame-based Video Super-Resolution (K = 15)

Zero-shot single object tracking, yield mapping in apple orchards, lidar absolute pose regression, opd: single-view 3d openable part detection, self-supervised scene text recognition, spatial-aware image editing, video narration captioning, spectral estimation, spectral estimation from a single rgb image, 3d prostate segmentation, aggregate xview3 metric, atomic action recognition, composite action recognition, calving front delineation from synthetic aperture radar imagery, computer vision transduction, crosslingual text-to-image generation, zero-shot dense video captioning, document to image conversion, frame duplication detection, geometrical view, hyperview challenge.

Image Operation Chain Detection

Kinematic based workflow recognition, logo recognition.

MLLM Aesthetic Evaluation

Motion detection in non-stationary scenes, open-set video tagging, satellite orbit determination.

Segmentation Based Workflow Recognition

2d particle picking, small object detection.

Rice Grain Disease Detection

Sperm morphology classification, video & kinematic base workflow recognition, video based workflow recognition, video, kinematic & segmentation base workflow recognition, animal pose estimation.

Explore Blog

Data Collection

Building Blocks

Device Enrollment

Monitoring Dashboards

Video Annotation

Application Editor

Device Management

Remote Maintenance

Model Training

Application Library

Deployment Manager

Unified Security Center

AI Model Library

Configuration Manager

IoT Edge Gateway

Privacy-preserving AI

Ready to get started?

Why Viso Suite

Top Computer Vision Papers of All Time (Updated 2024)

Viso Suite is the all-in-one solution for teams to build, deliver, scale computer vision applications.

Viso Suite is the world’s only end-to-end computer vision platform. Request a demo.

Today’s boom in computer vision (CV) started at the beginning of the 21 st century with the breakthrough of deep learning models and convolutional neural networks (CNN). The main CV methods include image classification, image localization, object detection, and segmentation.

In this article, we dive into some of the most significant research papers that triggered the rapid development of computer vision. We split them into two categories – classical CV approaches, and papers based on deep-learning. We chose the following papers based on their influence, quality, and applicability.

Gradient-based Learning Applied to Document Recognition (1998)

Distinctive image features from scale-invariant keypoints (2004), histograms of oriented gradients for human detection (2005), surf: speeded up robust features (2006), imagenet classification with deep convolutional neural networks (2012), very deep convolutional networks for large-scale image recognition (2014), googlenet – going deeper with convolutions (2014), resnet – deep residual learning for image recognition (2015), faster r-cnn: towards real-time object detection with region proposal networks (2015), yolo: you only look once: unified, real-time object detection (2016), mask r-cnn (2017), efficientnet – rethinking model scaling for convolutional neural networks (2019).

About us: Viso Suite is the end-to-end computer vision solution for enterprises. With a simple interface and features that give machine learning teams control over the entire ML pipeline, Viso Suite makes it possible to achieve a 3-year ROI of 695%. Book a demo to learn more about how Viso Suite can help solve business problems.

Classic Computer Vision Papers

The authors Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner published the LeNet paper in 1998. They introduced the concept of a trainable Graph Transformer Network (GTN) for handwritten character and word recognition . They researched (non) discriminative gradient-based techniques for training the recognizer without manual segmentation and labeling.

LeNet CNN architecture digits recognition

Characteristics of the model:

LeNet-5 CNN contains 6 convolution layers with multiple feature maps (156 trainable parameters).
The input is a 32×32 pixel image and the output layer is composed of Euclidean Radial Basis Function units (RBF) one for each class (letter).
The training set consists of 30000 examples, and authors achieved a 0.35% error rate on the training set (after 19 passes).

Find the LeNet paper here .

David Lowe (2004), proposed a method for extracting distinctive invariant features from images. He used them to perform reliable matching between different views of an object or scene. The paper introduced Scale Invariant Feature Transform (SIFT), while transforming image data into scale-invariant coordinates relative to local features.

Model characteristics:

The method generates large numbers of features that densely cover the image over the full range of scales and locations.
The model needs to match at least 3 features from each object – in order to reliably detect small objects in cluttered backgrounds.
For image matching and recognition, the model extracts SIFT features from a set of reference images stored in a database.
SIFT model matches a new image by individually comparing each feature from the new image to this previous database (Euclidian distance).

Find the SIFT paper here .

The authors Navneet Dalal and Bill Triggs researched the feature sets for robust visual object recognition, by using a linear SVM-based human detection as a test case. They experimented with grids of Histograms of Oriented Gradient (HOG) descriptors that significantly outperform existing feature sets for human detection .

Authors achievements:

The histogram method gave near-perfect separation from the original MIT pedestrian database.
For good results – the model requires: fine-scale gradients, fine orientation binning, i.e. high-quality local contrast normalization in overlapping descriptor blocks.
Researchers examined a more challenging dataset containing over 1800 annotated human images with many pose variations and backgrounds.
In the standard detector, each HOG cell appears four times with different normalizations and improves performance to 89%.

Find the HOG paper here .

Herbert Bay, Tinne Tuytelaars, and Luc Van Goo presented a scale- and rotation-invariant interest point detector and descriptor, called SURF (Speeded Up Robust Features). It outperforms previously proposed schemes concerning repeatability, distinctiveness, and robustness, while computing much faster. The authors relied on integral images for image convolutions, furthermore utilizing the leading existing detectors and descriptors.

Applied a Hessian matrix-based measure for the detector, and a distribution-based descriptor, simplifying these methods to the essential.
Presented experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application.
SURF showed strong performance – SURF-128 with an 85.7% recognition rate, followed by U-SURF (83.8%) and SURF (82.6%).

Find the SURF paper here .

Papers Based on Deep-Learning Models

Alex Krizhevsky and his team won the ImageNet Challenge in 2012 by researching deep convolutional neural networks. They trained one of the largest CNNs at that moment over the ImageNet dataset used in the ILSVRC-2010 / 2012 challenges and achieved the best results reported on these datasets. They implemented a highly-optimized GPU of 2D convolution, thus including all required steps in CNN training, and published the results.

The final CNN contained five convolutional and three fully connected layers, and the depth was quite significant.
They found that removing any convolutional layer (each containing less than 1% of the model’s parameters) resulted in inferior performance.
The same CNN, with an extra sixth convolutional layer, was used to classify the entire ImageNet Fall 2011 release (15M images, 22K categories).
After fine-tuning on ImageNet-2012 it gave an error rate of 16.6%.

Find the ImageNet paper here .

Karen Simonyan and Andrew Zisserman (Oxford University) investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Their main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, specifically focusing on very deep convolutional networks (VGG) . They proved that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers.

image classification CNN results VOC-2007, VOC-2012

Their ImageNet Challenge 2014 submission secured the first and second places in the localization and classification tracks respectively.
They showed that their representations generalize well to other datasets, where they achieved state-of-the-art results.
They made two best-performing ConvNet models publicly available, in addition to the deep visual representations in CV.

Find the VGG paper here .

The Google team (Christian Szegedy, Wei Liu, et al.) proposed a deep convolutional neural network architecture codenamed Inception. They intended to set the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of their architecture was the improved utilization of the computing resources inside the network.

A carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant.
Their submission for ILSVRC14 was called GoogLeNet , a 22-layer deep network. Its quality was assessed in the context of classification and detection.
They added 200 region proposals coming from multi-box increasing the coverage from 92% to 93%.
Lastly, they used an ensemble of 6 ConvNets when classifying each region which improved results from 40% to 43.9% accuracy.

Find the GoogLeNet paper here .

Microsoft researchers Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun presented a residual learning framework (ResNet) to ease the training of networks that are substantially deeper than those used previously. They reformulated the layers as learning residual functions concerning the layer inputs, instead of learning unreferenced functions.

They evaluated residual nets with a depth of up to 152 layers – 8× deeper than VGG nets, but still having lower complexity.
This result won 1st place on the ILSVRC 2015 classification task.
The team also analyzed the CIFAR-10 with 100 and 1000 layers, achieving a 28% relative improvement on the COCO object detection dataset.
Moreover – in ILSVRC & COCO 2015 competitions, they won 1 st place on the tasks of ImageNet detection, ImageNet localization, COCO detection/segmentation.

Find the ResNet paper here .

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun introduced the Region Proposal Network (RPN) with full-image convolutional features with the detection network, therefore enabling nearly cost-free region proposals. Their RPN was a fully convolutional network that simultaneously predicted object bounds and objective scores at each position. Also, they trained the RPN end-to-end to generate high-quality region proposals, which Fast R-CNN used for detection.

Merged RPN and fast R-CNN into a single network by sharing their convolutional features. In addition, they applied neural networks with “ attention” mechanisms .
For the very deep VGG-16 model, their detection system had a frame rate of 5fps on a GPU.
Achieved state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image.
In ILSVRC and COCO 2015 competitions, faster R-CNN and RPN were the foundations of the 1st-place winning entries in several tracks.

Find the Faster R-CNN paper here .

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi developed YOLO, an innovative approach to object detection. Instead of repurposing classifiers to perform detection, the authors framed object detection as a regression problem. In addition, they spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance .

The base YOLO model processed images in real-time at 45 frames per second.
A smaller version of the network, Fast YOLO, processed 155 frames per second, while still achieving double the mAP of other real-time detectors.
Compared to state-of-the-art detection systems, YOLO was making more localization errors, but was less likely to predict false positives in the background.
YOLO learned very general representations of objects and outperformed other detection methods, including DPM and R-CNN , when generalizing natural images.

Find the YOLO paper here .

Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick (Facebook) presented a conceptually simple, flexible, and general framework for object instance segmentation. Their approach could detect objects in an image, while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN , extended Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.

Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.
Showed great results in all three tracks of the COCO suite of challenges. Also, it includes instance segmentation, bounding box object detection, and person keypoint detection.
Mask R-CNN outperformed all existing, single-model entries on every task, including the COCO 2016 challenge winners.
The model served as a solid baseline and eased future research in instance-level recognition.

Find the Mask R-CNN paper here .

The authors (Mingxing Tan, Quoc V. Le) of EfficientNet studied model scaling and identified that carefully balancing network depth, width, and resolution can lead to better performance. They proposed a new scaling method that uniformly scales all dimensions of depth resolution using a simple but effective compound coefficient. They demonstrated the effectiveness of this method in scaling up MobileNet and ResNet .

Designed a new baseline network and scaled it up to obtain a family of models, called EfficientNets. It had much better accuracy and efficiency than previous ConvNets.
EfficientNet-B7 achieved state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet.
It also transferred well and achieved state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with much fewer parameters.

Find the EfficientNet paper here .

Synthetic Data: A Model Training Solution

Discover the role of synthetic data in AI, ML, and data privacy. Learn how it enhances AI and ML training and how it is generated.

ImageNet Dataset: Evolution & Applications (2024)

Everything you need to know about the ImageNet dataset and its resounding impact on the world of computer vision and machine learning.

All-in-one platform to build, deploy, and scale computer vision applications

Join 6,300+ Fellow AI Enthusiasts

Get expert news and updates straight to your inbox. Subscribe to the Viso Blog.

Get expert AI news 2x a month. Subscribe to the most read Computer Vision Blog.

You can unsubscribe anytime. See our privacy policy .

Build any Computer Vision Application, 10x faster

All-in-one Computer Vision Platform for businesses to build, deploy and scale real-world applications.

Deploy Apps
Monitor Apps
Manage Apps
Help Center

Privacy Overview

The application of deep learning in computer vision

Ieee account.

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
My Account Login
Explore content
About the journal
Publish with us
Sign up for alerts
Review Article
Open access
Published: 08 January 2021

Deep learning-enabled medical computer vision

Andre Esteva ORCID: orcid.org/0000-0003-1937-9682 1 ,
Katherine Chou 2 na1 ,
Serena Yeung 3 na1 ,
Nikhil Naik ORCID: orcid.org/0000-0002-5191-2726 1 na1 ,
Ali Madani 1 na1 ,
Ali Mottaghi 3 na1 ,
Yun Liu ORCID: orcid.org/0000-0003-4079-8275 2 ,
Eric Topol 4 ,
Jeff Dean 2 &
Richard Socher 1

npj Digital Medicine volume 4 , Article number: 5 ( 2021 ) Cite this article

92k Accesses

491 Citations

312 Altmetric

Metrics details

Computational science
Health care
Medical research

A decade of unprecedented progress in artificial intelligence (AI) has demonstrated the potential for many fields—including medicine—to benefit from the insights that AI techniques can extract from data. Here we survey recent progress in the development of modern computer vision techniques—powered by deep learning—for medical applications, focusing on medical imaging, medical video, and clinical deployment. We start by briefly summarizing a decade of progress in convolutional neural networks, including the vision tasks they enable, in the context of healthcare. Next, we discuss several example medical imaging applications that stand to benefit—including cardiology, pathology, dermatology, ophthalmology–and propose new avenues for continued work. We then expand into general medical video, highlighting ways in which clinical workflows can integrate computer vision to enhance care. Finally, we discuss the challenges and hurdles required for real-world clinical deployment of these technologies.

Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis

Where do we stand in AI for endoscopic image analysis? Deciphering gaps and future directions

Applications of artificial intelligence in cardiovascular imaging

Introduction.

Computer vision (CV) has a rich history spanning decades 1 of efforts to enable computers to perceive visual stimuli meaningfully. Machine perception spans a range of levels, from low-level tasks such as identifying edges, to high-level tasks such as understanding complete scenes. Advances in the last decade have largely been due to three factors: (1) the maturation of deep learning (DL)—a type of machine learning that enables end-to-end learning of very complex functions from raw data 2 (2) strides in localized compute power via GPUs 3 , and (3) the open-sourcing of large labeled datasets with which to train these algorithms 4 . The combination of these three elements has enabled individual researchers the resource access needed to advance the field. As the research community grew exponentially, so did progress.

The growth of modern CV has overlapped with the generation of large amounts of digital data in a number of scientific fields. Recent medical advances have been prolific 5 , 6 , owing largely to DL’s remarkable ability to learn many tasks from most data sources. Using large datasets, CV models can acquire many pattern-recognition abilities—from physician-level diagnostics 7 to medical scene perception 8 . See Fig. 1 .

a Multimodal discriminative model. Deep learning architectures can be constructed to jointly learn from both image data, typically with convolutional networks, and non-image data, typically with general deep networks. Learned annotations can include disease diagnostics, prognostics, clinical predictions, and combinations thereof. b Generative model. Convolutional neural networks can be trained to generate images. Tasks include image-to-image regression (shown), super-resolution image enhancement, novel image generation, and others.

Here we survey the intersection of CV and medicine, focusing on research in medical imaging, medical video, and real clinical deployment. We discuss key algorithmic capabilities which unlocked these opportunities, and dive into the myriad of accomplishments from recent years. The clinical tasks suitable for CV span many categories, such as screening, diagnosis, detecting conditions, predicting future outcomes, segmenting pathologies from organs to cells, monitoring disease, and clinical research. Throughout, we consider the future growth of this technology and its implications for medicine and healthcare.

Computer vision

Object classification, localization, and detection, respectively refer to identifying the type of an object in an image, the location of objects present, and both type and location simultaneously. The ImageNet Large-Scale Visual Recognition Challenge 9 (ILSVRC) was a spearhead to progress in these tasks over the last decade. It created a large community of DL researchers competing and collaborating together to improve techniques on various CV tasks. The first contemporary, GPU-powered DL approach, in 2012 10 , yielded an inflection point in the growth of this community, heralding an era of significant year-over-year improvements 11 , 12 , 13 , 14 through the competition’s final year in 2017. Notably, classification accuracy achieved human-level performance during this period. Within medicine, fine-grained versions of these methods 15 have successfully been applied to the classification and detection of many diseases (Fig. 2 ). Given sufficient data, the accuracy often matches or surpasses the level of expert physicians 7 , 16 . Similarly, the segmentation of objects has substantially improved 17 , 18 , particularly in challenging scenarios such as the biomedical segmentation of multiple types of overlapping cells in microscopy. The key DL technique leveraged in these tasks is the convolutional neural network 19 (CNN)—a type of DL algorithm which hardcodes translational invariance, a key feature of image data. Many other CV tasks have benefited from this progress, including image registration (identifying corresponding points across similar images), image retrieval (finding similar images), and image reconstruction and enhancement. The specific challenges of working with medical data require the utilization of many types of AI models.

CNNs—trained to classify disease states—have been extensively tested across diseases, and benchmarked against physicians. Their performance is typically on par with experts when both are tested on the same image classification task. a Dermatology 7 and b Radiology 156 . Examples reprinted with permission and adapted for style.

These techniques largely rely on supervised learning, which leverages datasets that contain both data points (e.g. images) and data labels (e.g. object classes). Given the sparsity and access difficulties of medical data, transfer learning—in which an algorithm is first trained on a large and unrelated corpus (e.g. ImageNet 4 ), then fine-tuned on a dataset of interest (e.g. medical)—has been critical for progress. To reduce the costs associated with collecting and labeling data, techniques to generate synthetic data, such as data augmentation 20 and generative adversarial networks (GANs) 21 are being developed. Researchers have even shown that crowd-sourcing image annotations can yield effective medical algorithms 22 , 23 . Recently, self-supervised learning 24 —in which implicit labels are extracted from data points and used to train algorithms (e.g predicting the spatial arrangement of tiles generated from splitting an image into pieces)—have pushed the field towards fully unsupervised learning, which lacks the need for labels. Applying these techniques in medicine will reduce the barrier to development and deployment.

Medical data access is central to this field, and key ethical and legal questions must be addressed. Do patients own their de-identified data? What if methods to re-identify data improve over time? Should the community open-source large quantities of data? To date, academia and industry have largely relied on small, open-source datasets, and data collected through commercial products. Dynamics around data sharing and country-specific availability will impact deployment opportunities. The field of federated learning 25 —in which centralized algorithms can be trained on distributed data that never leaves protected enclosures—may enable a workaround in stricter jurisdictions.

These advances have spurred growth in other domains of CV, such as multimodal learning, which combines vision with other modalities such as language (Fig. 1a ) 26 , time-series data, and genomic data 5 . These methods can combine with 3D vision 27 , 28 to turn depth-cameras into privacy-preserving sensors 29 , making deployment easier for patient settings such as the intensive care unit 8 . The range of tasks is even broader in video. Applications like activity recognition 30 and live scene understanding 31 are useful in detecting and responding to important or adverse clinical events 32 .

Medical imaging

In recent years the number of publications applying computer vision techniques to static medical imagery has grown from hundreds to thousands 33 . A few areas have received substantial attention—radiology, pathology, ophthalmology, and dermatology—owing to the visual pattern-recognition nature of diagnostic tasks in these specialities, and the growing availability of highly structured images.

The unique characteristics of medical imagery pose a number of challenges to DL-based computer vision. For one, images can be massive. Digitizing histopathology slides produces gigapixel images of around 100,000 ×100,000 pixels, whereas typical CNN image inputs are around 200 ×200 pixels. Further, different chemical preparations will render different slides for the same piece of tissue, and different digitization devices or settings may produce different images for the same slide. Radiology modalities such as CT and MRI render equally massive 3D images, forcing standard CNNs to either work with a set of 2D slices, or adjust their internal structure to process in 3D. Similarly, ultrasound renders a time-series of noisy 2D slices of a 3D context–slices which are spatially correlated but not aligned. DL has started to account for the unique challenges of medical data. For instance, multiple-instance-learning (MIL) 34 enables learning from datasets containing massive images and few labels (e.g. histopathology). 3D convolutions in CNNs are enabling better learning from 3D volumes (e.g MRI and CT) 35 . Spatio-temporal models 36 and image registration enable working with time-series images (e.g. ultrasound).

Dozens of companies have obtained US FDA and European CE approval for medical imaging AI 37 , and commercial markets have begun to form as sustainable business models are created. For instance, regions of high-throughput healthcare, such as India and Thailand, have welcomed the deployment of technologies such as diabetic retinopathy screening systems 38 . This rapid growth has now reached the point of directly impacting patient outcomes—the US CMS recently approved reimbursement for a radiology stroke triage use-case which reduces the time it takes for patients to receive treatment 39 .

CV in medical modalities with non-standardized data collection requires the integration of CV into existing physical systems. For instance, in otolaryngology, CNNs can be used to help primary care physicians manage patients’ ears, nose, and throat 40 , through mountable devices attached to smartphones 41 . Hematology and serology can benefit from microscope-integrated AIs 42 that diagnose common conditions 43 or count blood cells of various types 44 —repetitive tasks that are easy to augment with CNNs. AI in gastroenterology has demonstrated stunning capabilities. Video-based CNNs can be integrated into endoscopic procedures 45 for scope guidance, lesion detection, and lesion diagnosis. Applications include esophageal cancer screening 46 , detecting gastric cancer 47 , 48 , detecting stomach infections such as H. Pylori 49 , and even finding hookworms 50 . Scientists have taken this field one step further by building entire medical AI devices designed for monitoring, such as at-home smart toilets outfitted with diagnostic CNNs on cameras 51 . Beyond the analysis of disease states, CV can serve the future of human health and welfare through applications such as screening human embryos for implantation 52 .

Computer vision in radiology is so pronounced that it has quickly burgeoned into its own field of research, growing a corpus of work 53 , 54 , 55 that extends into all modalities, with a focus on X-rays, CT, and MRI. Chest X-ray analysis—a key clinical focus area 33 —has been an exemplar. The field has collected nearly 1 million annotated, open-source images 56 , 57 , 58 —the closest ImageNet 9 equivalent to date in medical CV. Analysis of brain imagery 59 (particularly for time-critical use-cases like stroke), and abdominal imagery 60 have similarly received substantial attention. Disease classification, nodule detection 61 , and region segmentation (e.g. ventricular 62 ) models have been developed for most conditions for which data can be collected. This has enabled the field to respond rapidly in times of crisis—for instance, developing and deploying COVID-19 detection models 63 . The field continues to expand with work in image translation (e.g. converting noisy ultrasound images into MRI), image reconstruction and enhancement (e.g. converting low-dosage, low-resolution CT images into high-resolution images 64 ), automated report generation, and temporal tracking (e.g. image registration to track tumor growth over time). In the sections below, we explore vision-based applications in other specialties.

Cardiac imaging is increasingly used in a wide array of clinical diagnoses and workflows. Key clinical applications for deep learning include diagnosis and screening. The most common imaging modality in cardiovascular medicine is the cardiac ultrasound, or echocardiogram. As a cost-effective, radiation-free technique, echocardiography is uniquely suited for DL due to straightforward data acquisition and interpretation—it is routinely used in most acute inpatient facilities, outpatient centers, and emergency rooms 65 . Further, 3D imaging techniques such as CT and MRI are used for the understanding of cardiac anatomy and to better characterize supply-demand mismatch. CT segmentation algorithms have even been FDA—cleared for coronary artery visualization 66 .

There are many example applications. DL can be trained on a large database of echocardiographic studies and surpass the performance of board-certified echocardiographers in view classification 67 . Computational DL pipelines can assess hypertrophic cardiomyopathy, cardiac amyloid, and pulmonary arterial hypertension 68 . EchoNet 69 —a deep learning model that can recognize cardiac structures, estimate function, and predict systemic phenotypes that are not readily identifiable to human interpretation—has recently furthered the field.

To account for challenges around data access, 70 data-efficient echocardiogram algorithms 70 have been developed, such as semi-supervised GANs that are effective at downstream tasks (e.g predicting left ventricular hypertrophy). To account for the fact that most studies utilize privately held medical imaging datasets, 10,000 annotated echocardiogram videos were recently open-sourced 36 . Alongside this release, a video-based model, EchoNet-Dynamic 36 , was developed. It can estimate ejection fraction and assess cardiomyopathy, alongside a comprehensive evaluation criterion based on results from an external dataset and human experts.

Pathologists play a key role in cancer detection and treatment. Pathological analysis—based on visual inspection of tissue samples under microscope—is inherently subjective in nature. Differences in visual perception and clinical training can lead to inconsistencies in diagnostic and prognostic opinions 71 , 72 , 73 . Here, DL can support critical medical tasks, including diagnostics, prognostication of outcomes and treatment response, pathology segmentation, disease monitoring, and so forth.

Recent years have seen the adoption of sub-micron-level resolution tissue scanners that capture gigapixel whole-slide images (WSI) 74 . This development, coupled with advances in CV has led to research and commercialization activity in AI-driven digital histopathology 75 . This field has the potential to (i) overcome limitations of human visual perception and cognition by improving the efficiency and accuracy of routine tasks, (ii) develop new signatures of disease and therapy from morphological structures invisible to the human eye, and (iii) combine pathology with radiological, genomic, and proteomic measurements to improve diagnosis and prognosis 76 .

One thread of research has focused on automating the routine, time-consuming task of localization and quantification of morphological features. Examples include the detection and classification of cells, nuclei, and mitoses 77 , 78 , 79 , and the localization and segmentation of histological primitives such as nuclei, glands, ducts, and tumors 80 , 81 , 82 , 83 . These methods typically require expensive manual annotation of tissue components by pathologists as training data.

Another research avenue focuses on direct diagnostics 84 , 85 , 86 and prognostics 87 , 88 from WSI or tissue microarrays (TMA) for a variety of cancers—breast, prostate, lung cancer, etc. Studies have even shown that morphological features captured by a hematoxylin and eosin (H&E) stain are predictive of molecular biomarkers utilized in theragnosis 85 , 89 . While histopathology slides digitize into massive, data-rich gigapixel images, region-level annotations are sparse and expensive. To help overcome this challenge, the field has developed DL algorithms based on multiple-instance learning 90 that utilize slide-level “weak” annotations and exploit the sheer size of these images for improved performance.

The data abundance of this domain has further enabled tasks such as virtual staining 91 , in which models are trained to predict one type of image (e.g. a stained image) from another (e.g. a raw microscopy image). See Fig. 1b . Moving forward, AI algorithms that learn to perform diagnosis, prognosis, and theragnosis using digital pathology image archives and annotations readily available from electronic health records have the potential to transform the fields of pathology and oncology.

Dermatology

The key clinical tasks for DL in dermatology include lesion-specific differential diagnostics, finding concerning lesions amongst many benign lesions, and helping track lesion growth over time 92 . A series of works have demonstrated that CNNs can match the performance of board-certified dermatologists at classifying malignant skin lesions from benign ones 7 , 93 , 94 . These studies have sequentially tested increasing numbers of dermatologists (25– 7 57– 93 , 157– 94 ), consistently demonstrating a sensitivity and specificity in classification that matches or even exceeds physician levels. These studies were largely restricted to the binary classification task of discerning benign vs malignant cutaneous lesions, classifying either melanomas from nevi or carcinomas from seborrheic keratoses.

Recently, this line of work has expanded to encompass differential diagnostics across dozens of skin conditions 95 , including non-neoplastic lesions such as rashes and genetic conditions, and incorporating non-visual metadata (e.g. patient demographics) as classifier inputs 96 . These works have been catalyzed by open-access image repositories and AI challenges that encourage teams to compete on predetermined benchmarks 97 .

Incorporating these algorithms into clinical workflows would allow their utility to support other key tasks, including large-scale detection of malignancies on patients with many lesions, and tracking lesions across images in order to capture temporal features, such as growth and color changes. This area remains fairly unexplored, with initial works that jointly train CNNs to detect and track lesions 98 .

Ophthalmology

Ophthalmology, in recent years, has observed a significant uptick in AI efforts, with dozens of papers demonstrating clinical diagnostic and analytical capabilities that extend beyond current human capability 99 , 100 , 101 . The potential clinical impact is significant 102 , 103 —the portability of the machinery used to inspect the eye means that pop-up clinics and telemedicine could be used to distribute testing sites to underserved areas. The field depends largely on fundus imaging, and optical coherence tomography (OCT) to diagnose and manage patients.

CNNs can accurately diagnose a number of conditions. Diabetic retinopathy—a condition in which blood vessels in the eyes of diabetic patients “leak” and can lead to blindness—has been extensively studied. CNNs consistently demonstrate physician-level grading from fundus photographs 104 , 105 , 106 , 107 , which has led to a recent US FDA-cleared system 108 . Similarly, they can diagnose or predict the progression of center-involved diabetic macular edema 109 , age-related macular degeneration 107 , 110 , glaucoma 107 , 111 , manifest visual field loss 112 , childhood blindness 113 , and others.

The eyes contain a number of non-human-interpretable features, indicative of meaningful medical information, that CNNs can pick up on. Remarkably, it was shown that CNNs can classify a number of cardiovascular and diabetic risk factors from fundus photographs 114 , including age, gender, smoking, hemoglobin-A1c, body-mass index, systolic blood pressure, and diastolic blood pressure. CNNs can also pick up signs of anemia 115 and chronic kidney disease 116 from fundus photographs. This presents an exciting opportunity for future AI studies predicting nonocular information from eye images. This could lead to a paradigm shift in care in which eye exams screen you for the presence of both ocular and nonocular disease—something currently limited for human physicians.

Medical video

Surgical applications.

The CV may provide significant utility in procedural fields such as surgery and endoscopy. Key clinical applications for deep learning include enhancing surgeon performance through real-time contextual awareness 117 , skills assessments, and training. Early studies have begun pursuing these objectives, primarily in video-based robotic and laparoscopic surgery—a number of works propose methods for detecting surgical tools and actions 118 , 119 , 120 , 121 , 122 , 123 , 124 . Some studies analyze tool movement or other cues to assess surgeon skill 119 , 121 , 123 , 124 , through established ratings such as the Global Operative Assessment of Laparoscopic Skills (GOALS) criteria for laparoscopic surgery 125 . Another line of work uses CV to recognize distinct phases of surgery during operations, towards developing context-aware computer assistance systems 126 , 127 . CV is also starting to emerge in open surgery settings 128 , of which there is a significant volume. The challenge here lies in the diversity of video capture viewpoints (e.g., head-mounted, side-view, and overhead cameras) and types of surgeries. For all types of surgical video, translating CV analysis to tools and applications that can improve patient outcomes is a natural next direction of research.

Human activity

CV can recognize human activity in physical spaces, such as hospitals and clinics, for a range of “ambient intelligence” applications. Ambient intelligence refers to a continuous, non-invasive awareness of activity in a physical space that can provide clinicians, nurses, and other healthcare workers with assistance such as patient monitoring, automated documentation, and monitoring for protocol compliance (Fig. 3 ). In hospitals, for example, early works have demonstrated CV-based ambient intelligence in intensive care units to monitor for safety-critical behaviors such as hand hygiene activity 32 and patient mobilization 8 , 129 , 130 . CV has also been developed for the emergency department, to transcribe procedures performed during the resuscitation of a patient 131 , and for the operating room (OR), to recognize activities for workflow optimization 132 . At the hospital operations level, CV can be a scalable and detailed form of labor and resource measurement that improves resource allocation for optimal care 133 .

Computer vision coupled with sensors and video streams enables a number of safety applications in clinical and home settings, enabling healthcare providers to scale their ability to monitor patients. Primarily created using models for fine-grained activity recognition, applications may include patient monitoring in ICUs, proper hand hygiene and physical action protocols in hospitals and clinics, anomalous event detection, and others.

Outside of hospitals, ambient intelligence can increase access to healthcare. For instance, it could enable at-risk seniors to live independently at home, by monitoring for safety and abnormalities in daily activities (e.g. detecting falls, which are particularly dangerous for the elderly 134 , 135 ), assisted living, and physiological measurement. Similar work 136 , 137 , 138 has targeted broader categories of daily activity. Recognizing and computing long-term descriptive analytics of activities such as sleeping, walking, and sitting over time can detect clinically meaningful changes or anomalies 136 . To ensure patient privacy, researchers have developed CV algorithms that work with thermal video data 136 . Another application area of CV is assisted living or rehabilitation, such as continuous sign language recognition to assist people with communication difficulties 139 , and monitoring of physiotherapy exercises for stroke rehabilitation 140 . CV also offers potential as a tool for remote physiological measurements. For instance, systems could use video 141 to analyze heart and breathing rates 141 . As telemedicine visits increase in frequency, CV could play a role in patient triaging, particularly in times of high demand such as the COVID-19 pandemic 142 . CV-based ambient intelligence technologies offer a wide range of opportunities for increased access to quality care.; However new ethical and legal questions will arise 143 in the design of these technologies.

Clinical deployment

As medical AI advances into the clinic 144 , it will simultaneously have the power to do great good for society, and to potentially exacerbate long-standing inequalities and perpetuate errors in medicine. If done properly and ethically, medical AI can become a flywheel for more equitable care—the more it is used, the more data it acquires, the more accurate and general it becomes. The key is in understanding the data that the models are built on and the environment in which they are deployed. Here, we present four key considerations when applying ML technologies in healthcare: assessment of data, planning for model limitations, community participation, and trust building.

Data quality largely determines model quality; identifying inequities in the data and taking them into account will lead towards more equitable healthcare. Procuring the right datasets may depend on running human-in-the-loop programs or broad-reaching data collection techniques. There are a number of methods that aim to remove bias in data. Individual-level bias can be addressed via expert discussion 145 and labeling adjudication 146 . Population-level bias can be addressed via missing data supplements and distributional shifts. International multi-institutional evaluation is a robust method to determine generalizability of models across diverse populations, medical equipment, resource settings, and practice patterns. In addition, using multi-task learning 147 to train models to perform a variety of tasks rather than one narrowly defined task, such as multi-cancer detection from histopathology images 148 , makes them more generally useful and often more robust.

Transparent reporting can reveal potential weaknesses and help address model limitations. Guardrails to protect against possible worst-case scenarios—minority, dismissal, or automation bias—must be put in place. It is insufficient to report and be satisfied with strong performance measures on general datasets when delivering care for patients—there should be an understanding of the specific instances in which the model fails. One technique is to assess demographic performance in combination with saliency maps 149 , to visualize what the model pays attention to, and check for potential biases. For instance, when using deep learning to develop a differential diagnosis for skin diseases 95 , researchers examined the model performance based on Fitzpatrick skin types and other demographic information to determine patient types for which there were insufficient examples, and inform future data collection. Further, they used saliency masks to verify the model was informed by skin abnormalities and not skin type. See Fig. 4 .

a Example graphic of biased training data in dermatology. AIs trained primarily on lighter skin tones may not generalize as well when tested on darker skin 157 . Models require diverse training datasets for maximal generalizability (e.g. 95 ). b Gradient Masks project the model’s attention onto the original input image, allowing practitioners to visually confirm regions that most influence predictions. Panel was reproduced from ref. 95 with permission.

A known limitation of ML is its performance on out-of-distribution data–data samples that are unlike any seen during model training. Progress has been made on out-of-distribution detection 150 and developing confidence intervals to help detect anomalies. Additionally, methods are developing to understand the uncertainty 151 around model outputs. This is especially critical when implementing patient-specific predictions that impact safety.

Community participation—from patients, physicians, computer scientists, and other relevant stakeholders—is paramount to successful deployment. This has helped identify structural drivers of racial bias in health diagnostics—particularly in discovering bias in datasets and identifying demographics for which models fail 152 . User-centered evaluations are a valuable tool in ensuring a system’s usability and fit into the real world. What’s the best way to present a model’s output to facilitate clinical decision making? How should a mobile app system be deployed in resource-constrained environments, such as areas with intermittent connectivity? For example, when launching ML-powered diabetic retinopathy models in Thailand and India, researchers noticed that model performance was impacted by socioeconomic factors 38 , and determined that where a model is most useful may not be where the model was generated. Ophthalmology models may need to be deployed in endocrinology care, as opposed to eye centers, due to access issues in the specific local environment. Another effective tool to build physician trust in AI results is side-by-side deployment of ML models with existing workflows (e.g manual grading 16 ). See Fig. 5 . Without question, AI models will require rigorous evaluation through clinical trials, to gauge safety and effectiveness. Excitingly, AI and CV can also help support clinical trials 153 , 154 through a number of applications—including patient selection, tumor tracking, adverse event detection, etc—creating an ecosystem in which AI can help design safe AI.

An example workflow showing the positive compounding effect of AI-enhanced workflows, and the resultant trust that can be built. AI predictions provide immediate value to physicians, and improve over time as bigger datasets are collected.

Trust for AI in healthcare is fundamental to its adoption 155 both by clinical teams and by patients. The foundation of clinical trust will come in large part from rigorous prospective trials that validate AI algorithms in real-world clinical environments. These environments incorporate human and social responses, which can be hard to predict and control, but for which AI technologies must account for. Whereas the randomness and human element of clinical environments are impossible to capture in retrospective studies, prospective trials that best reflect clinical practice will shift the conversation towards measurable benefits in real deployments. Here, AI interpretability will be paramount—predictive models will need the ability to describe why specific factors about the patient or environment lead them to their predictions.

In addition to clinical trust, patient trust—particularly around privacy concerns—must be earned. One significant area of need is next-generation regulations that account for advances in privacy-preserving techniques. ML typically does not require traditional identifiers to produce useful results, but there are meaningful signals in data that can be considered sensitive. To unlock insights from these sensitive data types, the evolution of privacy-preserving techniques must continue, and further advances need to be made in fields such as federated learning and federated analytics.

Each technological wave affords us a chance to reshape our future. In this case, artificial intelligence, deep learning, and computer vision represent an opportunity to make healthcare far more accessible, equitable, accurate, and inclusive than it has ever been.

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

Szeliski, R. Computer Vision: Algorithms and Applications (Springer Science & Business Media, 2010).

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521 , 436–444 (2015).

Article CAS PubMed Google Scholar

Sanders, J. & Kandrot, E. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional; 2010 Jul 19.BibTeXEndNoteRefManRefWorks

Deng, J. et al. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).

Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25 , 24–29 (2019).

Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25 , 44–56 (2019).

Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 , 115–118 (2017).

Article CAS PubMed PubMed Central Google Scholar

Yeung, S. et al. A computer vision system for deep learning-based detection of patient mobilization activities in the ICU. NPJ Digit Med. 2 , 11 (2019).

Article PubMed PubMed Central Google Scholar

Russakovsky, O. et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 115 , 211–252 (2015).

Article Google Scholar

Krizhevsky, A., Sutskever, I. & Hinton, G. E. in Advances in Neural Information Processing Systems 25 (eds Pereira, F., Burges, C. J. C., Bottou, L. & Weinberger, K. Q.) 1097–1105 (Curran Associates, Inc., 2012).

Sermanet, P. et al. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. Preprint at https://arxiv.org/abs/1312.6229 (2013).

Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint at https://arxiv.org/abs/1409.1556 (2014).

Szegedy, C. et al. Going deeper with convolutions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1–9 (2015).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016).

Gebru, T., Hoffman, J. & Fei-Fei, L. Fine-grained recognition in the wild: a multi-task domain adaptation approach. In 2017 IEEE International Conference on Computer Vision (ICCV) 1358–1367 (IEEE, 2017).

Gulshan, V. et al. Performance of a deep-learning algorithm vs manual grading for detecting diabetic retinopathy in india. JAMA Ophthalmol. https://doi.org/10.1001/jamaophthalmol.2019.2004 (2014).

Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention 234–241 (Springer, Cham, 2015).

Isensee, F. et al. nnU-Net: self-adapting framework for U-Net-based medical image segmentation. Preprint at https://arxiv.org/abs/1809.10486 (2018).

LeCun, Y. & Bengio, Y. in The Handbook of Brain Theory and Neural Networks 255–258 (MIT Press, 1998).

Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V. & Le, Q. V. AutoAugment: learning augmentation policies from data. Preprint at https://arxiv.org/abs/1805.09501 (2018).

Goodfellow, I. et al. Generative adversarial nets. In Advances inneural information processing systems 2672–2680 (2014).

Ørting, S. et al. A survey of Crowdsourcing in medical image analysis. Preprint at https://arxiv.org/abs/1902.09159 (2019).

Créquit, P., Mansouri, G., Benchoufi, M., Vivot, A. & Ravaud, P. Mapping of Crowdsourcing in health: systematic review. J. Med. Internet Res. 20 , e187 (2018).

Jing, L. & Tian, Y. in IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE, 2020).

McMahan, B., Moore, E., Ramage, D., Hampson, S. & y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics 1273–1282 (PMLR, 2017).

Karpathy, A. & Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 3128–3137 (IEEE, 2015).

Lv, D. et al. Research on the technology of LIDAR data processing. In 2017 First International Conference on Electronics Instrumentation Information Systems (EIIS) 1–5 (IEEE, 2017).

Lillo, I., Niebles, J. C. & Soto, A. Sparse composition of body poses and atomic actions for human activity recognition in RGB-D videos. Image Vis. Comput. 59 , 63–75 (2017).

Haque, A. et al. Towards vision-based smart hospitals: a system for tracking and monitoring hand hygiene compliance. In Proceedings of the 2nd Machine Learning for Healthcare Conference , 68 , 75–87 (PMLR, 2017).

Heilbron, F. C., Escorcia, V., Ghanem, B. & Niebles, J. C. ActivityNet: a large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 961–970 (IEEE, 2015).

Liu, Y. et al. Learning to describe scenes with programs. In ICLR (Open Access, 2019).

Singh, A. et al. Automatic detection of hand hygiene using computer visiontechnology. J. Am. Med. Inform. Assoc. 27 , 1316–1320 (2020).

Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42 , 60–88 (2017).

Article PubMed Google Scholar

Maron, O. & Lozano-Pérez, T. in A Framework for Multiple-Instance Learning. in Advances in Neural Information Processing Systems 10 (eds Jordan, M. I., Kearns, M. J. & Solla, S. A.) 570–576 (MIT Press, 1998).

Singh, S. P. et al. 3D Deep Learning On Medical Images: A Review. Sensors 20, https://doi.org/10.3390/s20185097 (2020).

Ouyang, D. et al. Video-based AI for beat-to-beat assessment of cardiac function. Nature 580 , 252–256 (2020).

Benjamens, S., Dhunnoo, P. & Meskó, B. The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database. NPJ Digit. Med. 3 , 118 (2020).

Beede, E. et al. A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. In Proc. 2020 CHI Conference on Human Factors in Computing Systems 1–12 (Association for Computing Machinery, 2020).

Viz.ai Granted Medicare New Technology Add-on Payment. PR Newswire https://www.prnewswire.com/news-releases/vizai-granted-medicare-new-technology-add-on-payment-301123603.html (2020).

Crowson, M. G. et al. A contemporary review of machine learning in otolaryngology-head and neck surgery. Laryngoscope 130 , 45–51 (2020).

Livingstone, D., Talai, A. S., Chau, J. & Forkert, N. D. Building an Otoscopic screening prototype tool using deep learning. J. Otolaryngol. Head. Neck Surg. 48 , 66 (2019).

Chen, P.-H. C. et al. An augmented reality microscope with real-time artificial intelligence integration for cancer diagnosis. Nat. Med. 25 , 1453–1457 (2019).

Gunčar, G. et al. An application of machine learning to haematological diagnosis. Sci. Rep. 8 , 411 (2018).

Article PubMed PubMed Central CAS Google Scholar

Alam, M. M. & Islam, M. T. Machine learning approach of automatic identification and counting of blood cells. Health. Technol. Lett. 6 , 103–108 (2019).

El Hajjar, A. & Rey, J.-F. Artificial intelligence in gastrointestinal endoscopy: general overview. Chin. Med. J. 133 , 326–334 (2020).

Horie, Y. et al. Diagnostic outcomes of esophageal cancer by artificial intelligence using convolutional neural networks. Gastrointest. Endosc. 89 , 25–32 (2019).

Hirasawa, T. et al. Application of artificial intelligence using a convolutional neural network for detecting gastric cancer in endoscopic images. Gastric Cancer 21 , 653–660 (2018).

Kubota, K., Kuroda, J., Yoshida, M., Ohta, K. & Kitajima, M. Medical image analysis: computer-aided diagnosis of gastric cancer invasion on endoscopic images. Surg. Endosc. 26 , 1485–1489 (2012).

Itoh, T., Kawahira, H., Nakashima, H. & Yata, N. Deep learning analyzes Helicobacter pylori infection by upper gastrointestinal endoscopy images. Endosc. Int Open 6 , E139–E144 (2018).

He, J.-Y., Wu, X., Jiang, Y.-G., Peng, Q. & Jain, R. Hookworm detection in wireless capsule endoscopy images with deep learning. IEEE Trans. Image Process. 27 , 2379–2392 (2018).

Park, S.-M. et al. A mountable toilet system for personalized health monitoring via the analysis of excreta. Nat. Biomed. Eng. 4 , 624–635 (2020).

VerMilyea, M. et al. Development of an artificial intelligence-based assessment model for prediction of embryo viability using static images captured by optical light microscopy during IVF. Hum. Reprod. 35 , 770–784 (2020).

Choy, G. et al. Current applications and future impact of machine learning in radiology. Radiology 288 , 318–328 (2018).

Saba, L. et al. The present and future of deep learning in radiology. Eur. J. Radiol. 114 , 14–24 (2019).

Mazurowski, M. A., Buda, M., Saha, A. & Bashir, M. R. Deep learning in radiology: an overview of the concepts and a survey of the state of the art with focus on MRI. J. Magn. Reson. Imaging 49 , 939–954 (2019).

Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6 , 317 (2019).

Irvin, J. et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proc. of the AAAI Conference on Artificial Intelligence Vol. 33, 590–597 (2019).

Wang, X. et al. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervisedclassification and localization of common thorax diseases. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2097–2106 (2017).

Chilamkurthy, S. et al. Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study. Lancet 392 , 2388–2396 (2018).

Weston, A. D. et al. Automated abdominal segmentation of CT scans for body composition analysis using deep learning. Radiology 290 , 669–679 (2019).

Ding, J., Li, A., Hu, Z. & Wang, L. in Medical Image Computing and Computer Assisted Intervention—MICCAI 2017 559–567 (Springer International Publishing, 2017).

Tan, L. K., Liew, Y. M., Lim, E. & McLaughlin, R. A. Convolutional neural network regression for short-axis left ventricle segmentation in cardiac cine MR sequences. Med. Image Anal. 39 , 78–86 (2017).

Zhang, J. et al. Viral pneumonia screening on chest X-ray images using confidence-aware anomaly detection. Preprint at https://arxiv.org/abs/2003.12338 (2020).

Zhang, X., Feng, C., Wang, A., Yang, L. & Hao, Y. CT super-resolution using multiple dense residual block based GAN. J. VLSI Signal Process. Syst. Signal Image Video Technol. , https://doi.org/10.1007/s11760-020-01790-5 (2020).

Papolos, A., Narula, J., Bavishi, C., Chaudhry, F. A. & Sengupta, P. P. U. S. Hospital use of echocardiography: insights from the nationwide inpatient sample. J. Am. Coll. Cardiol. 67 , 502–511 (2016).

HeartFlowNXT—HeartFlow Analysis of Coronary Blood Flow Using Coronary CT Angiography—Study Results—ClinicalTrials.gov. https://clinicaltrials.gov/ct2/show/results/NCT01757678 .

Madani, A., Arnaout, R., Mofrad, M. & Arnaout, R. Fast and accurate view classification of echocardiograms using deep learning. NPJ Digit. Med. 1 , 6 (2018).

Zhang, J. et al. Fully automated echocardiogram interpretation in clinical practice. Circulation 138 , 1623–1635 (2018).

Ghorbani, A. et al. Deep learning interpretation of echocardiograms. NPJ Digit. Med. 3 , 10 (2020).

Madani, A., Ong, J. R., Tibrewal, A. & Mofrad, M. R. K. Deep echocardiography: data-efficient supervised and semi-supervised deep learning towards automated diagnosis of cardiac disease. NPJ Digit. Med. 1 , 59 (2018).

Perkins, C., Balma, D. & Garcia, R. Members of the Consensus Group & Susan G. Komen for the Cure. Why current breast pathology practices must be evaluated. A Susan G. Komen for the Cure white paper: June 2006. Breast J. 13 , 443–447 (2007).

Brimo, F., Schultz, L. & Epstein, J. I. The value of mandatory second opinion pathology review of prostate needle biopsy interpretation before radical prostatectomy. J. Urol. 184 , 126–130 (2010).

Elmore, J. G. et al. Diagnostic concordance among pathologists interpreting breast biopsy specimens. JAMA 313 , 1122–1132 (2015).

Evans, A. J. et al. US food and drug administration approval of whole slide imaging for primary diagnosis: a key milestone is reached and new questions are raised. Arch. Pathol. Lab. Med. 142 , 1383–1387 (2018).

Srinidhi, C. L., Ciga, O. & Martel, A. L. Deep neural network models for computational histopathology: A survey. Medical Image Analysis . p. 101813 (2020).

Bera, K., Schalper, K. A., Rimm, D. L., Velcheti, V. & Madabhushi, A. Artificial intelligence in digital pathology - new tools for diagnosis and precision oncology. Nat. Rev. Clin. Oncol. 16 , 703–715 (2019).

Cireşan, D. C., Giusti, A., Gambardella, L. M. & Schmidhuber, J. in Medical Image Computing and Computer-Assisted Intervention—MICCAI 2013 411–418 (Springer Berlin Heidelberg, 2013).

Wang, H. et al. Mitosis detection in breast cancer pathology images by combining handcrafted and convolutional neural network features. J. Med Imaging (Bellingham) 1 , 034003 (2014).

Kashif, M. N., Ahmed Raza, S. E., Sirinukunwattana, K., Arif, M. & Rajpoot, N. Handcrafted features with convolutional neural networks for detection of tumor cells in histology images. In 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI) 1029–1032 (IEEE, 2016).

Wang, D., Khosla, A., Gargeya, R., Irshad, H. & Beck, A. H. Deep learning for identifying metastatic breast cancer. Preprint at https://arxiv.org/abs/1606.05718 (2016).

BenTaieb, A. & Hamarneh, G. in Medical Image Computing and Computer-Assisted Intervention—MICCAI 2016 460–468 (Springer International Publishing, 2016).

Chen, H. et al. DCAN: Deep contour-aware networks for object instance segmentation from histology images. Med. Image Anal. 36 , 135–146 (2017).

Xu, Y. et al. Gland instance segmentation using deep multichannel neural networks. IEEE Trans. Biomed. Eng. 64 , 2901–2912 (2017).

Litjens, G. et al. Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Sci. Rep. 6 , 26286 (2016).

Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24 , 1559–1567 (2018).

Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 25 , 1301–1309 (2019).

Mobadersany, P. et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proc. Natl Acad. Sci. U. S. A. 115 , E2970–E2979 (2018).

Courtiol, P. et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat. Med. 25 , 1519–1525 (2019).

Rawat, R. R. et al. Deep learned tissue ‘fingerprints’ classify breast cancers by ER/PR/Her2 status from H&E images. Sci. Rep. 10 , 7275 (2020).

Dietterich, T. G., Lathrop, R. H. & Lozano-Pérez, T. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89 , 31–71 (1997).

Christiansen, E. M. et al. In silico labeling: predicting fluorescent labels in unlabeled images. Cell 173 , 792–803.e19 (2018).

Esteva, A. & Topol, E. Can skin cancer diagnosis be transformed by AI? Lancet 394 , 1795 (2019).

Haenssle, H. A. et al. Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann. Oncol. 29 , 1836–1842 (2018).

Brinker, T. J. et al. Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task. Eur. J. Cancer 113 , 47–54 (2019).

Liu, Y. et al. A deep learning system for differential diagnosis of skin diseases. Nat. Med. 26 , 900–908 (2020).

Yap, J., Yolland, W. & Tschandl, P. Multimodal skin lesion classification using deep learning. Exp. Dermatol. 27 , 1261–1267 (2018).

Marchetti, M. A. et al. Results of the 2016 International Skin Imaging Collaboration International Symposium on Biomedical Imaging challenge: Comparison of the accuracy of computer algorithms to dermatologists for the diagnosis of melanoma from dermoscopic images. J. Am. Acad. Dermatol. 78 , 270–277 (2018).

Li, Y. et al. Skin cancer detection and tracking using data synthesis and deep learning. Preprint at https://arxiv.org/abs/1612.01074 (2016).

Ting, D. S. W. et al. Artificial intelligence and deep learning in ophthalmology. Br. J. Ophthalmol. 103 , 167–175 (2019).

Keane, P. A. & Topol, E. J. With an eye to AI and autonomous diagnosis. NPJ Digit. Med. 1 , 40 (2018).

Keane, P. & Topol, E. Reinventing the eye exam. Lancet 394 , 2141 (2019).

De Fauw, J. et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24 , 1342–1350 (2018).

Kern, C. et al. Implementation of a cloud-based referral platform in ophthalmology: making telemedicine services a reality in eye care. Br. J. Ophthalmol. 104 , 312–317 (2020).

Gulshan, V. et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316 , 2402–2410 (2016).

Raumviboonsuk, P. et al. Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program. NPJ Digit Med. 2 , 25 (2019).

Abràmoff, M. D. et al. Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning. Invest. Ophthalmol. Vis. Sci. 57 , 5200–5206 (2016).

Ting, D. S. W. et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. JAMA 318 , 2211–2223 (2017).

Abràmoff, M. D., Lavin, P. T., Birch, M., Shah, N. & Folk, J. C. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. NPJ Digit. Med. 1 , 39 (2018).

Varadarajan, A. V. et al. Predicting optical coherence tomography-derived diabetic macular edema grades from fundus photographs using deep learning. Nat. Commun. 11 , 130 (2020).

Yim, J. et al. Predicting conversion to wet age-related macular degeneration using deep learning. Nat. Med. 26 , 892–899 (2020).

Li, Z. et al. Efficacy of a deep learning system for detecting glaucomatous optic neuropathy based on color fundus photographs. Ophthalmology 125 , 1199–1206 (2018).

Yousefi, S. et al. Detection of longitudinal visual field progression in glaucoma using machine learning. Am. J. Ophthalmol. 193 , 71–79 (2018).

Brown, J. M. et al. Automated diagnosis of plus disease in retinopathy of prematurity using deep convolutional neural networks. JAMA Ophthalmol. 136 , 803–810 (2018).

Poplin, R. et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat. Biomed. Eng. 2 , 158–164 (2018).

Mitani, A. et al. Detection of anaemia from retinal fundus images via deep learning. Nat. Biomed. Eng. 4 , 18–27 (2020).

Sabanayagam, C. et al. A deep learning algorithm to detect chronic kidney disease from retinal photographs in community-based populations. Lancet Digital Health 2 , e295–e302 (2020).

Maier-Hein, L. et al. Surgical data science for next-generation interventions. Nat. Biomed. Eng. 1 , 691–696 (2017).

García-Peraza-Herrera, L. C. et al. ToolNet: Holistically-nested real-time segmentation of robotic surgical tools. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 5717–5722 (IEEE, 2017).

Zia, A., Sharma, Y., Bettadapura, V., Sarin, E. L. & Essa, I. Video and accelerometer-based motion analysis for automated surgical skills assessment. Int. J. Comput. Assist. Radiol. Surg. 13 , 443–455 (2018).

Sarikaya, D., Corso, J. J. & Guru, K. A. Detection and localization of robotic tools in robot-assisted surgery videos using deep neural networks for region proposal and detection. IEEE Trans. Med. Imaging 36 , 1542–1549 (2017).

Jin, A. et al. Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) 691–699 (IEEE, 2018).

Twinanda, A. P. et al. EndoNet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans. Med. Imaging 36 , 86–97 (2017).

Lin, H. C., Shafran, I., Yuh, D. & Hager, G. D. Towards automatic skill evaluation: detection and segmentation of robot-assisted surgical motions. Comput. Aided Surg. 11 , 220–230 (2006).

Khalid, S., Goldenberg, M., Grantcharov, T., Taati, B. & Rudzicz, F. Evaluation of deep learning models for identifying surgical actions and measuring performance. JAMA Netw. Open 3 , e201664 (2020).

Vassiliou, M. C. et al. A global assessment tool for evaluation of intraoperative laparoscopic skills. Am. J. Surg. 190 , 107–113 (2005).

Jin, Y. et al. SV-RCNet: Workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans. Med. Imaging 37 , 1114–1126 (2018).

Padoy, N. et al. Statistical modeling and recognition of surgical workflow. Med. Image Anal. 16 , 632–641 (2012).

Azari, D. P. et al. Modeling surgical technical skill using expert assessment for automated computer rating. Ann. Surg. 269 , 574–581 (2019).

Ma, A. J. et al. Measuring patient mobility in the ICU using a novel noninvasive sensor. Crit. Care Med. 45 , 630–636 (2017).

Davoudi, A. et al. Intelligent ICU for autonomous patient monitoring using pervasive sensing and deep learning. Sci. Rep. 9 , 8020 (2019).

Chakraborty, I., Elgammal, A. & Burd, R. S. Video based activity recognition in trauma resuscitation. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) 1–8 (IEEE, 2013).

Twinanda, A. P., Alkan, E. O., Gangi, A., de Mathelin, M. & Padoy, N. Data-driven spatio-temporal RGBD feature encoding for action recognition in operating rooms. Int. J. Comput. Assist. Radiol. Surg. 10 , 737–747 (2015).

Kaplan, R. S. & Porter, M. E. How to solve the cost crisis in health care. Harv. Bus. Rev. 89 , 46–52 (2011). 54, 56–61 passim.

PubMed Google Scholar

Wang, S., Chen, L., Zhou, Z., Sun, X. & Dong, J. Human fall detection in surveillance video based on PCANet. Multimed. Tools Appl. 75 , 11603–11613 (2016).

Núñez-Marcos, A., Azkune, G. & Arganda-Carreras, I. Vision-Based Fall Detection with Convolutional Neural Networks. In Proc. International Wireless Communications and Mobile Computing Conference 2017 (ACM, 2017).

Luo, Z. et al. Computer vision-based descriptive analytics of seniors’ daily activities for long-term health monitoring. In Machine Learning for Healthcare (MLHC) 2 (JMLR, 2018).

Zhang, C. & Tian, Y. RGB-D camera-based daily living activity recognition. J. Comput. Vis. image Process. 2 , 12 (2012).

Pirsiavash, H. & Ramanan, D. Detecting activities of daily living in first-person camera views. In 2012 IEEE Conference on Computer Vision and Pattern Recognition 2847–2854 (IEEE, 2012).

Kishore, P. V. V., Prasad, M. V. D., Kumar, D. A. & Sastry, A. S. C. S. Optical flow hand tracking and active contour hand shape features for continuous sign language recognition with artificial neural networks. In 2016 IEEE 6th International Conference on Advanced Computing (IACC) 346–351 (IEEE, 2016).

Webster, D. & Celik, O. Systematic review of Kinect applications in elderly care and stroke rehabilitation. J. Neuroeng. Rehabil. 11 , 108 (2014).

Chen, W. & McDuff, D. Deepphys: video-based physiological measurement using convolutional attention networks. In Proc. European Conference on Computer Vision (ECCV) 349–365 (Springer Science+Business Media, 2018).

Moazzami, B., Razavi-Khorasani, N., Dooghaie Moghadam, A., Farokhi, E. & Rezaei, N. COVID-19 and telemedicine: Immediate action required for maintaining healthcare providers well-being. J. Clin. Virol. 126 , 104345 (2020).

Gerke, S., Yeung, S. & Cohen, I. G. Ethical and legal aspects of ambient intelligence in hospitals. JAMA https://doi.org/10.1001/jama.2019.21699 (2020).

Young, A. T., Xiong, M., Pfau, J., Keiser, M. J. & Wei, M. L. Artificial intelligence in dermatology: a primer. J. Invest. Dermatol. 140 , 1504–1512 (2020).

Schaekermann, M., Cai, C. J., Huang, A. E. & Sayres, R. Expert discussions improve comprehension of difficult cases in medical image assessment. In Proc. 2020 CHI Conference on Human Factors in Computing Systems 1–13 (Association for Computing Machinery, 2020).

Schaekermann, M. et al. Remote tool-based adjudication for grading diabetic retinopathy. Transl. Vis. Sci. Technol. 8 , 40 (2019).

Caruana, R. Multitask learning. Mach. Learn. 28 , 41–75 (1997).

Wulczyn, E. et al. Deep learning-based survival prediction for multiple cancer types using histopathology images. PLoS ONE 15 , e0233678 (2020).

Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. Preprint at https://arxiv.org/abs/1312.6034 (2013).

Ren, J. et al. in Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) 14707–14718 (Curran Associates, Inc., 2019).

Dusenberry, M. W. et al. Analyzing the role of model uncertainty for electronic health records. In Proc. ACM Conference on Health, Inference, and Learning 204–213 (Association for Computing Machinery, 2020).

Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366 , 447–453 (2019).

Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI Extension. BMJ 370 , m3164 (2020).

Rivera, S. C. et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI Extension. BMJ 370 , m3210 (2020).

Asan, O., Bayrak, A. E. & Choudhury, A. Artificial intelligence and human trust in healthcare: focus on clinicians. J. Med. Internet Res. 22 , e15154 (2020).

McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577 , 89–94 (2020).

Kamulegeya, L. H. et al. Using artificial intelligence on dermatology conditions in Uganda: a case for diversity in training data sets for machine learning. https://doi.org/10.1101/826057 (2019).

Download references

Acknowledgements

The authors would like to thank Melvin Gruesbeck for the design of the figures, and Elise Kleeman for editorial review.

Author information

These authors contributed equally: Katherine Chou, Serena Yeung, Nikhil Naik, Ali Madani, Ali Mottaghi.

Authors and Affiliations

Salesforce AI Research, San Francisco, CA, USA

Andre Esteva, Nikhil Naik, Ali Madani & Richard Socher

Google Research, Mountain View, CA, USA

Katherine Chou, Yun Liu & Jeff Dean

Stanford University, Stanford, CA, USA

Serena Yeung & Ali Mottaghi

Scripps Research Translational Institute, La Jolla, CA, USA

You can also search for this author in PubMed Google Scholar

Contributions

A.E. organized the authors, synthesized the writing, and led the abstract, introduction, computer vision, dermatology, and ophthalmology sections. S.Y. led the medical video section. K.C. led the clinical deployment section. N.N. contributed the pathology section, Ali Madani contributed the cardiology section, Ali Mottaghi contributed to the sections within the medical video, and E.T. and J.D. contributed to the clinical deployment section. Y.L. significantly contributed to the figures, and writing style. All authors contributed to the overall writing and storyline. E.T., J.D., and R.S. oversaw and advised the work.

Corresponding author

Correspondence to Andre Esteva .

Ethics declarations

Competing interests.

A.E., N.N., Ali Madani, and R.S. are or were employees of Salesforce.com and own Salesforce stock. K.C., Y.L., and J.D. are employees of Google, L.L.C. and own Alphabet stock. S.Y., Ali Mottaghi and E.T. have no competing interests to declare.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Esteva, A., Chou, K., Yeung, S. et al. Deep learning-enabled medical computer vision. npj Digit. Med. 4 , 5 (2021). https://doi.org/10.1038/s41746-020-00376-2

Download citation

Received : 17 August 2020

Accepted : 01 December 2020

Published : 08 January 2021

DOI : https://doi.org/10.1038/s41746-020-00376-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Deep learning-based predictive classification of functional subpopulations of hematopoietic stem cells and multipotent progenitors.

Jianzhong Han

Stem Cell Research & Therapy (2024)

Deep learning for determining the difficulty of endodontic treatment: a pilot study

Hamed Karkehabadi
Elham Khoshbin
Soroush Sadr

BMC Oral Health (2024)

Web-based diagnostic platform for microorganism-induced deterioration on paper-based cultural relics with iterative training from human feedback

Chenshu Liu
Songbin Ben

Heritage Science (2024)

Identification of difficult laryngoscopy using an optimized hybrid architecture

XiaoXiao Liu
Colin Flanagan
Yongzheng Han

BMC Medical Research Methodology (2024)

Real-time simultaneous refractive index and thickness mapping of sub-cellular biology at the diffraction limit

Arturo Burguete-Lopez
Maksim Makarenko
Andrea Fratalocchi

Communications Biology (2024)

Quick links

Explore articles by subject
Guide to authors
Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

Notifications You must be signed in to change notification settings

A curated list of the top 10 computer vision papers in 2021 with video demos, articles, code and paper reference.

louisfb01/top-10-cv-papers-2021

Folders and files, repository files navigation, the top 10 computer vision papers of 2021, the top 10 computer vision papers in 2021 with video demos, articles, code, and paper reference..

While the world is still recovering, research hasn't slowed its frenetic pace, especially in the field of artificial intelligence. More, many important aspects were highlighted this year, like the ethical aspects, important biases, governance, transparency and much more. Artificial intelligence and our understanding of the human brain and its link to AI are constantly evolving, showing promising applications improving our life's quality in the near future. Still, we ought to be careful with which technology we choose to apply.

"Science cannot tell us what we ought to do, only what we can do." - Jean-Paul Sartre, Being and Nothingness

Here are my top 10 of the most interesting research papers of the year in computer vision, in case you missed any of them. In short, it is basically a curated list of the latest breakthroughs in AI and CV with a clear video explanation, link to a more in-depth article, and code (if applicable). Enjoy the read, and let me know if I missed any important papers in the comments, or by contacting me directly on LinkedIn!

The complete reference to each paper is listed at the end of this repository.

Maintainer: louisfb01

Subscribe to my newsletter - The latest updates in AI explained every week.

Feel free to message me any interesting paper I may have missed to add to this repository.

Tag me on Twitter @Whats_AI or LinkedIn @Louis (What's AI) Bouchard if you share the list!

Watch the 2021 CV rewind

Missed last year? Check this out: 2020: A Year Full of Amazing AI papers- A Review

👀 If you'd like to support my work and use W&B (for free) to track your ML experiments and make your work reproducible or collaborate with a team, you can try it out by following this guide ! Since most of the code here is PyTorch-based, we thought that a QuickStart guide for using W&B on PyTorch would be most interesting to share.

👉Follow this quick guide , use the same W&B lines in your code or any of the repos below, and have all your experiments automatically tracked in your w&b account! It doesn't take more than 5 minutes to set up and will change your life as it did for me! Here's a more advanced guide for using Hyperparameter Sweeps if interested :)

🙌 Thank you to Weights & Biases for sponsoring this repository and the work I've been doing, and thanks to any of you using this link and trying W&B!

If you are interested in AI research, here is another great repository for you:

A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more in-depth article, and code.

2021: A Year Full of Amazing AI papers- A Review

The Full List

Dall·e: zero-shot text-to-image generation from openai [1], taming transformers for high-resolution image synthesis [2], swin transformer: hierarchical vision transformer using shifted windows [3], deep nets: what have they ever done for vision [bonus].

Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [4]

Total Relighting: Learning to Relight Portraits for Background Replacement [5]

Animating Pictures with Eulerian Motion Fields [6]
CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation [7]

TimeLens: Event-based Video Frame Interpolation [8]

(Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [9]
CityNeRF: Building NeRF at City Scale [10]

Paper references

OpenAI successfully trained a network able to generate images from text captions. It is very similar to GPT-3 and Image GPT and produces amazing results.

Short read: OpenAI’s DALL·E: Text-to-Image Generation Explained
Paper: Zero-Shot Text-to-Image Generation
Code: Code & more information for the discrete VAE used for DALL·E

Tl;DR: They combined the efficiency of GANs and convolutional approaches with the expressivity of transformers to produce a powerful and time-efficient method for semantically-guided high-quality image synthesis.

Short read: Combining the Transformers Expressivity with the CNNs Efficiency for High-Resolution Image Synthesis
Paper: Taming Transformers for High-Resolution Image Synthesis
Code: Taming Transformers

Will Transformers Replace CNNs in Computer Vision? In less than 5 minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer.

Short read: Will Transformers Replace CNNs in Computer Vision?
Paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Click here for the code

"I will openly share everything about deep nets for vision applications, their successes, and the limitations we have to address."

Short read: What is the state of AI in computer vision?
Paper: Deep nets: What have they ever done for vision?

Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [4]

The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape!

Short read: Infinite Nature: Fly into an image and explore the landscape
Paper: Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image

Properly relight any portrait based on the lighting of the new background you add. Have you ever wanted to change the background of a picture but have it look realistic? If you’ve already tried that, you already know that it isn’t simple. You can’t just take a picture of yourself in your home and change the background for a beach. It just looks bad and not realistic. Anyone will just say “that’s photoshopped” in a second. For movies and professional videos, you need the perfect lighting and artists to reproduce a high-quality image, and that’s super expensive. There’s no way you can do that with your own pictures. Or can you?

Short read: Realistic Lighting on Different Backgrounds
Paper: Total Relighting: Learning to Relight Portraits for Background Replacement

If you’d like to read more research papers as well, I recommend you read my article where I share my best tips for finding and reading more research papers.

Animating Pictures with Eulerian Motion Fields [6]

This model takes a picture, understands which particles are supposed to be moving, and realistically animates them in an infinite loop while conserving the rest of the picture entirely still creating amazing-looking videos like this one...

Short read: Create Realistic Animated Looping Videos from Pictures
Paper: Animating Pictures with Eulerian Motion Fields

CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation [7]

Using a modified GAN architecture, they can move objects in the image without affecting the background or the other objects!

Short read: CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation
Paper: GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields

TimeLens can understand the movement of the particles in-between the frames of a video to reconstruct what really happened at a speed even our eyes cannot see. In fact, it achieves results that our intelligent phones and no other models could reach before!

Short read: How to Make Slow Motion Videos With AI!
Paper: TimeLens: Event-based Video Frame Interpolation

Subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!

(Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [9]

Have you ever dreamed of taking the style of a picture, like this cool TikTok drawing style on the left, and applying it to a new picture of your choice? Well, I did, and it has never been easier to do. In fact, you can even achieve that from only text and can try it right now with this new method and their Google Colab notebook available for everyone (see references). Simply take a picture of the style you want to copy, enter the text you want to generate, and this algorithm will generate a new picture out of it! Just look back at the results above, such a big step forward! The results are extremely impressive, especially if you consider that they were made from a single line of text!

Short read: Text-to-Drawing Synthesis With Artistic Control | CLIPDraw & StyleCLIPDraw
Paper (CLIPDraw): CLIPDraw: exploring text-to-drawing synthesis through language-image encoders
Paper (StyleCLIPDraw): StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis
CLIPDraw Colab demo
StyleCLIPDraw Colab demo

CityNeRF: Building NeRF at City Scale [10]

The model is called CityNeRF and grows from NeRF, which I previously covered on my channel. NeRF is one of the first models using radiance fields and machine learning to construct 3D models out of images. But NeRF is not that efficient and works for a single scale. Here, CityNeRF is applied to satellite and ground-level images at the same time to produce various 3D model scales for any viewpoint. In simple words, they bring NeRF to city-scale. But how?

Short read: CityNeRF: 3D Modelling at City Scale!
Paper: CityNeRF: Building NeRF at City Scale
Click here for the code (will be released soon)

If you would like to read more papers and have a broader view, here is another great repository for you covering 2020: 2020: A Year Full of Amazing AI papers- A Review and feel free to subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!

[1] A. Ramesh et al., Zero-shot text-to-image generation, 2021. arXiv:2102.12092

[2] Taming Transformers for High-Resolution Image Synthesis, Esser et al., 2020.

[3] Liu, Z. et al., 2021, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, arXiv preprint https://arxiv.org/abs/2103.14030v1

[bonus] Yuille, A.L., and Liu, C., 2021. Deep nets: What have they ever done for vision?. International Journal of Computer Vision, 129(3), pp.781–802, https://arxiv.org/abs/1805.04025 .

[4] Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N. and Kanazawa, A., 2020. Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image, https://arxiv.org/pdf/2012.09855.pdf

[5] Pandey et al., 2021, Total Relighting: Learning to Relight Portraits for Background Replacement, doi: 10.1145/3450626.3459872, https://augmentedperception.github.io/total_relighting/total_relighting_paper.pdf .

[6] Holynski, Aleksander, et al. “Animating Pictures with Eulerian Motion Fields.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

[7] Michael Niemeyer and Andreas Geiger, (2021), "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields", Published in CVPR 2021.

[8] Stepan Tulyakov*, Daniel Gehrig*, Stamatios Georgoulis, Julius Erbach, Mathias Gehrig, Yuanyou Li, Davide Scaramuzza, TimeLens: Event-based Video Frame Interpolation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 2021, http://rpg.ifi.uzh.ch/docs/CVPR21_Gehrig.pdf

[9] a) CLIPDraw: exploring text-to-drawing synthesis through language-image encoders b) StyleCLIPDraw: Schaldenbrand, P., Liu, Z. and Oh, J., 2021. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis.

[10] Xiangli, Y., Xu, L., Pan, X., Zhao, N., Rao, A., Theobalt, C., Dai, B. and Lin, D., 2021. CityNeRF: Building NeRF at City Scale.

Sponsor this project

Skip to main content
Skip to primary sidebar
Skip to footer

The Best of Applied Artificial Intelligence, Machine Learning, Automation, Bots, Chatbots

10 Cutting Edge Research Papers In Computer Vision & Image Generation

January 24, 2019 by Mariya Yao

UPDATE: We’ve also summarized the top 2019 and top 2020 Computer Vision research papers.

Ever since convolutional neural networks began outperforming humans in specific image recognition tasks, research in the field of computer vision has proceeded at breakneck pace.

The basic architecture of CNNs (or ConvNets) was developed in the 1980s . Yann LeCun improved upon the original design in 1989 by using backpropagation to train models to recognize handwritten digits.

We’ve come a long way since then.

In 2018, we saw novel architecture designs that improve upon performance benchmarks and also expand the range of media that machine learning models can analyze. We also saw a number of breakthroughs with media generation which enable photorealistic style transfer, high-resolution image generation, and video-to-video synthesis.

Due to the importance and prevalence of computer vision and image generation for applied and enterprise AI, we did feature some of the papers below in our previous article summarizing the top overall machine learning papers of 2018 . Since you might not have read that previous piece, we chose to highlight the vision-related research ones again here.

We’ve done our best to summarize these papers correctly, but if we’ve made any mistakes, please contact us to request a fix . Special thanks also goes to computer vision specialist Rebecca BurWei for generously offering her expertise in editing and revising drafts of this article.

If these summaries of scientific AI research papers are useful for you, you can subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries. We’re planning to release summaries of important papers in computer vision, reinforcement learning, and conversational AI in the next few weeks.

If you’d like to skip around, here are the papers we featured:

Spherical CNNs
Adversarial Examples that Fool both Computer Vision and Time-Limited Humans
A Closed-form Solution to Photorealistic Image Stylization
Group Normalization
Taskonomy: Disentangling Task Transfer Learning
Self-Attention Generative Adversarial Networks
GANimation: Anatomically-aware Facial Animation from a Single Image
Video-to-Video Synthesis
Everybody Dance Now
Large Scale GAN Training for High Fidelity Natural Image Synthesis

Important Computer Vision Research Papers of 2018

1. spherical cnns , by taco s. cohen, mario geiger, jonas koehler, and max welling, original abstract.

Convolutional Neural Networks (CNNs) have become the method of choice for learning problems involving 2D planar images. However, a number of problems of recent interest have created a demand for models that can analyze spherical images. Examples include omnidirectional vision for drones, robots, and autonomous cars, molecular regression problems, and global weather and climate modelling. A naive application of convolutional networks to a planar projection of the spherical signal is destined to fail, because the space-varying distortions introduced by such a projection will make translational weight sharing ineffective.

In this paper we introduce the building blocks for constructing spherical CNNs. We propose a definition for the spherical cross-correlation that is both expressive and rotation-equivariant. The spherical correlation satisfies a generalized Fourier theorem, which allows us to compute it efficiently using a generalized (non-commutative) Fast Fourier Transform (FFT) algorithm. We demonstrate the computational efficiency, numerical accuracy, and effectiveness of spherical CNNs applied to 3D model recognition and atomization energy regression.

Our Summary

Omnidirectional cameras that are already used by cars, drones, and other robots capture a spherical image of their entire surroundings. We could analyze such spherical signals by projecting them to the plane and using CNNs. However, any planar projection of a spherical signal results in distortions. To overcome this problem, the group of researchers from the University of Amsterdam introduces the theory of spherical CNNs, the networks that can analyze spherical images without being fooled by distortions. The approach demonstrates its effectiveness for classifying 3D shapes and Spherical MNIST images as well as for molecular energy regression, an important problem in computational chemistry.

What’s the core idea of this paper?

Planar projections of spherical signals result in significant distortions as some areas look larger or smaller than they really are.
Traditional CNNs are ineffective for spherical images because as objects move around the sphere, they also appear to shrink and stretch (think maps where Greenland looks much bigger than it actually is).
The solution is to use a spherical CNN which is robust to spherical rotations in the input data. By preserving the original shape of the input data, spherical CNNs treat all objects on the sphere equally without distortion.

What’s the key achievement?

Introducing a mathematical framework for building spherical CNNs.
Providing easy to use, fast and memory efficient PyTorch code for implementation of these CNNs.
classification of Spherical MNIST images
classification of 3D shapes,
molecular energy regression.

What does the AI community think?

The paper won the Best Paper Award at ICLR 2018, one of the leading machine learning conferences.

What are future research areas?

Development of a Steerable CNN for the sphere to analyze sections of vector bundles over the sphere (e.g., wind directions).
Expanding the mathematical theory from 2D spheres to 3D point clouds for classification tasks that are invariant under reflections as well as rotations.

What are possible business applications?

the omnidirectional vision for drones, robots, and autonomous cars;
molecular regression problems in computational chemistry;
global weather and climate modeling.

Where can you get implementation code?

The authors provide the original implementation for this research paper on GitHub .

2. Adversarial Examples that Fool both Computer Vision and Time-Limited Humans , by Gamaleldin F. Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alex Kurakin, Ian Goodfellow, Jascha Sohl-Dickstein

Machine learning models are vulnerable to adversarial examples: small changes to images can cause computer vision models to make mistakes such as identifying a school bus as an ostrich. However, it is still an open question whether humans are prone to similar mistakes. Here, we address this question by leveraging recent techniques that transfer adversarial examples from computer vision models with known parameters and architecture to other models with unknown parameters and architecture, and by matching the initial processing of the human visual system. We find that adversarial examples that strongly transfer across computer vision models influence the classifications made by time-limited human observers.

Google Brain researchers seek an answer to the question: do adversarial examples that are not model-specific and can fool different computer vision models without access to their parameters and architectures, can also fool time-limited humans? They leverage key ideas from machine learning, neuroscience, and psychophysics to create adversarial examples that do in fact impact human perception in a time-limited setting. Thus, the paper introduces a new class of illusions that are shared between machines and humans.

As the first step, the researchers use the black box adversarial example construction techniques that create adversarial examples without access to the model’s architecture or parameters.
prepending each model with a retinal layer that pre-processes the input to incorporate some of the transformations performed by the human eye;
performing an eccentricity-dependent blurring of the image to approximate the input which is received by the visual cortex of human subjects through their retinal lattice.
Classification decisions of humans are evaluated in a time-limited setting to detect even subtle effects in human perception.
Showing that adversarial examples that transfer across computer vision models do also successfully influence the perception of humans.
Demonstrating the similarity between convolutional neural networks and the human visual system.
The paper is widely discussed by the AI community. While most of the researchers are stunned by the results , some argue that we need a stricter definition of adversarial image because if humans classify the perturbated picture of a cat as a dog than it’s probably already a dog, not a cat.
Researching which techniques are crucial for the transfer of adversarial examples to humans (i.e., retinal preprocessing, model ensembling).
Practitioners should consider the risk that imagery could be manipulated to cause human observers to have unusual reactions because adversarial images can affect us below the horizon of awareness .

3. A Closed-form Solution to Photorealistic Image Stylization , by Yijun Li, Ming-Yu Liu, Xueting Li, Ming-Hsuan Yang, Jan Kautz

Photorealistic image stylization concerns transferring style of a reference photo to a content photo with the constraint that the stylized photo should remain photorealistic. While several photorealistic image stylization methods exist, they tend to generate spatially inconsistent stylizations with noticeable artifacts. In this paper, we propose a method to address these issues. The proposed method consists of a stylization step and a smoothing step. While the stylization step transfers the style of the reference photo to the content photo, the smoothing step ensures spatially consistent stylizations. Each of the steps has a closed-form solution and can be computed efficiently. We conduct extensive experimental validations. The results show that the proposed method generates photorealistic stylization outputs that are more preferred by human subjects as compared to those by the competing methods while running much faster. Source code and additional results are available at https://github.com/NVIDIA/FastPhotoStyle .

The team of scientists at NVIDIA and the University of California, Merced propose a new solution to photorealistic image stylization, FastPhotoStyle. The method consists of two steps: stylization and smoothing. Extensive experiments show that the suggested approach generates more realistic and compelling images than previous state-of-the-art. Even more, thanks to the closed-form solution, FastPhotoStyle can produce the stylized image 49 times faster than traditional methods.

The goal of photorealistic image stylization is to transfer style of a reference photo to a content photo while keeping the stylized image photorealistic.
The stylization step is based on the whitening and coloring transform (WCT), which processes images via feature projections. However, WCT was developed for artistic image stylizations, and thus, often generates structural artifacts for photorealistic image stylization. To overcome this problem, the paper introduces PhotoWCT method, which replaces the upsampling layers in the WCT with unpooling layers, and so, preserves more spatial information.
The smoothing step is required to solve spatially inconsistent stylizations that could arise after the first step. Smoothing is based on a manifold ranking algorithm.
Both steps have a closed-form solution, which means that the solution can be obtained in a fixed number of operations (i.e., convolutions, max-pooling, whitening, etc.). Thus, computations are much more efficient compared to the traditional methods.
outperforms artistic stylization algorithms by rendering much fewer structural artifacts and inconsistent stylizations, and
outperforms photorealistic stylization algorithms by synthesizing not only colors but also patterns in the style photos.
The experiments demonstrate that users prefer FastPhotoStyle results over the previous state-of-the-art in terms of both stylization effects (63.1%) and photorealism (73.5%).
FastPhotoSyle can synthesize an image of 1024 x 512 resolution in only 13 seconds, while the previous state-of-the-art method needs 650 seconds for the same task.
The paper was presented at ECCV 2018, leading European Conference on Computer Vision.
Finding the way to transfer small patterns from the style photo as they are smoothed away by the suggested method.
Exploring the possibilities to further reduce the number of structural artifacts in the stylized photos.
Content creators in the business settings can largely benefit from photorealistic image stylization as the tool basically allows you to automatically change the style of any photo based on what fits the narrative.
The photographers also discuss the tremendous impact that this technology can have in real estate photography.
NVIDIA team provides the original implementation for this research paper on GitHub .

4. Group Normalization , by Yuxin Wu and Kaiming He

Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems – BN’s error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN’s usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN’s computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. GN can be easily implemented by a few lines of code in modern libraries.

Facebook AI research team suggest Group Normalization (GN) as an alternative to Batch Normalization (BN). They argue that BN’s error increases dramatically for small batch sizes. This limits the usage of BN when working with large models to solve computer vision tasks that require small batches due to memory constraints. On the contrary, Group Normalization is independent of batch sizes as it divides the channels into groups and computes the mean and variance for normalization within each group. The experiments confirm that GN outperforms BN in a variety of tasks, including object detection, segmentation, and video classification.

Group Normalization is a simple alternative to Batch Normalization, especially in the scenarios where batch size tends to be small, for example, computer vision tasks, requiring high-resolution input.
GN explores only the layer dimensions, and thus, its computation is independent of batch size. Specifically, GN divides channels, or feature maps, into groups and normalizes the features within each group.
Group Normalization can be easily implemented by a few lines of code in PyTorch and TensorFlow.
Introducing Group Normalization, new effective normalization method.
GN’s accuracy is stable in a wide range of batch sizes as its computation is independent of batch size. For example, GN demonstrated a 10.6% lower error rate than its BN-based counterpart for ResNet-50 in ImageNet with a batch size of 2.
GN can be also transferred to fine-tuning. The experiments show that GN can outperform BN counterparts for object detection and segmentation in COCO dataset and video classification in Kinetics dataset.
The paper received an honorable mention at ECCV 2018, leading European Conference on Computer Vision.
It is also the second most popular paper in 2018 based on the people’s libraries at Arxiv Sanity Preserver.
Applying group normalization to sequential or generative models.
Investigating GN’s performance on learning representations for reinforcement learning.
Exploring if GN combined with a suitable regularizer will improve results.
Business applications that rely on BN-based models for object detection, segmentation, video classification and other computer vision tasks that require high-resolution input may benefit from moving to GN-based models as they are more accurate in these settings.
Facebook AI research team provides Mask R-CNN baseline results and models trained with Group Normalization .
PyTorch implementation of group normalization is also available on GitHub.

5. Taskonomy: Disentangling Task Transfer Learning , by Amir R. Zamir, Alexander Sax, William Shen, Leonidas J. Guibas, Jitendra Malik, and Silvio Savarese

Do visual tasks have a relationship, or are they unrelated? For instance, could having surface normals simplify estimating the depth of an image? Intuition answers these questions positively, implying existence of a structure among visual tasks. Knowing this structure has notable values; it is the concept underlying transfer learning and provides a principled way for identifying redundancies across tasks, e.g., to seamlessly reuse supervision among related tasks or solve many tasks in one system without piling up the complexity.

We proposes a fully computational approach for modeling the structure of space of visual tasks. This is done via finding (first and higher-order) transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D, and semantic tasks in a latent space. The product is a computational taxonomic map for task transfer learning. We study the consequences of this structure, e.g. nontrivial emerged relationships, and exploit them to reduce the demand for labeled data. For example, we show that the total number of labeled datapoints needed for solving a set of 10 tasks can be reduced by roughly 2/3 (compared to training independently) while keeping the performance nearly the same. We provide a set of tools for computing and probing this taxonomical structure including a solver that users can employ to devise efficient supervision policies for their use cases.

Assertions of the existence of a structure among visual tasks have been made by many researchers since the early years of modern computer science. And now Amir Zamir and his team make an attempt to actually find this structure. They model it using a fully computational approach and discover lots of useful relationships between different visual tasks, including the nontrivial ones. They also show that by taking advantage of these interdependencies, it is possible to achieve the same model performance with the labeled data requirements reduced by roughly ⅔.

A model aware of the relationships among different visual tasks demands less supervision, uses less computation, and behaves in more predictable ways.
A fully computational approach to discovering the relationships between visual tasks is preferable because it avoids imposing prior, and possibly incorrect, assumptions: the priors are derived from either human intuition or analytical knowledge, while neural networks might operate on different principles.
Identifying relationships between 26 common visual tasks.
Showing how this structure helps in discovering types of transfer learning that will be most effective for each visual task.
Creating a new dataset of 4 million images of indoor scenes including 600 buildings annotated with 26 tasks.
The paper won the Best Paper Award at CVPR 2018, the key conference on computer vision and pattern recognition.
The results are very important as for the most real-world tasks large-scale labeled datasets are not available .
To move from a model where common visual tasks are entirely defined by humans and try an approach where human-defined visual tasks are viewed as observed samples which are composed of computationally found latent subtasks.
Exploring the possibility to transfer the findings to not entirely visual tasks, e.g. robotic manipulation.
Relationships discovered in this paper can be used to build more effective visual systems that will require less labeled data and lower computational costs.

6. Self-Attention Generative Adversarial Networks , by Han Zhang, Ian Goodfellow, Dimitris Metaxas, Augustus Odena

In this paper, we propose the Self-Attention Generative Adversarial Network (SAGAN) which allows attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other. Furthermore, recent work has shown that generator conditioning affects GAN performance. Leveraging this insight, we apply spectral normalization to the GAN generator and find that this improves training dynamics. The proposed SAGAN achieves the state-of-the-art results, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset. Visualization of the attention layers shows that the generator leverages neighborhoods that correspond to object shapes rather than local regions of fixed shape.

Traditional convolutional GANs demonstrated some very promising results with respect to image synthesis. However, they have at least one important weakness – convolutional layers alone fail to capture geometrical and structural patterns in the images. Since convolution is a local operation, it is hardly possible for an output on the top-left position to have any relation to the output at bottom-right . The paper introduces a simple solution to this problem – incorporating the self-attention mechanism into the GAN framework. This solution combined with several stabilization techniques helps the Senf-Attention Generative Adversarial Networks (SAGANs) achieve the state-of-the-art results in image synthesis.

Convolutional layers alone are computationally inefficient for modeling long-range dependencies in images. On the contrary, a self-attention mechanism incorporated into the GAN framework will enable both the generator and the discriminator to efficiently model relationships between widely separated spatial regions.
The self-attention module calculates response at a position as a weighted sum of the features at all positions.
Applying spectral normalization for both generator and discriminator – the researchers argue that not only the discriminator but also the generator can benefit from spectral normalization, as it can prevent the escalation of parameter magnitudes and avoid unusual gradients.
Using separate learning rates for the generator and the discriminator to compensate for the problem of slow learning in a regularized discriminator and make it possible to use fewer generator steps per discriminator step.
Showing that self-attention module incorporated into the GAN framework is, in fact, effective in modeling long-range dependencies.
spectral normalization applied to the generator stabilizes GAN training;
utilizing imbalanced learning rates speeds up training of regularized discriminators.
Achieving state-of-the-art results in image synthesis by boosting the Inception Score from 36.8 to 52.52 and reducing Fréchet Inception Distance from 27.62 to 18.65.
“The idea is simple and intuitive yet very effective, plus easy to implement.” – Sebastian Raschka , assistant professor of Statistics at the University of Wisconsin-Madison.
Exploring the possibilities to reduce the number of weird samples generated by GANs.
Image synthesis with GANs can replace expensive manual media creation for advertising and e-commerce purposes.
PyTorch and TensorFlow implementations of Self-Attention GANs are available on GitHub.

7. GANimation: Anatomically-aware Facial Animation from a Single Image , by Albert Pumarola, Antonio Agudo, Aleix M. Martinez, Alberto Sanfeliu, Francesc Moreno-Noguer

Recent advances in Generative Adversarial Networks (GANs) have shown impressive results for task of facial expression synthesis. The most successful architecture is StarGAN, that conditions GANs generation process with images of a specific domain, namely a set of images of persons sharing the same expression. While effective, this approach can only generate a discrete number of expressions, determined by the content of the dataset. To address this limitation, in this paper, we introduce a novel GAN conditioning scheme based on Action Units (AU) annotations, which describes in a continuous manifold the anatomical facial movements defining a human expression. Our approach allows controlling the magnitude of activation of each AU and combine several of them. Additionally, we propose a fully unsupervised strategy to train the model, that only requires images annotated with their activated AUs, and exploit attention mechanisms that make our network robust to changing backgrounds and lighting conditions. Extensive evaluation show that our approach goes beyond competing conditional generators both in the capability to synthesize a much wider range of expressions ruled by anatomically feasible muscle movements, as in the capacity of dealing with images in the wild.

The paper introduces a novel GAN model that is able to generate anatomically-aware facial animations from a single image under changing backgrounds and illumination conditions. It advances current works, which had only addressed the problem for discrete emotions category editing and portrait images. The approach renders a wide range of emotions by encoding facial deformations as Action Units. The resulting animations demonstrate a remarkably smooth and consistent transformation across frames even with challenging light conditions and backgrounds.

Facial expressions can be described in terms of Action Units (AUs), which anatomically describe the contractions of specific facial muscles. For example, the facial expression for ‘fear’ is generally produced with the following activations: Inner Brow Raiser (AU1), Outer Brow Raiser (AU2), Brow Lowerer (AU4), Upper Lid Raiser (AU5), Lid Tightener (AU7), Lip Stretcher (AU20) and Jaw Drop (AU26). The magnitude of each AU defines the extent of emotion.
A model for synthetic facial animation is based on the GAN architecture, which is conditioned on a one-dimensional vector indicating the presence/absence and the magnitude of each Action Unit.
To circumvent the need for pairs of training images of the same person under different expressions, a bidirectional generator is used to both transform an image into a desired expression and transform the synthesized image back into the original pose.
To handle images under changing backgrounds and illumination conditions, the model includes an attention layer that focuses the action of the network only in those regions of the image that are relevant to convey the novel expression.
Introducing a novel GAN model for face animation in the wild that can be trained in a fully unsupervised manner and generate visually compelling images with remarkably smooth and consistent transformation across frames even with challenging light conditions and non-real world data.
Demonstrating how a wider range of emotions can be generated by interpolating between emotions the GAN has already seen.
Applying the introduced approach to video sequences.
The technology that automatically animates the facial expression from a single image can be applied in several areas including the fashion and e-commerce business, the movie industry, photography technologies.
The authors provide the original implementation of this research paper on GitHub .

8. Video-to-Video Synthesis , by Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro

We study the problem of video-to-video synthesis, whose goal is to learn a mapping function from an input source video (e.g., a sequence of semantic segmentation masks) to an output photorealistic video that precisely depicts the content of the source video. While its image counterpart, the image-to-image synthesis problem, is a popular topic, the video-to-video synthesis problem is less explored in the literature. Without understanding temporal dynamics, directly applying existing image synthesis approaches to an input video often results in temporally incoherent videos of low visual quality. In this paper, we propose a novel video-to-video synthesis approach under the generative adversarial learning framework. Through carefully-designed generator and discriminator architectures, coupled with a spatio-temporal adversarial objective, we achieve high-resolution, photorealistic, temporally coherent video results on a diverse set of input formats including segmentation masks, sketches, and poses. Experiments on multiple benchmarks show the advantage of our method compared to strong baselines. In particular, our model is capable of synthesizing 2K resolution videos of street scenes up to 30 seconds long, which significantly advances the state-of-the-art of video synthesis. Finally, we apply our approach to future video prediction, outperforming several state-of-the-art competing systems.

Researchers from NVIDIA have introduced a novel video-to-video synthesis approach. The framework is based on conditional GANs. Specifically, the method couples carefully-designed generator and discriminator with a spatio-temporal adversarial objective. The experiments demonstrate that the suggested vid2vid approach can synthesize high-resolution, photorealistic, temporally coherent videos on a diverse set of input formats including segmentation masks, sketches, and poses. It can also predict the next frames with far superior results than the baseline models.

current source frame;
past two source frames;
past two generated frames.
Conditional image discriminator ensures that each output frame resembles a real image given the same source image.
Conditional video discriminator ensures that consecutive output frames resemble the temporal dynamics of a real video given the same optical flow.
Foreground-background prior in the generator design further improves the synthesis performance of the proposed model.
Using a soft occlusion mask instead of binary allows to better handle the “zoom in” scenario: we can add details by gradually blending the warped pixels and the newly synthesized pixels.
Generating high-resolution (2048х2048), photorealistic, temporally coherent videos up to 30 seconds long.
Outputting several videos with different visual appearances depending on sampling different feature vectors.
Outperforming the baseline models in future video prediction.
Converting semantic labels into realistic real-world videos.
Generating multiple outputs of talking people from edge maps.
Generating an entire human body given a pose.
“NVIDIA’s new vid2vid is the first open-source code that lets you fake anybody’s face convincingly from one source video. […] interesting times ahead…”, Gene Kogan , an artist and a programmer.
The paper has also received some criticism over the concern that it can be used to create deepfakes or tampered videos which can deceive people.
Using object tracking information to make sure that each object has a consistent appearance across the whole video.
Researching if training the model with coarser semantic labels will help reduce the visible artifacts that appear after semantic manipulations (e.g., turning trees into buildings).
Adding additional 3D cues, such as depth maps, to enable synthesis of turning cars.
Marketing and advertising can benefit from the opportunities created by the vid2vid method (e.g., replacing the face or even the entire body in the video). However, this should be used with caution, keeping in mind the ethical considerations.
NVIDIA team provides the original implementation of this research paper on GitHub .

9. Everybody Dance Now , by Caroline Chan, Shiry Ginosar, Tinghui Zhou, Alexei A. Efros

This paper presents a simple method for “do as I do” motion transfer: given a source video of a person dancing we can transfer that performance to a novel (amateur) target after only a few minutes of the target subject performing standard moves. We pose this problem as a per-frame image-to-image translation with spatio-temporal smoothing. Using pose detections as an intermediate representation between source and target, we learn a mapping from pose images to a target subject’s appearance. We adapt this setup for temporally coherent video generation including realistic face synthesis. Our video demo can be found at https://youtu.be/PCBTZh41Ris .

UC Berkeley researchers present a simple method for generating videos with amateur dancers performing like professional dancers. If you want to take part in the experiment, all you need to do is to record a few minutes of yourself performing some standard moves and then pick up the video with the dance you want to repeat. The neural network will do the main job: it solves the problem as a per-frame image-to-image translation with spatio-temporal smoothing. By conditioning the prediction at each frame on that of the previous time step for temporal smoothness and applying a specialized GAN for realistic face synthesis, the method achieves really amazing results.

A pre-trained state-of-the-art pose detector creates pose stick figures from the source video.
Global pose normalization is applied to account for differences between the source and target subjects in body shapes and locations within the frame.
Normalized pose stick figures are mapped to the target subject.
To make videos smooth, the researchers suggest conditioning the generator on the previously generated frame and then giving both images to the discriminator. Gaussian smoothing on the pose keypoints allows to further reduce jitter.
To generate more realistic faces, the method includes an additional face-specific GAN that brushes up the face after the main generation is finished.
Suggesting a novel approach to motion transfer that outperforms a strong baseline (pix2pixHD), according to both qualitative and quantitative assessments.
Demonstrating that face-specific GAN adds considerable detail to the output video.
“Overall I thought this was really fun and well executed. Looking forward to the code release so that I can start training my dance moves.”, Tom Brown , member of technical staff at Google Brain.
“’Everybody Dance Now’ from Caroline Chan, Alyosha Efros and team transfers dance moves from one subject to another. The only way I’ll ever dance well. Amazing work!!!”, Soumith Chintala‏, AI Research Engineer at Facebook.
Replacing pose stick figures with temporally coherent inputs and representation specifically optimized for motion transfer.
“Do as I do” motion transfer might be applied to replace subjects when creating marketing and promotional videos.
PyTorch implementation of this research paper is available on GitHub .

10. Large Scale GAN Training for High Fidelity Natural Image Synthesis , by Andrew Brock, Jeff Donahue, and Karen Simonyan

Despite recent progress in generative image modeling, successfully generating high-resolution, diverse samples from complex datasets such as ImageNet remains an elusive goal. To this end, we train Generative Adversarial Networks at the largest scale yet attempted, and study the instabilities specific to such scale. We find that applying orthogonal regularization to the generator renders it amenable to a simple “truncation trick”, allowing fine control over the trade-off between sample fidelity and variety by truncating the latent space. Our modifications lead to models which set the new state of the art in class-conditional image synthesis. When trained on ImageNet at 128×128 resolution, our models (BigGANs) achieve an Inception Score (IS) of 166.3 and Frechet Inception Distance (FID) of 9.6, improving over the previous best IS of 52.52 and FID of 18.65.

DeepMind team finds that current techniques are sufficient for synthesizing high-resolution, diverse images from available datasets such as ImageNet and JFT-300M. In particular, they show that Generative Adversarial Networks (GANs) can generate images that look very realistic if they are trained at the very large scale, i.e. using two to four times as many parameters and eight times the batch size compared to prior art. These large-scale GANs, or BigGANs, are the new state-of-the-art in class-conditional image synthesis.

GANs perform much better with the increased batch size and number of parameters.
Applying orthogonal regularization to the generator makes the model responsive to a specific technique (“truncation trick”), which provides control over the trade-off between sample fidelity and variety.
Demonstrating that GANs can benefit significantly from scaling.
Building models that allow explicit, fine-grained control of the trade-off between sample variety and fidelity.
Discovering instabilities of large-scale GANs and characterizing them empirically.
an Inception Score (IS) of 166.3 with the previous best IS of 52.52;
Frechet Inception Distance (FID) of 9.6 with the previous best FID of 18.65.
The paper is under review for next ICLR 2019.
After BigGAN generators become available on TF Hub, AI researchers from all over the world are playing with BigGANs to generate dogs, watches, bikini images, Mona Lisa, seashores and many more.
Moving to larger datasets to mitigate GAN stability issues.
Replacing expensive manual media creation for advertising and e-commerce purposes.
A BigGAN demo implemented in TensorFlow is available to use on Google’s Colab tool.
Aaron Leong has a Github repository for BigGAN implemented in PyTorch .

Want Deeper Dives Into Specific AI Research Topics?

Due to popular demand, we’ve released several of these easy-to-read summaries and syntheses of major research papers for different subtopics within AI and machine learning.

Top 10 machine learning & AI research papers of 2018
Top 10 AI fairness, accountability, transparency, and ethics (FATE) papers of 2018
Top 14 natural language processing (NLP) research papers of 2018
Top 10 computer vision and image generation research papers of 2018
Top 10 conversational AI and dialog systems research papers of 2018
Top 10 deep reinforcement learning research papers of 2018

Update: 2019 Research Summaries Are Released

Top 10 AI & machine learning research papers from 2019
Top 11 NLP achievements & papers from 2019
Top 10 research papers in conversational AI from 2019
Top 10 computer vision research papers from 2019
Top 12 AI ethics research papers introduced in 2019
Top 10 reinforcement learning research papers from 2019

Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we release more summary articles like this one.

Email Address *
Name * First Last
Natural Language Processing (NLP)
Chatbots & Conversational AI
Computer Vision
Ethics & Safety
Machine Learning
Deep Learning
Reinforcement Learning
Generative Models
Other (Please Describe Below)
What is your biggest challenge with AI research? *

Reader Interactions

About Mariya Yao

Mariya is the co-author of Applied AI: A Handbook For Business Leaders and former CTO at Metamaven. She "translates" arcane technical concepts into actionable business advice for executives and designs lovable products people actually want to use. Follow her on Twitter at @thinkmariya to raise your AI IQ.

March 13, 2024 at 4:32 pm

If you have a patio, deck or pool and are looking for some fun ways to resurface it, you may be wondering how to do stamped concrete over existing patio surfaces. https://www.google.com/maps/place/?cid=10866013157741552281

March 21, 2024 at 6:18 am

Yes! Finally someone writes about tote bags.

March 27, 2024 at 7:39 am

A coloured concrete driveway can be a great option if you want to add character to a plain concrete driveway. It is durable, weatherproof, and offers many different design options. https://search.google.com/local/reviews?placeid=ChIJLRrbgctL4okRbNmXXl3Lpkk

You must be logged in to post a comment.

About TOPBOTS

Expert Contributors
Terms of Service & Privacy Policy
Contact TOPBOTS

Suggestions or feedback?

MIT News | Massachusetts Institute of Technology

Machine learning
Social justice
Black holes
Classes and programs

Departments

Aeronautics and Astronautics
Brain and Cognitive Sciences
Architecture
Political Science
Mechanical Engineering

Centers, Labs, & Programs

Abdul Latif Jameel Poverty Action Lab (J-PAL)
Picower Institute for Learning and Memory
Lincoln Laboratory
School of Architecture + Planning
School of Engineering
School of Humanities, Arts, and Social Sciences
Sloan School of Management
School of Science
MIT Schwarzman College of Computing

When computer vision works more like a brain, it sees more like people do

Press contact :.

Monotone image of a human eye with grahic representations of a computer network superimposed

Previous image Next image

From cameras to self-driving cars, many of today’s technologies depend on artificial intelligence to extract meaning from visual information. Today’s AI technology has artificial neural networks at its core, and most of the time we can trust these AI computer vision systems to see things the way we do — but sometimes they falter. According to MIT and IBM research scientists, one way to improve computer vision is to instruct the artificial neural networks that they rely on to deliberately mimic the way the brain’s biological neural network processes visual images.

Researchers led by MIT Professor James DiCarlo , the director of MIT’s Quest for Intelligence and member of the MIT-IBM Watson AI Lab, have made a computer vision model more robust by training it to work like a part of the brain that humans and other primates rely on for object recognition. This May, at the International Conference on Learning Representations, the team reported that when they trained an artificial neural network using neural activity patterns in the brain’s inferior temporal (IT) cortex, the artificial neural network was more robustly able to identify objects in images than a model that lacked that neural training. And the model’s interpretations of images more closely matched what humans saw, even when images included minor distortions that made the task more difficult.

Comparing neural circuits

Many of the artificial neural networks used for computer vision already resemble the multilayered brain circuits that process visual information in humans and other primates. Like the brain, they use neuron-like units that work together to process information. As they are trained for a particular task, these layered components collectively and progressively process the visual information to complete the task — determining, for example, that an image depicts a bear or a car or a tree.

DiCarlo and others previously found that when such deep-learning computer vision systems establish efficient ways to solve visual problems, they end up with artificial circuits that work similarly to the neural circuits that process visual information in our own brains. That is, they turn out to be surprisingly good scientific models of the neural mechanisms underlying primate and human vision.

That resemblance is helping neuroscientists deepen their understanding of the brain. By demonstrating ways visual information can be processed to make sense of images, computational models suggest hypotheses about how the brain might accomplish the same task. As developers continue to refine computer vision models, neuroscientists have found new ideas to explore in their own work.

“As vision systems get better at performing in the real world, some of them turn out to be more human-like in their internal processing. That’s useful from an understanding-biology point of view,” says DiCarlo, who is also a professor of brain and cognitive sciences and an investigator at the McGovern Institute for Brain Research.

Engineering a more brain-like AI

While their potential is promising, computer vision systems are not yet perfect models of human vision. DiCarlo suspected one way to improve computer vision may be to incorporate specific brain-like features into these models.

To test this idea, he and his collaborators built a computer vision model using neural data previously collected from vision-processing neurons in the monkey IT cortex — a key part of the primate ventral visual pathway involved in the recognition of objects — while the animals viewed various images. More specifically, Joel Dapello, a Harvard University graduate student and former MIT-IBM Watson AI Lab intern; and Kohitij Kar, assistant professor and Canada Research Chair (Visual Neuroscience) at York University and visiting scientist at MIT; in collaboration with David Cox, IBM Research’s vice president for AI models and IBM director of the MIT-IBM Watson AI Lab; and other researchers at IBM Research and MIT asked an artificial neural network to emulate the behavior of these primate vision-processing neurons while the network learned to identify objects in a standard computer vision task.

“In effect, we said to the network, ‘please solve this standard computer vision task, but please also make the function of one of your inside simulated “neural” layers be as similar as possible to the function of the corresponding biological neural layer,’” DiCarlo explains. “We asked it to do both of those things as best it could.” This forced the artificial neural circuits to find a different way to process visual information than the standard, computer vision approach, he says.

After training the artificial model with biological data, DiCarlo’s team compared its activity to a similarly-sized neural network model trained without neural data, using the standard approach for computer vision. They found that the new, biologically informed model IT layer was — as instructed — a better match for IT neural data. That is, for every image tested, the population of artificial IT neurons in the model responded more similarly to the corresponding population of biological IT neurons.

The researchers also found that the model IT was also a better match to IT neural data collected from another monkey, even though the model had never seen data from that animal, and even when that comparison was evaluated on that monkey’s IT responses to new images. This indicated that the team’s new, “neurally aligned” computer model may be an improved model of the neurobiological function of the primate IT cortex — an interesting finding, given that it was previously unknown whether the amount of neural data that can be currently collected from the primate visual system is capable of directly guiding model development.

With their new computer model in hand, the team asked whether the “IT neural alignment” procedure also leads to any changes in the overall behavioral performance of the model. Indeed, they found that the neurally-aligned model was more human-like in its behavior — it tended to succeed in correctly categorizing objects in images for which humans also succeed, and it tended to fail when humans also fail.

Adversarial attacks

The team also found that the neurally aligned model was more resistant to “adversarial attacks” that developers use to test computer vision and AI systems. In computer vision, adversarial attacks introduce small distortions into images that are meant to mislead an artificial neural network.

“Say that you have an image that the model identifies as a cat. Because you have the knowledge of the internal workings of the model, you can then design very small changes in the image so that the model suddenly thinks it’s no longer a cat,” DiCarlo explains.

These minor distortions don’t typically fool humans, but computer vision models struggle with these alterations. A person who looks at the subtly distorted cat still reliably and robustly reports that it’s a cat. But standard computer vision models are more likely to mistake the cat for a dog, or even a tree.

“There must be some internal differences in the way our brains process images that lead to our vision being more resistant to those kinds of attacks,” DiCarlo says. And indeed, the team found that when they made their model more neurally aligned, it became more robust, correctly identifying more images in the face of adversarial attacks. The model could still be fooled by stronger “attacks,” but so can people, DiCarlo says. His team is now exploring the limits of adversarial robustness in humans.

A few years ago, DiCarlo’s team found they could also improve a model’s resistance to adversarial attacks by designing the first layer of the artificial network to emulate the early visual processing layer in the brain. One key next step is to combine such approaches — making new models that are simultaneously neurally aligned at multiple visual processing layers.

The new work is further evidence that an exchange of ideas between neuroscience and computer science can drive progress in both fields. “Everybody gets something out of the exciting virtuous cycle between natural/biological intelligence and artificial intelligence,” DiCarlo says. “In this case, computer vision and AI researchers get new ways to achieve robustness, and neuroscientists and cognitive scientists get more accurate mechanistic models of human vision.”

This work was supported by the MIT-IBM Watson AI Lab, Semiconductor Research Corporation, the U.S. Defense Research Projects Agency, the MIT Shoemaker Fellowship, U.S. Office of Naval Research, the Simons Foundation, and Canada Research Chair Program.

Share this news article on:

Neuroscientists find a way to make object-recognition models perform better

A computer model of vision created by MIT neuroscientists designed these images that can stimulate very high activity in individual neurons.

Putting vision models to the test

MIT researchers have found that the part of the visual cortex known as the inferotemporal (IT) cortex is required to distinguish between different objects.

How the brain distinguishes between objects

Previous item Next item

More MIT News

In between two rocky hills, an icy blue glacier flows down and meets the water.

Microscopic defects in ice influence how massive glaciers flow, study shows

Read full story →

View of the torso of a woman wearing a white lab coat and gloves in a lab holding a petri dish with green material oozing in one hand and a small pipette in the other hand

Scientists identify mechanism behind drug resistance in malaria parasite

Ani Dasgupta gestures with his hands as he speaks while standing in front of chalkboard.

Getting to systemic sustainability

Maja Hoffmann and Hashim Sarkis pose outdoors on a sunny day. Behind them are a small pond, several older single-storey buildings, and a multi-storey building with a central tower, half metallic with windows jutting out in odd angles, and half tan stone

New MIT-LUMA Lab created to address climate challenges in the Mediterranean region

Photo of MIT Press Book Store shelves nestled beneath a glass stairwell

MIT Press releases Direct to Open impact report

Eli Sanchez stands in a naturally lit, out-of-focus hallway

Modeling the threat of nuclear war

More news on MIT News homepage →

Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA

Map (opens in new window)
Events (opens in new window)
People (opens in new window)
Careers (opens in new window)
Accessibility
Social Media Hub
MIT on Facebook
MIT on YouTube
MIT on Instagram

Skip to primary navigation
Skip to main content

Open Computer Vision Library

Research Areas in Computer Vision: Trends and Challenges

Farooq Alvi February 7, 2024 Leave a Comment AI Careers

Basics of Computer Vision

Computer Vision (CV) is a field of artificial intelligence that trains computers to interpret and understand the visual world. Using digital images from cameras and videos, along with deep learning models, computers can accurately identify and classify objects, and then react to what they “see.”

Key Concepts in Computer Vision

Image Processing: At the heart of CV is image processing, which involves enhancing image data (removing noise, sharpening, or brightening an image) and preparing it for further analysis.

Feature Detection and Matching: This involves identifying and using specific features of an image, like edges, corners, or objects, to understand the content of the image.

Pattern Recognition: CV uses pattern recognition to identify patterns and regularities in data. This can be as simple as recognizing the shape of an object or as complex as identifying a person’s face.

Core Technologies Powering Computer Vision

Machine Learning and Deep Learning: These are crucial for teaching computers to recognize patterns in visual data. Deep learning, especially, has been a game-changer, enabling advancements in facial recognition, object detection, and more.

Neural Networks: A type of machine learning, neural networks, particularly Convolutional Neural Networks (CNNs), are pivotal in analyzing visual imagery.

Image Recognition and Classification: This is the process of identifying and labeling objects within an image. It’s one of the most common applications of CV.

Object Detection: This goes a step further than image classification by not only identifying objects in images but also locating them.

Applications of Basic Computer Vision

Automated Inspection: Used in manufacturing to identify defects.

Surveillance: Helps in monitoring activities for security purposes.

Retail: For example, in cashier-less stores where CV tracks what customers pick up.

Healthcare: Assisting in diagnostic procedures through medical image analysis.

Challenges and Limitations

Data Quality and Quantity: The accuracy of a computer vision system is highly dependent on the quality and quantity of the data it’s trained on.

Computational Requirements: Advanced CV models require significant computational power, making them resource-intensive.

Ethical and Privacy Concerns: The use of CV in surveillance and data collection raises ethical and privacy issues that need to be addressed.

This interesting topic “2024 Guide to becoming a Computer Vision Engineer ” will help you set off on your journey to becoming one.

Key Research Areas in Computer Vision

Augmented Reality: The Convergence with Computer Vision

In 2024, Augmented Reality (AR) continues to make significant strides, increasingly integrating with computer vision (CV) to create more immersive and interactive experiences across various sectors. This integration is crucial as AR requires understanding and interacting with the real world through visual information, a capability at the core of CV.

Manufacturing, Retail, and Education: Transformative Sectors

Manufacturing : AR devices enable manufacturing workers to access real-time instructional and administrative information. This integration significantly enhances efficiency and accuracy in production processes.

Retail : In the retail sector, AR is revolutionizing the shopping experience. Consumers can now visualize products in great detail, including pricing and features, right from their AR devices, offering a more engaging and informed shopping experience.

Education: The impact of AR in education is substantial. Traditional teaching methods are being supplemented with immersive and interactive AR experiences, making learning more engaging and effective for students.

Technological Advances in AR

The advancement in AR technology, backed by major companies like Apple and Meta, is seeing a surge of consumer-grade AR devices entering the market. These devices are set to become more widely available, making AR more integral to daily life and work.

The development of sophisticated AR gaming is a testament to this growth. AR games now offer realistic gameplay, integrating virtual objects and characters into the real world, enhancing player engagement, and creating new possibilities in gaming and non-gaming applications. Startups like Mohx-games and smar.toys are at the forefront of this innovation, developing platforms and controllers that elevate the AR gaming experience.

Mobile AR tools are another significant advancement. These tools utilize the increasing capabilities of smartphone cameras and sensors to enhance AR interactions’ realism and immersion. Platforms like Phantom Technology’s PhantomEngine enable developers to create more sophisticated and context-aware AR applications.

Wearables with AR capabilities , such as those developed by ARKH and Wavelens, are offering hands-free experiences, further expanding the usability and applications of AR in various industries, including manufacturing and logistics. These wearables provide real-time guidance and information directly in the user’s field of view, enhancing convenience and efficiency.

3D design and prototyping in AR , as exemplified by Virtualist’s building design platform, are enabling industries like architecture and automotive to visualize products and designs in real-world contexts, significantly improving the decision-making process and reducing design errors.

Robotic Language-Vision Models (RLVM)

Integration of vision and language in robotics.

In 2024, the field of robotics is witnessing a significant shift with the integration of Language-Vision Models (RLVM), which are transforming how robots understand and interact with their environment. This blend of visual comprehension and language interpretation is paving the way for a new era of intelligent, responsive robotics.

Advancements in Robotic Language-Vision Models

Enhanced Learning Capabilities: Research and development efforts are increasingly focusing on using generative AI to make robots faster learners, especially for complex manipulation tasks. This advancement is likely to continue throughout 2024, potentially leading to commercial applications in robotics.

Natural Language Understanding:

Robots are becoming more personable, thanks to their improved ability to understand natural language instructions. This evolution is exemplified by projects where robots, such as Boston Dynamics’ Spot, are turned into interactive agents like tour guides.

Wider Application Spectrum:

Robots are moving beyond traditional environments like warehouses and manufacturing into public-facing roles in restaurants, hotels, hospitals, and more. Enabled by generative AI, these robots are expected to interact more naturally with people, enhancing their utility in these new roles.

Autonomous Mobile Robots (AMRs):

AMRs, combining sensors, AI, and computer vision, are increasingly used in varied settings, from factory floors to hospital corridors, for tasks like material handling, disinfection, and delivery services.

Intelligent Robotics:

Integration of AI in robotics is allowing robots to use real-time information to optimize tasks. This includes leveraging computer vision and machine learning for improved accuracy and performance in applications such as manufacturing automation and customer service in retail and hospitality.

Collaborative Robots (Cobots):

Cobots are being designed to safely interact and work alongside humans, augmenting human efforts in various industrial processes. Advances in sensor technology and software are enabling these robots to perform tasks more safely and efficiently alongside human workers.

Robotics as a Service (RaaS):

RaaS models are becoming more popular, providing businesses with flexible and scalable access to robotic solutions. This approach is particularly beneficial for small and medium-sized enterprises that can leverage robotic technology without incurring significant upfront costs.

Robotics Cybersecurity:

As robotics systems become more interconnected, the importance of cybersecurity in robotics is growing. Solutions are being developed to protect robotic systems from cyber threats, ensuring the safety and reliability of these systems in various applications.

Top research universities in the US

Advanced Satellite Vision:

Monitoring environmental and urban changes.

In 2024, the capabilities of satellite imagery have been significantly enhanced by advancements in computer vision (CV), leading to more effective monitoring of environmental and urban changes.

Satellite Imagery and Computer Vision

High-Resolution Monitoring: CV-powered satellite imagery provides high-resolution monitoring of various terrestrial phenomena. This includes tracking urban sprawl, deforestation, and changes in marine environments.

Environmental Management

These technological advancements are crucial for environmental monitoring and management. The detailed data from satellite imagery enables the study of ecological and climatic changes with unprecedented precision.

Urban Planning and Development

In urban areas, satellite vision assists in planning and development, providing critical data for infrastructure development, land use planning, and resource management.

Disaster Response and Management

Advanced satellite vision plays a key role in disaster management. It helps in assessing the impact of natural disasters and planning effective response strategies.

Agricultural Applications

In agriculture, satellite imagery helps in monitoring crop health, soil conditions, and water resources, enabling more efficient and sustainable farming practices.

Climate Change Analysis

Satellite vision is instrumental in understanding and monitoring the effects of climate change globally, including polar ice melt, sea-level rise, and changes in weather patterns.

3D Computer Vision: Enhancing Autonomous Vehicles and Digital Twin Modeling

In 2024, 3D Computer Vision (3D CV) is playing a pivotal role in advancing technologies in various sectors, particularly in autonomous vehicles and digital twin modeling.

3D Computer Vision in Autonomous Vehicles

Depth Perception: 3D CV enables autonomous vehicles to accurately perceive depth and distance. This is crucial for navigating complex environments and ensuring safety on the roads.

Object Detection and Tracking: It allows for precise detection and tracking of objects around the vehicle, including other vehicles, pedestrians, and road obstacles.

Environment Mapping: Advanced 3D imaging and processing help in creating detailed maps of the vehicle’s surroundings, essential for route planning and navigation.

Digital Twin Modeling with 3D Computer Vision

Accurate Replication: 3D CV is integral in creating accurate digital replicas of physical objects, buildings, or even entire cities for digital twin applications.

Simulation and Analysis: These digital twins are used for simulations, allowing for analysis and optimization of systems in a virtual environment before actual implementation.

Predictive Maintenance and Planning: In industries such as manufacturing and urban planning, digital twins aid in predictive maintenance and strategic planning, minimizing risks and enhancing efficiency.

Ethics in Computer Vision: Navigating Bias and Privacy Concerns

As computer vision (CV) technologies become increasingly integrated into various aspects of life, ethical considerations, particularly related to bias and privacy, are gaining prominence.

Addressing Bias in Computer Vision

Data Diversity: One major ethical challenge in CV is the bias in algorithms, often stemming from non-representative training data. Efforts are being made to create more diverse and inclusive datasets to help overcome biases related to race, gender, and other factors.

Fairness in Algorithms: There is a growing focus on developing algorithms that are fair and non-discriminatory. This includes techniques to detect and correct biases in CV systems.

Transparent and Explainable AI: Transparency in how CV models are built and function is crucial. There’s an emphasis on explainable AI, where the decision-making process of CV systems can be understood and interrogated by users.

Ensuring Privacy in Computer Vision

Consent and Anonymity: With CV technologies being used in public spaces, ensuring individual privacy is paramount. Techniques like face-blurring in videos and images are being adopted to protect identities.

Regulatory Compliance: Governments and regulatory bodies are proposing strict regulations to ensure responsible development and use of AI and CV technologies. This includes guidelines for data collection, processing, and storage to protect individual privacy.

Ethical Design and Deployment: Ethical considerations are increasingly becoming a part of the design and deployment process of CV technologies. This involves assessing the potential impact on society and individuals and ensuring that privacy and individual rights are safeguarded.

Synthetic Data and Generative AI in Computer Vision

The role of generative AI in creating synthetic data has become increasingly significant in developing and improving computer vision (CV) systems.

Generative AI and Synthetic Data Creation

Enhancing Training of CV Models: Generative AI algorithms can create realistic, high-quality synthetic data. This data is particularly valuable for training CV models, especially when real-world data is scarce, sensitive, or difficult to obtain.

Diversity and Volume: Synthetic data generated by AI can encompass various scenarios and variations, offering a rich and diverse dataset. This diversity is crucial for training robust CV models capable of performing accurately in various real-world conditions.

Privacy and Ethical Compliance: Using synthetic data mitigates privacy concerns associated with using real data, especially in sensitive areas like healthcare and security. It offers a way to train effective CV models without compromising individual privacy.

Cost-Effectiveness and Efficiency: Generating synthetic data can be more cost-effective and efficient than collecting and labeling vast amounts of real-world data. It also speeds up the iterative process of training and refining CV models.

Computer Vision in Edge Computing

In 2024, the trend of integrating Computer Vision (CV) with edge computing is becoming increasingly prominent, revolutionizing how data is processed in various applications.

The Shift to On-Device Processing

Reduced Latency: By processing visual data directly on the device (edge computing), response times are significantly decreased. This is vital in applications where real-time analysis is crucial, such as in autonomous vehicles or real-time monitoring systems.

Improved Privacy and Security: Edge computing allows for sensitive data to be processed locally, reducing the risk of data breaches during transmission to cloud-based servers. This is particularly important in applications involving personal or sensitive information.

Enhanced Efficiency: Local data processing minimizes the need to transfer large volumes of data to the cloud, thereby reducing bandwidth usage and associated costs. This is beneficial for devices operating in remote or bandwidth-constrained environments.

Scalability : Edge computing enables scalability in CV applications. Devices can process data independently, alleviating the load on central servers and allowing for the deployment of more devices without a proportional increase in central processing requirements.

Applications in Diverse Fields

Intelligent Security Systems: In security and surveillance, edge computing allows for immediate processing and analysis of visual data, enabling quicker response to potential security threats.

Healthcare: Portable medical devices with integrated CV can process data on the edge, aiding in immediate diagnostic procedures and patient monitoring.

Retail and Consumer Applications: In retail, edge computing enables smart shelves and inventory management systems to process visual data in real time, improving efficiency and customer experience.

Industrial and Manufacturing: In industrial settings, edge computing facilitates real-time monitoring and quality inspection, improving operational efficiency and safety.

Computer Vision in Healthcare

Computer Vision (CV) is significantly impacting the healthcare sector, offering innovative solutions for medical image analysis, surgical assistance, and patient monitoring.

Medical Image Analysis

Diagnostic Accuracy: CV algorithms are increasingly used to analyze medical images such as X-rays, MRIs, and CT scans. They assist in identifying abnormalities, leading to quicker and more accurate diagnoses.

Cancer Detection : In oncology, CV aids in the early detection of cancers, such as breast or skin cancer, through detailed analysis of medical imagery.

Automated Analysis: Automated image analysis can handle large volumes of medical images, reducing the workload on radiologists and increasing efficiency.

Aiding Surgeries

Surgical Robotics: CV is integral to the functioning of surgical robots, providing them with the necessary visual information to assist surgeons in performing precise and minimally invasive procedures.

Real-Time Navigation: During surgeries, CV provides real-time imaging, aiding surgeons in navigating complex procedures and avoiding critical structures.

Training and Simulation: CV technologies are used in surgical training, providing simulations that help surgeons hone their skills in a risk-free environment.

Patient Monitoring

Remote Monitoring : CV enables remote patient monitoring, allowing healthcare providers to observe patients’ physical condition and movements without being physically present. This is particularly beneficial for elderly care and monitoring patients in intensive care units.

Fall Detection and Prevention: In elderly care, CV systems can detect falls or unusual behaviors, alerting caregivers to potential emergencies.

Behavioral Analysis: CV is also used in analyzing patients’ behaviors and movements, which can be vital in psychiatric care and physical therapy.

Challenges and Future Directions

While CV is bringing transformative changes to healthcare, it also presents challenges such as data privacy concerns, the need for large annotated datasets, and ensuring the accuracy and reliability of algorithms. The future of CV in healthcare is promising, with ongoing research and development aimed at addressing these challenges and expanding its applications.

Top 7 research universities in India

Detecting Deepfakes: The Crucial Role of Computer Vision

As AI-generated deepfakes become increasingly realistic and pervasive, the importance of Computer Vision (CV) in detecting and combating them has become more critical.

The Challenge of Deepfakes

Realism and Proliferation: Deepfakes, synthesized using advanced AI algorithms, are becoming more sophisticated, making them harder to distinguish from real footage. Their potential use in spreading misinformation or malicious content poses significant challenges.

Misinformation and Security Threats: The use of deepfakes in spreading false information can have serious implications in various spheres, including politics, security, and personal privacy.

CV’s Role in Deepfake Detection

Analyzing Visual Inconsistencies: CV algorithms are trained to detect subtle inconsistencies in videos and images that are typically overlooked by the human eye. This includes irregularities in facial expressions, lip movements, and eye blinking patterns.

Temporal and Spatial Analysis: CV techniques analyze both spatial features (like facial features) and temporal features (like movement over time) in videos to identify anomalies that suggest manipulation.

Training on Diverse Data Sets: To improve the accuracy of deepfake detection, CV systems are trained on diverse datasets that include various types of manipulations and original content.

The importance of CV in identifying deepfakes cannot be understated, as it stands at the forefront of preserving information integrity in the digital age. The advancements in this field will be instrumental in maintaining trust and authenticity in digital media.

Real-Time Computer Vision

Enhancing security, crowd monitoring, and industrial safety.

Real-time computer vision (CV) technologies are increasingly being deployed in various fields like security, crowd monitoring, and industrial safety, offering dynamic and immediate data analysis for enhanced operational efficiency and safety.

Applications in Security

Surveillance Systems: Real-time CV is revolutionizing surveillance by enabling immediate identification and alerting of security breaches or unusual activities. This includes facial recognition, intrusion detection, and unauthorized access alerts.

Automated Threat Detection: CV systems can detect potential threats in real-time, such as identifying unattended bags in public areas or spotting unusual behaviors that could indicate criminal activities.

Crowd Monitoring and Management

Public Safety: In large public gatherings, real-time CV aids in crowd density analysis, helping to prevent stampedes or accidents by alerting authorities to potential dangers due to overcrowding.

Traffic Management: In urban settings, CV systems monitor and analyze traffic flow in real time, helping in congestion management and accident prevention.

Event Management: For events like concerts or sports games, real-time CV can assist in crowd control, ensuring that safety regulations are adhered to and identifying potential bottlenecks or overcrowding situations.

Industrial Safety

Workplace Monitoring: CV systems monitor industrial environments in real time, detecting potential hazards like equipment malfunctions or unsafe worker behavior, thus preventing accidents and ensuring compliance with safety protocols.

Quality Control: In manufacturing, real-time CV assists in continuous monitoring of production lines, instantly identifying defects or deviations from standard protocols.

Equipment Maintenance: CV can help in predictive maintenance by detecting early signs of wear and tear in machinery, preventing costly downtime and accidents.

Top research universities in Europe

Conclusion: Navigating the Future of Computer Vision

From enhancing healthcare and security to revolutionizing interactive technologies like AR, CV is reshaping our interaction with the digital world. Its advancements, including AI integration and edge computing, highlight a future rich with potential.

Yet, this journey forward isn’t without challenges. Balancing innovation with ethical responsibility, privacy, and fairness remains crucial. As CV becomes more embedded in our lives, it calls for a collaborative approach among technologists, ethicists, and policymakers to ensure it benefits society responsibly and equitably.

In essence, CV’s future is not just about technological growth but also about addressing ethical and societal needs, marking an exciting, transformative journey ahead.

August 16, 2023 Leave a Comment

August 23, 2023 Leave a Comment

Knowing the history of AI is important in understanding where AI is now and where it may go in the future.

August 30, 2023 Leave a Comment

Become a Member

Stay up to date on OpenCV and Computer Vision news

Free Courses

TensorFlow & Keras Bootcamp
OpenCV Bootcamp
Python for Beginners
Mastering OpenCV with Python
Fundamentals of CV & IP
Deep Learning with PyTorch
Deep Learning with TensorFlow & Keras
Computer Vision & Deep Learning Applications
Mastering Generative AI for Art

Partnership

Intel, OpenCV’s Platinum Member
Gold Membership
Development Partnership

General Link

Subscribe and Start Your Free Crash Course

Stay up to date on OpenCV and Computer Vision news and our new course offerings

We hate SPAM and promise to keep your email address safe.

Join the waitlist to receive a 20% discount

Courses are (a little) oversubscribed and we apologize for your enrollment delay. As an apology, you will receive a 20% discount on all waitlist course purchases. Current wait time will be sent to you in the confirmation email. Thank you!

Computer Vision: 10 Papers to Start

Dec 25, 2015

“How do I know what papers to read in computer vision? There are so many. And they are so different.” Graduate Student. Xi’An. China. November, 2011.

This is a quote from an opinion paper by my advisor. Having worked on computer vision for nearly 2 years, I can absolutely resonate with the comment. The diversity of computer vision may be especially confusing for starters.

This post serves as a humble attempt to answer the opening question. Of course it is subjective, but a good starting point for sure.

This post is intended for computer vision starters , mostly undergraduate students . An important lesson is that unlike undergraduate education, when doing research, you learn primarily from reading papers, which is why I am recommending 10 to start.

Before getting to the list, it is good to know where CV papers are usually published. CV people like to publish in conferences. The three top tier CV conferences are: CVPR (each year), ICCV (odd year), ECCV (even year). Since CV is an application of machine learning, people also publish in NIPS and ICML. ICLR is new but rapidly rising to the top tier. As for journals, PAMI and IJCV are the best.

I am partitioning the 10 papers into 5 categories, and the list is loosely sorted by publication time. Here it goes!

Finding good features has always been a core problem of computer vision. A good feature can summarize the information of the image and enable the subsequent use of powerful mathematical tools. In the 2000s, a lot of feature designs were proposed.

Distinctive Image Features from Scale-Invariant Keypoints , IJCV 2004

SIFT feature is designed to establish correspondence between two images. Its most important applications are in reconstruction and tracking.

Histograms of Oriented Gradients for Human Detection , CVPR 2005

HOG has the same philosophy of feature design as SIFT, but is even simpler. While SIFT is more low-level understanding, HOG is more high-level understanding.

Reconstruction

Reconstruction is an important branch of computer vision. Since the 2000s, structure from motion (SfM) has been formalized and is still the standard practice today.

Photo Tourism: Exploring Photo Collections in 3D , ACM Transactions on Graphics 2006

This paper uses SfM to reconstruct scenes from photos collected from the internet. Since then, the core pipeline remains more or less the same, and people seek improvement in, for instance, scalability and visualization. There is also an extended IJCV version later.

Graphical Models

Graphical model is a machine learning tool that tries to capture the relationship between random variables. It is quite general in nature, and is suitable for many computer vision tasks.

Structured Learning and Prediction in Computer Vision , Foundations and Trends in Computer Graphics and Vision 2011

This 180+ page paper is one of the first paper that I have read, and remains my personal favourite. It is a comprehensive overview of both theory and application of graphical models in various computer vision tasks.

The advancement in computer vision can hardly live without good datasets. The evaluation on a suited and unbiased dataset is the valid proof of the proposed algorithm. Interestingly, the evolution of dataset can also reflect the progress of computer vision research.

The PASCAL Visual Object Classes (VOC) Challenge , IJCV 2010

PASCAL VOC is the standard evaluation dataset of semantic segmentation and object detection. While the annual challenge has ended, the evaluation server is still open, and the leaderboard is definitely something you want to check out to find the state-of-the-art result/algorithm. There is also a recent retrospect paper on IJCV.

ImageNet: A Large-Scale Hierarchical Image Database , CVPR 2009

ImageNet is the first large scale dataset, containing millions of images of 1000 categories. It is the standard evaluation dataset of classification, and is one of the driving force behind the recent success of deep convolutional neural networks. There is also a recent retrospect paper on IJCV.

Microsoft COCO: Common Objects in Context , ECCV 2014

This dataset is relatively new. Similar to PASCAL VOC, it aims at instance segmentation and object detection, but the number of images is much larger. More interestingly, it contains language descriptions for each image, bridging computer vision with natural language processing.

Deep Learning

I am sure you have heard of deep learning. It is an end-to-end hierarchical model optimized by simply chain rule and gradient descent. What makes it powerful is its billions of parameters, which enables unprecedented representation capacity.

ImageNet Classification with Deep Convolutional Neural Networks , NIPS 2012

This paper marks the big breakthrough of applying deep learning to computer vision. Made possible by the large ImageNet dataset and the fast GPU, the model took 1 week to train, and outperforms the traditional method on image classification by 10%.

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , ICML 2014

This paper shows that while the model mentioned above is trained for image classification, its intermediate representation is a powerful feature that can transfer to other tasks. This comes back to finding good features for images. In high-level tasks, deep features consistently show superiority over traditional features.

Visualizing and Understanding Convolutional Networks , ECCV 2014

Understanding what is indeed going on inside the deep neural network remains a challenging task. This paper is perhaps the most famous and important work towards this goal. It looks at individual neurons and uses deconvolution to visualize. However, there is still much to be done.

Again, this has been a humble attempt to address the opening question. Hope these excellent papers can kindle your enthusiasm for computer vision!

Merry Christmas!

10 Research Papers Accepted to CVPR 2023

Research from the department has been accepted to the 2023 Computer Vision and Pattern Recognition (CVPR) Conference . The annual event explores machine learning, artificial intelligence, and computer vision research and its applications.

CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation Samir Yitzhak Gadre Columbia University , Mitchell Wortsman University of Washington , Gabriel Ilharco University of Washington , Ludwig Schmidt University of Washington , Shuran Song Columbia University

For robots to be generally useful, they must be able to find arbitrary objects described by people (i.e., be language-driven) even without expensive navigation training on in-domain data (i.e., perform zero-shot inference). We explore these capabilities in a unified setting: language-driven zero-shot object navigation (L-ZSON). Inspired by the recent success of open-vocabulary models for image classification, we investigate a straightforward framework, CLIP on Wheels (CoW), to adapt open-vocabulary models to this task without fine-tuning. To better evaluate L-ZSON, we introduce the Pasture benchmark, which considers finding uncommon objects, objects described by spatial and appearance attributes, and hidden objects described relative to visible objects. We conduct an in-depth empirical study by directly deploying 21 CoW baselines across Habitat, RoboTHOR, and Pasture. In total, we evaluate over 90k navigation episodes and find that (1) CoW baselines often struggle to leverage language descriptions, but are proficient at finding uncommon objects. (2) A simple CoW, with CLIP-based object localization and classical exploration — and no additional training — matches the navigation efficiency of a state-of-the-art ZSON method trained for 500M steps on Habitat MP3D data. This same CoW provides a 15.6 percentage point improvement in success over a state-of-the-art RoboTHOR ZSON model.

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-Channel Video-Language Retrieval Xudong Lin Columbia University , Simran Tiwari Columbia University , Shiyuan Huang Columbia University , Manling Li UIUC , Mike Zheng Shou National University of Singapore , Heng Ji UIUC , Shih-Fu Chang Columbia University

Multi-channel video-language retrieval require models to understand information from different channels (e.g. video+question, video+speech) to correctly link a video with a textual response or query. Fortunately, contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text, e.g., CLIP; text contrastive models are extensively studied recently for their strong ability of producing discriminative sentence embeddings, e.g., SimCSE. However, there is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources. In this paper, we identify a principled model design space with two axes: how to represent videos and how to fuse video and text information. Based on categorization of recent methods, we investigate the options of representing videos using continuous feature vectors or discrete text tokens; for the fusion method, we explore the use of a multimodal transformer or a pretrained contrastive text model. We extensively evaluate the four combinations on five video-language datasets. We surprisingly find that discrete text tokens coupled with a pretrained contrastive text model yields the best performance, which can even outperform state-of-the-art on the iVQA and How2QA datasets without additional training on millions of video-text data. Further analysis shows that this is because representing videos as text tokens captures the key visual information and text tokens are naturally aligned with text models that are strong retrievers after the contrastive pretraining process. All the empirical analysis establishes a solid foundation for future research on affordable and upgradable multimodal intelligence.

DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection Jiawei Ma Columbia University , Yulei Niu Columbia University , Jincheng Xu Columbia University , Shiyuan Huang Columbia University , Guangxing Han Columbia University , Shih-Fu Chang Columbia University

Generalized few-shot object detection aims to achieve precise detection on both base classes with abundant annotations and novel classes with limited training data. Existing approaches enhance few-shot generalization with the sacrifice of base-class performance, or maintain high precision in base-class detection with limited improvement in novel-class adaptation. In this paper, we point out the reason is insufficient Discriminative feature learning for all of the classes. As such, we propose a new training framework, DiGeo, to learn Geometry-aware features of inter-class separation and intra-class compactness. To guide the separation of feature clusters, we derive an offline simplex equiangular tight frame (ETF) classifier whose weights serve as class centers and are maximally and equally separated. To tighten the cluster for each class, we include adaptive class-specific margins into the classification loss and encourage the features close to the class centers. Experimental studies on two few-shot benchmark datasets (VOC, COCO) and one long-tail dataset (LVIS) demonstrate that, with a single model, our method can effectively improve generalization on novel classes without hurting the detection of base classes.

Supervised Masked Knowledge Distillation for Few-Shot Transformers Han Lin Columbia University , Guangxing Han Columbia University , Jiawei Ma Columbia University , Shiyuan Huang Columbia University , Xudong Lin Columbia University , Shih-Fu Chang Columbia University

Vision Transformers (ViTs) emerge to achieve impressive performance on many data-abundant computer vision tasks by capturing long-range dependencies among local features. However, under few-shot learning (FSL) settings on small datasets with only a few labeled data, ViT tends to overfit and suffers from severe performance degradation due to its absence of CNN-alike inductive bias. Previous works in FSL avoid such problem either through the help of self-supervised auxiliary losses, or through the dextile uses of label information under supervised settings. But the gap between self-supervised and supervised few-shot Transformers is still unfilled. Inspired by recent advances in self-supervised knowledge distillation and masked image modeling (MIM), we propose a novel Supervised Masked Knowledge Distillation model (SMKD) for few-shot Transformers which incorporates label information into self-distillation frameworks. Compared with previous self-supervised methods, we allow intra-class knowledge distillation on both class and patch tokens, and introduce the challenging task of masked patch tokens reconstruction across intra-class images. Experimental results on four few-shot classification benchmark datasets show that our method with simple design outperforms previous methods by a large margin and achieves a new start-of-the-art. Detailed ablation studies confirm the effectiveness of each component of our model. Code for this paper is available here: this https URL .

FLEX: Full-Body Grasping Without Full-Body Grasps Purva Tendulkar Columbia University , Dídac Surís Columbia University , Carl Vondrick Columbia University

Synthesizing 3D human avatars interacting realistically with a scene is an important problem with applications in AR/VR, video games and robotics. Towards this goal, we address the task of generating a virtual human — hands and full body — grasping everyday objects. Existing methods approach this problem by collecting a 3D dataset of humans interacting with objects and training on this data. However, 1) these methods do not generalize to different object positions and orientations, or to the presence of furniture in the scene, and 2) the diversity of their generated full-body poses is very limited. In this work, we address all the above challenges to generate realistic, diverse full-body grasps in everyday scenes without requiring any 3D full-body grasping data. Our key insight is to leverage the existence of both full-body pose and hand grasping priors, composing them using 3D geometrical constraints to obtain full-body grasps. We empirically validate that these constraints can generate a variety of feasible human grasps that are superior to baselines both quantitatively and qualitatively. See our webpage for more details: this https URL .

Humans As Light Bulbs: 3D Human Reconstruction From Thermal Reflection Ruoshi Liu Columbia University , Carl Vondrick Columbia University

The relatively hot temperature of the human body causes people to turn into long-wave infrared light sources. Since this emitted light has a larger wavelength than visible light, many surfaces in typical scenes act as infrared mirrors with strong specular reflections. We exploit the thermal reflections of a person onto objects in order to locate their position and reconstruct their pose, even if they are not visible to a normal camera. We propose an analysis-by-synthesis framework that jointly models the objects, people, and their thermal reflections, which combines generative models with differentiable rendering of reflections. Quantitative and qualitative experiments show our approach works in highly challenging cases, such as with curved mirrors or when the person is completely unseen by a normal camera.

Tracking Through Containers and Occluders in the Wild Basile Van Hoorick Columbia University , Pavel Tokmakov Toyota Research Institute , Simon Stent Woven Planet , Jie Li Toyota Research Institute , Carl Vondrick Columbia University

Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems. In this paper, we introduce TCOW, a new benchmark and model for visual tracking through heavy occlusion and containment. We set up a task where the goal is to, given a video sequence, segment both the projected extent of the target object, as well as the surrounding container or occluder whenever one exists. To study this task, we create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance under various forms of task variation, such as moving or nested containment. We evaluate two recent transformer-based video models and find that while they can be surprisingly capable of tracking targets under certain settings of task variation, there remains a considerable performance gap before we can claim a tracking model to have acquired a true notion of object permanence.

Doubly Right Object Recognition: A Why Prompt for Visual Rationales Chengzhi Mao Columbia University , Revant Teotia Columbia University , Amrutha Sundar Columbia University , Sachit Menon Columbia University , Junfeng Yang Columbia University , Xin Wang Microsoft Research , Carl Vondrick Columbia University

Many visual recognition models are evaluated only on their classification accuracy, a metric for which they obtain strong performance. In this paper, we investigate whether computer vision models can also provide correct rationales for their predictions. We propose a “doubly right” object recognition benchmark, where the metric requires the model to simultaneously produce both the right labels as well as the right rationales. We find that state-of-the-art visual models, such as CLIP, often provide incorrect rationales for their categorical predictions. However, by transferring the rationales from language models into visual representations through a tailored dataset, we show that we can learn a “why prompt,” which adapts large visual representations to produce correct rationales. Visualizations and empirical experiments show that our prompts significantly improve performance on doubly right object recognition, in addition to zero-shot transfer to unseen tasks and datasets.

What You Can Reconstruct From a Shadow Ruoshi Liu Columbia University , Sachit Menon Columbia University , Chengzhi Mao Columbia University , Dennis Park Toyota Research Institute , Simon Stent Woven Planet , Carl Vondrick Columbia University

3D reconstruction is a fundamental problem in computer vision, and the task is especially challenging when the object to reconstruct is partially or fully occluded. We introduce a method that uses the shadows cast by an unobserved object in order to infer the possible 3D volumes under occlusion. We create a differentiable image formation model that allows us to jointly infer the 3D shape of an object, its pose, and the position of a light source. Since the approach is end-to-end differentiable, we are able to integrate learned priors of object geometry in order to generate realistic 3D shapes of different object categories. Experiments and visualizations show that the method is able to generate multiple possible solutions that are consistent with the observation of the shadow. Our approach works even when the position of the light source and object pose are both unknown. Our approach is also robust to real-world images where ground-truth shadow mask is unknown.

CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes From Natural Language Aditya Sanghi Autodesk Research , Rao Fu Brown University , Vivian Liu Columbia University , Karl D.D. Willis Autodesk Research , Hooman Shayani Autodesk Research , Amir H. Khasahmadi Autodesk Research , Srinath Sridhar Brown University , Daniel Ritchie Brown University

Recent works have demonstrated that natural language can be used to generate and edit 3D shapes. However, these methods generate shapes with limited fidelity and diversity. We introduce CLIP-Sculptor, a method to address these constraints by producing high-fidelity and diverse 3D shapes without the need for (text, shape) pairs during training. CLIP-Sculptor achieves this in a multi-resolution approach that first generates in a low-dimensional latent space and then upscales to a higher resolution for improved shape fidelity. For improved shape diversity, we use a discrete latent space which is modeled using a transformer conditioned on CLIP’s image-text embedding space. We also present a novel variant of classifier-free guidance, which improves the accuracy-diversity trade-off. Finally, we perform extensive experiments demonstrating that CLIP-Sculptor outperforms state-of-the-art baselines.

Find open faculty positions here .

Computer Science at Columbia University

Upcoming events, in the news, press mentions, dean boyce's statement on amicus brief filed by president bollinger.

President Bollinger announced that Columbia University along with many other academic institutions (sixteen, including all Ivy League universities) filed an amicus brief in the U.S. District Court for the Eastern District of New York challenging the Executive Order regarding immigrants from seven designated countries and refugees. Among other things, the brief asserts that “safety and security concerns can be addressed in a manner that is consistent with the values America has always stood for, including the free flow of ideas and people across borders and the welcoming of immigrants to our universities.”

This recent action provides a moment for us to collectively reflect on our community within Columbia Engineering and the importance of our commitment to maintaining an open and welcoming community for all students, faculty, researchers and administrative staff. As a School of Engineering and Applied Science, we are fortunate to attract students and faculty from diverse backgrounds, from across the country, and from around the world. It is a great benefit to be able to gather engineers and scientists of so many different perspectives and talents – all with a commitment to learning, a focus on pushing the frontiers of knowledge and discovery, and with a passion for translating our work to impact humanity.

I am proud of our community, and wish to take this opportunity to reinforce our collective commitment to maintaining an open and collegial environment. We are fortunate to have the privilege to learn from one another, and to study, work, and live together in such a dynamic and vibrant place as Columbia.

Mary C. Boyce Dean of Engineering Morris A. and Alma Schapiro Professor

Courses This Semester

{{title}} ({{dept}} {{prefix}}{{course_num}}-{{section}})

Multi-Constraint Transferable Generative Adversarial Networks for Cross-Modal Brain Image Synthesis

Published: 28 May 2024

Cite this article

Yawen Huang 1 ,
Hao Zheng 1 ,
Yuexiang Li 2 ,
Feng Zheng 3 ,
Xiantong Zhen 4 ,
GuoJun Qi 5 ,
Ling Shao 6 &
Yefeng Zheng 1

Recent progress in generative models has led to the drastic growth of research in image generation. Existing approaches show visually compelling results by learning multi-modal distributions, but they still lack realism, especially in certain scenarios like medical image synthesis. In this paper, we propose a novel Brain Generative Adversarial Network (BrainGAN) that explores GANs with multi-constraint and transferable property for cross-modal brain image synthesis. We formulate BrainGAN by introducing a unified framework with new constraints that can enhance modal matching, texture details and anatomical structure, simultaneously. We show how BrainGAN can learn meaningful tissue representations with rich variability of brain images. In addition to generating 3D volumes that are visually indistinguishable from real ones, we model adversarial discriminators and segmentors jointly, along with the proposed cost functions, which forces our networks to synthesize brain MRIs with realistic textures conditioned on anatomical structures. BrainGAN is evaluated on three public datasets, where it consistently outperforms the other state-of-the-art approaches by a large margin, advancing cross-modal synthesis of brain images both visually and practically.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Make-A-Volume: Leveraging Latent Diffusion Models for Cross-Modality 3D Brain MRI Synthesis

3D-StyleGAN: A Style-Based Generative Adversarial Network for Generative Modeling of Three-Dimensional Medical Images

Trans-cGAN: transformer-Unet-based generative adversarial networks for cross-modality magnetic resonance image synthesis

https://brain-development.org/ixi-dataset/ .

https://insight-journal.org/midas/collection/view/190 .

https://www.med.upenn.edu/sbia/brats2018/data.html .

Please note that the segmentation mask is not real “ground truth”, thus the Dice score is calculated against noisy labels.

Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35 (8), 1798–1828.

Article Google Scholar

Bińkowski, M., Sutherland, D. J., Arbel, M., & Gretton, A. (2018). Demystifying mmd gans. arXiv preprint arXiv:1801.01401 .

Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H. P., Schölkopf, B., & Smola, A. J. (2006). Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22 (14), e49–e57.

Chartsias, A., Joyce, T., Giuffrida, M. V., & Tsaftaris, S. A. (2017). Multimodal MR synthesis via modality-invariant latent representation. IEEE Transactions on Medical Imaging, 37 (3), 803–814.

Chen, C., Dou, Q., Chen, H., Qin, J., & Heng, P. A. (2019). Synergistic image and feature adaptation: Towards cross-modality domain adaptation for medical image segmentation. In Proceedings of the AAAI conference on artificial intelligence (pp. 865–872).

Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34 , 8780–8794.

Google Scholar

Dziugaite, G. K., Roy, D. M., & Ghahramani, Z. (2015). Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906 .

Efros, A. A., & Freeman, W. T. (2001). Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on computer graphics and interactive techniques (pp. 341–346). ACM.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial nets. ACM, 63 , 139–144.

Havaei, M., Guizard, N., Chapados, N., & Bengio, Y. (2016). HEMIS: Hetero-modal image segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 469–477). Springer.

Heide, F., Heidrich, W., & Wetzstein, G. (2015). Fast and flexible convolutional sparse coding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5135–5143).

Huang, Y., Shao, L., & Frangi, A. F. (2017b). DOTE: Dual convolutional filter learning for super-resolution and cross-modality synthesis in MRI. In International conference on medical image computing and computer-assisted intervention (pp. 89–98). Springer.

Huang, Y., Zheng, F., Wang, D., Huang, W., Scott, M. R., & Shao, L. (2021). Brain image synthesis with unsupervised multivariate canonical CSCl4Net. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5881–5890).

Huang, Y., Shao, L., & Frangi, A. F. (2017). Cross-modality image synthesis via weakly coupled and geometry co-regularized joint dictionary learning. IEEE Transactions on Medical Imaging, 37 (3), 815–827.

Huang, Y., Zheng, F., Cong, R., Huang, W., Scott, M. R., & Shao, L. (2020). MCMT-GAN: Multi-task coherent modality transferable GAN for 3D brain image synthesis. IEEE Transactions on Image Processing, 29 , 8187–8198.

Hung, W. C., Tsai, Y. H., Liou, Y. T., Lin, Y. Y., & Yang, M. H. (2018). Adversarial learning for semi-supervised semantic segmentation. arXiv preprint arXiv:1802.07934 .

Iglesias, J. E., Modat, M., Peter, L., Stevens, A., Annunziata, R., Vercauteren, T., Lein, E., Fischl, B., & Ourselin, S. (2018). Joint registration and synthesis using a probabilistic model for alignment of MRI and histological sections. Medical Image Analysis, 50 , 127–144.

Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125–1134).

IXI. (2015). Information eXtraction from Images. https://brain-development.org/ixi-dataset/ .

Jog, A., Carass, A., Roy, S., Pham, D. L., & Prince, J. L. (2017). Random forest regression for magnetic resonance image synthesis. Medical Image Analysis, 35 , 475–488.

Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision (pp. 694–711). Springer.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. ACM, 60 , 84–90.

Li, C. L., Chang, W. C., Cheng, Y., Yang, Y., & Póczos, B. (2017). MMD GAN: Towards deeper understanding of moment matching network. Advances in Neural Information Processing Systems, 30 .

Liu, M. Y., Huang, X., Mallya, A., Karras, T., Aila, T., Lehtinen, J., & Kautz, J. (2019). Few-shot unsupervised image-to-image translation. arXiv preprint arXiv:1905.01723 .

Long, J., Shelhamer, E., Darrell, & T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).

Mahapatra, D., Bozorgtabar, B., Thiran, J. P., & Reyes, M. (2018). Efficient active learning for image classification and segmentation using a sample selection and conditional generative adversarial network. In International conference on medical image computing and computer-assisted intervention (pp. 580–588). Springer.

...Menze, B. H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., Lanczi, L., Gerstner, E., Weber, M. A., Arbel, T., Avants, B. B., Ayache, N., Buendia, P., Collins, D. L., Cordier, N., Van Leemput, K. (2015). The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Transactions on Medical Imaging, 34 (10), 1993–2024. https://doi.org/10.1109/TMI.2014.2377694

Najdenkoska, I., Zhen, X., Worring, M., & Shao, L. (2022). Uncertainty-aware report generation for chest x-rays by variational topic inference. Medical Image Analysis, 82 , 102603.

NAMIC. (2018). Brain multimodality dataset. https://www.med.upenn.edu/sbia/brats2018/data.html .

Nguyen, H. V., Zhou, K., & Vemulapalli, R. (2015). Cross-domain synthesis of medical images using efficient location-sensitive deep network. In International conference on medical image computing and computer-assisted intervention (pp. 677–684). Springer.

Nie, D., Wang, L., Xiang, L., Zhou, S., Adeli, E., & Shen, D. (2019). Difficulty-aware attention network with confidence learning for medical image segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 33 , 1085–1092.

Osokin, A., Chessel, A., Carazo Salas, R. E., & Vaggi, F. (2017). GANs for biological image synthesis. In Proceedings of the IEEE international conference on computer vision (pp. 2233–2242).

Pan, Y., Liu, M., Xia, Y., & Shen, D. (2021). Disease-image-specific learning for diagnosis-oriented neuroimage synthesis with incomplete multi-modality data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (10), 6839–6853.

Park, T., Efros, A. A., Zhang, R., & Zhu, J. Y. (2020). Contrastive learning for conditional image synthesis. In European conference on computer vision .

Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., & Lee, H. (2016). Generative adversarial text to image synthesis. In International conference on machine learning (pp. 1060–1069). PMLR.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 10684–10695).

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 234–241). Springer.

Rousseau, F. (2008). Brain hallucination. In European conference on computer vision (pp. 497–508). Springer.

Roy, S., Carass, A., & Prince, J. L. (2013). Magnetic resonance image example-based contrast synthesis. IEEE Transactions on Medical Imaging, 32 (12), 2348–2363.

Shao, W., Wang, T., Huang, Z., Cheng, J., Han, Z., Zhang, D., & Huang, K. (2019). Diagnosis-guided multi-modal feature selection for prognosis prediction of lung squamous cell carcinoma. In International conference on medical image computing and computer-assisted intervention (pp. 113–121). Springer.

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 .

Singh, J., Gould, S., & Zheng, L. (2023). High-fidelity guided image synthesis with latent diffusion models. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5997–6006). IEEE.

Souly, N., Spampinato, C., & Shah, M. (2017). Semi supervised semantic segmentation using generative adversarial network. In Proceedings of the IEEE international conference on computer vision (pp. 5688–5696).

Tang, H., Shao, L., Torr, P. H., & Sebe, N. (2023). Bipartite graph reasoning gans for person pose and facial image synthesis. International Journal of Computer Vision, 131 (3), 644–658.

Vemulapalli, R., Van Nguyen, H., & Zhou, K. S. (2015). Unsupervised cross-modal synthesis of subject-specific scans. In Proceedings of the IEEE international conference on computer vision (pp. 630–638).

Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8798–8807).

Wang, J., Zhou, W., Qi, G. J., Fu, Z., Tian, Q., & Li, H. (2020). Transformation GAN for unsupervised image synthesis and representation learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 472–481).

Xian, W., Sangkloy, P., Agrawal, V., Raj, A., Lu, J., Fang, C., Yu, F., & Hays, J. (2018). TextureGAN: Controlling deep image synthesis with texture patches. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8456–8465).

Xue, Y., Xu, T., Zhang, H., Long, L. R., & Huang, X. (2018). Segan: Adversarial network with multi-scale l 1 loss for medical image segmentation. Neuroinformatics, 16 , 383–392.

Yang, J., Wright, J., Huang, T. S., & Ma, Y. (2010). Image super-resolution via sparse representation. IEEE Transactions on Image Processing, 19 (11), 2861–2873.

Article MathSciNet Google Scholar

Zhao, A., Balakrishnan, G., Durand, F., Guttag, J. V., & Dalca, A. V. (2019). Data augmentation using learned transformations for one-shot medical image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8543–8553).

Zhou, Y., He, X., Cui, S., Zhu, F., Liu, L., & Shao, L. (2019). High-resolution diabetic retinopathy image synthesis manipulated by grading and lesions. In International conference on medical image computing and computer-assisted intervention (pp. 505–513). Springer.

Zhou, T., Fu, H., Chen, G., Shen, J., & Shao, L. (2020). Hi-Net: Hybrid-fusion network for multi-modal MR image synthesis. IEEE Transactions on Medical Imaging, 39 (9), 2772–2781.

Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223–2232).

Download references

Author information

Authors and affiliations.

Jarvis Research Center, Tencent YouTu Lab, Shenzhen, China

Yawen Huang, Hao Zheng & Yefeng Zheng

Medical AI ReSearch (MARS) Group, Guangxi Key Laboratory for Genomic and Personalized Medicine, Guangxi Medical University, Nanning, 530021, Guangxi, China

Yuexiang Li

Southern University of Science and Technology, Shenzhen, China

Central Research Institute, United Imaging Healthcare Co., Ltd., Beijing, China

Xiantong Zhen

University of Central Florida, Orlando, FL, USA

UCAS-Terminus AI Lab, University of Chinese Academy of Sciences, Beijing, 100049, China

You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yuexiang Li or Yefeng Zheng .

Additional information

Communicated by Paolo Rota.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Huang, Y., Zheng, H., Li, Y. et al. Multi-Constraint Transferable Generative Adversarial Networks for Cross-Modal Brain Image Synthesis. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02109-4

Download citation

Received : 03 April 2023

Accepted : 22 April 2024

Published : 28 May 2024

DOI : https://doi.org/10.1007/s11263-024-02109-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Image synthesis
Cross-modal
Multi-constraint
Generative adversarial network

Find a journal
Publish with us
Track your research

Two big computer vision papers boost prospect of safer self-driving vehicles

New chip and camera technology bring closer potential of hands-free road time.

Like nuclear fusion and jet-packs, the self-driving car is a long-promised technology that has stalled for years - yet armed with research, boffins think they have created potential improvements.…

Citizens of Phoenix, San Francisco, and Los Angeles are able to take one of Waymo's self-driving taxis, first introduced to the public in December 2020. But they have not been without their glitches. Just last month in San Francisco, for example, one of the taxi service's autonomous vehicles drove down the wrong side of the street to pass a unicycle. In December last year, a Waymo vehicle hit a backwards-facing pickup truck, resulting in a report with the US National Highway Traffic Safety Administration (NHTSA) and a software update.

But this week, not one but two groups of researchers bidding to improve the performance of self-driving cars and other autonomous vehicles have published papers in the international science journal Nature.

A design for a new chip geared towards autonomous vehicles has arrived from China. Tsinghua University's Luping Shi and colleagues have taken inspiration from the human visual system by both combining low-accuracy, fast event-based detection with more accurate, but slower visualization of an image.

The researchers were able to show the chip — dubbed Tianmouc — could process pixel arrays quickly and robustly in an automotive driving perception system.

In a paper published today, the authors said: "We demonstrate the integration of a Tianmouc chip into an autonomous driving system, showcasing its abilities to enable accurate, fast and robust perception, even in challenging corner cases on open roads. The primitive-based complementary sensing paradigm helps in overcoming fundamental limitations in developing vision systems for diverse open-world applications."

In a separate paper, Davide Scaramuzza, University of Zurich robotics and perception professor, and his colleagues adopt a similar hybrid approach but apply it to camera technologies.

Youtube Video

Cameras for self-driving vehicles navigate a trade-off between bandwidth and latency. While high-res color cameras have good resolution, they require high bandwidth to detect rapid changes. Conversely, reducing the bandwidth increases latency, affecting the timely processing of data for potentially life-saving decision making.

To get out of this bind, the Swiss-based researchers developed a hybrid camera combining event processing with high-bandwidth image processing. Events cameras only record intensity changes, and report them as sparse measurements, meaning the system does not suffer from the bandwidth/latency trade-off.

The event camera is used to detect changes in the blind time between image frames using events. Event data converted into a graph, which changes over time and connects nearby points, is computed locally. The resulting hybrid object detector reduces the detection time in dangerous high-speed situations, according to an explanatory video.

In their paper, the authors say: "Our method exploits the high temporal resolution and sparsity of events and the rich but low temporal resolution information in standard images to generate efficient, high-rate object detections, reducing perceptual and computational latency."

They argue their use of a 20 frames per second RGB camera plus an event camera can achieve the same latency as a 5,000-fps camera with the bandwidth of a 45-fps camera without compromising accuracy.

"Our approach paves the way for efficient and robust perception in edge-case scenarios by uncovering the potential of event cameras," the authors write.

With a hybrid approach to both cameras and data processing in the offing, more widespread adoption of self-driving vehicles may be just around the corner. ®

Two big computer vision papers boost prospect of safer self-driving vehicles

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: controllable longer image animation with diffusion models.

Abstract: Generating realistic animated videos from static images is an important area of research in computer vision. Methods based on physical simulation and motion prediction have achieved notable advances, but they are often limited to specific object textures and motion trajectories, failing to exhibit highly complex environments and physical dynamics. In this paper, we introduce an open-domain controllable image animation method using motion priors with video diffusion models. Our method achieves precise control over the direction and speed of motion in the movable region by extracting the motion field information from videos and learning moving trajectories and strengths. Current pretrained video generation models are typically limited to producing very short videos, typically less than 30 frames. In contrast, we propose an efficient long-duration video generation method based on noise reschedule specifically tailored for image animation tasks, facilitating the creation of videos over 100 frames in length while maintaining consistency in content scenery and motion coordination. Specifically, we decompose the denoise process into two distinct phases: the shaping of scene contours and the refining of motion details. Then we reschedule the noise to control the generated frame sequences maintaining long-distance noise correlation. We conducted extensive experiments with 10 baselines, encompassing both commercial tools and academic methodologies, which demonstrate the superiority of our method. Our project page: this https URL

Submission history

Access paper:.

HTML (experimental)
Other Formats

References & Citations

Google Scholar
Semantic Scholar

BibTeX formatted citation

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

COMMENTS

Deep learning in computer vision: A critical review of emerging techniques and application scenarios
The features of big data could be captured by DL automatically and efficiently. The current applications of DL include computer vision (CV), natural language processing (NLP), video/speech recognition (V/SP), and finance and banking (F&B). Chai and Li (2019) provided a survey of DL on NLP and the advances on V/SP. The survey emphasized the ...
Computer Vision
Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. ... You can create a new account if you don't have one. Browse SoTA > Computer Vision Computer Vision. 4656 benchmarks • 1431 tasks • 3023 datasets • 47702 papers with code Semantic Segmentation ... 5299 papers with code
Machine Learning in Computer Vision
The machine learning and computer vision research is still evolving [1]. Computer vision is an essential part of Internet of Things, Industrial Internet of Things, and brain human interfaces. The complex human activities are recognized and monitored in multimedia streams using machine learning and computer vison.
Computer Vision and Pattern Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) [11] arXiv:2405.14855 [ pdf , ps , html , other ] Title: Synergistic Global-space Camera and Human Reconstruction from Videos
Top Computer Vision Papers of All Time (Updated 2024)
We explore the groundbreaking research that has shaped the field of computer vision with our list of the top papers of all time. ... Classic Computer Vision Papers Gradient-based Learning Applied to Document Recognition (1998) The authors Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner published the LeNet paper in 1998. ...
IET Computer Vision
IET Computer Vision is a fully open access journal that introduces new horizons and sets the agenda for future avenues of research in a wide range of areas of computer vision. We are a fully open access journal that welcomes research articles reporting novel methodologies and significant results of interest.
The application of deep learning in computer vision
As the deep learning exhibits strong advantages in the feature extraction, it has been widely used in the field of computer vision and among others, and gradually replaced traditional machine learning algorithms. This paper first reviews the main ideas of deep learning, and displays several related frequently-used algorithms for computer vision. Afterwards, the current research status of ...
Rethinking the Inception Architecture for Computer Vision
Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is ...
[2101.01169] Transformers in Vision: A Survey
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different ...
Attention mechanisms in computer vision: A survey
Meng-Hao Guo is a Ph.D. candidate supervised by Prof. Shi-Min Hu in the Department of Computer Science and Technology at Tsinghua University, Beijing, China. His research interests include computer graphics, computer vision, and machine learning. Tian-Xing Xu received his bachelor degree in computer science from Tsinghua University in 2021. He is currently a Ph.D. candidate in the Department ...
(PDF) ARTIFICIAL INTELLIGENCE IN COMPUTER VISION
The research work was done during the period from 2019 till 2022 in ISCTE taking in consideration artificial intelligence for computer vision [48] concepts and software engineering practices [49 ...
CVIU
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image ...
Home
Overview. International Journal of Computer Vision (IJCV) details the science and engineering of this rapidly growing field. Regular articles present major technical advances of broad general interest. Survey articles offer critical reviews of the state of the art and/or tutorial presentations of pertinent topics. Coverage includes:
Deep learning-enabled medical computer vision
Computer vision in radiology is so pronounced that it has quickly burgeoned into its own field of research, growing a corpus of work 53,54,55 that extends into all modalities, with a focus on X ...
The Top 10 Computer Vision Papers of 2021
A curated list of the top 10 computer vision papers in 2021 with video demos, articles, code and paper reference. - louisfb01/top-10-cv-papers-2021 ... If you'd like to read more research papers as well, I recommend you read my article where I share my best tips for finding and reading more research papers.
Computer Vision and Image Processing: A Paper Review
This paper provides contribution of recent development on reviews related to computer vision, image processing, and their related studies. We categorized the computer vision mainstream into four ...
10 Cutting Edge Research Papers In Computer Vision & Image ...
UPDATE: We've also summarized the top 2019 and top 2020 Computer Vision research papers. Ever since convolutional neural networks began outperforming humans in specific image recognition tasks, research in the field of computer vision has proceeded at breakneck pace. The basic architecture of CNNs (or ConvNets) was developed in the 1980s. Yann LeCun improved upon […]
When computer vision works more like a brain, it sees more like people
Scientists from MIT and IBM Research made a computer vision model more robust by training it to work like a part of the brain that humans and other primates rely on for object recognition. ... Paper. Paper: "Aligning Model and Macaque Inferior Temporal Cortex Representations Improves Model-to-Human Behavioral Alignment and Adversarial Robustness"
Research Areas in Computer Vision: Trends and Challenges
Basics of Computer Vision. Computer Vision (CV) is a field of artificial intelligence that trains computers to interpret and understand the visual world. Using digital images from cameras and videos, along with deep learning models, computers can accurately identify and classify objects, and then react to what they "see.".
Computer Vision: 10 Papers to Start
This post is intended for computer vision starters, mostly undergraduate students. An important lesson is that unlike undergraduate education, when doing research, you learn primarily from reading papers, which is why I am recommending 10 to start. Before getting to the list, it is good to know where CV papers are usually published.
(PDF) OpenCV for Computer Vision Applications
Proceedings of National Conference on Big Data and Cloud Computing (NCBDC'15), March 20, 2015. OpenCV for Computer Vision Applications. M. Naveenkumar. Department of Computer Applications ...
10 Research Papers Accepted to CVPR 2023
Research from the department has been accepted to the 2023 Computer Vision and Pattern Recognition (CVPR) Conference. The annual event explores machine learning, artificial intelligence, and computer vision research and its applications. CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation
Multi-Constraint Transferable Generative Adversarial Networks for Cross
Recent progress in generative models has led to the drastic growth of research in image generation. Existing approaches show visually compelling results by learning multi-modal distributions, but they still lack realism, especially in certain scenarios like medical image synthesis. In this paper, we propose a novel Brain Generative Adversarial Network (BrainGAN) that explores GANs with multi ...
Computers
A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications. Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the ...
[2212.05153] Algorithmic progress in computer vision
Algorithmic progress in computer vision. Ege Erdil, Tamay Besiroglu. We investigate algorithmic progress in image classification on ImageNet, perhaps the most well-known test bed for computer vision. We estimate a model, informed by work on neural scaling laws, and infer a decomposition of progress into the scaling of compute, data, and algorithms.
Two big computer vision papers boost prospect of safer self ...
New chip and camera technology bring closer potential of hands-free road time Like nuclear fusion and jet-packs, the self-driving car is a long-promised technology that has stalled for years - yet ...
Controllable Longer Image Animation with Diffusion Models
Generating realistic animated videos from static images is an important area of research in computer vision. Methods based on physical simulation and motion prediction have achieved notable advances, but they are often limited to specific object textures and motion trajectories, failing to exhibit highly complex environments and physical dynamics. In this paper, we introduce an open-domain ...