Checkout-free retail represents one of the most challenging real-world computer vision deployments, requiring reliable performance across hundreds of sites and millions of interactions. This tutorial bridges academic research and production deployment by framing checkout-free retail as a canonical large-scale multi-camera vision systems problem.
The tutorial focuses on three foundational pillarsβpresented as generalizable computer vision problems rather than application-specific solutions. Each component is discussed in terms of its underlying formulations, scalability constraints, failure modes, and design tradeoffs that transfer directly to autonomous driving, smart spaces, sports analytics, warehouses, and urban sensing.
Continuous online estimation addressing drift and partial failures using deep learning and conventional CV pipelines.
Global data association under asynchronous, unreliable observations via integer programming and graph-based formulations.
Inference from partial visual evidence with edge computing, Kubernetes deployment, and CPU/GPU/TPU optimization.
A central theme is how infrastructure constraintsβincluding limited bandwidth, latency requirements, camera reliability, and edge computing budgetsβfundamentally shape algorithmic and architectural decisions. Attendees will learn how classical and modern deep learning techniques are adapted for continuous online operation, partial observability, and 99.9%+ system reliability.
Attendees will interact with a live multi-camera perception system demonstrating online calibration, global multi-object tracking, and structured event inference under realistic bandwidth, latency, and hardware constraints. The system operates fully offline using local compute and networking, with pre-recorded fallback visualizations for all scenarios. Each component (calibration, tracking, event inference) can be demonstrated independently.
While checkout-free retail serves as the motivating application, the tutorial deliberately abstracts away domain-specific details to focus on generalizable formulations applicable to autonomous vehicles, warehouses, smart cities, and sports analyticsβany application requiring automatic multi-camera calibration, real-time edge analysis, robust event detection with limited bandwidth, and scale deployment. The systems discussed have contributed to large-scale commercial deployments recognized through industry innovation awards and government-recognized R&D programs.
| Duration | Topic | Presenter |
|---|---|---|
| 15 min |
Introduction & Multi-Camera Vision at Scale
|
All |
| 60 min |
Automatic Multi-Camera Calibration under Scale, Drift, and Partial Observability
|
Agarwal |
| 15 min | Break — Attendees interact with demo | — |
| 45 min |
Real-Time Multi-Camera Tracking under Asynchrony and Missing Data
|
Kolluru |
| 45 min |
Structured Event Inference and Reliability in Multi-Camera Vision Systems
|
Bangalore + All |
| 30 min |
Interactive Q&A & Hands-on Demo
|
All |
Previously led deployment of checkout-free shopping platforms to 250+ stores worldwide at Zippin, visual search at Slyce, and served as Principal Architect for Self-Driving at Faraday Future. Holds an M.S. from UMass Amherst and 6 U.S. patents in computer vision, autonomous driving, and edge AI. Recipient of the Sports Business Journal Best Innovation Award (2022) and IDC Innovator (2023).
Previously a principal computer scientist at SRI International, leading large-scale robotics and vision systems for autonomous grasping, surveillance, and biometric capture, with collaborations at Stanford, MIT, and Carnegie Mellon. Ph.D. in Computer Vision from the University of Maryland; B.S. in Computer Science from IIT Delhi. Co-author of AdaMAE (CVPR 2023) and holder of multiple patents in automated checkout.
Previously Staff Software Engineer and Team Lead for Detection & Tracking at Zippin, leading computer vision infrastructure development for checkout-free retail systems deployed across hundreds of stores. Prior experience in autonomous vehicle perception and localization at General Motors and L3Harris Technologies.
Prerequisites: Basic computer vision (CNNs, detection). Camera geometry helpful but not required.
Expected Attendance: 100–300 attendees.
Primary: Multi-object tracking and re-identification.
Secondary: Camera calibration & 3D reconstruction; action/event recognition;
real-time CV systems; applications in retail automation; deep learning for correspondence.