visionbook/3d_scene_understanding_stereo.qmd at main · Foundations-of-Computer-Vision/visionbook · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
# Stereo Vision {#sec-stereo_vision}

## Introduction

@fig-titanic (a) shows a **stereo anaglyph** of the ill-fated ocean liner,
the Titanic, from @McManus2022. This was taken with two cameras, one
displaced laterally from the other. The right camera's image appears in
red in this image, and the left camera image is in cyan. If you view
this image with a red filter covering the left eye, and a cyan filter
covering the right (@fig-titanic\[b\]) then the right eye will see only
the right camera's image (analogously for the left eye), allowing the
ship to pop-out into a three-dimensional (3D) stereo image. Note that
the relative displacement between each camera's image of the smokestacks
changes as a function of their depth, which allows us to form a 3D
interpretation of the ship when viewing the image with the glasses
(@fig-titanic\[b\]).

![(a) Stereo anaglyph of the Titanic @McManus2022. The red image shows the right eye's view, and cyan the left eye's view.  (b) When viewed through red/cyan stereo glasses the cyan variations appear in the left eye and the red variations appear to the right eye, creating a perception of 3D.](figures/3d_scene_understanding/titanic.png){width="100%" #fig-titanic}

In this chapter, we study how to compute depth from a pair of images
from spatially offset cameras, such as those of @fig-titanic. There are
two parts to the depth computation, usually handled separately: (1)
analyzing the geometry of the camera projections, which allows for
triangulating depth once image offsets are known; and (2) calculating
the offset between matching parts of the objects depicted in each image.
Different techniques are used in each part. We first address the
geometry, asking where points matching those in a first image can lie in
the second image. A manipulation, called **image rectification**, is
often used to place the locus of potential matches along a pixel row.
Then, assuming a rectified image pair, we will address the second part
of the depth computation, computing the offsets between matching image
points.

## Stereo Cues

How can we estimate depth from two pictures taken by two cameras placed
at different locations? Let's first gain some intuition about the cues
that are useful to estimate distances from two views captured at two
separate observation points.

### How Far Away Is a Boat?

To give an intuition on how depth from stereo vision works, let's first
look at a simple geometry problem of practical use: How do we estimate
how far away a boat is from the coast?

:::{.column-margin}
Sailors can
use tricks to estimate distances using methods related to the ones
presented in this chapter. These techniques can be useful when the
electronics on the boat are out.
:::


We can solve this problem in several ways. We will briefly discuss two
methods. The first one uses a single observation point but will use the
horizon line as a reference. The second method will use two observation
points. Both methods are illustrated in @fig-2boats.

![Two methods to estimate the distance of a boat from the coast. (left) The first method uses a single observation point, with knowledge of the observer's height above the water.  (right) The second method uses two observation points.](figures/3d_scene_understanding/boats.png){width="100%" #fig-2boats}

The first method measures the location of the boat relative to the
horizon line. An observer is standing on top of an elevated point (a
mountain, tower, or building) at a known height, $h$. The observer
measures the angle $\alpha$ between the horizontal (obtained by pointing
toward the horizon) and the position of the boat. We can the derive the
boat distance to the base of the point of observation, $d$ by the simple
trigonometric relation: $$d = \frac{h}{\tan (\alpha)}$$ This method isnot very accurate and might require high observation points, which might
not be available, but it allows as to estimate the boat position with a
single observation point, if the observer's height above the water is
known. It also requires incorporating the earth curvature when the boat
is far away.

The second method requires two different observation points along the
coast separated by a distance $t$. At each point, we measure the angles,
$\alpha$ and $\beta$, between the direction of the boat and the line
connecting the two observation points. Then we can apply the following
relation:
$$d = t \frac{\sin (\alpha) \sin (\beta)}{\sin (\alpha+\beta)}$$ This
method is called **triangulation**. For it to work we need a large
distance $t$ so that the angles are significantly different from $90$
degrees.

### Depth from Image Disparities

When observing the world with two eyes we get two different views of the
world (@fig-titanic). The relative displacement of features across the
two images (**parallax effect**) is related to the distance between the
observer and the observed scene. Just as in the example of estimating
the distance to boats in the sea, our brain uses disparities across the
two eye views to measures distances to objects.


:::{.column-margin}
**Free fusion* consists in making the eyes converge or diverge in order to fuse a stereogram without needing a stereoscope. Making it work requires practice. Try it in @fig-random_dot_stereogram.
:::

Our brain will use many different cues such as the ones presented in the
previous chapter. But image disparity provides an independent additional
cue. This is illustrated by the **random dot stereograms**
(@fig-random_dot_stereogram) introduced by Bela Julesz @Julez1971.
Random dot stereograms are a beautiful demonstration that humans can
perceive depth using the relative displacement of features across two
views even in the absence of any other depth cues. The random images
contain no recognizable features and are quite different from the
statistics of the natural world. But this seems not to affect our
ability to compute the displacements between the two images and perceive
3D structure.

![Random dot stereogram. Both images are almost identical. The difference is that there is a square in the middle that is displaced by a few pixels left-right between the two images. When each image it seen by one eye you should see a square floating in from of a flat background. You can use a stereoscope or free fusion to see it. ](figures/3d_scene_understanding/random_dot_stereogram.png){width="100%" #fig-random_dot_stereogram}

### Building a Stereo Pinhole Camera

Stereo cameras generally are built by having two cameras mounted on a
rigid rig so that the camera parameters do not change over time.

We can also build a **stereo pinhole camera**. As we discussed in @sec-imaging, a pinhole camera is formed by a dark
chamber with a single hole in one extreme that lets light in. The light
gets projected into the opposite wall, forming an image. One interesting
aspect of pinhole cameras is that you can build different types of
cameras by playing with the shape of the pinhole. So, let's make a new
kind of pinhole camera! In particular, we can build a camera that
produces anaglyph images, like the one in @fig-titanic, in a single
shot by making two pinholes instead of just one.

We will transform the pinhole camera into an anaglyph pinhole camera by
making two holes and placing different color filters on each hole
(@fig-anaglyph). This will produce an image that will be the
superposition of two images taken from slightly different viewpoints.
Each image will have a different color. Then, to see the 3D image we can
look at the picture by placing the same color filters in front of our
eyes. In order to do this, we used color glasses with blue and red
filters.

![One way of playing with pinhole cameras is to build an anaglyph pinhole camera. The anaglyph pinhole camera captures stereo images by projecting an anaglyph image on the projection plane.](figures/3d_scene_understanding/anaglyph_camera.png){width="100%" #fig-anaglyph}

We used one of the anaglyph glasses to get the filters and placed them
in front of the two pinholes. @fig-anaglyph shows two pictures of the
two pinholes (@fig-anaglyph\[a\]) and the filters positioned in front of
each pinhole (@fig-anaglyph\[b\]). @fig-anaglyph (d) is a resulting
anaglyph image that appears on the projection plane
(@fig-anaglyph\[c\]); this picture was taken by placing a camera in the
observation hole seen at the right of @fig-anaglyph (c). @fig-anaglyph (d)
should give you a 3D perception of the image when viewed with anaglyph
glasses.

Here are experimental details for making the stereo pinhole camera. You
will have to experiment with the distance between the two pinholes. We
tried two distances: 3 in and 1.5 in. We found it easier to see the 3D
for the 1.5 in images. Try to take good quality images. Once you have
placed the color filters on each pinhole, you will need longer exposure
times in order to get the same image quality than with the regular
pinhole camera because the amount of light that enters the camera is
smaller. If the images are too blurry, it will be hard to perceive 3D.
Also, as you increase the distance between the pinholes, you will
increase the range of distances that will provide 3D information because
the parallax effect will increase. But if the images are too far apart,
the visual system will not be able to fuse them and you will not see a
3D image when looking with the anaglyph glasses.

Let's now look into how to use two views to estimate the 3D structure of
the observed scene.

## Model-Based Methods

Let's now make more concrete the intuitions we have built in the
previous sections. We will start describing some model-based methods to
estimate 3D from stereo images. Studying these models will help us
understand the sources of 3D information, the constraints, and
limitations that exist.

### Triangulation

The task of stereo is to triangulate the depth of points in the world by
comparing the images from two spatially offset
cameras.

:::{.column-margin}
If more than two cameras are used, the task
is referred to as multiview stereo.
:::

 To triangulate depth, we
need to describe how image positions in the stereo pair relate to depth
in the 3D world.

We will start with a very simple setting where both cameras are
identical (same intrinsic parameters) and one is translated horizontally
with respect to the other so that their optical axes are parallel, as
shown in @fig-stereo. We also will assume calibrated cameras. The
geometry shown in @fig-stereo reveals the essence of depth estimation
from the two cameras. Let's assume that there is one point,
$\mathbf{P}$, in 3D space and we know the locations in which it projects
on each camera, $[x_L,y_L]^\mathsf{T}$ for the left camera and
$[x_R,y_R]^\mathsf{T}$ for the right camera, as shown in @fig-stereo a.
Because the two cameras are aligned horizontally, the $y$-coordinates
for the projection in both cameras will be equal, $y_L=y_R$. Therefore,
we can look at the top view of the setting as shown in @fig-stereo b. We
consider the depth estimate within a single image row.

![Simple stereo. (a) Two cameras, of identical focal length, $f$, and separated by an offset, $T$, image the point $P$ onto the two-dimensional (2D) points $[x_L,y_L]^\mathsf{T}$ and $[x_R,y_R]^\mathsf{T}$ at each camera.  (b) The similarity of the two triangles in leads to @eq-depthFromDisparity](figures/3d_scene_understanding/triangularization_stereo.png){#fig-stereo}

The point $\mathbf{P}$, for which we want to estimate the depth, $Z$,
appears in the left camera at position $x_L$ and in the right camera at
position $x_R$.

:::{.column-margin}
To distinguish 2D image coordinates from 3D world
coordinates, for this chapter, we will denote 2D image coordinates by
lowercase letters, and 3D world coordinates by uppercase letters.
:::


A simple way to triangulate the depth, $Z$, of a point visible in both
images is through similar triangles. The two triangles (yellow triangle
and yellow+orange triangle) shown in @fig-stereo (b) are similar, and so
the ratios of their height to base must be the same. Thus we have:
$$\frac{T+x_R-x_L}{Z-f} = \frac{T}{Z}$$ Solving for $Z$ gives:
$$Z =  \frac{fT}{x_L-x_R}
    $${#eq-depthFromDisparity} The difference in horizontal position of
any given world point, $\mathbf{P}$, as seen in the left and right
camera images, $x_L - x_R$, is called the **disparity**, and is
*inversely proportional to the depth* $Z$ of the point $\mathbf{P}$ for
this configuration of the two cameras--parallel orientation. The task of
inferring depth of any point in either image of a stereo pair thus
reduces to measuring the disparity everywhere.

However, we are far from having solved the problem of perceiving 3D from
two stereo images. One important assumption we have made is that we know
the two corresponding projections of the point $\mathbf{P}$. But in
reality the two images will be complex and we will not know which points
in one camera correspond to the same 3D point on the other camera. This
requires solving the **stereo matching** problem.

### Stereo Matching

First, let's examine the task visually. @fig-stereomatch shows two
images of an office, taken approximately 1 m apart. The locations of
several objects in @fig-stereomatch (a) are shown in @fig-stereomatch (b),
revealing some of the disparities that we seek to measure automatically.
As the reader can appreciate by comparing the stereo images by eye, one
can compute the disparity by first localizing a feature point in one
image and then finding the corresponding point in the other image, based
on the visual similarity of the local image region. To do this
automatically is the crux of the stereo problem.

:::{layout-ncol="2" #fig-stereomatch}

![](figures/stereo/officeLeft.jpg){width="47.5%"}

![](figures/stereo/officeRight.jpg){width="47.5%"}

Two cameras, displaced from each other laterally, photograph the same scene, resulting in images (a) and (b).  Colored rectangles show some identifiable features in image (a), and where those features appear in image (b).  Arrows show the corresponding displacements, which reveal depth through @eq-depthFromDisparity
:::

A naive algorithm for finding the disparity of any pixel in the left
image is to search for a pixel of the same intensity in the
corresponding row of the right image. @fig-stereolamp, from @Hosni2013,
shows why that naive algorithm won't work and why this problem is hard.
@fig-stereolamp (a) shows the intensities of one row of the right-eye
image of a stereo pair. The squared difference of the intensity
differences from offset versions of the same row in the rectified
left-eye image of the stereo pair is shown in @fig-stereolamp (b), for a
range of disparity offsets, plotted vertically, in a configuration known
as the **disparity space image** @Scharstein2002. The disparity
corresponding to the minimum squared error matches are plotted in red.
Note the disparities of the best pixel intensity matches show much more
disparity variations than would result from the smooth surfaces depicted
in the image; the local intensity matches don't reliably indicate the
disparity. Many effects lead to that unreliability, including: lack of
texture in the image, which leads to noisy matches among the many nearly
identical pixels; image intensity noise; specularities in the image
which change location from one camera view to another; and scene points
appearing in one image but not another due to occlusion.
@fig-stereolamp (d) shows an effective approach to remove the matching
noise: blurring the disparity space image using an edge-preserving
filter @Hosni2013. The best-matched disparities are plotted in red, and
compare well with the ground truth disparities (@fig-stereolamp\[f\]),
as labeled in the dataset @Scharstein2002.

![Issues with intensity-based stereo matching.
%(a) One line of one image, (c), of stereo pair. (b) Intensity matching error for different disparities, plotted vertically, as a function of horizontal position.  Red shows disparities of best intensity match. (d) Smoothed matching errors, with best matches shown in red, and resulting depth image, (e).  (f) Ground truth matches and depth (g).
Dataset from @Scharstein2002](figures/stereo/stereolamp.jpg){width="100%" #fig-stereolamp}

The task of finding disparity at each point is often broken into two
parts: (1) finding features, and matching the features across images,
and (2) interpolating between, or adjusting, the feature matches to
obtain a reliable disparity estimate everywhere. These two steps are
often called matching and filtering, because the interpolation can be
performed using filtering methods.

We will describe next a classical method to find key points and extract
image descriptors to find image correspondences. In chapter
@sec-3D_multiview we will describe other learning-based
approaches.

#### Finding image features {#sec-finding_image_features}

Good image features are positions in the image that are easy to
localize, given only the image intensities. The **Harris corner
detector** @Harris88 identifies image points that are easy to localize
by evaluating how the sum of squared intensity differences over an image
patch change under a small translation of the image. If the squared
intensity changes quickly with patch translation in every direction,
then the region contains a good image feature.

To derive the equation of the Harris corner detector we will compute
image derivatives. Therefore, we will write an image as a continuous
function on $x$ and $y$. The compute the final quantities in practice we
will approximate the derivatives using their discrete approximations as
discussed in chapter
@sec-image_derivatives. Let the input image be
$\ell(x,y)$, and let the small translation be $\Delta x$ in $x$ and
$\Delta y$ in $y$. Then, the squared intensity difference,
$E(\Delta x, \Delta y)$, induced by that small translation, summed over
a small image patch, $P$, is
$$E(\Delta x, \Delta y) = \sum_{(x, y) \in P}    \left(
    \ell(x,y) - \ell(x + \Delta x, y + \Delta y)
    \right)^2
    $${#eq-harris}

We expand $\ell(x + \Delta x, y + \Delta y)$ about $\ell(x, y)$ in a
Taylor series: $$\ell(x + \Delta x, y + \Delta y) \approx    \ell(x, y) + \frac{\partial \ell}{\partial x} \Delta x
    + \frac{\partial \ell}{\partial y} \Delta y$$ Substituting the above
into equation (@eq-harris) and writing the result in matrix form yields
$$E(\Delta x, \Delta y) \approx
       \left[
       \Delta x ~~ \Delta y
       \right]
       \mathbf{M}
       \left[ \begin{array}{c}
       \Delta x \\
       \Delta y
       \end{array}
       \right]$$ where $$\mathbf{M} = \sum_{(n, m) \in P}
\begin{bmatrix}
(\frac{\partial \ell}{\partial x})^2 &
\frac{\partial \ell}{\partial x}
\frac{\partial \ell}{\partial y} \\
\frac{\partial \ell}{\partial x}
\frac{\partial \ell}{\partial y}
& (\frac{\partial \ell}{\partial y})^2
\end{bmatrix}$$

The smallest eigenvalue, $\lambda_{\mbox{min}}$ of the matrix,
$\mathbf{M}$, indicates the smallest possible change of image
intensities under translation of the patch in any direction. Thus, to
detect good corners, one can identify all points for which
$\lambda_{\mbox{min}} > \lambda_c$, for some threshold $\lambda_c$. The
Harris corner detector has been widely used for identifying features to
match across images, although they were later supplanted by the
scale-invariant feature transform (SIFT) @Lowe04, based on maxima over a
scale of Laplacian filter outputs.
@fig-stereopoints (a and b) show the result of the Harris
feature detector to the images of @fig-stereomatch.

:::{#fig-stereopoints layout-ncol="3"}
![](figures/stereo/pointsLeft.jpg){width="31%"}

![](figures/stereo/pointsRight.jpg){width="31%"}

![](figures/stereo/officeDepth.jpg){width="31%"}

The feature-based stereo approach, illustrated for the stereo pair of @fig-stereomatch. (a and b) Feature points. (c) Depth image.
:::

@fig-stereopoints (a and b) reveal the challenges of the
feature matching task: (1) not every feature is marked across both
images, and (2) we need to decide which pairs of image features go
together. Feature points are found, independently, in each of the stereo
pair images (@fig-stereopoints [a and b]). Detected Harris feature
points @Harris88 are shown as green dots. It is especially visible
within the marked ellipses that not the same set of feature points is
marked in each image. @fig-stereopoints (c) shows a depth image resulting
from a matched and interpolated set of feature points.

#### Local image descriptors

Matching corresponding image features require a local description of the
image region to allow matching like regions across two images. SIFT
@Lowe04 uses histograms of oriented filter responses to create an image
descriptor that is sensitive to local image structure, but that is
insensitive to minor changes in lighting or location which may occur
across the two images of a stereo pair.

@fig-SIFT shows a sketch of the use of oriented edges and SIFT
descriptors in image analysis. The top row of @fig-SIFT \[a\] shows two
images of a hand under different lighting conditions. Despite that both
images look quite different in a pixel intensity representation, they
look quite similar in a local orientation, depicted in the bottom row by
line segment directions. SIFT image descriptors @Lowe04 exploit that
observation. Taking histograms (@fig-SIFT\[b\]) of local orientation
over local spatial regions further allow for image descriptors to be
robust to small image translations @Freeman98c, @Lowe04.

:::{#fig-SIFT layout-ncol="2"}
![](figures/stereo/hands.jpg){width="40%" }


![](figures/stereo/SIFT.jpg){width="50%"}

Orientation-based image descriptors.  (a) Oriented features. Reprinted from @Freeman98c  (b) SIFT image descriptors @Lowe04.
:::

#### Interpolation between feature matches

Methods to interpolate depth between the depths of matched points
include probabilistic methods such as belief propagation, discussed in
@sec-probabilistic_graphical_models. Under assumptions
about the local spatial relationships between depth values, optimal
shapes can be inferred provided the spatial structure over which
inference is performed is a chain or a tree. Note that @Hirschmuller2007
uses a related inference algorithm, dynamic programming, which has the
same constraints on spatial structure for exact inference.
Hirschmuller's algorithm, called semiglobal matching (SGM), applies the
one-dimensional processing over many different orientations to
approximate a global solution to the stereo shape inference problem.

### Constraints for Arbitrary Cameras

Until now we have considered only the case where the two cameras are
parallel, but in general the two images will be produced by cameras that
can have arbitrary orientations. Note that disparity of the smokestacks
in @fig-titanic (a) increases with depth, rather than decreasing as
described by equation (@eq-depthFromDisparity). That is because the
cameras that took the stereo pair were not in the same orientation,
related only by a translation.

Let's assume we have two calibrated cameras: we know the intrinsic and
extrinsic parameters for both. They are related by a generic
translation, $\mathbf{T}$, and a rotation, $\mathbf{R}$, such that their
optical axis are not parallel. Let's now consider that we see one point
in the image produced by camera 1 and we want to identify where is the
corresponding point in the image from camera 2. The following sketch
(@fig-epipolar_1) shows the setting of the problem we are trying to
solve.

![How can we identify the specific point $\mathbf{p](figures/3d_scene_understanding/epipolar_1.png){width="70%" #fig-epipolar_1}

Are there any geometric constraints that can help us to identify
candidate matches? We know that the observed point in camera 1,
$\mathbf{p}_1$, corresponds to a 3D point that lies within the ray that
connects the camera 1 origin and $\mathbf{p}_1$ (shown in red in
@fig-epipolar_3):

![The point $\mathbf{p](figures/3d_scene_understanding/epipolar_3.png){width="70%" #fig-epipolar_3}

Therefore, $\mathbf{p}_2$ is constrained to be within the line that
results of projecting the ray into camera 2. Remember that a line in 3D
projects into a line in 2D. There are two points that are of particular
interest in the line projected into camera 2. The first point is the
projection of the camera center of camera 1 into the camera plane 2 (the
point could lay outside the camera view but in this example it happens
to be visible). The second point is the projection of $\mathbf{p}_1$
itself into camera 2. As we know all the intrinsic and extrinsic camera
parameters, those two points should be easy to calculate. And these two
points will define the equation of the line (there are many other ways
to think about this problem).

The following figure shows the geometry of a stereo pair. Given a 3D
point, $\mathbf{P}$, in the world, we define the following objects:

-   The **epipolar plane** is the plane that contains the two camera
    origins and a 3D point, $\mathbf{P}$.

-   The **epipolar lines** are the intersection between the epipolar
    plane and the camera planes. When the point $\mathbf{P}$ moves in
    space, the epipolar plane and epipolar lines change.

-   The **epipoles** are the intersection points between the line that
    connects both camera origins and the camera planes. The epipoles'
    location does not depend on the point $\mathbf{P}$. Therefore, all
    the epipolar lines intersect at the same epipoles.

![Geometry of a stereo pair and terminology. The figure shows the epipolar plane, epipolar lines, and epipoles ($e_1$ and $e_2$), for a given point $\mathbf{P}$. As we move the point $\mathbf{P}$, only the epipoles remain constant.](figures/3d_scene_understanding/epipolar_geometry.png){width="80%" #fig-epipolar_geometry_terminology}

In the example shown in @fig-epipolar_geometry_terminology, as both
cameras are only rotated along their vertical axis and there is no
vertical displacement between them, the epipoles are contained in the
$x$ axes of both camera coordinate systems. The epipolar line in camera
2, corresponds to the projection of the line $O_1 P$ into the camera
plane. Therefore, given the image point $\mathbf{p}_1$, we know that the
match should be somewhere along the epipolar line.

The epipolar line in camera 2 that corresponds to $\mathbf{p}_1$ is
uniquely defined by the camera geometries and the location of image
point $\mathbf{p}_1$. So, when looking for matches of a point in one
camera, we only need to look along the corresponding epipolar line in
the other camera. The same applies for finding matches in camera 1 given
a point in camera 2.

In the next section we will see how to find the epipolar lines
mathematically.

### The Essential and Fundamental Matrices

We will follow here the derivation that was proposed by Longuet-Higgins,
in 1981 @Longuet-Higgens1981. Let's consider a 3D point, $\mathbf{P}$,
viewed by both cameras. The coordinates of this 3D point $\mathbf{P}$
are $\mathbf{P}_1 = [X_1,Y_1,Z_1]^\mathsf{T}$ when expressed in the
coordinates system of camera 1, and
$\mathbf{P}_2= [X_2,Y_2,Z_2]^\mathsf{T}$ when expressed in the
coordinates system of camera 2. Both systems are related by (using
heterogeneous coordinates):
$$\mathbf{P}_2 = \mathbf{R} \mathbf{P}_1 + \mathbf{T}
$${#eq-relation1}

The 3D point $\mathbf{P}_1$ projects into the camera 1 at the
coordinates $\mathbf{p}_1 = f \mathbf{P}_1 /Z_1$ where $\mathbf{p}_1$ is
expressed also in 3D coordinates
$\mathbf{p}_1 = [f X_1/Z_1, f Y_1/Z_1, f]^\mathsf{T}$. Without losing
generality we will assume $f=1$.

Taking the cross product of both sides of equation (@eq-relation1) with
$\mathbf{T}$ we get:
$$\mathbf{T} \times \mathbf{P}_2 = \mathbf{T} \times \mathbf{R} \mathbf{P}_1$${#eq-relation2} as $\mathbf{T} \times \mathbf{T} = 0$. And now, taking
the dot product of both sides of equation (@eq-relation2) with
$\mathbf{P}_2$:
$$\mathbf{P}_2^\mathsf{T}\mathbf{T} \times \mathbf{P}_2 = \mathbf{P}_2^\mathsf{T}\mathbf{T} \times \mathbf{R} \mathbf{P}_1$$The left side is zero because $\mathbf{T} \times \mathbf{P}_2$ results
in a vector that is orthogonal to both $\mathbf{T}$ and $\mathbf{P}_2$,
and the dot product with $\mathbf{P}_2$ is zero (dot product of two
orthogonal vectors is zero). Therefore, $\mathbf{P}_1$ and
$\mathbf{P}_2$ satisfy the following relationship:
$$\mathbf{P}_2^\mathsf{T}\left( \mathbf{T} \times \mathbf{R} \right) \mathbf{P}_1 = 0$$


:::{.column-margin}
The cross product of two vectors $\mathbf{a} \times \mathbf{b}$ can be written in matrix form as:
$$
\mathbf{a} \times \mathbf{b} =
\mathbf{a}_{\times} \mathbf{b}
$$
The special matrix $\mathbf{a}_{\times}$ is defined as:
$$
\mathbf{a}_{\times} =
\begin{bmatrix}
0 & -a_3 & a_2 \\
a_3 & 0 & -a_1 \\
-a_2 & a_1 & 0 \\
\end{bmatrix}
$$
This matrix $\mathbf{a}_{\times}$ is rank 2. In addition, it is a skew-symmetric matrix (i.e., $\mathbf{a}_{\times} = -\mathbf{a}_{\times}^\mathsf{T}$). It has two identical singular values and the third one is zero.
:::

The matrix $\mathbf{E}$ is called the **essential matrix** and it
corresponds to:
$$\mathbf{E} =  \mathbf{T} \times \mathbf{R} = \mathbf{T}_{\times} \mathbf{R}$$
This results in the relationship:
$$\mathbf{P}_2^\mathsf{T}\mathbf{E} \mathbf{P}_1 = 0
$${#eq-relation3}

If we divide equation (@eq-relation3) by $Z_1 Z_2$ we will obtain the
same relationship but relating the image coordinates:

$$\mathbf{p}_2^\mathsf{T}\mathbf{E} \mathbf{p}_1 = 0$${#eq-relation4} This equation relates the coordinates of the two
projections of a 3D point into two cameras. If we fix the coordinates
from one of the cameras, the equation reduces to the equation of a line
for the coordinates of the other camera, which is consistent with the
geometric reasoning that we made before.

The essential matrix is a $3 \times 3$ matrix with some special
properties:

-   $\mathbf{E}$ has five degrees of freedom instead of nine. To see
    this, let's count degrees of freedom: the rotation and translation
    have three degrees of freedom each, this gives six; then, one degree
    is lost in equation (@eq-relation4) as a global scaling of the
    matrix does not change the result. This results in five degrees of
    freedom.

-   The essential matrix has rank 2. This is because $\mathbf{E}$ is the
    product of a rank 2 matrix, $\mathbf{T}_{\times}$, and a rank 3
    matrix, $\mathbf{R}$.

-   The essential matrix has two identical singular values and the third
    one is zero. This is because $\mathbf{E}$ is the product of
    $\mathbf{T}_{\times}$, which has two identical singular values, and
    the rotation matrix $\mathbf{R}$ that does not change the singular
    values.

As a consequence, not all $3 \times 3$ matrices correspond to an
essential matrix which is important when trying to estimate
$\mathbf{E}$.

#### The fundamental matrix

We have been using camera coordinates, $\mathbf{p}$. If we want to use
image coordinates, in homogeneous coordinates,
$\mathbf{c}=[n,m,1]^\mathsf{T}$, we need to also incorporate the
intrinsic camera parameters (which will also incorporate pixel size and
change of origin). Incorporating the transformation from camera
coordinates, $\mathbf{p}$, to image coordinates, $\mathbf{c}$, the
equation relating image coordinates in both cameras has the same
algebraic form as in equation (@eq-relation4):

$$\mathbf{c}_2^\mathsf{T}\mathbf{F} \mathbf{c}_1 = 0$$ where
$\mathbf{F}$ is the **fundamental matrix**.

The fundamental and essential matrices are related via the intrisic
camera matrix, namely:
$$\mathbf{F} = \mathbf{K}^{-\mathsf{T}} \mathbf{E} \mathbf{K}^{\mathsf{T}}$$

The fundamental matrix has rank 2 (like the essential matrix) and 7
degrees of freedom (out of the 9 degrees of freedom, one is lost due to
the scale ambiguity and another is lost because the determinant has to
be zero).

#### Estimation of the essential/fundamental matrix

Given a set of matches between the two views we can estimate either the
essential or the fundamental matrix by writing a linear set of
equations. Any two matching points will satisfy
$\mathbf{p}_2^\mathsf{T}\mathbf{E} \mathbf{p}_1 = 0$ and we can use this
a linear set of equations for the coefficients of $\mathbf{E}$. However,
the matrix $\mathbf{E}$ has a very special structure that allows more
efficient and robust methods to estimate the coefficients. The same
applies for the fundamental matrix.

As both matrices are rank 2, the minimization the linear set of
equations can be done using a constrained minimization or directly
optimizing a rank 2 matrix. You can also delete the third singular value
$\mathbf{F}$, and, for $\mathbf{E}$, you can delete the third singular
value and set the first and second singular values to be equal.

Once $\mathbf{E}$ is estimated, it can be decomposed to recover the
translation (skew-symmetric matrix) and rotation (orthonormal matrix)
between the two cameras using the singular value decomposition methods.

#### Finding the epipoles

As the epipoles, $\mathbf{e}_1$ and $\mathbf{e}_2$, belong to the
epipolar lines, they satisfy
$$\mathbf{p}_2^\mathsf{T}\mathbf{E} \mathbf{e}_1 = 0$$$$\mathbf{e}_2^\mathsf{T}\mathbf{E} \mathbf{p}_1 = 0$$
and, as the epipoles are the intersection of all the epipolar lines,
both relationships shown previously have to be true for any points
$\mathbf{p}_1$ and $\mathbf{p}_2$. Therefore,
$$\mathbf{E} \mathbf{e}_1 = 0$$$$\mathbf{e}_2^\mathsf{T}\mathbf{E} = 0$$
As the matrix $\mathbf{E}$ is likely to be noisy, we can compute the
epipoles as the eigenvectors with the smallest eigenvalues of
$\mathbf{E}$ and $\mathbf{E}^\mathsf{T}$.

#### Epipolar lines: The game

In order to build an intuition of how epipolar lines work we propose to
play a game. @fig-epipolarlinesgame shows a set of camera pairs and a
set of epipolar lines. Can you guess what camera pair goes with what set
of epipolar lines?


:::{.column-margin}
Answer for the game in @fig-epipolarlinesgame: A-2, B-5, C-4,
    D-6, E-3, F-1.
:::

![Epipolar game. The figure shows six camera pair arrangements, and six sets of epipolar lines. Can you identify which camera pairs correspond to each set of epipolar lines?](figures/3d_scene_understanding/epipolar_game_play.png){width="100%" #fig-epipolarlinesgame}

To solve the game, visualize mentally the cameras and think about how
rays from the origin of one camera will be seen from the other camera.
The epipole corresponds to the location where one camera sees the origin
of the other camera.

Observe that most of the sets of epipolar lines shown in the figure are
symmetric across the two cameras, but it is not always true. What
conditions are needed to have symmetric epipolar lines across both
cameras? What camera arrangements will break the symmetry? If you were
to place two cameras in random locations, do you expect to see symmetric
epipolar lines?

It is also useful to use the equations defining the epipolar lines and
deduce the equations for the epipolar lines for some of the special
cases shown in @fig-epipolarlinesgame.


:::{.column-margin}
When building systems, it is crucial to visualize intermediate results in order to find bugs in code. This requires knowing what the results should look like. The game from @fig-epipolarlinesgame helps building that understanding.
:::


### Image Rectification

As described previously, one only needs to search along epipolar lines
to find matching points in the second image. If the sensor planes of
each camera are coplanar, and if the pixel rows are colinear across the
two cameras, then the epipolar lines will be along image scanlines,
simplifying the stereo search.

Using the homography transformations (@sec-homography), we can warp observed stereo camera
images taken under general conditions to synthesize the images that
would have been recorded had the cameras been aligned to yield epipolar
lines along scan rows.

Many configurations of the two cameras satisfy the desired image
rectification constraints @Zhang2003: (1) that epipolar lines are along
image scanlines, and (2) that matching points are within the same rows
in each camera image. One procedure that will satisfy those constraints
is as follows @wikiRectification2021. First, rotate each camera to be
parallel with each other and with optical axes perpendicular to the line
joining the two camera centers. Second, rotate each of the cameras about
its optical axis to align the rows of each camera with those of the
other. Third, uniformly scale one camera's image to be the same size as
to the other, if necessary. The resulting image pair will thus satisfy
the two requirements above for image rectification. Image rectification
uses image warping with bilinear (or other) interpolation as described
in @sec-bilinearinterpolation.

Other algorithms are optimized to minimize image distortions @Zhang2003
and may give better performance for stereo algorithms. We assume camera
calibrations are known, but other rectification algorithms can use
weaker conditions, such as knowing the fundamental matrix
@Hartley2004, @Pollefeys99. Many software packages provide image
rectification code @Zhang99.

## Learning-Based Methods

As is the case for most other vision tasks, neural network approaches
now dominate stereo methods. The two stages of a stereo algorithm,
matching and filtering, can be implemented separately by neural networks
@Zbontar2015, or together in a single, end-to-end optimization
@Zhang2019GANet, @Chang2018, @Mayer2016, @Kendall2017.

### Output Representation

One important question is how we want to represent the output. We are
ultimately interested in estimating the 3D structure of the scene. We
can estimate the disparity or the depth. Disparity might be easier to
estimate as it is independent of the camera parameters. It is also
easier to model noise, as we can assume that noise is independent of
disparity. However, when using depth, estimation error will be larger
for distant points.

Another interesting property of disparity is that for flat surfaces, the
second-order derivative along any spatial direction, $\mathbf{x}$, is
zero: $$\frac{\partial^2 d}{\partial \mathbf{x}^2}  = 0.$$ This equality
does not hold for depth. This property allows introducing a simple
regularization term that favors planarity.

To translate the returned disparity for a given pixel
$d\left[n,m \right]$ into depth values $z\left[n,m \right]$, we need to
invert the disparity, that is

$$\begin{aligned}
z \left[n,m \right] = \frac{1}{ d \left[n,m \right] }
\end{aligned}$$ Now we need to recover the world coordinates for every
pixel. Assuming a pinhole camera and using the depth,
$z \left[n,m \right]$, and the intrinsic camera parameters,
$\mathbf{K}$, we obtain the 3D coordinates for a pixel with image
coordinates $[n,m]$ in the camera reference frame as

$$\begin{aligned}
\begin{bmatrix}
X \\
Y \\
Z
\end{bmatrix}
= z\left[n,m \right] \times \mathbf{K}^{-1} \times
\begin{bmatrix}
n \\
m \\
1
\end{bmatrix}
\end{aligned}$$ Note that $\mathbf{K}$ are the intrinsic camera
parameters for the reference image.

### Two-Stage Networks

Given two images, $\mathbf{v}_1$ and $\mathbf{v}_2$, the disparity at
each location can be computed by a function, that is:
$$\hat{\mathbf{d}} = f_{\theta}(\mathbf{v}_1, \mathbf{v}_2)$$

The objective is that the estimated disparities, $\hat{\mathbf{d}}$,
should be close to ground truth disparities, $\mathbf{d}$. One example
of training set are the KITTI dataset @Geiger2013 and the scene flow
dataset @Mayer2016, which contain ground truth disparities from
calibrated stereo cameras. One typical loss is:
$$J(\theta) = \sum_t \left| \mathbf{d} - f_{\theta} (\mathbf{v}_1, \mathbf{v}_2) \right|^2$$
This is optimized by gradient descent as usual.

When matching and filtering are implemented separately, one common
formulation is to use one neural network to extract features from both
images, $g(\mathbf{i}_1)$ and $g(\mathbf{i}_2)$, and one neural network
to compute the matching cost and estimate the disparities at every
pixel.
$$f(\mathbf{v}_1, \mathbf{v}_2) = c \left( g(\mathbf{v}_1), g(\mathbf{v}_2) \right)$$

A common approach, depicted in block diagram form in @fig-stereo (b)lock,
is the following:

-   Rectify the stereo pair so that scanlines in each image depict the
    same points in the world.

-   Extract features from each image, using a pair of networks with
    shared weights.

-   Form a 3D cost volume indicating the local visual evidence of a
    match between the two images for each possible pixel position and
    each possible stereo disparity. (This 3D cost volume can be
    referenced to the $H$ and $V$ positions of one of the input images.)

-   Train and apply a 3D convolutional neural net (CNN) to aggregate
    (process) the costs over the 3D cost volume in order to estimate a
    single, best disparity estimate for each pixel position. This
    performs the function of the regularization step of
    nonneural-network approaches, such as SGM @Hirschmuller2007, but the
    performance is better for the neural network approaches.

![Block diagram of CNN stereo processing, abstracted from the block diagrams of @Chang2018, @Zhang2019GANet, @Kendall2017](figures/stereo/stereocnn.jpg){width="100%" #fig-stereoblock}

The performance of methods following the block diagram of
@fig-stereoblock surpassed that of methods with handcrafted features.
More recent work has addressed the problem of poor generalization to
out-of-training-domain test examples @zhang2019domaininvariant. Training
and computing the 3D neural network to process the cost volume can be
expensive in time and memory and recent work has obtained both speed and
performance improvement through developing an iterative algorithm to
avoid the 3D filtering @Lipson2021.

## Evaluation

The Middlebury stereo web page @Scharstein2002 provides a comprehensive
comparison of many stereo algorithms, compared on a common dataset, with
pointers to algorithm descriptions and code. One can use the web page to
evaluate the improvement of stereo reconstruction algorithms over time.
The advancement of the field over time is very impressive.

## Concluding Remarks

Accurate depth estimation from stereo pairs remains an unsolved problem.
The results are still not reliable. In order to get accurate depth
estimates, most methods require many images and the use of multiview
geometry methods to reconstruct the 3D scene.

Stereo matching is often also formulated as a problem of estimating the
image motion between two views captured at different locations. In
chapter
@sec-motion_estimation we will discuss motion estimation
algorithms.

We have only provided a short introduction to stereo vision, for an in
depth study of the geometry of stereo pairs the reader can consult
specialized books such as @Hartley2004 and @Faugeras93.