-
Notifications
You must be signed in to change notification settings - Fork 3
Expand file tree
/
Copy pathpublished-202509-boulet-simulator.qmd
More file actions
3271 lines (2846 loc) · 155 KB
/
published-202509-boulet-simulator.qmd
File metadata and controls
3271 lines (2846 loc) · 155 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
::: {.content-hidden}
\(
\newcommand{\bbR}{\mathbb{R}}
\newcommand{\calA}{\mathcal{A}}
\newcommand{\calC}{\mathcal{C}}
\newcommand{\calD}{\mathcal{D}}
\newcommand{\calF}{\mathcal{F}}
\newcommand{\calL}{\mathcal{L}}
\newcommand{\calM}{\mathcal{M}}
\newcommand{\calO}{\mathcal{O}}
\newcommand{\calP}{\mathcal{P}}
\newcommand{\calQ}{\mathcal{Q}}
\newcommand{\calX}{\mathcal{X}}
\newcommand{\Os}{O^{\sharp}}
\newcommand{\pr}{\widehat{\text{pr}}}
\newcommand{\thn}{\widehat{\theta}_{n}}
\newcommand{\card}{\mathop{\mathrm{card}}}
\newcommand{\Class}{\mathop{\mathrm{Class}}}
\newcommand{\Corr}{\mathop{\mathrm{Corr}}}
\newcommand{\Dec}{\mathop{\mathrm{Dec}}}
\newcommand{\diag}{\mathop{\mathrm{diag}}}
\newcommand{\Enc}{\mathop{\mathrm{Enc}}}
\newcommand{\expit}{\mathop{\mathrm{expit}}}
\newcommand{\Gen}{\mathop{\mathrm{Gen}}}
\newcommand{\KL}{\mathop{\mathrm{KL}}}
\newcommand{\law}{\mathop{\mathrm{law}}}
\newcommand{\LB}{\mathop{\mathrm{LB}}}
\newcommand{\Unif}{\mathop{\mathrm{Unif}}}
\newcommand{\ReLU}{\mathop{\mathrm{ReLU}}}
\)
:::
> Asher Rubin walks out of the starosta's home and heads toward the market square. With evening, the sky has cleared, and now a million stars are shining, but their light is cold and brings down a frost upon the earth, upon Rohatyn. The first of this autumn. Rubin pulls his black wool coat tighter around him; tall and thin, he looks like a vertical line.
[@Tokarczuk, I(3)]
# Introduction
## Fiction as the original simulation
One of humanity's oldest creative endeavors, fiction represents an early form
of simulation. It extends the imaginative play where children create
scenarios, roles, or worlds that are not constrained by the rules of reality,
that is "childhood pretence" [@Carruthers] or "the make-believe games" of
children [@Walton]. Through stories, myths, and imagined worlds, humans
construct alternative realities to explore ideas, express emotions, and
reflect on their existence. By presenting hypothetical scenarios and posing
"what if things had been different" questions [@Pearl, page 34], fiction
empowers individuals to explore alternative histories, draw insights from the
experiences of others, and engage with possibilities that extend beyond the
confines of the physical world. At its core, fiction abstracts and
reconstructs elements of reality. An author selectively includes, exaggerates,
or omits aspects of the real world, creating models that serve their artistic
or thematic intentions. From Homer's *Odyssey* [@Jaccottet] to speculative
tales like Mary Shelley's *Frankenstein* [@Shelley], fiction mirrors the
complexities of human life, enabling readers to engaged with an imagined
reality that resonates with their own.
The relationship between fiction and reality has long been a subject of
debate. Plato, in his critique of art, viewed fiction as a mere imitation of
the physical world, itself a flawed reflection of the ideal "Forms". By this
reasoning, fiction is a "simulation of a simulation", twice removed from truth
[@Platon, Livre X]. Aristotle, by contrast, argued that fiction, through
"mimesis", the imitation of action and life, can illuminate universal truths
[@Aristote, Chapitres 1 à 5]. By abstracting from the particular, fiction
allows exploration of broader patterns and principles.
Following Aristotle's perspective, this tradition of creating and interacting
with imagined realities provides a natural foundation for distinguishing
scientific theories from scientific models [@Barberousse] and understanding
modern simulations. While they stem from the same drive to represent and
explore, scientific theories, scientific models and modern simulations
introduce a higher degree of mathematical rigor. Nevertheless, fiction
remains their conceptual ancestor, reminding us that the human impulse to
model and engage with alternate realities is as old as storytelling itself.
## From modern simulations to computer simulations {#sec-modern-computer-simulations}
The concept of modern simulations predates the modern era. Early instances
include mechanical devices like the Antikythera, a sophisticated analog
computer from the 2nd century BCE designed to simulate celestial movements
[and the MacGuffin chased by Indiana Jones in the 2024 installment of the
franchise, @Antikythera]. The emergence of mathematical models in the works of
Galileo and Newton introduced a new form of simulation, where equations were
used to predict physical phenomena with increasing precision. By the 18th
century, probabilistic experiments like Buffon's Needle, designed to
approximate the number $\pi$ [@Proofs, section 24], demonstrated the power of
simulating complex systems. However, the advent of computer simulations, as
we understand them today, began during World War II with the work of J.\ von
Neumann and S.\ Ulam [@metropolis1949monte].
While studying neutron behavior, they faced a challenge that was too complex
for theoretical analysis and too hazardous, time-consuming, and costly to
investigate experimentally. Fundamental properties (e.g., possible events and
their probabilities) and basic quantities (e.g., the average distance a
neutron would travel before colliding with an atomic nucleus, the likelihood
of absorption or reflection, and energy loss after collisions) were known, but
predicting the outcomes of entire event sequences was infeasible. To address
this challenge, they devised a method of generating random sequences step by
step using a computer, naming it "Monte Carlo" after the casino, a suggestion
by N.\ Metropolis. Statistical analysis of the data produced by repeatedly
applying this method provided sufficiently accurate solutions to better
understand nuclear chain reactions, a crucial aspect of designing atomic bombs
and later nuclear reactors. This breakthrough marked the birth of modern
computer simulations.
Today, computer simulations, henceforth referred to simply as *simulations*,
play a fundamental role in applied mathematics. Generally, conducting a
simulation involves running a computer program (a "simulator") designed to
represent a "system of interest" at a problem-dependent level of abstraction
(that is, with a specific degree of complexity) and collecting the numerical
output for analysis.
Examples of systems of interest are virtually limitless and highly diverse.
They can represent a real-world process in a holistic fashion, such as the
regular functioning of a person's heart at rest, or the medical trajectories
of a cohort of patients undergoing chemotherapy. Alternatively, in a more
focused fashion, they can consist of a hybrid pipeline that combines an
upstream real-world process with downstream data processing of intermediary
outputs, such as the estimation of peripheral oxygen saturation in a healthy
patient using a pulse oximeter. Regardless of the context, determining the
appropriate levels of abstraction and realism is always a significant
challenge.
Here, we focus on simulations used to evaluate the performance of statistical
procedures through simulation studies, as discussed by @MWC19 in their
excellent tutorial on the design and conduct of such studies. The interested
reader will find in their work a carefully curated list of books on simulation
methods in general and articles emphasizing rigor in specific aspects of
simulation studies. Specifically, we consider scenarios where a statistician,
interested in a real-world process, has developed an algorithm tailored to
learning a particular feature of that process from collected data and seeks to
assess the algorithm's performance through simulations.
Once the simulator is devised, the following process is repeated multiple
times. In each iteration, typically independently from previous iterations:
first, the simulator generates a synthetic data set of size $n$; second, the
algorithm is run on the generated data; third, the algorithm's output is
collected for further analysis. After completing these iterations, the next
step is to compare the outcome from one run to the algorithm's target. This is
made possible by the design of the simulator. Finally, the overall
performance of the algorithm is assessed by comparing all the results
collectively to the algorithm's target. Depending on the task, this evaluation
can involve assessing the algorithm's ability to well estimate its target, the
validity of the confidence regions it constructs for its target, the
algorithm's ability to detect whether its target lies within a specified null
domain (using an alternative domain as a reference), and more. This list is
far from exhaustive. The entire process can be repeated multiple times, for
example, to assess how the algorithm's performance depends on $n$.
However, in order to carry out these steps, the statistician must first devise
a simulator. This simulator should ideally generate synthetic data that
resemble the real-world data in a meaningful way, a goal that is often
difficult to achieve. So, how can one design a realistic simulator, and what
does "realistic simulator" even mean in this context? These are the central
questions we explore in this work.
## A probabilistic stance
We adopt a probabilistic framework to model the data collected by observing a
real-world process. Specifically, the data are represented as a random
variable $O^{n}$ ($O$ as in *observations*) drawn from a probability law
$P^{n}$ ($P$ as in *probability*). The law $P^{n}$ is assumed to belong to a
statistical model $\calM^{n}$ ($\calM$ as in *model*), which is the set of all
probability laws on the space $\calO^{n}$ where $O^{n}$ takes its values. This
model incorporates constraints that reflect known properties of the real-world
process and, where necessary, minimal assumptions about it.
The superscript $n$ indicates an amount of information. For example, in the
context of this study, $n$ typically represents the number of elementary
observations drawn independently from a law $P$ on $\calO$ and gathered in
$O^{n}$. In this case, $\calO^{n}$ corresponds to the Cartesian product $\calO
\times \cdots\times \calO$ (repeated $n$ times) and $P^{n}$ to the product law
$P^{\otimes n}$, with $O^{n}$ decomposing as $(O_{1}, \ldots, O_{n})$
The feature of interest is an element of a space $\calF$ (e.g., a subset of
the real line, or a set of functions). It is modeled as the value
$\Psi(P^{n})$ of a functional $\Psi:\calM^{n} \to \calF$ evaluated at $P^{n}$.
The algorithm developed to estimate this feature is modeled as a functional
$\calA : \calO^{n} \to \calF$. Training the algorithm involves applying
$\calA$ to the observed data $O^{n}$, resulting in the estimator
$\calA(O^{n})$ for the estimand $\Psi(P^{n})$.
We emphasize that we address the questions closing
@sec-modern-computer-simulations without focusing on the specific nature of
the functional of interest $\Psi$: how can one design a realistic simulator,
and what does "realistic simulator" even mean in this context?
## Draw me a simulator {#sec-draw-me-a-model}
When constructing simulators, there is a spectrum of approaches, varying in
complexity and flexibility. At one end of the spectrum, simulators are built
upon relatively simple parametric models. While these models are sometimes
more elaborate, they often rely on standard forms or recurring techniques,
which streamlines their implementation. This approach is further reinforced
by the common practice of using models proposed by others. Doing so not only
saves effort but also facilitates meaningful comparisons between studies, as
the same modeling framework is shared.
Regardless of the model's simplicity, parametric simulators are inherently
limited and unable to capture the complexity of real-world processes. The term
"unnatural" aptly describes this shortcoming, as these models are
simplifications that abstract away many intricacies of reality. Even with
sophisticated parametrizations, it is fundamentally impossible for such
simulators to convincingly replicate the multifaceted interactions and
variability inherent in "nature". Thus, parametric simulators, by their very
essence, cannot achieve realism.
At the other end of the spectrum, one can also adopt a nonparametric approach
through bootstrapping, which involves resampling data directly from the
observed dataset. This method bypasses the need to specify a parametric model
and instead leverages the structure of the real data to generate simulated
samples.
Bootstrapping usually refers to a self-starting process that is supposed to
continue or grow without external input. The term is sometimes attributed to
the story where Baron Münchausen pulls himself and his horse out of a swamp by
his pigtail, not by his bootstraps [@Munchausen, Chapter 4]. In France,
"bootstrap" is sometimes translated as "à la Cyrano", in reference to the
literary hero Cyrano de Bergerac, who imagined reaching the moon by standing
on a metal plate and repeatedly using a magnet to propel himself [@Cyrano, Act
III, Scene 13].
When dealing with independent and identically distributed (i.i.d.) samples,
bootstrapping generates data that closely resemble the observed data.
However, the origin of the term "bootstrapping" suggests a measure of
incompleteness hence dissatisfaction, which is fitting in the context of this
article. Indeed, a bootstrapped simulator can be viewed as both transparent
and opaque, depending on the perspective. Conditionally on the real data, the
simulator's behavior is transparent, as understanding it reduces to
understanding the sampling mechanism over the set of indices $\{1, \ldots,
n\}$. Unconditionally, however, one is again confronted with the limitation
of knowledge about $P^{n}$, beyond recognizing it as an element of
$\calM^{n}$.
```{bash}
cowsay -f sheep \
"I am a simulator. Press ENTER to run the synthetic experiment."
```
In *Le Petit Prince* [@St-Ex], the Little Prince dismisses the pilot's simple
drawings of a sheep as unsatisfactory. Instead, he prefers a drawing of a box,
imagining the perfect sheep inside.
```{bash}
echo \
"I am a simulator. Press ENTER to run the synthetic experiment." | \
boxes -d cc
```
Similarly, in simulations, straightforward simulators often fail to capture
the complexity we seek, while black-box simulators, though opaque, can
sometimes offer greater efficiency. Unlike the Little Prince, however, we are
not content with the box alone -- we want to look inside, to understand and
refine the mechanisms driving our simulator.
## Organization of the article
In this article, we explore an avenue to build more realistic simulators by
using real data and neural networks, more specifically, Variational
Auto-Encoders (VAEs). To illustrate our approach, we focus on a simple example
rooted in causal analysis, as the causal framework presents particularly
interesting challenges.
@sec-objectives outlines our objectives and introduces a running example that
serves as a unifying thread throughout the study. @sec-nutshell provides a
concise overview of VAEs, including their formal definition and the key ideas
behind their training. @sec-build-VAE offers an explanation of how VAEs are
constructed, while @sec-implementation-VAE presents a comprehensive
implementation tailored to the running example. Using this VAE, @sec-sim-data
describes the construction of a simulator designed to approximate the law of
simulated data and discusses methods for evaluating the simulator's
performance. @sec-IWPC extends this approach to a real-world dataset.
Finally, @sec-conclusion concludes the article with a literature review, a
discussion of the challenges encountered, the limitations of the proposed
approach, and some closing reflections.
*Note that the online version of this article is preferable to the PDF
version, as it allows readers to directly view the code.* Throughout the
article, we use a mix of `Python` [@Python] and `R` [@R-base] for
implementation, leveraging commonly used libraries in both ecosystems.
# Objective {#sec-objectives}
Suppose that we have observed $O_{1}, \ldots, O_{n}, O_{n+1}, \ldots,
O_{n+n'}$ drawn independently from $P$, with $P$ known to belong to a model
$\calP$ consisting of laws on $\calO$. For brevity, we will use the notation
$O^{1:n} = (O_{1}, \ldots, O_{n})$ and $O^{(n+1):(n+n')} = (O_{n+1}, \ldots,
O_{n+n'})$.
Suppose moreover that we are interested in a causal framework where each
$O_{i}$ is viewed as a piece of a complete data $X_{i} \in \calX$ drawn from a
law $Q$ that lives in a model $\calQ$, with $X_{1}, \ldots, X_{n}, X_{n+1},
\ldots, X_{n+n'}$ independent. The piece $O_{i}$ is expressed $\pi(X_{i})$,
with the function $\pi$ projecting a complete data $X \sim Q \in \calQ$ onto a
*coarser* real-world data $O=\pi(X) \sim P \in \calP$.
Our objective is twofold. First, we aim to build a generator that
approximates $P$, that is, an element of $\calP$ from which it is possible to
sample independent data that exhibit statistical properties similar to (or,
colloquially, "behave like") $O_{1}, \ldots, O_{n+n'}$. In other words, we
require that the generator produces data whose joint law approximates the law
of the observed data, ensuring that the generated samples reflect the same
underlying structure and dependencies as the real-world observations. Second,
we require the generator to correspond to the law of $\pi(X)$ with $X$ drawn
from an element of $\calQ$.
We use a running example throughout the document.
::: {.callout-note appearance="simple"}
## Running example.
For example, $\calP$ can be the set of all laws on $\calO := (\{0,1\}^{2}
\times \bbR^{3}) \times \{0,1\} \times \bbR$ such that
$$
O := (V,W,A,Y)\sim P\in \calP
$$
satisfies
$$
c \leq P(A=1|W,V), P(A=0|W,V) \leq 1-c
$$
$P$-almost surely for some $P$-specific constant $c \in ]0,1/2]$, and $Y$ is
$P$-integrable.
Moreover, we view $O$ as $\pi(X)$ with
$$
\begin{aligned}
X&:=(V, W, Y[0], Y[1], A) \in \calX:= ( \{0,1\}^{2} \times \bbR^{3}) \times
\bbR \times \bbR \times \{0,1\},\\
\pi&:(v,w,y[0],y[1],a) \mapsto (v,w, a, ay[1] + (1-a)y[0]),
\end{aligned}
$$
and $\calQ$ defined as the set of all laws on $\calX$ such that $X\sim Q\in
\calQ$ satisfies
$$
c' \leq Q(A=1|W,V), Q(A=0|W,V) \leq 1-c'
$$
$Q$-almost surely for some $Q$-specific constant $c' \in ]0,1/2]$, and $Y[0]$
and $Y[1]$ are $Q$-integrable.
We consider $(V, W)$ as the context in which two possible actions $a=0$ and
$a=1$ would yield the counterfactual rewards $Y[0]$ and $Y[1]$, respectively.
One of these actions, $A\in\{0,1\}$, is factually carried out, resulting in
the factual reward $Y = A Y[1] + (1-A) Y[0]$, that is, $Y[1]$ if $A=1$ and
$Y[0]$ otherwise. In the causal inference literature, this definition of $Y$
is referred to as the consistency assumption.
In the context of a randomized clinical trial, the pair $(V, W)$ represents
baseline covariates, which are measured before one of two possible treatments
is assigned. The binary variable $A$ indicates the treatment randomly assigned
to a participant. The outcome variable $Y$ measures the effect of the
treatment; it is viewed as the counterfactual outcome in a hypothetical
scenario where everyone would receive that same treatment.
:::
::: {.callout-note appearance="simple"}
## Running example in action.
The `Python` function `simulate` defined in the next chunk of code
operationalizes drawing independent data from a law $P \in \calM$.
```{python function-simulate}
#| echo: true
import numpy as np
import random
from numpy import hstack, zeros, ones
def simulate(n, dimV, dimW):
def expit(x):
return 1 / (1 + np.exp(-x))
p = np.hstack((1/3 * np.ones((n, 1)), 1/2 * np.ones((n, 1))))
V = np.random.binomial(n = 1, p = p)
W = np.random.normal(loc = 0, scale = 1, size = (n, dimW))
WV = np.hstack((W, V))
pAgivenWV = np.clip(expit(0.8 * WV[:, 0]), 1e-2, 1 - 1e-2)
A = np.random.binomial(n = 1, p = pAgivenWV)
meanYgivenAWV = 0.5 * expit(-5 * A * (WV[:, 0] - 1)\
+ 3 * (1 - A) * (WV[:, 1] + 0.5))\
+ 0.5 * expit(WV[:, 2])
Y = np.random.normal(loc = meanYgivenAWV, scale = 1/25, size = n)
dataset = np.vstack((np.transpose(WV), A, Y))
dataset = np.transpose(dataset)
return dataset
```
:::
Note that justifying the specific choices made while defining the function
`simulate` is unnecessary. In the context of this study, we are free from the
need for, or aspiration to, a realistic simulation scheme. Under the law $P$
that `simulate` samples from, $V$ and $W$ are independent; $V$ consists of two
independent variables $V_{1}$ and $V_{2}$ that are drawn from the Bernoulli
laws with parameters $\tfrac{1}{3}$ and $\tfrac{1}{2}$; $W$ is a standard
Gaussian random variable. In addition, given $(W,V)$, $A$ is sampled from the
Bernoulli law with parameter
$$
\max\left\{0.01, \min\left[0.99, \expit(0.8 \times W_{1})\right]\right\}
$$
and, given $(A,W,V)$, $Y$ is sampled from the Gaussian law with mean
$$
\tfrac{1}{2} \expit\left[-5A\times(W_{1}-1) + 3(1-A)\times
(\tfrac{1}{2} + W_{2})\right] + \tfrac{1}{2} \expit(W_{3})
$$
and (small) standard deviation $\tfrac{1}{25}$. As noted in the introduction,
these choices rely on standard forms and recurring techniques.
::: {.callout-note appearance="simple"}
## Running example, cted.
For future use, we sample in the next chunk of code $n+n'=10^{4}$ independent
observations from $P$. Observations $O^{1:n}$ (gathered in `train`) will be
used for training and observations $O^{(n+1):(n+n')}$ (gathered in `test`)
will be used for testing.
```{python simulation}
#| echo: true
import random
random.seed(54321)
dimV, dimW = 2, 3
n_train = int(5e3)
train = simulate(n_train, dimV, dimW)
test = simulate(n_train, dimV, dimW)
print("The three first observations in 'train':\n",
" V_1 V_2 W_1 W_2 W_3 A Y\n",
np.around(train[:3, [3, 4, 0, 1, 2, 5, 6]], decimals = 3))
## np.savetxt("data/train.csv", train, delimiter = ",")
## np.savetxt("data/test.csv", test, delimiter = ",")
```
:::
# VAE in a nutshell {#sec-nutshell}
## Formal definition {#sec-formal-definition}
In the context of this article, a Variational Auto-Encoder (VAE) [@VAE2013],
[@VAE2014] is an algorithm that, once trained, outputs a *generator*. The
generator is the law of a random variable of the form
$$
\Gen_{\theta} (Z)
$${#eq-gen-theta}
where
1. the source of randomness $Z$ in @eq-gen-theta writes as
$$
Z := (Z^{(0)}, \ldots, Z^{(d)})
$${#eq-alea}
with $Z^{(0)}, \ldots, Z^{(d)}$ independently drawn from
- the uniform law on $\{1, \ldots, n\}$ for $Z^{(0)}$
- the standard normal distribution for $Z^{(1)}, \ldots, Z^{(d)}$;
2. the function $\Gen_{\theta}$ in @eq-gen-theta is an element of a large
collection, parametrized by the finite-dimensional set $\Theta$, of functions
mapping $\bbR^{d+1}$ to $\calX$.
For simplicity, we will also refer to the function $\Gen_{\theta}$ as a
*generator*. Moreover, we emphasize that the entire parameter $\theta$ is to
be fitted, meaning there are no hyperparameters involved.
Because $\Gen_{\theta}(Z)$ belongs to $\calX$, we can evaluate $\pi \circ
\Gen_{\theta}(Z)$, hence the generator can also be used to generate random
variables in $\calO$. @fig-VAE illustrates the architecture of the VAE used
in this study. It shows the key components of the model, including the
encoder, the latent space, and the decoder, along with the flow of information
between them.
:::{#fig-VAE}
{width=700}
Architecture of the simulator. The figure depicts the flow of information
through the encoder, latent space, and decoder components. It emphasizes how
the input source of randomness $Z$ is transformed into a latent representation
and then reconstructed as a complete data, $X=\Gen_{\theta}(Z)$, which can be
mapped to a real-world data $O=\pi(X)$.
:::
The word "auto-encoder" reflects the nature of the parametric form of each
$\Gen_{\theta}$. We begin with a formal presentation in four steps, which is
then followed by a discussion of what each step implements. Specifically,
each $\Gen_{\theta}$ writes as a composition of four mappings $J_{n}$,
$\Enc_{\theta_{1}}$, $K$ and $\Dec_{\theta_{2}}$ with $\theta := (\theta_{1},
\theta_{2}) \in \Theta_{1} \times \Theta_{2} = \Theta$:
$$
\Gen_{\theta} = \Dec_{\theta_{2}} \circ K \circ \Enc_{\theta_{1}} \circ
J_{n}.
$$
Here,
1. $J_{n}: \{1, \ldots, n\} \times \bbR^{d} \to \calO \times \bbR^{d}$ is
such that
$$
J_{n} (Z) = (O_{i}, (Z^{(1)}, \ldots, Z^{(d)}))
$$
with $i = Z^{(0)}$;
2. $\Enc_{\theta_{1}} : \calO \times \bbR^{d} \to \bbR^{d} \times
(\bbR_{+}^{*})^{d} \times \bbR^{d}$ is such that,
if $\Enc_{\theta_{1}}(o,z) = (\mu, \sigma, z')$, then
- $z=z'$, and
- $\Enc_{\theta_{1}}(o,z'') = (\mu, \sigma, z'')$ for all $z''\in\bbR^{d}$;
3. $K : \bbR^{d} \times (\bbR_{+}^{*})^{d} \times \bbR^{d} \to \bbR^{d}$ is
given by
$$
K(\mu,\sigma,z) := \mu + \sigma \odot z,
$$
where $\odot$ denotes the componentwise product;
4. $\Dec_{\theta_{2}}$ maps $\bbR^{d}$ to $\calX$.
Conditionally on $O^{1:n}$ and $Z$, the computation of $\Gen_{\theta} (Z)$ is
deterministic. The process unfolds in four steps:
1. **Sampling and transfer.** Compute $J_{n}(Z)$, which involves sampling
one observation $O_{i}$ uniformly among all genuine observations and transfer
$(Z^{(1)}, \ldots, Z^{(d)})$ unchanged.
2. **Encoding step.** Compute $\Enc_{\theta_{1}} \circ J_{n}(Z)$, which
encodes $O_{i}$ as a vector $\mu \in \bbR^{d}$ and a $d\times d$ covariance
matrix $\diag(\sigma)^2$. This step does not modify $(Z^{(1)}, \ldots,
Z^{(d)})$, which is transferred unchanged.
3. **Gaussian sampling.** Compute $K\circ \Enc_{\theta_{1}} \circ J_{n}(Z)$ by
evaluating $\mu + \sigma \odot (Z^{(1)}, \ldots, Z^{(d)}) \in \bbR^{d}$. This
amounts to sampling from the Gaussian law with mean $\mu$ and covariance
matrix $\diag(\sigma)^2$. In other words, $K$ enables the sampling of a
neighbor of the encoded version of $O_i$ in the latent space.
4. **Decoding step.** Compute $\Dec_{\theta_{2}}\circ K\circ \Enc_{\theta_{1}}
\circ J_{n}(Z)$, which maps the encoded version of $O_{i}$, that is, $\mu +
\sigma \odot (Z^{(1)}, \ldots, Z^{(d)})$, to an element of $\calX$.
## Formal training
Formally, training the VAE involves maximizing the likelihood of $O^{1:n}$
within a parametric model of laws by maximizing a lower bound of the
likelihood. This process begins with the introduction of a working model of
mixtures for $P$. The working model (undoubtedly flawed) postulates the
existence of a latent random variable $U\in \bbR^{d}$ and a parametric model
of *tractable* conditional densities
$$
\{o \mapsto p_{\theta_{2}} (o|u) : u \in \bbR^{d},
\theta_{2} \in \Theta_{2}\}
$$
such that
- $U$ is drawn from the standard Gaussian law on $\bbR^{d}$;
- there exists $\theta_{2} \in \Theta_{2}$ such that, given $U$, $O$ is drawn
from $p_{\theta_{2}} (\cdot|U)$.
Here, tractable densities refer to those that can be easily worked with
analytically, while in contrast, intractable densities are too complex to
handle directly.
Therefore, the working model (undoubtedly flawed) postulates the existence of
$\theta_{2} \in \Theta_{2}$ such that $P$ admits the generally *intractable*
density
$$
o \mapsto \int p_{\theta_{2}}(o|u) \phi_{d}(u)du
$$
where $\phi_{d}$ denotes the density of the standard Gaussian law on
$\bbR^{d}$. As suggested by the use of the parameter $\theta_{2}$, the
definition of the conditional densities $p_{\theta_{2}}(\cdot|u)$ ($u \in
\bbR^{d}$) involves the decoder $\Dec_{\theta_{2}}$.
Since directly maximizing the likelihood of $O^{1:n}$ under the working model
is infeasible, a *secondary* parametric model of *tractable* conditional
densities is introduced:
$$
\{u \mapsto g_{\theta_{1}} (u|O_{i}) : 1 \leq i \leq n, \theta_{1} \in
\Theta_{1}\}
$$
to model the conditional laws of $U$ given $O_{1}$, given $O_{2}$, $\ldots$,
given $O_{n}$. Here too, the use of the parameter $\theta_{1}$ indicates that
the definition of the conditional densities $g_{\theta_{1}} (\cdot|O_{i})$ ($1
\leq i \leq n$) involves the encoder $\Enc_{\theta{1}}$.
Now, by Jensen's inequality, for any $1 \leq i \leq n$ and all
$\theta = (\theta_{1}, \theta_{2})\in \Theta$,
$$
\begin{aligned}
\log p_{\theta_{2}}(O_{i})
&= \log \int p_{\theta_{2}} (O_{i}|u)
\frac{\phi_{d}(u)}{g_{\theta_{1}}(u|O_{i})} g_{\theta_{1}}(u|O_{i}) du\\
&\geq \int \log\left(p_{\theta_{2}} (O_{i}|u)
\frac{\phi_{d}(u)}{g_{\theta_{1}}(u|O_{i})}\right) g_{\theta_{1}}(u|O_{i})
du\\
&= -\KL(g_{\theta_{1}}(\cdot|O_{i}); \phi_{d}) +
E_{U\sim g_{\theta_{1}}(\cdot|O_{i})} [\log p_{\theta_{2}}(O_{i}|U)]=:
\LB_{\theta}(O_{i}),
\end{aligned}
$${#eq-LB}
where $\KL$ denotes the Kullback-Leibler divergence and $U$ in the expectation
is drawn from the conditional law with density $g_{\theta_{1}}(\cdot |
O_{i})$. The notation $\LB$ is used to indicate that it represents a lower
bound. Thus, the likelihood of $O^{1:n}$ under $\theta_{2} \in \Theta_{2}$ is
lower-bounded by
$$
\sum_{i=1}^{n} \LB_{\theta}(O_{i})
$$
for all $\theta_{1} \in \Theta_{1}$. As suggested earlier, training the VAE
formally consists of solving
$$
\max_{\theta\in\Theta} \left\{\sum_{i=1}^{n}\LB_{\theta}(O_{i})\right\}
$${#eq-VAE}
rather than solving
$$
\max_{\theta_{2} \in \Theta_{2}} \left\{\sum_{i=1}^{n} \log p_{\theta_{2}}
(O_{i})\right\}.
$$
# How to build the VAE {#sec-build-VAE}
## A formal description {#sec-VAE-formal}
We implement the classes of encoders and decoders, that is
$\{\Enc_{\theta_{1}} : \theta_{1} \in \Theta_{1}\}$ and $\{\Dec_{\theta_{2}} :
\theta_{2} \in \Theta_{2}\}$, as neural network models. Each encoder
$\Enc_{\theta_{1}}$ and decoder $\Dec_{\theta_{2}}$ consist of a stack of
layers of two types: *densely-connected* and *activation* layers (linear, $x
\mapsto x$; ReLU : $x \mapsto \max(0, x)$, softmax: $(x_{1}, x_{2}) \mapsto
(e^{x_{1}}, e^{x_{2}})/(e^{x_{1}} + e^{x_{2}})$). The neural networks are
rather simple in design, but nevertheless (moderately) high-dimensional and
arguably over-parametrized, as discussed in @sec-over-parametrization.
The model $\{u \mapsto g_{\theta_{1}}(u|O_{i}) : 1 \leq i \leq n, \theta_{1}
\in \Theta_{1}\}$ is chosen such that $U$ drawn from
$g_{\theta_{1}}(\cdot|O_{i})$ is a Gaussian vector with mean $\mu_{i}$ and
covariance matrix $\diag(\sigma_{i})^{2}$ where $\Enc_{\theta_{1}}(O_{i},
\cdot) = (\mu_{i}, \sigma_{i}, \cdot)$, that is, when the
$\theta_{1}$-specific encoding of $O_{i}$ equals $(\mu_{i}, \sigma_{i})$.
Remarkably, the left-hand side term in the definition of $\LB_{\theta}(O_{i})$
(@eq-LB) is then known in closed form:
$$
-\KL(g_{\theta_{1}}(\cdot|O_{i}); \phi_{d}) = \tfrac{1}{2} \sum_{j=1}^{d}
\left(1 + \log (\sigma_{i}^{2})_{j} - (\sigma_{i}^{2})_{j} - (\mu_{i}^{2})_{j}
\right),
$${#eq-LB-LHS-term}
where $(\mu_{i}^{2})_{j}$ and $(\sigma_{i}^{2})_{j}$ are the $j$-th components
of $\mu_{i} \odot \mu_{i}$ and $\sigma_{i} \odot \sigma_{i}$, respectively.
This is very convenient, because @eq-LB-LHS-term makes estimating the term
$\KL(g_{\theta_{1}}(\cdot|O_{i}); \phi_{d})$ unnecessary, a task that would
otherwise introduce more variability in the procedure.
As for the model $\{o \mapsto p_{\theta_{2}}(o|u) : u \in \bbR^{d}, \theta_{2}
\in \Theta_{2}\}$, the only requirement is that it must be chosen in such a
way that $\log p_{\theta_{2}}(O_{i} | u)$ be computable for all $1\leq i \leq
n$, $\theta_{2} \in \Theta_{2}$ and $u \in \bbR^{d}$. This is not a tall
order as soon as $O$ can be decomposed as a sequence of (e.g., time-ordered)
random variables that are vectors with categorical, or integer or real
entries. Indeed, it then suffices (i) to decompose the likelihood accordingly
under the form of a product of conditional likelihoods, and (ii) to choose a
tractable parametric model for each factor in the decomposition. We
illustrate the construction of $\{o \mapsto p_{\theta_{2}}(o|u) : u \in
\bbR^{d}, \theta_{2} \in \Theta_{2}\}$ in the context of our running example.
::: {.callout-note appearance="simple"}
## Running example, cted.
In the context of this example, $O=(V,W,A,Y)$ with $V \in \{0,1\}^{2}$, $W \in
\bbR^{3}$, $A\in\{0,1\}$ and $Y\in\bbR$. Since the source of randomness $Z$
has dimension $(d+1)$, $d$ must satisfy $d = d_{1} + 3$ for some integer
$d_{1} \geq 1$.
Set $\theta = (\theta_{1}, \theta_{2})\in \Theta$, $u \in \bbR^{d}$, and let
$\pi\circ \Dec_{\theta_{2}} (u) = (\tilde{v}, \tilde{w}, \tilde{a}, \tilde{y})
\in \calO$. The conditional likelihood $p_{\theta_{2}}(O|u)$ (of $O$ given
$U=u$) equals
$$
p_{\theta_{2}}(V,W|u) \times p_{\theta_{2}}(A|W,V,u) \times p_{\theta_{2}}(Y|A,W,V,u)
$$
so it suffices to define the conditional likelihoods $p_{\theta_{2}}(V,W|u)$
(of $(V,W)$ given $U=u$), $p_{\theta_{2}}(A|W,V,u)$ (of $A$ given $(W,V)$ and
$U=u$) and $p_{\theta_{2}}(Y|A,W,V,u)$ (of $Y$ given
$(A,W,V)$ and $U=u$).
- We decide that $V$ and $W$ are conditionally independent given $U$ under
$p_{\theta_{2}}(\cdot|u)$. Therefore, it suffices to characterize the
conditional likelihoods $p_{\theta_{2}}(V|u)$ (of $V$ given $U=u$) and
$p_{\theta_{2}}(W|u)$ (of $W$ given $U=u$).
- We choose $w\mapsto p_{\theta_{2}}(w|u)$ to be the Gaussian density with
mean $\tilde{w}$ and identity covariance matrix.
:::
::: {.callout-note appearance="simple"}
## Running example, cted.
- The description of the conditional law of $V$ given $U=u$ under
$p_{\theta_{2}}(\cdot|u)$ is slightly more involved. It requires that we give
more details on the encoders and decoders.
- Like every encoder, $\Enc_{\theta_{1}}$ actually maps $\calO \times
\bbR^{d}$ to $[\bbR^{d_{1}}\times \{0\}^{3}] \times
[(\bbR_{+}^{*})^{d_{1}} \times \{1\}^{3}] \times \bbR^{d}$. In words, if
$\Enc_{\theta_{1}}(o, \cdot) = (\mu, \sigma, \cdot)$, then it necessarily
holds that the three last components of $\mu$ and $\sigma$ are 0 and 1,
respectively. Therefore the three last components of the random vector $K
\circ \Enc_{\theta_{1}} \circ J_{n}(Z)$ equal $Z^{(d-2)}, Z^{(d-1)},
Z^{(d)}$, three independent standard normal random variables.
- To compute $\Dec_{\theta_{2}}(u) = (\tilde{v}, \tilde{w}, \tilde{y}_{0},
\tilde{y}_{1}, \tilde{a}) \in \calX$, we actually compute $\tilde{w}$
*then* $\tilde{v}$, *then* $(\tilde{y}_{0}, \tilde{y}_{1},\tilde{a})$.
- The output $\tilde{w}$ is a $\theta_{2}$-specific deterministic
function of the first $d_{1}$ components of $u$.
- The output $\tilde{v}$ is a $\theta_{2}$-specific deterministic
function of the $(d_{1}+2)$ first components of $u$.
More specifically, two (latent) probabilities $\tilde{g}_{1},
\tilde{g}_{2}$ are first computed, as $\theta_{2}$-specific
deterministic functions of the $d_{1}$ first components of $u$. Then
$\tilde{v}_{1}$ and $\tilde{v}_{2}$ are set to
$\textbf{1}\{\Phi(u^{(d_{1}+1)}) \leq \tilde{g}_{1}\}$ and
$\textbf{1}\{\Phi(u^{(d_{1}+2)}) \leq \tilde{g}_{2}\}$, where $\Phi$
denotes the standard normal cumulative distribution function (c.d.f)
and $u^{(d_{1}+1)}, u^{(d_{1}+2)}$ are the $(d_{1}+1)$-th and
$(d_{1}+2)$-th components of $u$.
For instance, $\tilde{v}_{1}$ is given the value 1 if
$\Phi(u^{(d_{1}+1)}) \leq \tilde{g}_{1}$ and 0 otherwise. Note that
$\textbf{1}\{\Phi(Z^{(d_{1}+1)}) \leq \tilde{g}_{1}\}$ follows the
Bernoulli law with parameter $\tilde{g}_{1}$ because $Z^{(d_{1}+1)}$
is drawn from the standard normal law.
- The output $(\tilde{y}_{0},
\tilde{y}_{1})$ is a $\theta_{2}$-specific deterministic function of
$(\tilde{v}, \tilde{w})$ and the $d_{1}$ first components of $u$.
- The output $\tilde{a}$ is a $\theta_{2}$-specific deterministic
function of $(\tilde{v}, \tilde{w})$ and the last component of $u$.
More specifically, a (latent) probability $\tilde{h}$ is first
computed, as a $\theta_{2}$-specific deterministic function of
$(\tilde{v}, \tilde{w})$. Then $\tilde{a}$ is set to
$\textbf{1}\{\Phi(u^{(d)}) \leq \tilde{h}\}$.
Note that $\textbf{1}\{\Phi(Z^{(d)}) \leq \tilde{h}\}$ follows the
Bernoulli law with parameter $\tilde{h}$ because $Z^{(d)}$ is drawn
from the standard normal law.
We are now in a position to describe the conditional law of $V$ given
$U=u$. We decide that, conditionally on $U=u$, under
$p_{\theta_{2}}(\cdot|u)$, $V_{1}$ and $V_{2}$ are independently drawn
from the Bernoulli laws with parameters $\tilde{g}_{1}$ and
$\tilde{g}_{2}$. Thus, $p_{\theta_{2}}(\cdot|u)$ is such that
$p_{\theta_{2}}(v|u) = [v_{1} \tilde{g}_{1} + (1-v_{1})(1-\tilde{g}_{1})]
\times [v_{2} \tilde{g}_{2} + (1-v_{2})(1-\tilde{g}_{2})]$ for $v=(v_{1},
v_{2}) \in \{0,1\}^{2}$.
:::
::: {.callout-note appearance="simple"}
## Running example, cted.
- The description of the conditional law of $A$ given $(W, V)$ and $U=u$ under
$p_{\theta_{2}}(\cdot|u)$ is similar to that of $V$ given $U$. We decide
that, conditionally on $(W, V)$ and $U=u$, under $p_{\theta_{2}}(\cdot|W, V,
u)$, $A$ follows the Bernoulli law with parameter
$\tilde{\underline{h}}(V,W)$, where the probability
$\tilde{\underline{h}}(v,w)$ lies between $\tilde{h}$ and
$\bar{A}_{n}:=\tfrac{1}{n}\sum_{i=1}^{n} A_{i}$ and is given, for any $(v,w)
\in \{0,1\}^{2} \times \bbR^{3}$, by
$$
\begin{aligned}
\tilde{\underline{h}}(v,w) :=& t(v,w) \tilde{h} + [1 - t(v,w)] \bar{A}_{n} \quad \text{with}\\
-10\log t(v,w) =& - \left[v_{1} \log \tilde{g}_{1} + (1-v_{1}) \log
(1-\tilde{g}_{1})\right]\\
& - \left[v_{2} \log \tilde{g}_{2} + (1-v_{2}) \log (1-\tilde{g}_{2})\right]\\
&+\|w - \tilde{w}\|_{2}^{2}.
\end{aligned}
$$
Thus, $p_{\theta_{2}}(\cdot|W,V,u)$ is such that $p_{\theta_{2}}(a|W,V,u)
= a \tilde{\underline{h}(V,W)} + (1-a)(1-\tilde{\underline{h}}(V,W))$ for
$a \in \{0,1\}$.
- Finally, we choose $y \mapsto p_{\theta_{2}}(y|A,W,V,u)$ to be the
two-regime density given by
$$
p_{\theta_{2}}(y|A,W,V,u) =
\frac{\textbf{1}\{A=\tilde{a}\}}{\tilde{s}(W)} \phi_{1}
\left(\frac{y - \tilde{y}}{\tilde{s}(W)}\right) + \textbf{1}\{A\neq
\tilde{a}\} C^{-1}
$$
where $\tilde{s}(w):= \tfrac{1}{\sqrt{5}} \|w - \tilde{w}\|_{2}$ for any $w \in
\bbR^{3}$ and $C$ is the
Lebesgue measure of the support of the marginal law of $Y$ under $P$ (it
does not matter if $C$ is unknown).
Thus, two cases arise:
- If $A = \tilde{a}$, then $Y$ is conditionally drawn under
$p_{\theta_{2}}(\cdot|A,W,V,u)$ from the Gaussian law with mean $\tilde{y} =
a \tilde{y}_{1} + (1-a) \tilde{y}_{0}$
and variance $\tilde{s}(W)^{2}$.
- Otherwise, $Y$ is conditionally drawn under
$p_{\theta_{2}}(\cdot|A,W,V,u)$ from the uniform law on the support of the
marginal law of $Y$ under $P$.
Therefore, the conditional likelihood $p_{\theta_{2}}(Y|A,W,V,u)$ bears
information only if $A=\tilde{a}$ (that is, if the actions $A$ and
$\tilde{a}$ undertaken when generating $O=(V,W,A,Y)$ and computing
$\Dec_{\theta_{2}}(u)$ coincide), which can be interpreted as a necessary
condition to justify the comparison of the rewards $Y$ and
$\tilde{y}$. Moreover, when $A=\tilde{a}$, the closer are the contexts $W$
and $\tilde{w}$, the more relevant is the comparison and the larger the
magnitude of $p_{\theta_{2}}(Y|A,W,V,u)$ can be.
:::
::: {.callout-note appearance="simple"}
## Running example, cted.
In summary, the right-hand side term in the definition of
$\LB_{\theta}(O_{i})$ @eq-LB equals, up to a term that does not depend on
$\theta$,
$$ \begin{aligned}
\tfrac{1}{2}E_{U \sim g_{\theta_{1}}(\cdot|O_{i})} \Bigg[
& -2\left(V_{1,i} \log \tilde{G}_{1} + (1-V_{1,i}) \log (1 -
\tilde{G}_{1})\right) \\
& -2\left(V_{2,i} \log \tilde{G}_{2} + (1-V_{2,i}) \log (1 -
\tilde{G}_{2})\right) \\
& - \|W_{i} - \tilde{W}\|_{2}^{2}\\
& -2\left(A_{i} \log \tilde{\underline{H}} + (1-A_{i}) \log [1 -
\tilde{\underline{H}}]\right)\\
& -\textbf{1}\{A_{i} = \tilde{A}\} \times \left(\log \tilde{S}(W_{i})^{2} +
\frac{(Y_{i} - \tilde{Y})^{2}}{2\tilde{S}(W_{i})^{2}}\right)\Bigg],
\end{aligned}
$${#eq-log-p-theta2}
with the notational conventions $\pi \circ \Dec_{\theta_{2}} (U) = (\tilde{V},
\tilde{W}, \tilde{A}, \tilde{Y})$, $V_{i} = (V_{i,1}, V_{i,2})$, and where
$\tilde{G}_{1}$, $\tilde{G}_{2}, \tilde{\underline{H}}, \tilde{S}$
are defined like the
above latent quantities $\tilde{g}_{1}$, $\tilde{g}_{2}, \tilde{\underline{h}},
\tilde{s}$
with $U$ substituted for $u$. The expression is easily interpreted: the
opposite of @eq-log-p-theta2 is an average risk that measures
- the likelihood of $V_{i,1}$ and $V_{i,2}$ from the points of view of the
Bernoulli laws with parameters $\tilde{G}_{1}$ and $\tilde{G}_{2}$ (first and
second terms),
- the average proximity between $W_{i}$ and $\tilde{W}$ (third term),
- the likelihood of $A_{i}$ from the point of view of the Bernoulli law with
parameter $\tilde{\underline{H}}$ (fourth term),
- the average proximity between $Y_{i}$ and $\tilde{Y}$ (fifth term) *only if*
$A_{i} = \tilde{A}$ (otherwise, the comparison would be meaningless).
In other terms, the opposite of @eq-log-p-theta2 can be interpreted as a
measure of the average faithfulness of the reconstruction of $O_{i}$ under the
form $\pi \circ \Dec_{\theta_{2}} (U)$ with $U$ drawn from
$g_{\theta_{1}}(\cdot|O_{i})$. The larger is @eq-log-p-theta2, the better is
the reconstruction of $O_{i}$ under the form $\pi \circ \Dec_{\theta_{2}} (U)$
with $U$ drawn from $g_{\theta_{1}}(\cdot|O_{i})$.
To conclude, note that the conditional laws of $W$ and $Y$, both Gaussian,
could easily be associated with diagonal covariance matrices different from
the identity matrix. This adjustment would be particularly relevant in
situations where $\|W\|_{2}$ and $|Y|$ are typically not of the same
magnitude, with $O=(V,W,A,Y)$ drawn from the law $P$ of the experiment of
interest. Alternatively, the genuine observations could be pre-processed to
ensure that $\|W\|_{2}$ and $|Y|$ are brought to comparable magnitudes.
:::
The hope is that, once the VAE is trained, yielding a parameter $\thn =
((\thn)_{1}, (\thn)_{2})$, the corresponding generator $\Gen_{\thn}$ produces
a synthetic complete data $X\in\calX$ such that the law of $\pi(X) \in\calO$
closely approximates $P$. Naturally, this approximation is closely related to
the conditional densities $g_{(\thn)_{1}}(\cdot|O_{i})$ and
$p_{(\thn)_{2}}(\cdot | u)$ ($1 \leq i \leq n$, $u \in \bbR$).
For instance, in the context of the running example, if $O = (V,W, A, Y) = \pi
\circ \Gen_{\thn}(Z)$ and if $\xi, \zeta$ are independently drawn from the
centered Gaussian laws with an identity covariance matrix on $\bbR^{3}$ and
variance 1 on $\bbR$, respectively, then $(W + \xi, A, Y + \zeta)$ follows a
law that admits the density
$$
\begin{aligned}
(v, w, a, y)\mapsto \int & p_{(\thn)_{2}}(y|a, w, v, u) \times
\textbf{1}\{a=\tilde{a}_{(\thn)_{2}}(u)\}\\
& \times
p_{(\thn)_{2}}(w,v|u)\left(\frac{1}{n}\sum_{i=1}^{n}
g_{(\thn)_{1}}(u |O_{i}) \right) du,
\end{aligned}
$$
where $\tilde{a}_{(\thn)_{2}}(u)$ is defined as the $A$-coefficient of
$\pi\circ \Dec_{(\thn)_{2}}(u)$.
## About the over-parametrization {#sec-over-parametrization}
In @sec-VAE-formal we acknowledged that the models $\{\Enc_{\theta_{1}} :
\theta_{1} \in \Theta_{1}\}$ and $\{\Dec_{\theta_{2}} : \theta_{2} \in
\Theta_{2}\}$ are over-parametrized in the sense that the dimensions of the
parameter set $\Theta_{1}\times \Theta_{2}$ is potentially large. For
instance, the dimension of the model that we build in the next section is
1157. This is a common feature of neural networks.
Our models are also over-parametrized in the sense that they are not
identifiable. This is obviously the case because of the loss of information
that governs the derivation of an observation $O$ as a piece $\pi(X)$ of a
complete data $X$ that we are not given to observe in its entirety.
::: {.callout-note appearance="simple"}
## Running example, cted.
In particular, in the context of this example, it is well know that we cannot
learn from $O_{1}, \ldots, O_{n}$ any feature of the joint law of the
counterfactual random variables $(Y[0], Y[1])$ that does not reduce to a
feature of the marginal laws of $Y[0]$ or $Y[1]$, unless we make very strong
assumptions on this joint law (e.g., that $Y[0]$ and $Y[1]$ are independent).
:::
This is not a source of concern. First, it is generally recognized that the
fitting of neural networks often benefits from the high dimensionality of the
optimization space and the presence of numerous equivalently good local
optima, resulting in a redundant optimization landscape
[@choromanska2015loss], [@arora2018optimization]. Second, our objective is to
construct a generator that approximates the law $P$ of $O_{1}, \ldots, O_{n}$,
generating $O\in\calO$ by first producing $X\in\calX$ (via $\Gen_{\theta}(Z)$)
and then providing $\pi(X)$. The fact that two different generators
$\Gen_{\theta}$ and $\Gen_{\theta'}$ can perform equally well is not
problematic. Identifying *one* generator $\Gen_{\theta}$ that performs well
is sufficient.
It is possible to search for generators that satisfy user-supplied
constraints, provided these can be expressed as a real-valued criterion
$F(E[\calC(\Gen_{\theta}(Z))])$. For example, one may wish to construct a
generator $\Gen_{\theta}$ such that the components of $X$ under
$\law(\Gen_{\theta})$ exhibit a pre-specified correlation pattern (as
demonstrated in the simple example below).
To focus the optimization procedure on generators that approximately meet
these constraints, one can modify the original criterion @eq-VAE by adding a
penalty term. Specifically, given a user-supplied hyper-parameter $\lambda >
0$, we can substitute
$$
\max_{\theta\in\Theta} \left\{\sum_{i=1}^{n}\LB_{\theta}(O_{i}) +
\lambda F(E_{Z \sim \Unif\{1, \ldots, n\} \otimes N(0,1)^{\otimes
d}}[\calC(\Gen_{\theta}(Z))])\right\}
$${#eq-penalized-VAE}
for @eq-VAE. From a computational perspective, this adjustment simply
involves adding the term
$$
\lambda F\left(\frac{1}{m}\sum_{i=1}^{m} \calC(\Gen_{\theta} (Z_{m+i}))\right)
$${#eq-penalization-VAE}
to the expressions within the curly brackets in the definition of $g$ in the
algorithm described in @sec-algo.
::: {.callout-note appearance="simple"}
## Running example, cted.
In particular, in the context of this example, we could look for generators
$\Gen_{\theta}$ such that the correlation of $Y[0]$ and $Y[1]$ under
$\law(\Gen_{\theta} (Z))$ be close to a target correlation $r \in ]-1, 1[$.
In that case, we could choose $\calC(\Gen_{\theta} (Z)) := (Y[0]Y[1],
Y[0]^{2}, Y[1]^{2}, Y[0], Y[1])$ and $F : (a,b,c,d,e) \mapsto |(a -
de)/\sqrt{(b - d^2)(c - e^2)} - r|$.
:::