published-202509-boulet-simulator/published-202509-boulet-simulator.qmd at main · computorg/published-202509-boulet-simulator · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
::: {.content-hidden}
\(
\newcommand{\bbR}{\mathbb{R}}
\newcommand{\calA}{\mathcal{A}}
\newcommand{\calC}{\mathcal{C}}
\newcommand{\calD}{\mathcal{D}}
\newcommand{\calF}{\mathcal{F}}
\newcommand{\calL}{\mathcal{L}}
\newcommand{\calM}{\mathcal{M}}
\newcommand{\calO}{\mathcal{O}}
\newcommand{\calP}{\mathcal{P}}
\newcommand{\calQ}{\mathcal{Q}}
\newcommand{\calX}{\mathcal{X}}
\newcommand{\Os}{O^{\sharp}}
\newcommand{\pr}{\widehat{\text{pr}}}
\newcommand{\thn}{\widehat{\theta}_{n}}
\newcommand{\card}{\mathop{\mathrm{card}}}
\newcommand{\Class}{\mathop{\mathrm{Class}}}
\newcommand{\Corr}{\mathop{\mathrm{Corr}}}
\newcommand{\Dec}{\mathop{\mathrm{Dec}}}
\newcommand{\diag}{\mathop{\mathrm{diag}}}
\newcommand{\Enc}{\mathop{\mathrm{Enc}}}
\newcommand{\expit}{\mathop{\mathrm{expit}}}
\newcommand{\Gen}{\mathop{\mathrm{Gen}}}
\newcommand{\KL}{\mathop{\mathrm{KL}}}
\newcommand{\law}{\mathop{\mathrm{law}}}
\newcommand{\LB}{\mathop{\mathrm{LB}}}
\newcommand{\Unif}{\mathop{\mathrm{Unif}}}
\newcommand{\ReLU}{\mathop{\mathrm{ReLU}}}
\)
:::

> Asher Rubin walks out of the starosta's home and heads toward the market square. With evening, the sky has cleared, and now a million stars are shining, but their light is cold and brings down a frost upon the earth, upon Rohatyn. The first of this autumn. Rubin pulls his black wool coat tighter around him; tall and thin, he looks like a vertical line.
[@Tokarczuk, I(3)]

# Introduction

## Fiction as the original simulation

One of humanity's oldest creative  endeavors, fiction represents an early form
of  simulation.   It  extends  the  imaginative  play  where  children  create
scenarios, roles, or worlds that are  not constrained by the rules of reality,
that  is "childhood  pretence" [@Carruthers]  or "the  make-believe games"  of
children  [@Walton].   Through stories,  myths,  and  imagined worlds,  humans
construct  alternative  realities  to  explore ideas,  express  emotions,  and
reflect on their  existence.  By presenting hypothetical  scenarios and posing
"what  if things  had been  different"  questions [@Pearl,  page 34],  fiction
empowers individuals to explore alternative  histories, draw insights from the
experiences of  others, and engage  with possibilities that extend  beyond the
confines  of  the  physical  world.    At  its  core,  fiction  abstracts  and
reconstructs elements of reality. An author selectively includes, exaggerates,
or omits aspects of the real  world, creating models that serve their artistic
or thematic  intentions.  From  Homer's *Odyssey* [@Jaccottet]  to speculative
tales  like  Mary Shelley's  *Frankenstein*  [@Shelley],  fiction mirrors  the
complexities  of human  life, enabling  readers  to engaged  with an  imagined
reality that resonates with their own.

The  relationship between  fiction  and reality  has long  been  a subject  of
debate. Plato, in his  critique of art, viewed fiction as  a mere imitation of
the physical world,  itself a flawed reflection of the  ideal "Forms". By this
reasoning, fiction is a "simulation of a simulation", twice removed from truth
[@Platon,  Livre X].   Aristotle, by  contrast, argued  that fiction,  through
"mimesis", the imitation  of action and life, can  illuminate universal truths
[@Aristote, Chapitres  1 à  5].  By abstracting  from the  particular, fiction
allows exploration of broader patterns and principles.

Following Aristotle's perspective, this  tradition of creating and interacting
with  imagined  realities provides  a  natural  foundation for  distinguishing
scientific theories  from scientific  models [@Barberousse]  and understanding
modern simulations.   While they  stem from  the same  drive to  represent and
explore,  scientific  theories,  scientific   models  and  modern  simulations
introduce  a  higher  degree  of mathematical  rigor.   Nevertheless,  fiction
remains  their conceptual  ancestor, reminding  us that  the human  impulse to
model and engage with alternate realities is as old as storytelling itself.


## From modern simulations to computer simulations {#sec-modern-computer-simulations}

The concept  of modern simulations  predates the modern era.   Early instances
include  mechanical  devices  like  the Antikythera,  a  sophisticated  analog
computer from  the 2nd  century BCE designed  to simulate  celestial movements
[and the  MacGuffin chased  by Indiana  Jones in the  2024 installment  of the
franchise, @Antikythera]. The emergence of mathematical models in the works of
Galileo and Newton  introduced a new form of simulation,  where equations were
used  to predict  physical phenomena  with increasing  precision. By  the 18th
century,  probabilistic   experiments  like   Buffon's  Needle,   designed  to
approximate the number $\pi$ [@Proofs,  section 24], demonstrated the power of
simulating complex systems.   However, the advent of  computer simulations, as
we understand them today,  began during World War II with the  work of J.\ von
Neumann and S.\ Ulam [@metropolis1949monte].

While studying neutron  behavior, they faced a challenge that  was too complex
for  theoretical analysis  and too  hazardous, time-consuming,  and costly  to
investigate experimentally.  Fundamental properties (e.g., possible events and
their  probabilities)  and basic  quantities  (e.g.,  the average  distance  a
neutron would travel  before colliding with an atomic  nucleus, the likelihood
of absorption or reflection, and energy loss after collisions) were known, but
predicting the outcomes of entire  event sequences was infeasible.  To address
this challenge, they  devised a method of generating random  sequences step by
step using a computer, naming it  "Monte Carlo" after the casino, a suggestion
by N.\  Metropolis.  Statistical analysis  of the data produced  by repeatedly
applying  this  method  provided  sufficiently accurate  solutions  to  better
understand nuclear chain reactions, a crucial aspect of designing atomic bombs
and  later nuclear  reactors.  This  breakthrough marked  the birth  of modern
computer simulations.

Today, computer  simulations, henceforth referred to  simply as *simulations*,
play  a  fundamental role  in  applied  mathematics. Generally,  conducting  a
simulation involves  running a  computer program  (a "simulator")  designed to
represent a "system  of interest" at a problem-dependent  level of abstraction
(that is, with  a specific degree of complexity) and  collecting the numerical
output for analysis.

Examples of  systems of interest  are virtually limitless and  highly diverse.
They can  represent a real-world  process in a  holistic fashion, such  as the
regular functioning of  a person's heart at rest, or  the medical trajectories
of a  cohort of  patients undergoing chemotherapy.   Alternatively, in  a more
focused  fashion, they  can  consist of  a hybrid  pipeline  that combines  an
upstream real-world  process with  downstream data processing  of intermediary
outputs, such as  the estimation of peripheral oxygen saturation  in a healthy
patient using  a pulse oximeter.   Regardless of the context,  determining the
appropriate  levels  of  abstraction  and  realism  is  always  a  significant
challenge.

Here, we focus on simulations used  to evaluate the performance of statistical
procedures  through  simulation  studies,  as discussed  by  @MWC19  in  their
excellent tutorial on  the design and conduct of such  studies. The interested
reader will find in their work a carefully curated list of books on simulation
methods  in general  and articles  emphasizing  rigor in  specific aspects  of
simulation studies. Specifically, we  consider scenarios where a statistician,
interested in  a real-world  process, has developed  an algorithm  tailored to
learning a particular feature of that process from collected data and seeks to
assess the algorithm's performance through simulations.

Once  the simulator  is devised,  the following  process is  repeated multiple
times. In  each iteration,  typically independently from  previous iterations:
first, the simulator  generates a synthetic data set of  size $n$; second, the
algorithm  is run  on the  generated data;  third, the  algorithm's output  is
collected for  further analysis. After  completing these iterations,  the next
step is to compare the outcome from one run to the algorithm's target. This is
made  possible  by  the  design   of  the  simulator.   Finally,  the  overall
performance  of  the  algorithm  is  assessed by  comparing  all  the  results
collectively to the algorithm's target. Depending on the task, this evaluation
can involve assessing the algorithm's ability to well estimate its target, the
validity  of  the  confidence  regions  it  constructs  for  its  target,  the
algorithm's ability to detect whether its  target lies within a specified null
domain (using an  alternative domain as a reference), and  more.  This list is
far from exhaustive.   The entire process can be repeated  multiple times, for
example, to assess how the algorithm's performance depends on $n$.

However, in order to carry out these steps, the statistician must first devise
a  simulator.  This  simulator  should ideally  generate  synthetic data  that
resemble  the real-world  data  in a  meaningful  way, a  goal  that is  often
difficult to achieve.  So, how can  one design a realistic simulator, and what
does "realistic  simulator" even mean in  this context? These are  the central
questions we explore in this work.

## A probabilistic stance

We adopt a probabilistic framework to  model the data collected by observing a
real-world  process.  Specifically,  the  data are  represented  as  a  random
variable  $O^{n}$ ($O$  as in  *observations*)  drawn from  a probability  law
$P^{n}$ ($P$ as in  *probability*). The law $P^{n}$ is assumed  to belong to a
statistical model $\calM^{n}$ ($\calM$ as in *model*), which is the set of all
probability laws on the space $\calO^{n}$ where $O^{n}$ takes its values. This
model incorporates constraints that reflect known properties of the real-world
process and, where necessary, minimal assumptions about it.

The superscript $n$  indicates an amount of information.  For  example, in the
context  of this  study, $n$  typically  represents the  number of  elementary
observations drawn  independently from a  law $P$  on $\calO$ and  gathered in
$O^{n}$. In this case, $\calO^{n}$ corresponds to the Cartesian product $\calO
\times \cdots\times \calO$ (repeated $n$ times) and $P^{n}$ to the product law
$P^{\otimes n}$, with $O^{n}$ decomposing as $(O_{1}, \ldots, O_{n})$

The feature of  interest is an element  of a space $\calF$ (e.g.,  a subset of
the  real  line,  or  a  set  of functions).   It  is  modeled  as  the  value
$\Psi(P^{n})$ of a functional $\Psi:\calM^{n} \to \calF$ evaluated at $P^{n}$.
The algorithm  developed to estimate this  feature is modeled as  a functional
$\calA  : \calO^{n}  \to  \calF$.  Training  the  algorithm involves  applying
$\calA$   to  the   observed  data   $O^{n}$,  resulting   in  the   estimator
$\calA(O^{n})$ for the estimand $\Psi(P^{n})$.

We     emphasize     that     we     address     the     questions     closing
@sec-modern-computer-simulations without  focusing on  the specific  nature of
the functional of  interest $\Psi$: how can one design  a realistic simulator,
and what does "realistic simulator" even mean in this context?

## Draw me a simulator {#sec-draw-me-a-model}

When constructing  simulators, there is  a spectrum of approaches,  varying in
complexity and flexibility.  At one end  of the spectrum, simulators are built
upon relatively  simple parametric models.   While these models  are sometimes
more elaborate,  they often  rely on standard  forms or  recurring techniques,
which streamlines  their implementation.  This approach  is further reinforced
by the common practice of using models  proposed by others.  Doing so not only
saves effort but  also facilitates meaningful comparisons  between studies, as
the same modeling framework is shared.

Regardless  of the  model's simplicity,  parametric simulators  are inherently
limited and unable to capture the complexity of real-world processes. The term
"unnatural"   aptly  describes   this   shortcoming,  as   these  models   are
simplifications that  abstract away  many intricacies  of reality.   Even with
sophisticated  parametrizations,  it  is  fundamentally  impossible  for  such
simulators  to  convincingly  replicate   the  multifaceted  interactions  and
variability inherent in "nature".  Thus,  parametric simulators, by their very
essence, cannot achieve realism.

At the other end of the spectrum,  one can also adopt a nonparametric approach
through  bootstrapping,  which  involves  resampling data  directly  from  the
observed dataset. This method bypasses the  need to specify a parametric model
and instead  leverages the structure  of the  real data to  generate simulated
samples.

Bootstrapping usually  refers to a  self-starting process that is  supposed to
continue or grow  without external input. The term is  sometimes attributed to
the story where Baron Münchausen pulls himself and his horse out of a swamp by
his  pigtail, not  by his  bootstraps  [@Munchausen, Chapter  4].  In  France,
"bootstrap" is  sometimes translated  as "à  la Cyrano",  in reference  to the
literary hero Cyrano  de Bergerac, who imagined reaching the  moon by standing
on a metal plate and repeatedly using a magnet to propel himself [@Cyrano, Act
III, Scene 13].

When dealing  with independent and identically  distributed (i.i.d.)  samples,
bootstrapping  generates  data  that   closely  resemble  the  observed  data.
However,  the  origin  of  the  term "bootstrapping"  suggests  a  measure  of
incompleteness hence dissatisfaction, which is  fitting in the context of this
article.  Indeed, a  bootstrapped simulator can be viewed  as both transparent
and opaque, depending on the perspective.  Conditionally on the real data, the
simulator's  behavior   is  transparent,   as  understanding  it   reduces  to
understanding the  sampling mechanism  over the set  of indices  $\{1, \ldots,
n\}$.  Unconditionally, however,  one is again confronted  with the limitation
of  knowledge  about   $P^{n}$,  beyond  recognizing  it  as   an  element  of
$\calM^{n}$.


```{bash}
cowsay -f sheep \
"I am a simulator. Press ENTER to run the synthetic experiment."
```

In *Le Petit Prince* [@St-Ex], the  Little Prince dismisses the pilot's simple
drawings of a sheep as unsatisfactory. Instead, he prefers a drawing of a box,
imagining the perfect sheep inside.

```{bash}
echo \
"I am a simulator. Press ENTER to run the synthetic experiment." | \
boxes -d cc
```

Similarly, in  simulations, straightforward  simulators often fail  to capture
the  complexity  we  seek,  while black-box  simulators,  though  opaque,  can
sometimes offer greater efficiency.  Unlike the Little Prince, however, we are
not content with  the box alone --  we want to look inside,  to understand and
refine the mechanisms driving our simulator.

## Organization of the article

In this  article, we explore an  avenue to build more  realistic simulators by
using  real   data  and   neural  networks,  more   specifically,  Variational
Auto-Encoders (VAEs). To illustrate our approach, we focus on a simple example
rooted  in causal  analysis,  as the  causal  framework presents  particularly
interesting challenges.

@sec-objectives outlines our objectives and  introduces a running example that
serves as  a unifying  thread throughout the  study. @sec-nutshell  provides a
concise overview of VAEs, including their  formal definition and the key ideas
behind their  training. @sec-build-VAE offers  an explanation of how  VAEs are
constructed,   while   @sec-implementation-VAE    presents   a   comprehensive
implementation tailored to the running  example. Using this VAE, @sec-sim-data
describes the construction  of a simulator designed to approximate  the law of
simulated  data   and  discusses   methods  for  evaluating   the  simulator's
performance.   @sec-IWPC  extends  this  approach  to  a  real-world  dataset.
Finally, @sec-conclusion  concludes the  article with  a literature  review, a
discussion  of the  challenges encountered,  the limitations  of the  proposed
approach, and some closing reflections.

*Note  that the  online  version of  this  article is  preferable  to the  PDF
version,  as it  allows readers  to directly  view the  code.* Throughout  the
article,  we  use   a  mix  of  `Python`  [@Python]  and   `R`  [@R-base]  for
implementation, leveraging commonly used libraries in both ecosystems.


# Objective {#sec-objectives}

Suppose  that  we  have  observed  $O_{1},  \ldots,  O_{n},  O_{n+1},  \ldots,
O_{n+n'}$ drawn  independently from $P$, with  $P$ known to belong  to a model
$\calP$ consisting of  laws on $\calO$. For brevity, we  will use the notation
$O^{1:n} = (O_{1},  \ldots, O_{n})$ and $O^{(n+1):(n+n')}  = (O_{n+1}, \ldots,
O_{n+n'})$.

Suppose  moreover that  we are  interested in  a causal  framework where  each
$O_{i}$ is viewed as a piece of a complete data $X_{i} \in \calX$ drawn from a
law $Q$  that lives in a  model $\calQ$, with $X_{1},  \ldots, X_{n}, X_{n+1},
\ldots, X_{n+n'}$  independent. The  piece $O_{i}$ is  expressed $\pi(X_{i})$,
with the function $\pi$ projecting a complete data $X \sim Q \in \calQ$ onto a
*coarser* real-world data $O=\pi(X) \sim P \in \calP$.

Our  objective  is  twofold.   First,  we   aim  to  build  a  generator  that
approximates $P$, that is, an element of  $\calP$ from which it is possible to
sample independent  data that exhibit  statistical properties similar  to (or,
colloquially, "behave  like") $O_{1}, \ldots,  O_{n+n'}$.  In other  words, we
require that the generator produces data  whose joint law approximates the law
of the  observed data, ensuring  that the  generated samples reflect  the same
underlying structure and dependencies  as the real-world observations. Second,
we require the generator  to correspond to the law of  $\pi(X)$ with $X$ drawn
from an element of $\calQ$.

We use a running example throughout the document.

::: {.callout-note appearance="simple"}
## Running example.

For example,  $\calP$ can  be the set  of all laws  on $\calO  := (\{0,1\}^{2}
\times \bbR^{3}) \times \{0,1\} \times \bbR$ such that

$$
O := (V,W,A,Y)\sim P\in \calP
$$

satisfies

$$
	c \leq P(A=1|W,V), P(A=0|W,V) \leq 1-c
$$

$P$-almost surely for  some $P$-specific constant $c \in ]0,1/2]$,  and $Y$ is
$P$-integrable.

Moreover, we view $O$ as $\pi(X)$ with
$$
\begin{aligned}
	X&:=(V, W, Y[0], Y[1], A) \in \calX:= ( \{0,1\}^{2} \times \bbR^{3}) \times
	\bbR \times \bbR \times \{0,1\},\\
	\pi&:(v,w,y[0],y[1],a) \mapsto (v,w, a, ay[1] + (1-a)y[0]),
\end{aligned}
$$

and $\calQ$ defined  as the set of  all laws on $\calX$ such  that $X\sim Q\in
\calQ$ satisfies

$$
	c' \leq Q(A=1|W,V), Q(A=0|W,V) \leq 1-c'
$$

$Q$-almost surely for some $Q$-specific  constant $c' \in ]0,1/2]$, and $Y[0]$
and $Y[1]$ are $Q$-integrable.

We consider $(V,  W)$ as the context  in which two possible  actions $a=0$ and
$a=1$ would yield the counterfactual  rewards $Y[0]$ and $Y[1]$, respectively.
One of  these actions, $A\in\{0,1\}$,  is factually carried out,  resulting in
the factual  reward $Y = A  Y[1] + (1-A) Y[0]$,  that is, $Y[1]$ if  $A=1$ and
$Y[0]$ otherwise.  In the causal  inference literature, this definition of $Y$
is referred to as the consistency assumption.

In the  context of a randomized  clinical trial, the pair  $(V, W)$ represents
baseline covariates, which are measured  before one of two possible treatments
is assigned. The binary variable $A$ indicates the treatment randomly assigned
to  a participant.   The  outcome  variable $Y$  measures  the  effect of  the
treatment;  it is  viewed  as  the counterfactual  outcome  in a  hypothetical
scenario where everyone would receive that same treatment.
:::

::: {.callout-note appearance="simple"}
## Running example in action.

The  `Python`  function   `simulate`  defined  in  the  next   chunk  of  code
operationalizes drawing independent data from a law $P \in \calM$.

```{python function-simulate}
#| echo: true
import numpy as np
import random
from numpy import hstack, zeros, ones

def simulate(n, dimV, dimW):
  def expit(x):
    return 1 / (1 + np.exp(-x))
  p = np.hstack((1/3 * np.ones((n, 1)), 1/2 * np.ones((n, 1))))
  V = np.random.binomial(n = 1, p = p)
  W = np.random.normal(loc = 0, scale = 1, size = (n, dimW))
  WV = np.hstack((W, V))
  pAgivenWV = np.clip(expit(0.8 * WV[:, 0]), 1e-2, 1 - 1e-2)
  A = np.random.binomial(n = 1, p = pAgivenWV)
  meanYgivenAWV  = 0.5 * expit(-5 * A * (WV[:, 0] - 1)\
                               + 3 * (1 - A) * (WV[:, 1] + 0.5))\
                               + 0.5 * expit(WV[:, 2])
  Y = np.random.normal(loc = meanYgivenAWV, scale = 1/25, size = n)
  dataset = np.vstack((np.transpose(WV), A, Y))
  dataset = np.transpose(dataset)
  return dataset
```
:::

Note that  justifying the  specific choices made  while defining  the function
`simulate` is unnecessary.  In the context of this study, we are free from the
need for, or aspiration to, a  realistic simulation scheme.  Under the law $P$
that `simulate` samples from, $V$ and $W$ are independent; $V$ consists of two
independent variables  $V_{1}$ and $V_{2}$  that are drawn from  the Bernoulli
laws  with parameters  $\tfrac{1}{3}$ and  $\tfrac{1}{2}$; $W$  is a  standard
Gaussian random variable.  In addition, given $(W,V)$, $A$ is sampled from the
Bernoulli law with parameter

$$
\max\left\{0.01, \min\left[0.99, \expit(0.8 \times W_{1})\right]\right\}
$$

and, given $(A,W,V)$, $Y$ is sampled from the Gaussian law with mean

$$
\tfrac{1}{2}  \expit\left[-5A\times(W_{1}-1)   +   3(1-A)\times
(\tfrac{1}{2} + W_{2})\right] + \tfrac{1}{2} \expit(W_{3})
$$

and (small) standard deviation $\tfrac{1}{25}$.   As noted in the introduction,
these choices rely on standard forms and recurring techniques.


::: {.callout-note appearance="simple"}
## Running example, cted.

For future use, we sample in  the next chunk of code $n+n'=10^{4}$ independent
observations from  $P$. Observations $O^{1:n}$  (gathered in `train`)  will be
used  for training  and observations  $O^{(n+1):(n+n')}$ (gathered  in `test`)
will be used for testing.

```{python simulation}
#| echo: true
import random
random.seed(54321)
dimV,  dimW = 2, 3
n_train = int(5e3)
train = simulate(n_train, dimV,  dimW)
test = simulate(n_train, dimV,  dimW)
print("The three first observations in 'train':\n",
      "   V_1    V_2    W_1    W_2    W_3    A     Y\n",
      np.around(train[:3, [3, 4, 0, 1, 2, 5, 6]], decimals = 3))
## np.savetxt("data/train.csv", train, delimiter = ",")
## np.savetxt("data/test.csv", test, delimiter = ",")
```
:::

# VAE in a nutshell {#sec-nutshell}

## Formal definition {#sec-formal-definition}

In the context  of this article, a Variational  Auto-Encoder (VAE) [@VAE2013],
[@VAE2014] is  an algorithm  that, once trained,  outputs a  *generator*.  The
generator is the law of a random variable of the form

$$
	\Gen_{\theta} (Z)
$${#eq-gen-theta}

where

1. the source of randomness $Z$ in @eq-gen-theta writes as
   $$
   Z := (Z^{(0)}, \ldots, Z^{(d)})
   $${#eq-alea}
   with $Z^{(0)}, \ldots, Z^{(d)}$ independently drawn from

- the uniform law on $\{1, \ldots, n\}$ for $Z^{(0)}$

- the standard normal distribution for $Z^{(1)}, \ldots, Z^{(d)}$;

2.  the  function $\Gen_{\theta}$ in  @eq-gen-theta is  an element of  a large
collection, parametrized by the  finite-dimensional set $\Theta$, of functions
mapping $\bbR^{d+1}$ to $\calX$.

For  simplicity, we  will  also refer  to the  function  $\Gen_{\theta}$ as  a
*generator*. Moreover, we  emphasize that the entire parameter  $\theta$ is to
be fitted, meaning there are no hyperparameters involved.

Because  $\Gen_{\theta}(Z)$ belongs  to $\calX$,  we can  evaluate $\pi  \circ
\Gen_{\theta}(Z)$, hence  the generator  can also be  used to  generate random
variables in $\calO$.   @fig-VAE illustrates the architecture of  the VAE used
in  this study.   It shows  the  key components  of the  model, including  the
encoder, the latent space, and the decoder, along with the flow of information
between them.

:::{#fig-VAE}

![](figures/VAE_viz.jpg){width=700}

Architecture  of the  simulator. The  figure depicts  the flow  of information
through the encoder,  latent space, and decoder components.  It emphasizes how
the input source of randomness $Z$ is transformed into a latent representation
and then reconstructed as a  complete data, $X=\Gen_{\theta}(Z)$, which can be
mapped to a real-world data $O=\pi(X)$.
:::

The word "auto-encoder" reflects the nature of the parametric form of each
$\Gen_{\theta}$. We begin  with a formal presentation in four  steps, which is
then followed  by a  discussion of what  each step  implements.  Specifically,
each  $\Gen_{\theta}$  writes  as  a composition  of  four  mappings  $J_{n}$,
$\Enc_{\theta_{1}}$, $K$ and $\Dec_{\theta_{2}}$  with $\theta := (\theta_{1},
\theta_{2}) \in \Theta_{1} \times \Theta_{2} = \Theta$:

$$
  \Gen_{\theta}  = \Dec_{\theta_{2}}  \circ  K  \circ \Enc_{\theta_{1}}  \circ
  J_{n}.
$$

Here,

1.  $J_{n}:  \{1, \ldots, n\}  \times \bbR^{d}  \to \calO \times  \bbR^{d}$ is
  such that
  $$
  J_{n} (Z) = (O_{i}, (Z^{(1)}, \ldots, Z^{(d)}))
  $$
  with $i = Z^{(0)}$;

2. $\Enc_{\theta_{1}} : \calO \times \bbR^{d} \to \bbR^{d} \times
  (\bbR_{+}^{*})^{d} \times \bbR^{d}$ is such that,
  	  if $\Enc_{\theta_{1}}(o,z) = (\mu, \sigma, z')$, then
  	  - $z=z'$, and
  	  - $\Enc_{\theta_{1}}(o,z'') = (\mu, \sigma, z'')$ for all $z''\in\bbR^{d}$;

3. $K : \bbR^{d} \times (\bbR_{+}^{*})^{d} \times \bbR^{d} \to \bbR^{d}$ is
  given by
  $$
		K(\mu,\sigma,z) := \mu + \sigma \odot z,
  $$
  where $\odot$ denotes the componentwise product;

4. $\Dec_{\theta_{2}}$ maps $\bbR^{d}$ to $\calX$.


Conditionally on $O^{1:n}$ and $Z$,  the computation of $\Gen_{\theta} (Z)$ is
deterministic. The process unfolds in four steps:

1. **Sampling and transfer.** Compute $J_{n}(Z)$, which involves sampling
one observation $O_{i}$ uniformly among  all genuine observations and transfer
$(Z^{(1)}, \ldots, Z^{(d)})$ unchanged.

2. **Encoding step.**  Compute $\Enc_{\theta_{1}} \circ J_{n}(Z)$, which
encodes $O_{i}$  as a vector $\mu  \in \bbR^{d}$ and a  $d\times d$ covariance
matrix  $\diag(\sigma)^2$.   This  step  does not  modify  $(Z^{(1)},  \ldots,
Z^{(d)})$, which is transferred unchanged.

3. **Gaussian sampling.** Compute $K\circ \Enc_{\theta_{1}} \circ J_{n}(Z)$ by
evaluating $\mu + \sigma \odot  (Z^{(1)}, \ldots, Z^{(d)}) \in \bbR^{d}$. This
amounts  to sampling  from the  Gaussian law  with mean  $\mu$ and  covariance
matrix  $\diag(\sigma)^2$. In  other  words,  $K$ enables  the  sampling of  a
neighbor of the encoded version of $O_i$ in the latent space.

4. **Decoding step.** Compute $\Dec_{\theta_{2}}\circ K\circ \Enc_{\theta_{1}}
\circ J_{n}(Z)$,  which maps the encoded  version of $O_{i}$, that  is, $\mu +
\sigma \odot (Z^{(1)}, \ldots, Z^{(d)})$, to an element of $\calX$.

## Formal training

Formally, training  the VAE  involves maximizing  the likelihood  of $O^{1:n}$
within  a  parametric  model of  laws  by  maximizing  a  lower bound  of  the
likelihood.  This process  begins with the introduction of a  working model of
mixtures  for $P$.   The  working model  (undoubtedly  flawed) postulates  the
existence of a  latent random variable $U\in \bbR^{d}$ and  a parametric model
of *tractable* conditional densities

$$
	\{o   \mapsto  p_{\theta_{2}}  (o|u)  :   u  \in  \bbR^{d},
	\theta_{2} \in \Theta_{2}\}
$$

such that

- $U$ is drawn from the standard Gaussian law on $\bbR^{d}$;

- there exists $\theta_{2} \in \Theta_{2}$  such that, given $U$, $O$ is drawn
  from $p_{\theta_{2}} (\cdot|U)$.

Here,  tractable densities  refer  to those  that can  be  easily worked  with
analytically,  while in  contrast, intractable  densities are  too complex  to
handle directly.

Therefore, the working model (undoubtedly flawed) postulates the existence of
$\theta_{2} \in \Theta_{2}$  such that $P$ admits  the generally *intractable*
density

$$
	o \mapsto \int p_{\theta_{2}}(o|u) \phi_{d}(u)du
$$

where  $\phi_{d}$  denotes  the  density  of  the  standard  Gaussian  law  on
$\bbR^{d}$.   As suggested  by  the  use of  the  parameter $\theta_{2}$,  the
definition  of the  conditional  densities  $p_{\theta_{2}}(\cdot|u)$ ($u  \in
\bbR^{d}$) involves the decoder $\Dec_{\theta_{2}}$.

Since directly maximizing the likelihood  of $O^{1:n}$ under the working model
is  infeasible,  a *secondary*  parametric  model  of *tractable*  conditional
densities is introduced:

$$
	\{u \mapsto  g_{\theta_{1}} (u|O_{i}) :  1 \leq  i \leq n,  \theta_{1} \in
	\Theta_{1}\}
$$

to model the  conditional laws of $U$ given $O_{1}$,  given $O_{2}$, $\ldots$,
given $O_{n}$.  Here too, the use of the parameter $\theta_{1}$ indicates that
the definition of the conditional densities $g_{\theta_{1}} (\cdot|O_{i})$ ($1
\leq i \leq n$) involves the encoder $\Enc_{\theta{1}}$.


Now,  by   Jensen's  inequality,  for   any  $1  \leq   i  \leq  n$   and  all
$\theta = (\theta_{1}, \theta_{2})\in \Theta$,

$$
\begin{aligned}
  \log p_{\theta_{2}}(O_{i})
  &=           \log           \int          p_{\theta_{2}}           (O_{i}|u)
    \frac{\phi_{d}(u)}{g_{\theta_{1}}(u|O_{i})} g_{\theta_{1}}(u|O_{i}) du\\
  &\geq           \int            \log\left(p_{\theta_{2}}           (O_{i}|u)
    \frac{\phi_{d}(u)}{g_{\theta_{1}}(u|O_{i})}\right) g_{\theta_{1}}(u|O_{i})
    du\\
  &=          -\KL(g_{\theta_{1}}(\cdot|O_{i});           \phi_{d})          +
    E_{U\sim g_{\theta_{1}}(\cdot|O_{i})} [\log p_{\theta_{2}}(O_{i}|U)]=:
    \LB_{\theta}(O_{i}),
\end{aligned}
$${#eq-LB}

where $\KL$ denotes the Kullback-Leibler divergence and $U$ in the expectation
is  drawn  from  the  conditional law  with  density  $g_{\theta_{1}}(\cdot  |
O_{i})$. The  notation $\LB$ is  used to indicate  that it represents  a lower
bound.  Thus, the likelihood of $O^{1:n}$ under $\theta_{2} \in \Theta_{2}$ is
lower-bounded by

$$
	\sum_{i=1}^{n} \LB_{\theta}(O_{i})
$$

for all  $\theta_{1} \in \Theta_{1}$.  As suggested earlier, training  the VAE
formally consists of solving

$$
	\max_{\theta\in\Theta} \left\{\sum_{i=1}^{n}\LB_{\theta}(O_{i})\right\}
$${#eq-VAE}

rather than solving

$$
	\max_{\theta_{2}  \in  \Theta_{2}}  \left\{\sum_{i=1}^{n}  \log  p_{\theta_{2}}
	(O_{i})\right\}.
$$

# How to build the VAE {#sec-build-VAE}

## A formal description {#sec-VAE-formal}

We   implement   the    classes   of   encoders   and    decoders,   that   is
$\{\Enc_{\theta_{1}} : \theta_{1} \in \Theta_{1}\}$ and $\{\Dec_{\theta_{2}} :
\theta_{2}  \in  \Theta_{2}\}$,  as   neural  network  models.   Each  encoder
$\Enc_{\theta_{1}}$  and decoder  $\Dec_{\theta_{2}}$  consist of  a stack  of
layers of two  types: *densely-connected* and *activation*  layers (linear, $x
\mapsto x$;  ReLU : $x \mapsto  \max(0, x)$, softmax: $(x_{1},  x_{2}) \mapsto
(e^{x_{1}},  e^{x_{2}})/(e^{x_{1}} +  e^{x_{2}})$).  The  neural networks  are
rather simple  in design,  but nevertheless (moderately)  high-dimensional and
arguably over-parametrized, as discussed in @sec-over-parametrization.

The model $\{u  \mapsto g_{\theta_{1}}(u|O_{i}) : 1 \leq i  \leq n, \theta_{1}
\in    \Theta_{1}\}$     is    chosen    such    that     $U$    drawn    from
$g_{\theta_{1}}(\cdot|O_{i})$  is a  Gaussian vector  with mean  $\mu_{i}$ and
covariance  matrix   $\diag(\sigma_{i})^{2}$  where  $\Enc_{\theta_{1}}(O_{i},
\cdot)    =   (\mu_{i},    \sigma_{i},    \cdot)$,   that    is,   when    the
$\theta_{1}$-specific  encoding  of  $O_{i}$ equals  $(\mu_{i},  \sigma_{i})$.
Remarkably, the left-hand side term in the definition of $\LB_{\theta}(O_{i})$
(@eq-LB) is then known in closed form:

$$
-\KL(g_{\theta_{1}}(\cdot|O_{i});   \phi_{d})  =   \tfrac{1}{2}  \sum_{j=1}^{d}
\left(1 + \log (\sigma_{i}^{2})_{j} - (\sigma_{i}^{2})_{j} - (\mu_{i}^{2})_{j}
\right),
$${#eq-LB-LHS-term}

where $(\mu_{i}^{2})_{j}$ and $(\sigma_{i}^{2})_{j}$ are the $j$-th components
of $\mu_{i}  \odot \mu_{i}$  and $\sigma_{i} \odot  \sigma_{i}$, respectively.
This is  very convenient,  because @eq-LB-LHS-term  makes estimating  the term
$\KL(g_{\theta_{1}}(\cdot|O_{i});  \phi_{d})$ unnecessary,  a task  that would
otherwise introduce more variability in the procedure.

As for the model $\{o \mapsto p_{\theta_{2}}(o|u) : u \in \bbR^{d}, \theta_{2}
\in \Theta_{2}\}$, the  only requirement is that  it must be chosen  in such a
way that $\log p_{\theta_{2}}(O_{i} | u)$  be computable for all $1\leq i \leq
n$, $\theta_{2}  \in \Theta_{2}$  and $u  \in \bbR^{d}$.  This  is not  a tall
order as soon as  $O$ can be decomposed as a  sequence of (e.g., time-ordered)
random  variables  that are  vectors  with  categorical,  or integer  or  real
entries.  Indeed, it then suffices (i) to decompose the likelihood accordingly
under the form of  a product of conditional likelihoods, and  (ii) to choose a
tractable  parametric  model  for  each   factor  in  the  decomposition.   We
illustrate  the  construction of  $\{o  \mapsto  p_{\theta_{2}}(o|u) :  u  \in
\bbR^{d}, \theta_{2} \in \Theta_{2}\}$ in the context of our running example.

::: {.callout-note appearance="simple"}
## Running example, cted.

In the context of this example, $O=(V,W,A,Y)$ with $V \in \{0,1\}^{2}$, $W \in
\bbR^{3}$, $A\in\{0,1\}$ and  $Y\in\bbR$.  Since the source  of randomness $Z$
has dimension  $(d+1)$, $d$  must satisfy  $d =  d_{1} +  3$ for  some integer
$d_{1} \geq 1$.

Set $\theta =  (\theta_{1}, \theta_{2})\in \Theta$, $u \in  \bbR^{d}$, and let
$\pi\circ \Dec_{\theta_{2}} (u) = (\tilde{v}, \tilde{w}, \tilde{a}, \tilde{y})
\in  \calO$. The  conditional likelihood  $p_{\theta_{2}}(O|u)$ (of  $O$ given
$U=u$) equals

$$
p_{\theta_{2}}(V,W|u) \times p_{\theta_{2}}(A|W,V,u) \times p_{\theta_{2}}(Y|A,W,V,u)
$$

so it  suffices to define the  conditional likelihoods $p_{\theta_{2}}(V,W|u)$
(of $(V,W)$ given $U=u$), $p_{\theta_{2}}(A|W,V,u)$  (of $A$ given $(W,V)$ and
$U=u$) and $p_{\theta_{2}}(Y|A,W,V,u)$ (of $Y$ given
$(A,W,V)$ and $U=u$).

- We  decide that $V$  and $W$ are  conditionally independent given  $U$ under
$p_{\theta_{2}}(\cdot|u)$.   Therefore,   it  suffices  to   characterize  the
conditional  likelihoods  $p_{\theta_{2}}(V|u)$  (of   $V$  given  $U=u$)  and
$p_{\theta_{2}}(W|u)$ (of $W$ given $U=u$).

- We choose $w\mapsto p_{\theta_{2}}(w|u)$ to be the Gaussian density with
mean $\tilde{w}$ and  identity covariance matrix.
:::

::: {.callout-note appearance="simple"}
## Running example, cted.

- The description of the conditional law of $V$ given $U=u$ under
$p_{\theta_{2}}(\cdot|u)$ is slightly more involved.  It requires that we give
more details on the encoders and decoders.

	- Like every  encoder,  $\Enc_{\theta_{1}}$ actually  maps $\calO  \times
	\bbR^{d}$       to       $[\bbR^{d_{1}}\times      \{0\}^{3}]       \times
	[(\bbR_{+}^{*})^{d_{1}} \times \{1\}^{3}] \times  \bbR^{d}$.  In words, if
	$\Enc_{\theta_{1}}(o, \cdot) = (\mu,  \sigma, \cdot)$, then it necessarily
	holds that the  three last components of  $\mu$ and $\sigma$ are  0 and 1,
	respectively.  Therefore the three last components of the random vector $K
	\circ  \Enc_{\theta_{1}}  \circ  J_{n}(Z)$  equal  $Z^{(d-2)},  Z^{(d-1)},
	Z^{(d)}$, three independent standard normal random variables.

    - To compute $\Dec_{\theta_{2}}(u) = (\tilde{v}, \tilde{w}, \tilde{y}_{0},
    \tilde{y}_{1},  \tilde{a}) \in  \calX$,  we  actually compute  $\tilde{w}$
    *then*  $\tilde{v}$,  *then*  $(\tilde{y}_{0},  \tilde{y}_{1},\tilde{a})$.

		- The output $\tilde{w}$ is a $\theta_{2}$-specific deterministic
			function of the first $d_{1}$ components of $u$.
		- The output  $\tilde{v}$  is  a $\theta_{2}$-specific  deterministic
		  function of the $(d_{1}+2)$ first components of $u$.

		  More  specifically,   two  (latent)   probabilities  $\tilde{g}_{1},
		  \tilde{g}_{2}$   are   first  computed,   as   $\theta_{2}$-specific
		  deterministic functions of the $d_{1}$ first components of $u$. Then
		  $\tilde{v}_{1}$     and      $\tilde{v}_{2}$     are      set     to
		  $\textbf{1}\{\Phi(u^{(d_{1}+1)})    \leq     \tilde{g}_{1}\}$    and
		  $\textbf{1}\{\Phi(u^{(d_{1}+2)}) \leq \tilde{g}_{2}\}$, where $\Phi$
		  denotes the standard normal cumulative distribution function (c.d.f)
		  and  $u^{(d_{1}+1)},  u^{(d_{1}+2)}$   are  the  $(d_{1}+1)$-th  and
		  $(d_{1}+2)$-th components of $u$.

		  For   instance,   $\tilde{v}_{1}$   is   given  the   value   1   if
		  $\Phi(u^{(d_{1}+1)}) \leq \tilde{g}_{1}$ and 0 otherwise.  Note that
		  $\textbf{1}\{\Phi(Z^{(d_{1}+1)})  \leq \tilde{g}_{1}\}$  follows the
		  Bernoulli law with parameter $\tilde{g}_{1}$ because $Z^{(d_{1}+1)}$
		  is drawn from the standard normal law.

		- The output $(\tilde{y}_{0},
		  \tilde{y}_{1})$ is a $\theta_{2}$-specific deterministic function of
		  $(\tilde{v}, \tilde{w})$  and the  $d_{1}$ first components  of $u$.
		- The output  $\tilde{a}$ is  a $\theta_{2}$-specific  deterministic
		  function of $(\tilde{v}, \tilde{w})$ and  the last component of $u$.

		  More  specifically,  a  (latent) probability  $\tilde{h}$  is  first
		  computed,  as  a  $\theta_{2}$-specific  deterministic  function  of
		  $(\tilde{v},    \tilde{w})$.   Then    $\tilde{a}$    is   set    to
		  $\textbf{1}\{\Phi(u^{(d)}) \leq \tilde{h}\}$.

		  Note that  $\textbf{1}\{\Phi(Z^{(d)}) \leq \tilde{h}\}$  follows the
		  Bernoulli law with parameter  $\tilde{h}$ because $Z^{(d)}$ is drawn
		  from the standard normal law.

    We are now in a position to describe the conditional law of $V$ given
    $U=u$.     We    decide    that,    conditionally    on    $U=u$,    under
    $p_{\theta_{2}}(\cdot|u)$,  $V_{1}$ and  $V_{2}$  are independently  drawn
    from   the   Bernoulli   laws    with   parameters   $\tilde{g}_{1}$   and
    $\tilde{g}_{2}$.    Thus,    $p_{\theta_{2}}(\cdot|u)$   is    such   that
    $p_{\theta_{2}}(v|u) = [v_{1}  \tilde{g}_{1} + (1-v_{1})(1-\tilde{g}_{1})]
    \times [v_{2} \tilde{g}_{2}  + (1-v_{2})(1-\tilde{g}_{2})]$ for $v=(v_{1},
    v_{2}) \in \{0,1\}^{2}$.
:::

::: {.callout-note appearance="simple"}
## Running example, cted.

- The description of the conditional law of $A$ given $(W, V)$ and $U=u$ under
$p_{\theta_{2}}(\cdot|u)$  is similar  to that  of $V$  given $U$.   We decide
that, conditionally on  $(W, V)$ and $U=u$,  under $p_{\theta_{2}}(\cdot|W, V,
u)$,     $A$     follows     the      Bernoulli     law     with     parameter
$\tilde{\underline{h}}(V,W)$,          where          the          probability
$\tilde{\underline{h}}(v,w)$      lies       between      $\tilde{h}$      and
$\bar{A}_{n}:=\tfrac{1}{n}\sum_{i=1}^{n} A_{i}$  and is  given, for  any $(v,w)
\in \{0,1\}^{2} \times \bbR^{3}$, by

  $$
  \begin{aligned}
   \tilde{\underline{h}}(v,w) :=& t(v,w) \tilde{h} + [1 - t(v,w)] \bar{A}_{n} \quad \text{with}\\
   -10\log t(v,w) =& - \left[v_{1} \log \tilde{g}_{1} + (1-v_{1}) \log
   (1-\tilde{g}_{1})\right]\\
   & - \left[v_{2} \log \tilde{g}_{2} + (1-v_{2}) \log (1-\tilde{g}_{2})\right]\\
   &+\|w - \tilde{w}\|_{2}^{2}.
  \end{aligned}
  $$
  Thus, $p_{\theta_{2}}(\cdot|W,V,u)$ is such that $p_{\theta_{2}}(a|W,V,u)
  = a \tilde{\underline{h}(V,W)}  + (1-a)(1-\tilde{\underline{h}}(V,W))$ for
  $a \in \{0,1\}$.

- Finally,  we  choose  $y   \mapsto  p_{\theta_{2}}(y|A,W,V,u)$  to  be  the
  two-regime density given by

  $$
	  p_{\theta_{2}}(y|A,W,V,u)           =
	  \frac{\textbf{1}\{A=\tilde{a}\}}{\tilde{s}(W)} \phi_{1}
  \left(\frac{y   -    \tilde{y}}{\tilde{s}(W)}\right)   +   \textbf{1}\{A\neq
	  \tilde{a}\} C^{-1}
  $$
  where $\tilde{s}(w):= \tfrac{1}{\sqrt{5}} \|w -  \tilde{w}\|_{2}$ for any $w \in
  \bbR^{3}$ and $C$ is the
  Lebesgue measure  of the support  of the marginal law  of $Y$ under  $P$ (it
  does not matter if $C$ is unknown).

  Thus, two cases arise:
  - If $A = \tilde{a}$, then $Y$ is conditionally drawn under
  $p_{\theta_{2}}(\cdot|A,W,V,u)$ from the Gaussian law with mean $\tilde{y} =
	  a \tilde{y}_{1} + (1-a) \tilde{y}_{0}$
  and variance $\tilde{s}(W)^{2}$.
  - Otherwise, $Y$ is conditionally drawn under
  $p_{\theta_{2}}(\cdot|A,W,V,u)$ from the  uniform law on the  support of the
  marginal law of $Y$ under $P$.

  Therefore,  the  conditional  likelihood  $p_{\theta_{2}}(Y|A,W,V,u)$  bears
  information  only  if  $A=\tilde{a}$  (that  is,  if  the  actions  $A$  and
  $\tilde{a}$   undertaken  when   generating   $O=(V,W,A,Y)$  and   computing
  $\Dec_{\theta_{2}}(u)$ coincide),  which can  be interpreted as  a necessary
  condition   to   justify   the   comparison   of   the   rewards   $Y$   and
  $\tilde{y}$. Moreover, when  $A=\tilde{a}$, the closer are  the contexts $W$
  and $\tilde{w}$,  the more  relevant is  the comparison  and the  larger the
  magnitude of $p_{\theta_{2}}(Y|A,W,V,u)$ can be.
:::

::: {.callout-note appearance="simple"}
## Running example, cted.

In summary, the right-hand side term in the definition of
$\LB_{\theta}(O_{i})$ @eq-LB  equals, up  to a  term that  does not  depend on
$\theta$,

$$ \begin{aligned}
\tfrac{1}{2}E_{U \sim g_{\theta_{1}}(\cdot|O_{i})} \Bigg[
&   -2\left(V_{1,i}    \log   \tilde{G}_{1}   +   (1-V_{1,i})    \log   (1   -
\tilde{G}_{1})\right) \\
&   -2\left(V_{2,i}    \log   \tilde{G}_{2}   +   (1-V_{2,i})    \log   (1   -
\tilde{G}_{2})\right) \\
& - \|W_{i} - \tilde{W}\|_{2}^{2}\\
&   -2\left(A_{i}   \log  \tilde{\underline{H}}   +   (1-A_{i})   \log  [1   -
\tilde{\underline{H}}]\right)\\
& -\textbf{1}\{A_{i}  = \tilde{A}\}  \times \left(\log  \tilde{S}(W_{i})^{2} +
\frac{(Y_{i} - \tilde{Y})^{2}}{2\tilde{S}(W_{i})^{2}}\right)\Bigg],
\end{aligned}
$${#eq-log-p-theta2}

with the notational conventions $\pi \circ \Dec_{\theta_{2}} (U) = (\tilde{V},
\tilde{W},  \tilde{A}, \tilde{Y})$,  $V_{i} =  (V_{i,1}, V_{i,2})$,  and where
$\tilde{G}_{1}$, $\tilde{G}_{2},  \tilde{\underline{H}}, \tilde{S}$
are defined like the
above latent quantities $\tilde{g}_{1}$, $\tilde{g}_{2}, \tilde{\underline{h}},
\tilde{s}$
with  $U$ substituted  for $u$.   The  expression is  easily interpreted:  the
opposite of @eq-log-p-theta2 is an average risk that measures
- the  likelihood of $V_{i,1}$  and $V_{i,2}$ from the  points of view  of the
Bernoulli laws with parameters  $\tilde{G}_{1}$ and $\tilde{G}_{2}$ (first and
second terms),
- the average proximity between $W_{i}$ and $\tilde{W}$ (third term),
- the likelihood of  $A_{i}$ from the point of view of  the Bernoulli law with
parameter $\tilde{\underline{H}}$ (fourth term),
- the average proximity between $Y_{i}$ and $\tilde{Y}$ (fifth term) *only if*
$A_{i} = \tilde{A}$ (otherwise, the comparison would be meaningless).

In  other terms,  the opposite  of @eq-log-p-theta2  can be  interpreted as  a
measure of the average faithfulness of the reconstruction of $O_{i}$ under the
form    $\pi   \circ    \Dec_{\theta_{2}}   (U)$    with   $U$    drawn   from
$g_{\theta_{1}}(\cdot|O_{i})$.  The larger is  @eq-log-p-theta2, the better is
the reconstruction of $O_{i}$ under the form $\pi \circ \Dec_{\theta_{2}} (U)$
with $U$ drawn from $g_{\theta_{1}}(\cdot|O_{i})$.

To conclude,  note that the  conditional laws of  $W$ and $Y$,  both Gaussian,
could easily  be associated with  diagonal covariance matrices  different from
the  identity  matrix.  This  adjustment  would  be particularly  relevant  in
situations  where  $\|W\|_{2}$  and  $|Y|$  are  typically  not  of  the  same
magnitude,  with $O=(V,W,A,Y)$  drawn  from the  law $P$  of  the experiment  of
interest. Alternatively,  the genuine  observations could be  pre-processed to
ensure that $\|W\|_{2}$ and $|Y|$ are brought to comparable magnitudes.
:::

The  hope is  that, once  the VAE  is trained,  yielding a  parameter $\thn  =
((\thn)_{1}, (\thn)_{2})$, the  corresponding generator $\Gen_{\thn}$ produces
a synthetic complete  data $X\in\calX$ such that the law  of $\pi(X) \in\calO$
closely approximates $P$.  Naturally, this approximation is closely related to
the      conditional      densities     $g_{(\thn)_{1}}(\cdot|O_{i})$      and
$p_{(\thn)_{2}}(\cdot | u)$ ($1 \leq i \leq n$, $u \in \bbR$).

For instance, in the context of the running example, if $O = (V,W, A, Y) = \pi
\circ \Gen_{\thn}(Z)$  and if  $\xi, \zeta$ are  independently drawn  from the
centered Gaussian  laws with an  identity covariance matrix on  $\bbR^{3}$ and
variance 1 on $\bbR$,  respectively, then $(W + \xi, A, Y  + \zeta)$ follows a
law that admits the density

$$
  \begin{aligned}
  (v, w, a, y)\mapsto \int & p_{(\thn)_{2}}(y|a, w, v, u) \times
  \textbf{1}\{a=\tilde{a}_{(\thn)_{2}}(u)\}\\
  & \times
  p_{(\thn)_{2}}(w,v|u)\left(\frac{1}{n}\sum_{i=1}^{n}
    g_{(\thn)_{1}}(u |O_{i}) \right) du,
  \end{aligned}
$$
where $\tilde{a}_{(\thn)_{2}}(u)$ is defined as  the $A$-coefficient of
$\pi\circ \Dec_{(\thn)_{2}}(u)$.

## About the over-parametrization {#sec-over-parametrization}

In  @sec-VAE-formal we  acknowledged  that the  models $\{\Enc_{\theta_{1}}  :
\theta_{1}  \in  \Theta_{1}\}$  and   $\{\Dec_{\theta_{2}}  :  \theta_{2}  \in
\Theta_{2}\}$ are  over-parametrized in the  sense that the dimensions  of the
parameter  set   $\Theta_{1}\times  \Theta_{2}$  is  potentially   large.  For
instance, the  dimension of  the model that  we build in  the next  section is
1157. This is a common feature of neural networks.

Our  models  are  also  over-parametrized  in the  sense  that  they  are  not
identifiable. This  is obviously the case  because of the loss  of information
that governs  the derivation of  an observation $O$ as  a piece $\pi(X)$  of a
complete data $X$ that we are not given to observe in its entirety.

::: {.callout-note appearance="simple"}
## Running example, cted.
In particular, in the context of this  example, it is well know that we cannot
learn  from  $O_{1}, \ldots,  O_{n}$  any  feature of  the  joint  law of  the
counterfactual  random variables  $(Y[0], Y[1])$  that  does not  reduce to  a
feature of the marginal  laws of $Y[0]$ or $Y[1]$, unless  we make very strong
assumptions on this joint law (e.g.,  that $Y[0]$ and $Y[1]$ are independent).
:::

This is not a  source of concern.  First, it is  generally recognized that the
fitting of neural networks often benefits  from the high dimensionality of the
optimization  space  and the  presence  of  numerous equivalently  good  local
optima,     resulting     in     a    redundant     optimization     landscape
[@choromanska2015loss], [@arora2018optimization].  Second, our objective is to
construct a generator that approximates the law $P$ of $O_{1}, \ldots, O_{n}$,
generating $O\in\calO$ by first producing $X\in\calX$ (via $\Gen_{\theta}(Z)$)
and  then  providing  $\pi(X)$.   The   fact  that  two  different  generators
$\Gen_{\theta}$  and   $\Gen_{\theta'}$  can  perform  equally   well  is  not
problematic.  Identifying  *one* generator $\Gen_{\theta}$ that  performs well
is sufficient.

It  is   possible  to  search   for  generators  that   satisfy  user-supplied
constraints,  provided  these can  be  expressed  as a  real-valued  criterion
$F(E[\calC(\Gen_{\theta}(Z))])$.   For example,  one may  wish to  construct a
generator   $\Gen_{\theta}$   such   that   the  components   of   $X$   under
$\law(\Gen_{\theta})$   exhibit  a   pre-specified  correlation   pattern  (as
demonstrated in the simple example below).

To  focus the  optimization procedure  on generators  that approximately  meet
these constraints, one  can modify the original criterion @eq-VAE  by adding a
penalty term.  Specifically, given  a user-supplied hyper-parameter $\lambda >
0$, we can substitute

$$
\max_{\theta\in\Theta} \left\{\sum_{i=1}^{n}\LB_{\theta}(O_{i})   +
\lambda   F(E_{Z   \sim   \Unif\{1,  \ldots,   n\}   \otimes   N(0,1)^{\otimes
d}}[\calC(\Gen_{\theta}(Z))])\right\}
$${#eq-penalized-VAE}

for  @eq-VAE.   From  a  computational  perspective,  this  adjustment  simply
involves adding the term

$$
\lambda F\left(\frac{1}{m}\sum_{i=1}^{m} \calC(\Gen_{\theta} (Z_{m+i}))\right)
$${#eq-penalization-VAE}

to the expressions within the curly brackets in the definition of $g$ in the
algorithm described in @sec-algo.

::: {.callout-note appearance="simple"}
## Running example, cted.
In particular,  in the context of  this example, we could  look for generators
$\Gen_{\theta}$  such  that  the  correlation   of  $Y[0]$  and  $Y[1]$  under
$\law(\Gen_{\theta} (Z))$  be close to a  target correlation $r \in  ]-1, 1[$.
In  that  case,  we  could  choose  $\calC(\Gen_{\theta}  (Z))  :=  (Y[0]Y[1],
Y[0]^{2},  Y[1]^{2},  Y[0],  Y[1])$  and   $F  :  (a,b,c,d,e)  \mapsto  |(a  -
de)/\sqrt{(b - d^2)(c - e^2)} - r|$.
:::