Bachelor Thesis Dimensionality Reduction

1. Lines of Thoughts
2. Overview
- 2.1. Project scope
- 2.2. Index:
3. Ideas for Hyperparameters
- 3.1. Number of neurons per layer
4. ML Pipeline
5. Autoencoders
6. Papers
7. Presentations
- 7.1. First one
- 7.2. Second one
8. INPROGRESS Future Research
9. INPROGRESS Code
- 9.1. REPOS
10. Contacts
11. Todos
- 11.1. Code
- 11.2. Paper

1. Lines of Thoughts

1.1. Dimensionality Reduction

The concept of dimensionality reduction is quite straightforward. The idea is to reduce the number of dimensions/features while retaining maximum information. Even though the definition is quite simple, being able to perform such transformations is not trivial.

Ideally, the reasons for performing such processes are:(Note this list is not complete but gives a general overview)

Avoid the curse of dimensionality
Reducing potential overfitting of further processing
Reducing computation time of further processing
Reducing storage space
Plotting
Noise removal
Removing Correlated features
Removing redundant features

Several methods can perform this transformation. They are usually divided into 2 categories:

Linear methods
Non-Linear methods

Of course, there can also be other types of categorizations (e.g. feature selection, feature extraction, Neural, Manifold based, Local methods etc.) In the following sections, we will present roughly 2 approaches per category.

Figure 1: The Dimensionality reduction methods timeline

NOTE we will only focus on unsupervised methods since they are the most suitable for real-life situations where having labelled data is hard and expensive.

1.2. Autoencoder

An Autoencoder is a special network architecture which approximate two different function encode and decode such as: \[decode(encode(\hat{X})) = \hat{X}\]

Note most of the the time is not an = but an ≈

The network is therefore composed of two different sub-networks. An Encoder which can be defined as: \[encode \rightarrow \mathbb{R}^n \times \mathbb{R}^m \] And a Decoder which can be defined as: \[decode \rightarrow \mathbb{R}^m \times \mathbb{R}^n \]

There are two constraints to these two functions. The first one is that decode must be approximately the inverse of the encode¹. The second one is that \[ m << n \]

NOTE when the second constraint is satisfied, the autoencoder is considered an under-complete autoencoder. However, every time we will use the autoencoder word we will refer to under-complete autoencoder.

The second constraint is an architectural one, meanwhile the first one is a functional constraint that will be achieved after the network is trained.

The error function is, therefore, a reconstruction error or distance measure between the input and output.

The layer between the Encoder and the Decoder express what is usually known as Latent space which dimensionality is \(\mathbb{R}^m\).

We will from now on refer to the Latent space as \(\hat{z}\). For clarity we can rewrite the above formulas as: \[encode(\hat{X}) = \hat{z}\] \[decode(\hat{z}) \approx \hat{X}\]

As Wang stated ²

Auto-encoder can be seen as a way to transform representation.

1.3. PCA

Principal Component Analysis(PCA) is a linear technique. It is probably one of the most used methods because of its reliability and explainability. Conceptually, PCA find the directions of maximum variance in the data and project it into a new space with fewer dimensions than the data

The crucial point of PCA is to find the Principal Component of the data which are therefore completely uncorrelated while maintaining most of the variability of the data. Note The Principal Components are selected based on the explained variance.

Figure 2: PCA visualization

1.3.1. Assumptions/downfalls

Linear dimensions (i.e. the variables in the dataset must combine in a linear manner)
approximately normally distributed data

1.4. LDA

Linear Discriminant Analysis (LDA) is a linear method. In a nutshell, we want to find a new subspace to project the data in order to maximize classes separability.

The idea to measure such separability is to maximize the difference between the mean of each class while minimizing the spread within the class.

The main disadvantage is that LDA have good performance only if the dataset is Normally distributed.

Figure 3: LDA vs PCA

1.4.1. Assumptions/downfalls

Normally distributed data
Linear combination of features

1.5. LLE

Locally Linear Embedding (LLE) is a non-linear methods. Conceptually it aims to discover the underline non-linear structure of the data set while preserving the distance within local neighborhoods.

Figure 4: LLE visualization

This techinique is a 3 steps procedure:

Uses a KNN approach to find the k nearest neighbors of every data point.
Approximates each data vecotr as a weighted linear combination of its k-nearest neighbors. (Note all data point which are not in a particular neighborhoods have 0 weight)
Computes the wieghts that best reconstruct the vectors from its neighbors

1.5.1. Assumptions/downfalls

Euclidean distance to compute k-nearest neighbors
Quite sensible to outliers and noise

1.6. Isomap

Isometric Mapping (Isomap) is non-linear method which belong to the category of Manifold Learning.

Ideally, it is quite similar to LLE, however, the crucial objective of this mapping is to maintain a geodesic distance between two points.

Note Geodesic is the shortest path between two points on the surface itself. This is why Isomap is considered a Manifold Learning method

This technique is also defined by a 3 steps process:

Construct a neighbourhoods graph (equivalent to the first step of LLE)
Compute the shortest path between points (using either Dijkstra's or Floyd-Warshall algorithm)
Construct a d-dimensional embedding by a partial eigenvalue decomposition (i.e. taking the d largest eigenvalues of the kernel)

Figure 5: Isomap vs PCA vs LLE

1.6.1. Assumptions/downfalls

Computational intensive
Euclidean distance for k-nearest neighbors

1.7. Intermezzo

So, we have rapidly been through classical and not dimensionality reduction techniques. The main focus of this thesis, though, is to perform what is usually referred to as Representation Learning.

Of course, it is quite trivial to see how Representation Learning and Dimensionality reduction are strictly related.

Indeed, a representation usually has fewer dimensions than the original input. A good representation also should maintain the most important information/features of the input space.

Therefore, the two branches are strictly related. However, it is important to notice that a good Dimensionality reduction method does not always produce a good representation (by good we mean that it has all the important features needed to learn a mapping between states and actions)

For this reason, this thesis will mainly focus on Autoencoders technique to perform dimensionality reduction (and/or Representation learning) since they give a good tradeoff between process flexibility and accuracy. (for reference: ²^,³^,⁴)

It is also crucial to notice that the literature indicate that usually Autoencoders-based latent space (or embeddings) outperforms other dimensionality reduction technique when the latent space is used as input in an RL-based framework (for reference : ⁵ ^,⁶ ^,⁷ ^,⁸ ^,⁹ ^,¹⁰)

Before jumping into more advanced Autoencoders-based technique we will briefly introduce MDPs Generalization. This is another important point in the thesis since, the two main objectives of constructing low dimensional latent space for an RL algorithm are:

Faster and more stable convergence
Better Generalization property

The first point seems quite intuitive. Having a low dimensional state-space should result in a faster and more stable convergence since the RL algorithm needs to learn a mapping from a low dimensional state-space to action which should be easier than learning a mapping from a high dimensional state space to action.

Another interesting point made in ⁵ is that all Deep Reinforcement Learning (DRL) algorithm implicitly learns a first mapping from high dimensional state space to low dimensional state space and then the maps this low dimensional state space to action. Therefore, by performing dimensionality reduction we take away the concern of learning a good representation from the DRL algorithm which therefore will only focus on learning a mapping from state to action directly.

Other valuable properties of doing such a process are described in the next chapter.

1.8. MDPs Generalization

For formal description of this concept look up at ⁵ (section 2.2) The idea though is quite intuitive. Let us assume that we have a natural world from which we can sample MDPs. The crucial characteristic of these MDPs is that they all do have the same action space but they have differences between the state spaces. However, since we are sampling these MDPs from the same natural world these state spaces must have some structural similarity (i.e. isomorphisms)

Therefore, to have good generalization property, we need to construct a good representation that aims to represent the state space of the natural world. To do so, we cannot leave this concern to a DRL method for the following reason.

Since DRL is maximizing some objectives, it makes sense that the best representation is the most MDP-entangled one and therefore is the one that is guided by the learning process to learn.⁵ Therefore, if we do not move this concern outside the DRL we will have poor generalization ability, particularly without extensive fine-tuning.

Here, Dimensionality reduction methods such as Autoencoders comes to the rescue. Since they do not maximize the same objective as the DRL we can guide the process of learning a representation as we please. Moreover, we will discover in the next chapters how it is crucial to aims for disentangled representation.

The main downfall of moving the concern of learning a representation outside DRL is that we need to be careful of what kind of dataset we use to train the autoencoders. It is crucial that the dataset has big variability and covers most of the "visible" state space. This is because a lot of autoencoders architecture have weird/undefined behaviour in point of the space not explored during training which is not desirable.

Other potential downfalls are:

Increase overall computation time (not always true though)
Risk of losing important information for the DRL algorithm
Non-trivial definition of AE-hyperparameters

Since usually the AE objective is centred on the reconstruction error it is not trivial to focus on learning useful representations as opposed to learning representations which are based on the ability of the decoder to achieve better reconstruction errors. Therefore, a tradeoff must be made to achieve useful representations for RL. We will see in future sections how different AE architectures deal with this tradeoff.

1.9. Disentagled Representations

This is a big topic in current AI research ¹¹ (6th big challenge)

In the literature is not entirely clear what we mean when we talk about disentangled representations. However, some research and effort was made to have a formal definition ¹² It seems, following their¹² definition, that the concept of disentangled is quite similar to the concept of symmetry in physics. Physics, indeed, can be seen as an in-depth study of symmetries (see More is different from P.W.Anderson in Science which stated "it is only slightly overstating the case to say that physics is the study of symmetry")

This is quite important because, given this point of view, it is easier to define formally (i.e. mathematically) what are the properties of disentangled representations.

As stated in ¹²

Intuitively. we define a vector representation as disentangled if it can be decomposed into a number of subspaces, each one of which is compatible with, and can be transformed independently by a unique symmetry transformation

The paper¹², then goes towards defining more formally this intuition using group theory, and this concept of symmetries.

Of course, this is one point of view on disentangled representations, there are different ones, however, this is to the best of our knowledge the best formal attempt to define it.

Another important point is that this "new" definition tries to put together all the different approaches/points of view that were present in the literature at that time. The main 3 characteristics that the authors identify are: modularity, compactness and explicitness. Directly quoting from ¹²:

Modularity "measures whether a single latent dimension encodes no more than a single data generative factor"
Compactness "measures whether each data generative factor is encoded by a single lantent dimension"
Explicitness "measures whether the values of all of the data generative factors can be decoded from the representation using a linear transformation"

Not all of them are explicitly required to have a disentangled representation. Particularly the explicitness characteristic, since linearity is not required to have a disentangled representation (given the definition in ¹²)

Figure 6: Comparison of entangled vs disentangled representation

So now that we have a general overview of what we mean when we talk about disentangled representation, we can move on in understanding whether having such a representation is useful or at least is better than having a "normal" representation (i.e. without forcing any of the properties aforementioned)

Understanding whether it is useful having such a representation is also, not a trivial task and the literature is not clear on it. Most papers, claims that having such a thing is useful for three main reasons ¹³:

more sample-efficient
less sensitive to nuisance variables
better in terms of generalization

This has been shown to be experimentally correct¹³, however, from a formal point of view, it is not clear why this is the case.

That is another motivation for conducting this thesis on disentangled representation to see whether these findings translate to harder and more complex settings.

Now we will proceed to show the different architectures that will be used in the thesis and we will justify the decision on each one.

1.9.1. Mathematical Background and explaination of Disentangled representations

From a mathematical point of view the definition of disentangled representation is based on group theory and group representation theory(all the mathematical theory in this paragraph was summarised from ¹²)

So first of all let us introduce the basic concepts of group theory.

A group is defined by a tuple \((\mathcal{G},\circ)\).

\(\mathcal{G}\) is a set, \(\circ\) is defined as following: \[\circ : \mathcal{G} \times \mathcal{G} \rightarrow \mathcal{G}\]

In order for the tuple to be consider a group the binary operator \(\circ\) must have the following properties:

Associativity \[\forall x,y,z \in \mathcal{G} : x \circ (y \circ z) = (x \circ y) \circ z\]
Identity \[\exists e \in \mathcal{G}, \forall x \in \mathcal{G} : e \circ x = x \circ e = x \]
Inverse \[\forall x \in \mathcal {G},\exists x^{-1} \in \mathcal{G} : x \circ x^{-1} = x^{-1}\circ x = e\]

If the tuple has these properties then it is considered a group.

Now let us define what a group's action is: Given a tuple \((\mathcal{G},\circ)\) a group's action is a binary function such that \[\cdot : \mathcal{G} \times \mathcal{X} \rightarrow \mathcal{X}\] With the following properties: \[e \cdot x = x \ \ \ \ \forall x \in \mathcal{X}\] \[(g \cdot h) \cdot x = g \cdot (h \cdot x) \ \ \ \ \forall g,h \in \mathcal{G},x \in \mathcal {X}\]

\(\mathcal{X}\) can be any structured space (e.g. topological space, vector space etc.) In the case \(\mathcal{X}\) is a vector space we have the specialization of the above properties as: \[g(x + y) = gx + gy \ \ \ \ \forall g \in \mathcal{G},\forall x,y \in \mathcal{X}\] \[g(\lambda x) = \lambda (gx) \ \ \ \ \ \forall g \in \mathcal {G},\lambda \in \mathbb{R}, x \in \mathcal {X}\]

So now that we have the basic tools, let us first define what a Disentangled group of action is.

\[\cdot: \mathcal {G} \times \mathcal {X} \rightarrow \mathcal{X}\]

a group's action is disentangled with respect to a particular decomposition \(\mathcal{G} = \mathcal{G_1} \times \mathcal{G_2}\) if exists \(\mathcal{X} = \mathcal{X_1} \times \mathcal{X_2}\) and subactions \(\cdot_i = \mathcal{G_i} \times \mathcal{X_i} = \mathcal{X_i}\) where \(i \in \{1,2\}\) such that: \[(g_1,g_2)\cdot(v_1,v_2) = (g_1 \cdot_1 v_1, g_2 \cdot v_2)\] The important thing to notice here is that we are simply saying that each subaction \(\cdot_i\) modified the respective \(\mathcal{X_i}\) but do not modify other \(\mathcal{X}\) so it is invariant to the other \(\mathcal{X}\) Final note, if \(\mathcal{X}\) has an additional structure we would like the subaction to preserve such structure. So for example if \(\cdot\) is linear we want \(\cdot_i\) to be linear too.

So now we can see what a disentangled representation is with respect to this framework.

Let us first introduce some terminology: \(\mathcal{W}\) is a set of world-state. There exists a generative process such that \(b:\mathcal{W} \rightarrow \mathcal{O}\) where \(\mathcal{O}\) are the observations. Then, there exists an inference process such that \(h: \mathcal{O} \times \mathcal{Z}\) where \(\mathcal{Z}\) is the agent representation. We will assume that \(\mathcal{Z}\) is a vector space. So now with function composition we can say that exists an \(f\) such that \(f:\mathcal{W} \rightarrow \mathcal{Z}\). Now lets say that there exists a group's of action \(\mathcal{G}\) such that \(\cdot : \mathcal{G} \times \mathcal{W} \rightarrow \mathcal{W}\) which describes the symmetries present in \(\mathcal{W}\). We would like to find another group's action such that it maintains these symmetries in \(\mathcal{Z}\) and it is defined as following: \(\cdot' : \mathcal{G} \times \mathcal{Z} \rightarrow \mathcal{Z}\) Such group's action will exists if and only if: \[g \cdot f(w) = f(g\cdot w) \ \ \ \ \forall g \in \mathcal{G}, w \in \mathcal{W}\] If that is the case,\(f\) is called a \(\mathcal{G}\) -morphism or equivariant map. Of course there is no guarantee that \(\cdot'\) exists. If \(f\) is bijective we can define \(\cdot'\) in terms of \(\cdot\) as follwing: \[g\cdot z = f(g \cdot f^-1(z))\] If \(f\) has other properties see ¹² So we can say that \(\mathcal{Z}\) is disentangled with respect to a decomposition \(\mathcal{G} = \mathcal{G_i} \times ... \times \mathcal{G_n}\) if:

exists \(\cdot\) such that \(\cdot: \mathcal{G} \times \mathcal{Z} \rightarrow \mathcal{Z}\)
exists \(f\) such that \(f:\mathcal{W} \rightarrow \mathcal{Z}\) is a \(\mathcal{G}\) -morphism or equivariant map.
exists a decomposition of \(\mathcal{Z} = \mathcal{Z_1} \times ... \times \mathcal{Z_n}\) and a disentangled group's action which act on it \(\cdot\) such that \(\cdot_i : \mathcal{G_i} \times \mathcal{Z_i} \rightarrow \mathcal{Z_i}\)

1.10. Variational inference

Before diving into the different VAEs architecture let's take a step back and let's understand the core concept of Variational inference in Bayesian modelling. In general, Variational inference comes from Variational Calculus

The main reference of this overview is the paper ¹⁴

The core problem is to approximate a probability density that can be hard to compute or even intractable computationally.

Formally the problem can be expressed as following:

Let \(x=x_{1:n}\) be the dataset or observed variables and let \(z=z_{1:m}\) the set of latent variables The problem is to compute the conditional density of \(z\) over \(x\) \[p(z|x)=\frac{p(z,x)}{p(x)}\] Rewriting the nominator using Bayes rules we get : \[p(z|x) = \frac{p(x|z)p(z)}{p(x)}\] so \(p(x|z)\) is the likelyhood and \(p(z)\) is the prior of the latent variable. We can rewrite the evidence (or the marginal) such that: \[p(x) =\int p(z,x) dz\]

NOTE in a lot of cases the evidence integral is unavailable in closed form and therefore the integral is most of the time intractable (also in this case, we can rewrite the formula, the time we need to compute such integral is exponential and unfeasible with large data sets)

Therefore, since in the general case computing this integral is intractable, we want to find a surrogate posterior which is close enough to the real one but it is more tractable and easier to work with. We can express this surrogate as: \[q(z)\approx p(z|x)\]

Therefore to find such a surrogate, in an optimization problem framework, we first need to find how to assess the "goodness of the fit". In other words, we need to find a measure that tells us how close \(q(z)\) is to \(p(z|x)\). The most natural measure of distributions distance is the KL-divergence. This metric comes directly from Information theory (TODO explain better KL-divergence)

We can write how optimization problem in terms of the KL-divergence as follows: \[ q^*(z)=argmin_{q(z)\in Q}KL(q(z)||p(z|x))\]

\(Q\) is the family of "simple" distribution from which we sample our \(q(z)\). Note simple here just means we have an analytical form of the distribution

The KL-divergence is expressed as: \[KL(q(z)||p(z|x)) = \mathbb{E}_{z \thicksim q(z)}\left[log\left(\frac{q(z)}{p(z|x)}\right)\right]\]

As we can see we still have the intractable posterior \(p(z|x)\) We can rearrange the equation as following: (note expectation in a continuos situation it is express as an integral) \[KL(q(z)||p(z|x)) = \int q(z) log\left(\frac{q(z)}{p(z|x)}\right)\] using the aforementioned definition of the posterior we get \[= \int q(z) log\left(\frac{q(z)p(x)}{p(z,x)}\right) dz\] Now we can rearrange and we can separete the integral in two different part: \[= \int q(z) log\left(\frac{q(z)}{p(z,x)}\right)dz + \int q(z) log(p(x))dz \] Now we can see that we have two different expectations: \[= \mathbb{E}_{z \thicksim q(z)}\left[log\left(\frac{q(z)}{p(z,x)}\right)\right] + \mathbb{E}_{z \thicksim q(z)}\left[log(p(x))\right]\]

Now it is important to notice that the second expectation does not contains \(z\) therefore we can remove the expectation operation. Finally, we will do another further rearrangement to the first expectation. Specifically we will invert the numerator with the denominator, and to make it mathematically sound we need to negate it (exploiting logarithms properties) \[= -\mathbb{E}_{z \thicksim q(z)}\left[log\left(\frac{p(z,x)}{q(z)}\right)\right] + log(p(x))\]

Let us call the first expectation \(\mathcal{L}(q)\) Now we have the final form of the KL-divergence: \[KL = - \mathcal{L}(q) + log(p(x))\] \(p(x)\) is known as the marginal probability. \(log(p(x))\) is known as the evidence and it will always be negative since taking the log of something between 0 and 1 will always result in negative values. Another important point is that \(log(p(x))\) will be a constant since it does not change given the dataset (i.e. the observed variables)

The KL-divergence is by definition something positive. Therefore, \(\mathcal{L}(q)\) must be negative for the formula to have sense.

Also \(\mathcal{L}(q)\) must be smaller than the evidence (\(log(p(x))\)), therefore, \(mathcal{L}(q)\) is also known as Evidence Lower Bound (ELBO).

Finally, we also know that \(\mathcal{L}{q}=log(p(x))\) is true iff \(KL(q(z)||p(z,x))=0\)

Therefore, we just derived the ELBO and we know that it is tractable. So now instead of minimizing the KL-divergence , we can maximise the ELBO.

The optimization problem we had before was: \[ q^*(z)=argmin_{q(z)\in Q}KL(q(z)||p(z|x))\] Now, we have: \[q^*(z)=argmax_{q(z)\in Q}\mathcal{L}(q)\]

To conclude, let us rewrite the ELBO (\(\mathcal{L}(q)\)) in a more tractable for and one that you will see more often in papers and literature. \[ELBO(q)= \mathbb{E}\left[log\left(p(z,x)\right)\right] - \mathbb{E}[log(q(z))]\] Another important form is the one in terms of the KL-divergence Directly quoting from ¹⁴

Examining the \(ELBO\) gives intuitions about the optimal variational density. We rewrite the
ELBO as a sum of the expected log-likelihood of the data and the KL divergence between the
prior \(p(z)\) and \(q(z)\),

\[ELBO(q)=\mathbb{E}[log(p(x|z))] - KL(q(z)||p(z))\]

This will be the form we will use while optimizing Variational Autoencoders and its derivatives

1.11. Variational Autoencoder

The Variational autoencoder is a variation of the standard Autoencoder architecture. The first and crucial difference is that instead of mapping one data point to one latent point, it encodes the data point into a distribution. Therefore, transforming both encoder and decoder into probabilistic ones instead of deterministic (like in standard autoencoders)

Figure 7: Variational autoencoder vs Standard Autoencoder

Therefore, the encoder part instead of returning a single point it will return a mean and a variance (log variance will be used to have a more stable and reliable learning process)

Figure 8: Variational autoencoder architecture

Given this change in architecture, also the loss function will have to change to address the new needs. In particular, the loss function will be now formed by two different part: reconstruction error term and a regularization term. The reconstruction term will be the same as before (e.g the L₂ between the input and the output) The regularization term, on the other hand, will be the KL-divergence between the current latent space distribution and the "wanted" one.

NOTE having two terms in the loss function clearly remark the fact that there is a tradeoff between forcing the latent distribution and having a good reconstruction error. This will be the main focus for future VAE architectures

The intuition behind the regularization term is that we want to have two properties for the latent space distribution: Continuity (two close points in the latent space must be mapped to close points in the output) and Completeness (sampling from the latent distribution must returns some "meaningful" output).

As we can see, we are entering into a probabilistic framework. Here, the previous chapter on Variational inference will become handy.

We can the decoder as \(p(x|z)\) and the encoder as \(p(z|x)\) As seen before we can express \(p(z|x)\) as: \[p(z|x) = \frac {p(x|z)p(z)}{p(x)}\]

That said, as before \(p(z|x)\) is intractable however, in this architecture, we will assume that our surrogate posterior \(q_x(z)\) is a Guassian specified by \[q_x(z)=\mathcal{N}(g(x),h(x))\] where \(g\in \mathcal{G}\) and \(h\in \mathcal{H}\). \(\mathcal{G}\mathcal{H}\) represent family of function which will be approximate by the encoder.

To find such functions we can use the last equation in the previous chapter:

\[(g^*,h^*)=argmax_{(g,h)\in\mathcal{G}\times\mathcal{H}}\mathbb{E}[log(p(x|z))] - KL(q(z)||p(z))\]

It is important to notice that \(p(x|z)\) is our decoder and therefore we will approximate by minimizing the reconstruction error.

So we can finally rewrite the above equation to also consider the decoder part (again \(\mathcal{F}\) is a family of functions): \[(f^*,g^*,h^*) =argmax_{(f,g,h)\in\mathcal{F}\times\mathcal{G}\times\mathcal{H}}\mathbb{E}\left[-\frac{||x-f(z)||^2}{2c}\right] - KL(q(z)||p(z))\]

Of course, this is only one possibility, we can change the reconstruction error accordingly to the type of data or results we want to achieve.

Some reference ¹⁵Understanding Variational Autoencoders

1.12. Adversarial Autoencoder

The Adversarial Autoencoder is a variation on the standard Autoencoder. It aims to induce some prior distribution on the latent space, semantically performing the same idea of VAE. However, this architecture exploits a completely different mechanism. It uses adversarial learning to force such distribution. It takes the idea from Generative adversarial networks (GANs) (Goodfellow et al, 2014)

Figure 9: Adversarial Autoencoder structure

Therefore, the idea is to have a discriminator network (\(\mathcal{D}\)) which tries to distinguish between the real prior distribution (\(p(z)\)) and the actual latent space distribution (\(q(z|x)\)). In an adversarial learning process, the encoder is encouraged to fool the discriminator network, therefore, bringing \(q(z|x)\) closer to the actual prior (\(p(z)\)). In this way, this architecture achieve similar results to the VAE architecture. The main advantage over the VAE is that we do not need a functional form of the prior to enforce it. We only need to be able to sample from it. The error function of this architecture is quite similar to the VAE with the main difference that instead of the regularization term (i.e. KL-divergence )we have the adversarial training procedure.

1.13. β-VAE

The β-VAE is a variation on the VAE architecture that slightly changes the optimization problem by mutating the objective function. In particular, it adds a hyperparameter β to the KL-divergence in order to arbitrarily force the latent space distribution to match the prior. With β = 1 we have the same formula of the VAE meanwhile, with β >> 1 the KL constraint has a higher force. In this way, following the paper ¹⁶ we will have more disentangled representations. In the same paper, they also show that β is not strictly bound from above, the only limitation is that β >= 0. Of course, with higher \betat we are also making the tradeoff between reconstruction error and similarity to the prior.

1.14. Info VAE

The InfoVAE was first presented in ¹⁷. This architecture tries to make more explicit the ELBO tradeoff that in other architecture are implicit. Doing so they add a new term and 2 hyperparameter α and λ. They also introduce a new term in the ELBO objective which is the amount of mutual information between \(x\) and \(z\) under \(q\). This new term should avoid that \(x\) and \(z\) are completely independent. The α hyperparameter should control this value. The λ hyperparameter, on the other side, should control the KL divergence.

Therefore the objective function is: \[\mathcal{L}_{infoVAE} = - \lambda D_{KL}(q_\phi(z)||p(z)) - \mathbb{E}_{q(z)}\left[D_{KL}(q_\phi(x|z)||p_\theta(x|z))\right] + \alpha I_q(x;z)\] where \(I_q(x;z)\) is the mutual information between \(x\) and \(z\) under the distribution \(q_\phi(x,z)\) However, it is not possible to optimize this objective. Therefore an alternative but equivalent formulation is: \[\mathcal{L}_{infoVAE} = \mathbb{E}_{p_{\mathcal{D}}(x)}\mathbb{E}_{q_\phi(z|x)}[log p_\theta(x|z)] - (1-\alpha) \mathbb{E}_{p_{\mathcal{D}}(x)}D_{KL}(q_\phi(z|x)||p(z)) - (\alpha + \lambda -1) D_{KL}(q_\phi(z)||p(z))\]

It is important to notice that this formulation of the objective capture all the previous seen autoencoders architectures.

We get the standard VAE when α = 0 and λ = 1.

We get the β-VAE when λ > 0 and α + λ - 1 = 0.

Finally, we get the AAE when α = 1 and λ = 1 and \(\mathcal{D}\) is chosen to be the Jensen Shannon divergence.

1.15. INPROGRESS Random thought

Based on (towards ) why not training the AE and RL toghether. an idea can be to mix the two objective toghether by a sum (for a trivial case) other ideas can be to scale the objective of the AE based on time another idea can be to scale the objective of the AE based on how "good" the current representation is. For example we can use on of the disentanglement metrics described in the papers . Another interesting ideas is to use some metric of the representation to guide the exploration policy of the RL agent. Something like: There are points/ areas of the latent space which we did not explore yet or that maybe we have a low "bias" (we just encounter those states just few times)

Seems that active perception is quite crucial to learn a good and usefull representation of the world! and more importantly to learn the invariant of the world. Maybe this has something to do with active learning and GFlow net by Benjo! More investigation is needed.

2. Overview

2.1. Project scope

The project aim to build an autoencoder for dimensionality reduction. In particular, this will be used to hopefully enhance the performance of a DRL algorithm for opensim-rl simulation and to enhance the ability to generalize in different environment. In this project different type of Autoencoders will be tested.

2.2. Index:

2.2.1. Lines of Thoughts

A kinda of overview on what the thesis is about (e.i. Autoencoders and dimensionality reduction)

2.2.2. Papers

A list of all the background literature found. For each of paper there is a short description of the aims and the results.

2.2.3. Autoencoders

Contains a list of implemented and not autoencoders

2.2.4. Ideas for Hyperparameters

Contains some ideas for the hyperparameters fitting and some maybe clever ideas

2.2.5. Code

Contains the git repo and the link for the code documentation

2.2.6. Todos

Contains a list of different type of Todo

2.2.7. Presentations

Contains all the presentations done or in progress

2.2.8. Contacts

Self explanatory

3. Ideas for Hyperparameters

Since we will have to fit quite a lot of hyperparameters we tried to come up with some cleaver ideas to remove some of these hyperparameters.

3.1. Number of neurons per layer

The first hyperparameters we would like to remove is the number of neurons per layer. Since we are doing an autoencoder and therefore we are trying to find a compression function f we can assume that the number of neurons per layer is defined by some kind of function h that given the number of layers N, the numeber of dimension of the input I and the final number of dimension of the latent space Z returns the number of neurons for a specific layer. This function can be either linear or non-linear. The first intuition is that if h is linear it should be somewhat easier to learn a good compression function f. However, until now, we do not have any mathematical backgroud for this intuition! We need to do more research!

The first possible implementation of this function is defined as follows: (n is the index of the layer for which we are trying to find the number of neurons)

\[n_1 = I\] \[n_N = Z\] \[n_i = n_{i+1}*\lambda\] From these equations we can find out the equation which define the value of λ quite intuitevely. \[ \lambda = \sqrt[N-1]{\frac{I}{Z}} \] Now we can derive the number of layers given the number of neurons for the first and last layer \[ n_i = n_N * \prod_{x=1}^{N-i} \lambda \] \[ n_i= n_N * \lambda^{N-1}\] Substitute λ with the prievious found equation. \[ n_i = n_N * \left(\sqrt[N-1]{\frac{I}{Z}}\right)^ {N-i} \] \[ n_i = n_N * \left(\frac{I}{Z}\right)^{\frac{N-i}{N-1}} \] \[ n_i = Z * \left(\frac{I}{Z}\right)^{\frac{N-i}{N-1}} \]

4. ML Pipeline

5. Autoencoders

5.1. DONE Vanilla

The vanilla autoencoder is the classical one. Composed by an encoder and a decoder without any kind of constriction.

5.2. DONE VAE

The Variational autoencoder is a modified version of a vanilla AE which forces the distribution of the latent space to be a gaussian.

5.3. DONE AVB

Adversarial Variational Bayes is a relatively new ideas which exploits some Bayesian concept to force a particular latent space distribution. All is done in an adversarial environment.

5.4. HOLD B-VAE

Quite interesting if focus is on transfer learning and disentangled representations

5.5. HOLD InfoVae

6. Papers

6.1. Common knowledge resources

6.2. General papers:

6.2.1. Interpretable machine learning: Fundamental principles and 10 grand challenges¹¹

Quite interesting paper about the 10 grand problems/challenges in the current Machine learning state of research Note the 6th one is about unsupervised disentaglement of neural networks

6.2.2. Variational Inference: A Review for Statisticians¹⁴

Review paper on Variational inference, explain in mathematical details the theory and compare it with MCMC (Markov chain Monte Carlo) methods.

6.3. Autoencoders:

6.3.1. Recent advances in Autoencoder-based Representation Learning ¹⁸

Interesting paper which reviews a lot of up to date techniques for Autoencoders and disentangled representations.

6.3.2. β-VAE: Learning basic visual concepts with a constrained variational framework¹⁶

Summary
- introduces β-VAE
- shows how to achive better disentangled representations
- introduces a new hyperparameter (β) which forces the posterior to be closer to the isotropic unit Gaussian (prior)
- Introduces a new metric for measuring the disentanglement (independece + interpretability)
- reconstruction error should not be used to discriminate between AE (or at least it should be the only factor) see conclusions
- Interesting references (usef

6.3.3. Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks ¹⁹

Summary
- Introduces AVB
- Adversarial procedure
- focuses on giving better flexibility to the normal VAE procedure
- Quite interesting approach however,does not focuses on disentagling the representation so, even tho, it achieve on average better results than a normal VAE maybe it is not suitable in a RL framework. Testing is need to asses the performance.
Resources

tutorial with pytorch

6.3.4. DISCARDED Generalized Autoencoder: A Neural Network Framework for Dimensionality Reduction²⁰

Summary
- Interesting approach to dimensionality reduction which tries to focus on the main downfall of the AE
- However, does not look like anymore progress was done in this direction
- Therefore seems to be a bit too much to try for the thesis
- That's why it was discarded

6.3.5. Tutorial on Variational Autoencoders²¹

Quite usefull tutorial, explains unformally what vae tries to achive and how it does it.

6.3.6. Auto-Encoding Variational Bayes ¹⁵

Summary
- Paper which introduce the theory behind VAE
- Variational inference
- Highly teoretical and matematically
- Crucial to understand what,how and why we will want something like a VAE
Resources

Variational Inference intuitions

6.3.7. InfoVAE: Balancing Learning and Inference in Variational Autoencoders ¹⁷

Summary
- Shows crucial down fall of the ELBO objective.
- Introduces a new objective.
- Quite interesting cause it tries to balance the mutual information between the X and Z while trying to force the posterior distribution to a family of distribution (e.g. Guassian)
- It is a generalization of the β-VAE, VAE and Adversarial AE
- Also shows how it is possible to change the divergence metric in this new objective function (the only requirement is that it must be a strict divergence metric (i.e. D(p,q) = 0 iff p(x) = q(x)))
- This new objective also focuses on learning disentagled representations
- REALLY INTERESTING in particular in an RL framework since it parametrize both mutual information

6.3.8. Adversarial Autoencoders ²²

Figure 10: Adversarial Autoencoder structure

Summary
- introduces formally the AAE
- The main difference between VAE and AAE is that the KL term (or cross entropy) is replaced with a discriminator network to force and adversarial learning process
- Moreover, in constrast to VAE, we do not need to have a functional form of the posterior distribution we want to force. We just need to be able to sample from it.
- The objective of the AE is to both minimize the reconstruction error and to fool as best as possible the discriminator.
- The discriminator is used to discriminate between the wanted posterior and the actual latent space distribution
- Quite interesting however, does not look like it focus on disentangled representations but on the reconstruction error which maybe is not suitable if the main point is to use it within an RL framework. However, it can be intere

6.3.9. Learning representations by maximizing mutual information in variational autoencoders ²³

Summary
- Quite interesting new architecture (similar approach to ¹⁷)
- Again mark the fact that ELBO and KL aims to decrease the mutual information between the input and the latent representation which can results in quite bad representations
- for future

6.3.10. HOLD The information autoencoding family: A lagrangian Perspective on latent variable generative models ²⁴

Summary
- Quite advanced mathematically maybe for the future rese

6.3.11. HOLD CausalVAE: Disentangled Representation Learning via Neural Structural Causal Models ²⁵

Summary
- The concept of causuality seems pretty interesting in particular with respect to phisics and real environments, however, left for future re

6.3.12. Life-Long Disentangled representation learning with Cross-domain Latent Homologies ²⁶

Really interesting technique for life-long learning with AE. It introduce the VASE architecture which aims to be able to learn in a continuouly change environment. More specifically it aims to transfer old learned knowledge to new environment when possible and using new latent space when needed. It seems a really intersting architecture for active perception in an RL framework Todo search for further research using such method Todo read again and understand deeply the math.

6.3.13. TODO Unsupervised model selection for variational disentangled representation learning ²⁷

Summary

6.4. Dimensionality reduction:

6.4.1. Auto-encoder based dimensionality reduction ²

Contributions

We start from auto-encoder and focus on its ability to reduce
the dimensionality, trying to understand the difference between
auto-encoder and state-of-the-art dimensionality reduction
methods. The results show that auto-encoder indeed learn
something different from other methods.

We preliminarily investigate the influence of the number of
hidden layer nodes on the performance of auto-encoder on
MNIST and Olivetti face datasets. The results reveal its possible
relation with the intrinsic dimensionality.

Summary

Shows comparison of Autoencoder and other dimensionality reduction methods (e.g. PCA,LLE) Notable results: Autoencoder different than other dimensionality reduction. Potentially detect repetitive structures. Dimensionality of the Latent space is best when it maches the intrinsic dimensionality of the dataset.
Opinions

This clearly shows how Autoencoder can be essentially different and more usefull than other dimensionality reduction methods. This consolidate the choice of usi

6.4.2. Dimensionality Reduction of SDSS Spectra with Variational Autoencoders ³

Summary

Show how AEs were already used in astrony for different dimensionality reduction/classification task with success. Moreover, it aims to address the limitation of PCA using VAE. Results show how on this dataset (SDSS sloan digital sky survey) the autoencoder outperforms PCA in particular with low dimension latent space(or component for PCA) They mainly use InfoVAE¹⁷,a variant of the VAE, focused on trying to disentangle (e.i. force mapping different inputs to disjoint distribution ) the different latent space

6.4.3. DISCARDED Dimensionality reduction for EEG-based sleep stage detection: comparison of autoencoders, principal component analysis and factor analysis ²⁸

It was DISCARDED because it contains too specific content and the comparison between algorithms has multiple steps and variables which are highly specific to the task. Therefore I do not think it should b

6.4.4. A deep adversarial variational autoencoder model for dimensionality reduction in single-cell RNA sequencing analysis ⁴

The-novel-architecture-of-an-Adversarial-Variational-AutoEncoder-with-Dual-Matching.png

Figure 11: Adversarial Variational Autoencoder with dual matching

Summary

It introduces a novel Adversarial autoencoder architecture named AVE-DM (Adversarial Variational autoencoder with dual matching ) The main difference between this new architecture and the prievious proposed Adversarial autoencoders²² is that it has 2 discriminator (hence the name dual matching). Main Results: Shows how AVE-DM outperforms other state-of-the-art methods such us PCA, UMAP (Uniform Manifold Approximation and Projection), t-SNE (T-distributed sochastic neighbor embedding ) etc. Note: The interesting part of the data is that it have dropout event (the reason for it is quite technical and specific to RNA sequencing). This dropout event are zero expression measurament that can be either biological or technical. This phenomenon result in poor results

6.5. Disentagled representation:

6.5.1. Understanding disentagling in β-VAE ²⁹

Summary
- Give some intuitions on why β-VAE achive disentaglement
- Show some more formal derivations on the why
- Introduces a modification on the "standard" β-VAE
- Quite interesting, however, also here there seems to not be any research on the application

6.5.2. Towards a Definition of Disentangled Representations ¹²

Summary
- Gives a formal definition of what a Disentangled representation is
- Really crucial paper
- Idea based on symmetry transformations (like in physics)
- Mainly based on group theory
- Really good references (Look at the underlined ones)
- Interesting idea about active perception!

6.5.3. Are Disentagled representations helpful for Abstract Visual Reasoning?¹³

Summary

6.5.4. DISCARDED On the binding Problem in Artificial Neural Networks ³⁰

Really interesting paper about the binding problems. It does talk about representation learning and disentagled representations however, it goes way beyond the scope of this thesis talking about hierarchical representations, object detection and classification etc. Really interesting for future research!

6.5.5. Disentangling Disentanglement in Variational Autoencoders ³¹

Summary
- Take another approach to disentangling representations.
- decomposition as opposed to disentanglement
- Achieve a pretty similarity objective to the ¹⁷

6.5.6. TODO Unsupervised State Representation Learning in Atari ³²

Summary

6.6. Continual Learning:

6.6.1. TODO Generative Models from the perspective of Continual Learning ³³

Summary

6.6.2. TODO Embracing Change: Continual Learning in Deep Neural Networks ³⁴

Summary

6.6.3. TODO Continual learning for robotics: Definition,framework,learning strategies, opportunities and challenges ³⁵

Summary

6.6.4. TODO Continual Unsupervised Representation Learning ³⁶

Summary

6.7. AE + RL:

6.7.1. VARL: a variational autoencoder‑based reinforcement learning Framework for vehicle routing problems ⁶

Quotes

It inherits the idea of variational inference to use a distribution to approximate the posterior distribution¹⁵. The
diference is that VAE considers the posterior distribution
of all data simultaneously and approximates each posterior
distribution with distribution, minimizing KL divergence.

It has many advantages,
including fast training, stability, and so on, so it has a wide
range of theoretical models and industry applications.

Summary

Introduces a new variational framework for combinatorial optimization (e.g. TSP). Introduces Variational inference and VAE. Propose VARL (Variational autoencoder-based reinforcement learning) which exploits variational inference ideas to learn efiiciently and effectively a solution in a graph-based framework.
Opinions

The main down side is that the VARL architecture seems to be quite complex and specific for combinatorial optimization. That said, it also shows how variational inference can be effectively used in combination with reinforcement learning (in the paper R

6.7.2. Robot skill learning in latent space of a deep autoencoder neural network ⁷

Summary
- Gaussian Process Regression (GPR) for statistical learning
- Shows that Autoencoder-based latent space is more effective than PCA-based latent space
- Tanh as hidden activation function and linear for output
- Interesting enough the AE with only Linear activation function still performs better than PCA (researcher stated that maybe it's due to the fact that AE latent space dimensions do not have to be orthogonal (interesting!))
- RL converges faster and it is more stable in latent space with respect to DMP space (this is true for both PCA and AE)
- Moreover RL+AE outperforms RL+PCA! researcher stated that this is probably due to the non linearity in AE
- In particular during the introduction sections it has a lot of good references which are worth looking in

6.7.3. AutoEncoder-based Safe Reinforcement Learning for Power Augmentation in a Lower-limb Exoskeleton ¹

Summary
- GPR used to generate data given few real-world examples.
- AE both for action and state space reduction.
- Action and states are DMP
- MSE as loss function
- This paper also shows that AE-based latent space make the RL learning process faster,safe and more stable.
- Note: in this paper HMI (Human-Machine Interaction) was a central role in the optimizat

6.7.4. The Dreaming Variational Autoencoder for Reinforcement Learning Environments ¹⁰

Summary
- NOTE Gaussian distributed policy for initial state-space exploration
- Introduces DVAE architecture
- Main aims is to model environments with sparse rewards in order to perform offline RL.
- Main problems: if the exploration in the real environment is constly. Then this techinique does not behave as the environments in the unexplored states.
- Quite interesting, however, not fitting for highly complex and continuos environments where exploration is costly and/or

6.7.5. Deep Variational Reinforcement Learning for POMDPs ⁹

Summary
- Defines a new RL method with AE
- Interesting is that a new method for approximating the ELBO is introduced. Using MC methods.
- NOTE Interesting part is that latent space mapping and policy are learned toghether. However, policy is update more frequently then the latent space. Which stabilize the learning process.
- Only test in "trivial" environment (Though most of them are continuos)
- Quite interesting, seems too complex and more reasearch most be done if we want to use this approch.
- Contains some use

6.7.6. On the use of Deep Autoencoders for Efficient Embedded Reinforcement Learning ⁸

Summary
- Shows how using AE-based latent space reduce the time of convergence. Moreover, it also produce more vaiable policies faster.
- The main downsite with respect to the thesis is that a big part of this advantage is due to the fact that images were the input to the AE. Of course, this is the main reason why the vanilla RL performs drastically worst than the one with the Convolutional AE.
- However, seems another good source of information which again shows that AE-based latent space increase the RL performance and decrease the time

6.7.7. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning ⁵

Summary
- REALLY INTERESTING PAPER!
- formally shows how to do zero-shot transfer on MDPs
- introduces this new concept DARLA (DisentAngle Rapresentation Learning Agent)
- 3 steps:
  - Learn to see (Train the AE with some fixed policy ) Crucial the distribution of data collected in this phase must be as variagate as possible in order to train the AE appropriately.
  - Learn to act Train the RL agent on the source domain using the latent space of the AE
  - Transfer Test the RL agent on the target domain without anymore fine-tuning
- Uses β-VAE which aims to force the learning of a disentangled representation
Resources

[[http://proceedings.mlr.pres

6.7.8. TODO Explainability in deep reinforcement learning ³⁷

Summary

6.8. Old ideas

6.8.1. Deep reinforcement learning for modeling human locomotion control in neuromechanical simulation ³⁸

Summary

review paper, it introduces in general the topic. it illustrate previous methodology and then it moves on Deep Rl. It talks about Learn to Move competition and the different techniques used in that competition. Finally some future directions.
Opinions

This paper is really interesting in particular the part about Learn to Move and future directions.
Ideas

It suggest imitation learning and hierarchical

6.9. TODO Old work

6.9.1. Level ground walking for healthy and transfemoral amputee models. Deep reinforcement learning with phasic policy gradient optimization ³⁹

6.9.2. Deep reinforcement learning for physics-based musculoskeletal model of a transfemoral amputee with a prothesis walking on uneven terrain ⁴⁰

6.9.3. Deep reinforcement learning for physics-based musculoskeletal simulations of healthy subjects and transfemoral protheses' users during normal walking ⁴¹

6.9.4. Learning to walk: Phasic Policy Gradient for healthy and impaired musculoskeletal models ⁴²

6.9.5. Evaluating Deep Reinforcement Learning Algorithms for Physics-Based Musculoskeletal Transfemoral Model with a Prosthetic Leg Performing Ground-Level Walking ⁴³

6.9.6. Deep Reinforcement Learning for Physics-based Musculoskeletal Simulations of Transfemoral Prosthesis' Users during the Transition between Normal Walking and Stairs Ascending ⁴⁴

6.9.7. Testing For Generality Of A Proximal Policy Optimiser For Advanced Human Locomotion Beyond Walking ⁴⁵

7. Presentations

7.1. First one

7.2. Second one

8. INPROGRESS Future Research

test other Disentanglement AE/GAN architectures (e.g. FactorVAE,CasualVAE,DreamingVAE).
explicitly focus on transfer (maybe with fine-tuning instead of zero-shot)
test different DRL algorithm to see how this impact the performance
test different method of training the AE (for example on-line, see active perceptions frameworks and/or active learning currently in development by Microsoft research closely followed by Yoshua Bengio)
In general, the idea of representation learning + DRL seems to be a really interesting and not fully explored path. (see The Consciusness Prior by Yoshua Bengio)