Constraints as Guidance, Not Hindrance
When I first learned about the idea of constraint solving (I belive it was in a class taught by professor Sichun Gao in UCSD CSE called "Search and optimization"), I wasn't as exiciting as I am. I thought of constraint solving just as a small sub-branch of optimization, but I was wrong. It is actually the "class" that can generates any other type optimization, reasoning, or machine learning techniques when you look at if from the right perspective. Quoting from what professor Gao said himself: "All you need is constraint solving". If I wnat to summarize the ideas in this article, it would be:
- Theoretically, constraints greatly reduced our search problem because among all the answers we can search for, we only care about the ones that matches with our constraint.
- Practically, "doability" is always part of the puzzle, in theory we can do everything but in practice we can't, so how do we optimize with these constraint not as a hinderance but as a helper?
This article tries to go over some quite complicated topic, which I focuesed more on the intuition of it and less on the athamatical deriviation.
Standard Models (SM) Constraint¶
(Though I should probbaly put the section describing Variational Autoencoder as the first section since it will be a easier introduction to Expectation Maximization, but I think that starting with the Standard Model illusrate mypoints much clearer.)
From a theoritical perspective, looking at constraint is very interesting. I come from a background of doing reinforcement learning and in there we like to frame our problem as a search & optimization problem. How do we find the best path in a tree (the tree here is not actually a search tree but an abstract concept of so)? Or how do we find the optimal trajectory for getting to the maximum of the objective surface that changes over time since we get more and more knowledge of such surface? From a classical machine learning perspective, learning is about finding patterns in the dataset, no matter if you are doing supervised or unsupervised, it is all about looking for some useful pattern and characteristic from the dataset. Things got particularly interesting when I took a class named "Machine learning with few labels" taught by professor Zhiting Hu in UCSD HDSI. One of the major work that he has done was establishing the standard model for machine learning, similar to how physicist established the standard model of all partcles. As he always says in this class:
We want to study machine learning like chemistry, to understand and constructs upon each other, not arcamy.
The main reason of mentioning the SM is because when we look at the Standard Equation (SE) proposed in the same paper that unifies all type of learning and can be derived into any type of learning algorithm with specific configuration, which can be written as the following, we results in a constraint optimization problem.
subject to
Where the first term is teh Uncertainty function \(\mathcal{H}(\cdot)\) that controls the compactness of the output model, for example, by regulating the amount of allowed randomness while trying to fit experience. The second term is the Divergence function \(\mathcal{D}(\cdot,\cdot)\) that measures the distance between the target model to be trained and the auxiliary model, facilitating a teacher–student mechanism. At last, the thrid term is teh Experience function, introduced by a Penalty term \(U(\xi)\), which incorporates the set of "experience functions" \(f_k^{(\theta)}\) that represent external experience of various kinds for training the target model. With the follwoing assumptions, we can frame this problem as a Expecttaion Maximization (EM) procedure (EM is also a very very interesting topic that I want to go into a little bit, for now I have attached an example of EM for binomial mixture model below).
Example of Expectation Maximization
Uncertainty: Maximizing \(\mathcal{H}(q)\) encourages the model to maintain a certain level of 'spread' or variability, implicitly allowing for more uncertainty in the distribution \(q\).
Divergence: Minimizing \(\mathcal{D}(q,p_\theta)\) pushes \(p_\theta\) to better match the distribution \(q\), effectively acting as a divergence measure.
Experience: \(f(t)\) encodes knowledge or constraints from external sources or data, influencing the target model’s learning process.
With these asumption, the derived update would take in the following form. In EM, parameters doesn’t matter, the hidden distribution is what impact all important information, then we just optimize parameter with regard to this distribution is fine. The below illustrate an analytical ideal \(q\) distribution that comes from SE (incorporating both uncertainty, divergence, and experiences), with such \(q\) distribution, we just directly MLE the parameter (same with minimizing the KL).
We wouldn't dive into the specific formulation and mathamatical deriviation of such model (I have attached my note that trys to conduct some of the deriviation below), but notice that this is a constraint solvcing problem! We will go over one instance of SM (MLE) to illustrate this idea.
From SM \(\rightarrow\) MLE:¶
For an arbitrary configuration \((x_0,y_0)\), its probability \(p_d(x_0,y_0)\) under the data distribution can be seen as measuring the expected similarity between \((x_0,y_0)\) and true data samples \((x^*,y^*)\), and can be written as:
Here the similarity measure is \(\mathbb{I}^{(x^*,y^*)}(x,y)\), an indicator function that takes the value 1 if \((x,y)=(x^*,y^*)\) and 0 otherwise. The experience function would thus be defined as:
Plugging this into the teacher model for the expected \(q\)-distribution, we have:
Notice that under this condition we don't care about the \(\beta \log p_{\theta}(x,y)\) term (derived from divergence). Since we don't care about the true distribution \(q\)'s distance to \((p_{\theta}(x,y))\) and instead we just want to fit the data (which is what MLE is doing):
Then we maximize where the derived \(q\) distribution is the direct data distribution since \(q\) is retrieved from an indicator function on the data distribution. This is exactly the definition of MLE.
Essentially, the SE provides a new perspective of looking at all types of learning as a instance of a constraint solving problem. If we say RL is a search and optimization process under the reward constraint, then we can also think of supervised learning not as finding patterns in data, but as a optimization or a search in the landscape of weights with the constraint of data. Think how we are doing projected gradient descent (Hard constraint) or regularized gradient descent (Lagrangian constraint) where we are projecting the steps onto a subspace following the constraint requirement (refer to this article). Then our constraint in supervised setting is just constraining (a strict constraint) all posible weight world steps to a data space. For more information:
Notes on Standard Model of Machine Learning
Variational Autoencoder (VAE) Constraint¶
We have briefly mentioned the name of EM in the previous section. Turns out that, in practice, how we optimize EM is also a constraint optimization process. To quickly recap, the key of vanilla EM is to gradually find such \(q\) (posterrior) distribution that captures all the hidden variable and we maximize based on this \(q\) distribution. Usually, we optimize the Evidence of Lower Bound (ELBO) objective (the equation showing here is for the KL divergence definition).
I have also attached the proof of deriving both ELBO + KL divergence and ELBO + entropy for reference in the links below:
Deriving such posterrior distribution in simpler examples such as Binomial Mixture Model would be tractable. lHowever, derving it in much more complicated situations may get very tedious (i.e. Baysian Mixture of Gaussian's posterriror distribution):
Naturally, an question that we would be asking is whether we can do a approximation of such posterrior distribution. This is known as Variational Inference (as compared to what we do in traditional EM as inference). Traditional method would involve factorizing this posterrior to many independent Gaussian distrbution (Mean Field) or using re-parametrization tricks to approximate any family of distribution (Black Box Inference). As neural networks becomes more popular, more modern approaches such as Variational Autoencoder (VAE) becomes more popular where it reframes the original ELBO objective in EM into a generative model format of \(p_{\theta}(x|z)\) and adding the same constraint of the KL divergence.
Thus, the whole VAE expression becomes the following.
When taking the gradient, we can borrow the same re-parametrize trick from Black Box Inference where we treat this complicated distribution of \(z\) as a transformation on a simpler distribution that we know of. We can then take gradient of the encoder \(q_{\phi}\) distribution with respect to \(\phi\) and also gradient of the decoder \(p_{\theta}\) distribution with respect to \(\theta\).
Notice that this is a constraint solving problem again. We are maximizing the first term (decoder) to get improved in acccuracy while minimizing the second term (encoder) which is measuring divergence. With more trials, we get better encoder,hence, better decoder, and hence, better encoder. In the same trend of thoughts, further extensions of VAE such as Variational Information Bottleneck (IB) can alsobe framed into an constraint optimization as well.
Constraint Solving and Life¶
Maybe life itself can be framed as a constraint solving problem. Life is not a supervised learning problem --- you cannot copy someone else's "success model" and hope it works, let alone having an analytical solution. You face two kinds of constraints at every moment: the environment's constraints (the state you're in, the actions available to you, the transitions you cannot control) and the constraints you've built (your history, your experiences, the tree you have grown). Neither is a hindrance. Both are guidance.
The environment constrains you the way physics constrains optimization --- it defines the feasible region. You don't get to choose which state you're born into, which opportunities appear, or how the world responds to your actions. You have to work within what the environment gives you, just as every constraint solver must respect the boundaries of the problem. But with these constraints, the search direction is narrowed and more directed.
The tree you grow is the other constraint, and it is yours alone. You grow it independently from other trees, doing your own search and generating backtracking statistics along the way. With any new task you encounter or any new paradigm you step into, you don't try to graft yourself onto someone else's tree. Instead, you branch out a leaf from your own tree and see what the returning statistics tell you. The tree constrains your search in the best possible way: it tells you which directions have been explored, which returned high value, which led to dead ends. It narrows the search space not by limiting you but by informing you. You have traveled a long way, expanded the tree far, and everything works just fine. Trust your \(q\) function. Trust the constraint that your history and your tree have given you.
Don't be afraid to branch out to unknown places, since the key of going to the unknown is to make mistakes. No matter where you go, your tree leaves a mark in the space. The further you travel, the more marks you leave, and when they are abundant enough, they form a shape --- a more intricate structure --- that tells you who you are. The constraints don't shrink your world. They sharpen it.
References¶
- DSC 190, Zhiting Hu, UC San Diego
- CSE 258, Julian McAuley, UC San Diego
- Hu, Z., Xing, E.P. Toward a 'Standard Model' of Machine Learning. Harvard Data Science Review, 2021.