## Implicit function theorem and its multivariate generalization

The implicit function theorem for a single output variable can be stated as follows:

Single equation implicit function theorem. Let $F(\mathbf{x}, y)$ be a function of class $C^1$ on some neighborhood of a point $(\mathbf{a}, b) \in \mathbb{R}^{n+1}$. Suppose that $F(\mathbf{a}, b) = 0$ and $\partial_y F(\mathbf{a}, b) \neq 0$. Then there exist positive numbers $r_0, r_1$ such that the following conclusions are valid.

a. For each $\mathbf{x}$ in the ball $|\mathbf{x} - \mathbf{a}| < r_0$ there is a unique $y$ such that $|y - b| < r_1$ and $F(\mathbf{x}, y) = 0$. We denote this $y$ by $f(\mathbf{x})$; in particular, $f(\mathbf{a}) = b$.

b. The function $f$ thus defined for $|\mathbf{x} - \mathbf{a}| < r_0$ is of class $C^1$, and its partial derivatives are given by

$\partial_j f(\mathbf{x}) = -\frac{\partial_j F(\mathbf{x}, f(\mathbf{x}))}{\partial_y F(\mathbf{x}, f(\mathbf{x}))}$.

Proof. For part (a), assume without loss of generality positive $\partial_y F(\mathbf{a}, b)$. By continuity of that partial derivative, we have that in some neighborhood of $(\mathbf{a}, b)$ it is positive and thus for some $r_1 > 0, r_0 > 0$ there exists $f$ such that $|\mathbf{x} - \mathbf{a}| < r_0$ implies that there exists a unique $y$ (by intermediate value theorem along with positivity of $\partial_y F$) such that $|y - b| < r_1$ with $F(\mathbf{x}, y) = 0$, which defines some function $y = f(\mathbf{x})$.

To show that $f$ has partial derivatives, we must first show that it is continuous. To do so, we can let $r_1$ be our $\epsilon$ and use the same process to arrive at our $\delta$, which corresponds to $r_0$.

For part (b), to show that its partial derivatives exist and are equal to what we desire, we perturb $\mathbf{x}$ with an $\mathbf{h}$ that we let WLOG be

$\mathbf{h} = (h, 0, \ldots, 0)$.

Then with $k = f(\mathbf{x}+\mathbf{h}) - f(\mathbf{x})$, we have $F(\mathbf{x} + \mathbf{h}, y+k) = F(\mathbf{x}, y) = 0$. From the mean value theorem, we can arrive at

$0 = h\partial_1F(\mathbf{x}+t\mathbf{h}, y + tk) + k\partial_y F(\mathbf{x}+t\mathbf{h}, y+tk)$

for some $t \in (0,1)$. Rearranging and taking $h \to 0$ gives us

$\partial_j f(\mathbf{x}) = -\frac{\partial_j F(\mathbf{x}, y)}{\partial_y F(\mathbf{x}, y)}$.

The following can be generalized to multiple variables, with $k$ implicit functions and $k$ constraints.     ▢

Implicit function theorem for systems of equations. Let $\mathbf{F}(\mathbf{x}, \mathbf{y})$ be an $\mathbb{R}^k$ valued functions of class $C^1$ on some neighborhood of a point $\mathbf{F}(\mathbf{a}, \mathbf{b}) \in \mathbb{R}^{n+k}$ and let $B_{ij} = (\partial F_i / \partial y_j)(\mathbf{a}, \mathbf{b})$. Suppose that $\mathbf{F}(\mathbf{x}, \mathbf{y}) = \mathbf{0}$ and $\det B \neq 0$. Then there exist positive numbers $r_0, r_1$ such that the following conclusions are valid.

a. For each $\mathbf{x}$ in the ball $|\mathbf{x} - \mathbf{a}| < r_0$ there is a unique $\mathbf{y}$ such that $|\mathbf{y} - \mathbf{b}| < r_1$ and $\mathbf{F}(\mathbf{x}, \mathbf{y}) = 0$. We denote this $\mathbf{y}$ by $\mathbf{f}(\mathbf{x})$; in particular, $\mathbf{f}(\mathbf{a}) = \mathbf{b}$.

b. The function $\mathbf{f}$ thus defined for $|\mathbf{x} - \mathbf{a}| < r_0$ is of class $C^1$, and its partial derivatives $\partial_j \mathbf{f}$ can be computed by differentiating the equations $\mathbf{F}(\mathbf{x}, \mathbf{f}(\mathbf{x})) = \mathbf{0}$ with respect to $x_j$ and solving the resulting linear system of equations for $\partial_j f_1, \ldots, \partial_j f_k$.

Proof: For this we will be using Cramer’s rule, which is that one can solve a linear system $Ax = y$ (provided of course that $A$ is non-singular) by taking matrix obtained from substituting the $k$th column of $A$ with $y$ and letting $x_k$ be the determinant of that matrix divided by the determinant of $A$.

From this, we are somewhat hinted that induction is in order. If $B$ is invertible, then one of its $k-1 \times k-1$ submatrices is invertible. Assume WLOG that such applies to the one determined by $B^{kk}$. With this in mind, we can via our inductive hypothesis have

$F_1(\mathbf{x}, \mathbf{y}) = F_2(\mathbf{x}, \mathbf{y}) = \cdots = F_{k-1}(\mathbf{x}, \mathbf{y}) = 0$

determine $y_j = g_j(\mathbf{x}, y_k)$ for $j = 1,2,\ldots,k-1$. Here we are making $y_k$ an independent variable and we can totally do that because we are inducting on the number of outputs (and also constraints). Substituting this into the $F_k$ constraint, this reduces to the single variable case, with

$G(\mathbf{x}, y_k) = F_k(\mathbf{x}, \mathbf{g}(\mathbf{x}, y_k), y_k) = 0$.

It suffices now to show via our $\det B \neq 0$ hypothesis that $\frac{\partial G}{\partial y_k} \neq 0$. Routine application of the chain rule gives

$\frac{\partial G}{\partial y_k} = \displaystyle\sum_{j=1}^{k-1} \frac{\partial F_k}{\partial y_j} \frac{\partial g_j}{\partial y_k} + \frac{\partial F_k}{\partial y_k} = \displaystyle\sum_{j=1}^{k-1} B^{kj} \frac{\partial g_j}{\partial y_k} + B^{kk}. \ \ \ \ (1)$

The $\frac{\partial g_j}{\partial y_k}$s are the solution to the following linear system:

$\begin{pmatrix} \frac{\partial F_1}{\partial y_1} & \dots & \frac{\partial F_1}{\partial y_{k-1}} \\ \; & \ddots \; \\ \frac{\partial F_{k-1}}{\partial y_1} & \dots & \frac{\partial F_{k-1}}{\partial y_{k-1}} \end{pmatrix} \begin{pmatrix} \frac{\partial g_1}{\partial y_k} \\ \vdots \\ \frac{\partial g_{k-1}}{\partial y_k} \end{pmatrix} = \begin{pmatrix} \frac{-\partial F_1}{\partial y_k} \\ \vdots \\ \frac{-\partial F_{k-1}}{\partial y_k} \end{pmatrix}$.

Let $M^{ij}$ denote the $k-1 \times k-1$ submatrix induced by $B_{ij}$. We see then that in the replacement for Cramer’s rule, we arrive at what is $M^{kj}$ but with the last column swapped to the left $k-j-1$ times such that it lands in the $j$th column and also with a negative sign, which means

$\frac{\partial g_j}{\partial y_k}(\mathbf{a}, b_k) = (-1)^{k-j} \frac{\det M^{jk}}{\det M^{kk}}$.

Now, we substitute this into $(1)$ to get

\begin{aligned}\frac{\partial G}{\partial y_k}(\mathbf{a}, b_k) &= \displaystyle_{j=1}^{k-1} (-1)^{k-j}B_{kj}\frac{\det M^{kj}}{\det M^{kk}} + B_kk \\ &= \frac{\sum_{j=1}^k (-1)^{j+k} B_{kj}\det M^{kj}}{\det M^{kk}} \\ &= \frac{\det B}{\det M^{kk}} \\ &\neq 0. \end{aligned}

Finally, we apply the implicit function theorem for one variable for the $y_k$ that remains.     ▢

References

• Gerald B. Folland, Advanced Calculus, Prentice Hall, Upper Saddle River, NJ, 2002, pp. 114–116, 420–422.

## A nice consequence of Baire category theorem

In a complete metric space $X$, we call a point $x$ for which $\{x\}$ is open an isolated point. If $X$ is countable and there are no isolated points, we can take $\displaystyle\cap_{x \in X} X \setminus x = \emptyset$, with each of the $X \setminus x$ open and dense, to violate the Baire category theorem. From that, we can arrive at the proposition that in a complete metric space, no isolated points implies that the space uncountable, and similarly, that countable implies there is an isolated point.

## Arzela-Ascoli theorem and delta epsilons

I always like to think of understanding of the delta epsilon definition of limit as somewhat of an ideal dividing line on the cognitive hierarchy, between actually smart and pseudo smart. I still remember vividly struggling to grok that back in high school when I first saw it junior year, though summer after, it made sense, as for why it was reasonable to define it that way. That such was only established in the 19th century goes to show how unnatural such abstract precise definitions are for the human brain (more reason to use cognitive genomics to enhance it 😉 ). At that time, I would not have imagined easily that this limit definition could be generalized further, discarding the deltas and epsilons, which presumes and restricts to real numbers, as it already felt abstract enough. Don’t even get me started on topological spaces, nets, filters, and ultrafilters; my understanding of them is still cursory at best, but someday I will fully internalize them.

Fortunately, I have improved significantly since then, both in terms of experience and in terms of my biological intelligence, that last night, I was able to reconstruct in my head the proof of the Arzela-Ascoli theorem, which also had been once upon a time mind-boggling. Remarkably, that proof, viewed appropriately, comes naturally out of just a few simple, critical observations.

The statement of the theorem is as follows.

Arzela-Ascoli theorem Let $F$ be a family of functions from $I = [a,b]$ to $\mathbb{R}$ that are uniformly bounded and equicontinuous. Then there is a sequence of $f_n$ of elements in $F$ that converges uniformly in $I$.

The rationals are dense in the reals and make an excellent starting point. Uniform boundedness enables us employ Bolzano-Weierstrass to construct a sequence of functions convergent at any rational in $I$. With a countable number of such rationals, we can iteratively filter this sequence to construct one that converges at every rational in $I$. Essentially, we have an enumeration of the rationals in $I$ and a chain of subsequences based on that, where in the $n$th subsequence is convergent at the first $n$ rationals in the enumeration, and Bolzano-Weierstrass is applied onto the results of the application of the functions of that subsequence on the $n+1$th rational to yield another subsequence. Take, diagonally, the $n$th function in the $n$th subsequence, to derive our desired uniformly convergent sequence, which we call $\{f_i\}$.

To show uniform convergence, it suffices to show uniform Cauchyness, namely that for any $\epsilon > 0$, there is an $N$ such that $n,m > N$ implies $|f_m(x) - f_n(x)| < \epsilon$ for all $x \in I$. By compactness, open neighborhoods of all rationals of $I$, as an open cover, has a finite subcover. Each element of the subcover comes from some rational of $I$ and across that finite subset of $I$ we can for any $\epsilon > 0$ take the max of all the $N$s for convergence. This means that so long as our neighborhoods are sufficiently small, we can for any point $x \in I$ have some $x'$ that is the point of focus of the neighborhood of our finite subcover containing $x$ and thereby connect $f_m(x)$ to $f_m(x')$ by equicontinuity and use our maximum $N$ (over finite elements) to connect that to $f_n(x')$ and use equicontinuity again to connect that to $f_n(x)$. Thus, triangle inequality over three of $\epsilon/3$ suffices.

More explicitly, equicontinuity-wise, we have for every $x$ some open neighborhood of $U_x$ such that $s,t \in U_x$ implies that $|f(s) - f(t)| < \epsilon$.