不知为何,突然想起测度论里的不可测度的维塔利集合

复制以我写的知乎文章

我在知乎上写的目前竟是一些有关于美国华人和ABC和犹太人的政治话题,自己快成了民族活动家了,其实对于学理工科的人而言,民族活动家比较贬义。民族活动家似的言论与活动,尤其在美国,其实是自然被有能力的人所藐视的,这原则很简单,它根本就是不“专业”的表现,甚至可以说是一种流氓耍诬赖的作为。在美国,中国人政治上都是特别老实的,从来不闹事儿,不抗议,就服从性的低调的埋头苦干。相反,我看到过一位根本不黑但有黑人血统的数学研究生,他的数学水平其实很差的,与其他人相比,可是他却公开的支持Black Lives Matter,然后学校媒体却非常支持他,以他宣传自己的diversity,公布的视频里还有他说I didn’t have to think about race。有意思的是他根本不黑,要他不说,其实都看不出来他是黑人。所预料,这些在学校没人敢说的,说了都怕给自己惹麻烦,其实好多人都为此感到不满,但不得不不了了之,最终政治赢者是谁就毫无疑问了。

我为什么开始写这些东西具体愿意也很难说,一个根本是我天性特别讨厌装傻扯淡的表现,而美国的政治在我眼中就是个大装傻戏(当然,未避免是好多人真的那么傻,或者自己太聪明了哈哈)。反正美国人关于中国和中国人的看法好多实在太傻,在那儿的中国人大多也懒得去“纠正“,大多也是不了了之,我也是,则好多负面的又不太符合客观事实的刻板印象逐渐形成。好多这些我也在我博客用英文写了,美国人爱看他们可以看,知识让一部人知道,并留个记录,也给中国人一点启发。

好,说起数学,我想稍微写写关于我对某非常具有纯数学性质又非常基础重要及美妙的一个观念和例子,那就是不可测度集合的体会。我记得曾经把他的英文维基百科页发给某北大物理毕业的人看了,他的反应就是这种鬼东西只有脱离现实的数学家才会在乎。其实测度这个观念大家是有的,即使不喜欢数学的人,长度宽度这些都是对人很直觉的。形式化一些,我们以实数集合模拟,一个实数集合的子集的度量,勒贝格度量(Lebesgue measure)定义为

\mu^*(E) = \inf\{\displaystyle\sum_{k=1}^{\infty} I_k : (I_k)_{n \in \mathbb{N}} \text{as open intervals}, \displaystyle\bigcup_{k=1}^{\infty} I_k \supset E\}

这个其实是勒贝格外测度,是否可勒贝格测度有条件,那就是所有实数子集A 符合

\displaystyle \mu ^{*}(A)=\mu ^{*}(A\cap E)+\mu ^{*}(A\cap E^{c})

若可勒贝格测度,度值为上的外测度值。

好,我们去找一个非可勒贝格测度的集合。为此,我们将引用三个观察。

  • 测度平移守恒,那就是 \mu(S) = \mu(x+S), \forall x \in \mathbb{R}
  • \{A_k\}_{k \in \mathbb{N}} 互相不交则 \mu(\bigcup_{k=1}^{\infty} A_k) = \sum_{k=1}^{\infty} \mu(A_k)
  • S \subset T\mu(S) \leq \mu(T)

这些直觉上显而易见,形式化证也不难。

以前两点,我们发觉任何可以表示为可数无限多个平移同等的不交集的测度比为 0\infty ,因为所有不交集测度一样所以总测度必然是无限个零或无限个有限数。
那,若我们找到那样的一个集合,并且通过第三点,把他的测度加以有限上限及下限,则得以矛盾,则无可测度。

我们先取商加法群 \mathbb{R} / \mathbb{Q} ,并以选择公理在其所有同价类选在 [0,1] 范围内的一个元素构建一个集合,称之 V 。我们在以每一个 \mathbb{Q} \cap [-1,1] 的元素将 V 平移,这些平移互相不交叉,其并集又包含 [0,1] (也就是我们选择不大不小 [-1,1] ),但 [-1,2] 之内,所以以第三点,他的测度又在 13 之间。所以他若可测度就矛盾了。

我第一次看到这个好像是大三看到的,是自己在英文维基百科页看到的,当时,想这个脑子都有点晕了,还是太不数学成熟,使劲想了但还为此感到迷惑,这个构建的要素脑子里还未看透。可是,后来数学有了大的进步,今年初,我又在没有查任何资料的情况下十分钟左右就从新想了起来,接着把它板上解释给了一位芝加哥大学经济系毕业的人,可是这次,那个人却难以理解。

前天晚上,又想起这个了,感到它非常奇妙,尽然在数学的脱离于物质世界的抽象化及形式化存在这样度不可测的集合。当然,有一点是他依赖有一定哲学争议的选择公理,这我现在也没资格谈。这次根本没有想就能够清晰在脑子里看到这个构建,自然就回顾到了与其恰恰相反的无法理解之的曾经,觉得那时候自己脑子还处于一种半沉睡的状态。这也是数学一种奇妙之处吧。有的定理无论如何证出来不得不有点复杂繁琐,但也有一些定理或观念虽然简单但是从数学思想上却是天才般的,革命性的,之所以那么久才能被人发现到,之所以当初理解困难而经过正确深思后却一目了然永不忘。

纯数学我觉得还是最需要智力的学科,主要是他那种抽象度啊,是很少一部分人脑子先天条件足以接受的,像这种东西他跟计算机科学那些算法就很不一样了,算法还是相对具体的,离我们日常生活不太远,我当初接触有一定难度的算法题没问题,但是某些抽象的数学观念总是吃不透,让我感到自己就是不够聪明,天分有限,可以后来,突然就数学觉醒了。这些客观可严谨证明的抽象数学真理终于在我脑海里实现了,而之前虽然一直存在,对我当时还有问题的头脑却是不存在的。主要还是这些定理的构建与证明都有一些简单的抽象数学观念为基础,这些却很难抓住,在没有掌握的时候,你再费脑也无用,但一旦看透了就觉得其实很简单。所以我很佩服数学天才,他们的结果更多是靠一种天才般的智力和想象力,而非仅仅刻苦,他们能够看到一个远远更高的境界,而这不是什么毛泽东思想或耶稣这种人为的信仰世界,而是一种绝对的科学真理。

Implicit function theorem and its multivariate generalization

The implicit function theorem for a single output variable can be stated as follows:

Single equation implicit function theorem. Let F(\mathbf{x}, y) be a function of class C^1 on some neighborhood of a point (\mathbf{a}, b) \in \mathbb{R}^{n+1}. Suppose that F(\mathbf{a}, b) = 0 and \partial_y F(\mathbf{a}, b) \neq 0. Then there exist positive numbers r_0, r_1 such that the following conclusions are valid.

a. For each \mathbf{x} in the ball |\mathbf{x} - \mathbf{a}| < r_0 there is a unique y such that |y - b| < r_1 and F(\mathbf{x}, y) = 0. We denote this y by f(\mathbf{x}); in particular, f(\mathbf{a}) = b.

b. The function f thus defined for |\mathbf{x} - \mathbf{a}| < r_0 is of class C^1, and its partial derivatives are given by

\partial_j f(\mathbf{x}) = -\frac{\partial_j F(\mathbf{x}, f(\mathbf{x}))}{\partial_y F(\mathbf{x}, f(\mathbf{x}))}.

Proof. For part (a), assume without loss of generality positive \partial_y F(\mathbf{a}, b). By continuity of that partial derivative, we have that in some neighborhood of (\mathbf{a}, b) it is positive and thus for some r_1 > 0, r_0 > 0 there exists f such that |\mathbf{x} - \mathbf{a}| < r_0 implies that there exists a unique y (by intermediate value theorem along with positivity of \partial_y F) such that |y - b| < r_1 with F(\mathbf{x}, y) = 0, which defines some function y = f(\mathbf{x}).

To show that f has partial derivatives, we must first show that it is continuous. To do so, we can let r_1 be our \epsilon and use the same process to arrive at our \delta, which corresponds to r_0.

For part (b), to show that its partial derivatives exist and are equal to what we desire, we perturb \mathbf{x} with an \mathbf{h} that we let WLOG be

\mathbf{h} = (h, 0, \ldots, 0).

Then with k = f(\mathbf{x}+\mathbf{h}) - f(\mathbf{x}), we have F(\mathbf{x} + \mathbf{h}, y+k) = F(\mathbf{x}, y) = 0. From the mean value theorem, we can arrive at

0 = h\partial_1F(\mathbf{x}+t\mathbf{h}, y + tk) + k\partial_y F(\mathbf{x}+t\mathbf{h}, y+tk)

for some t \in (0,1). Rearranging and taking h \to 0 gives us

\partial_j f(\mathbf{x}) = -\frac{\partial_j F(\mathbf{x}, y)}{\partial_y F(\mathbf{x}, y)}.

The following can be generalized to multiple variables, with k implicit functions and k constraints.     ▢

Implicit function theorem for systems of equations. Let \mathbf{F}(\mathbf{x}, \mathbf{y}) be an \mathbb{R}^k valued functions of class C^1 on some neighborhood of a point \mathbf{F}(\mathbf{a}, \mathbf{b}) \in \mathbb{R}^{n+k} and let B_{ij} = (\partial F_i / \partial y_j)(\mathbf{a}, \mathbf{b}). Suppose that \mathbf{F}(\mathbf{x}, \mathbf{y}) = \mathbf{0} and \det B \neq 0. Then there exist positive numbers r_0, r_1 such that the following conclusions are valid.

a. For each \mathbf{x} in the ball |\mathbf{x} - \mathbf{a}| < r_0 there is a unique \mathbf{y} such that |\mathbf{y} - \mathbf{b}| < r_1 and \mathbf{F}(\mathbf{x}, \mathbf{y}) = 0. We denote this \mathbf{y} by \mathbf{f}(\mathbf{x}); in particular, \mathbf{f}(\mathbf{a}) = \mathbf{b}.

b. The function \mathbf{f} thus defined for |\mathbf{x} - \mathbf{a}| < r_0 is of class C^1, and its partial derivatives \partial_j \mathbf{f} can be computed by differentiating the equations \mathbf{F}(\mathbf{x}, \mathbf{f}(\mathbf{x})) = \mathbf{0} with respect to x_j and solving the resulting linear system of equations for \partial_j f_1, \ldots, \partial_j f_k.

Proof: For this we will be using Cramer’s rule, which is that one can solve a linear system Ax = y (provided of course that A is non-singular) by taking matrix obtained from substituting the kth column of A with y and letting x_k be the determinant of that matrix divided by the determinant of A.

From this, we are somewhat hinted that induction is in order. If B is invertible, then one of its k-1 \times k-1 submatrices is invertible. Assume WLOG that such applies to the one determined by B^{kk}. With this in mind, we can via our inductive hypothesis have

F_1(\mathbf{x}, \mathbf{y}) = F_2(\mathbf{x}, \mathbf{y}) = \cdots = F_{k-1}(\mathbf{x}, \mathbf{y}) = 0

determine y_j = g_j(\mathbf{x}, y_k) for j = 1,2,\ldots,k-1. Here we are making y_k an independent variable and we can totally do that because we are inducting on the number of outputs (and also constraints). Substituting this into the F_k constraint, this reduces to the single variable case, with

G(\mathbf{x}, y_k) = F_k(\mathbf{x}, \mathbf{g}(\mathbf{x}, y_k), y_k) = 0.

It suffices now to show via our \det B \neq 0 hypothesis that \frac{\partial G}{\partial y_k} \neq 0. Routine application of the chain rule gives

\frac{\partial G}{\partial y_k} = \displaystyle\sum_{j=1}^{k-1} \frac{\partial F_k}{\partial y_j} \frac{\partial g_j}{\partial y_k} + \frac{\partial F_k}{\partial y_k} = \displaystyle\sum_{j=1}^{k-1} B^{kj} \frac{\partial g_j}{\partial y_k} + B^{kk}. \ \ \ \ (1)

The \frac{\partial g_j}{\partial y_k}s are the solution to the following linear system:

\begin{pmatrix} \frac{\partial F_1}{\partial y_1}  & \dots & \frac{\partial F_1}{\partial y_{k-1}} \\ \; & \ddots \; \\ \frac{\partial F_{k-1}}{\partial y_1} & \dots & \frac{\partial F_{k-1}}{\partial y_{k-1}} \end{pmatrix} \begin{pmatrix} \frac{\partial g_1}{\partial y_k} \\ \vdots \\ \frac{\partial g_{k-1}}{\partial y_k} \end{pmatrix} = \begin{pmatrix} \frac{-\partial F_1}{\partial y_k} \\ \vdots \\ \frac{-\partial F_{k-1}}{\partial y_k} \end{pmatrix} .

Let M^{ij} denote the k-1 \times k-1 submatrix induced by B_{ij}. We see then that in the replacement for Cramer’s rule, we arrive at what is M^{kj} but with the last column swapped to the left k-j-1 times such that it lands in the jth column and also with a negative sign, which means

\frac{\partial g_j}{\partial y_k}(\mathbf{a}, b_k) = (-1)^{k-j} \frac{\det M^{jk}}{\det M^{kk}}.

Now, we substitute this into (1) to get

\begin{aligned}\frac{\partial G}{\partial y_k}(\mathbf{a}, b_k) &= \displaystyle_{j=1}^{k-1} (-1)^{k-j}B_{kj}\frac{\det M^{kj}}{\det M^{kk}} + B_kk \\ &= \frac{\sum_{j=1}^k (-1)^{j+k} B_{kj}\det M^{kj}}{\det M^{kk}} \\ &= \frac{\det B}{\det M^{kk}} \\ &\neq 0. \end{aligned}

Finally, we apply the implicit function theorem for one variable for the y_k that remains.     ▢

References

  • Gerald B. Folland, Advanced Calculus, Prentice Hall, Upper Saddle River, NJ, 2002, pp. 114–116, 420–422.

 

A nice consequence of Baire category theorem

In a complete metric space X, we call a point x for which \{x\} is open an isolated point. If X is countable and there are no isolated points, we can take \displaystyle\cap_{x \in X} X \setminus x = \emptyset, with each of the X \setminus x open and dense, to violate the Baire category theorem. From that, we can arrive at the proposition that in a complete metric space, no isolated points implies that the space uncountable, and similarly, that countable implies there is an isolated point.

 

Arzela-Ascoli theorem and delta epsilons

I always like to think of understanding of the delta epsilon definition of limit as somewhat of an ideal dividing line on the cognitive hierarchy, between actually smart and pseudo smart. I still remember vividly struggling to grok that back in high school when I first saw it junior year, though summer after, it made sense, as for why it was reasonable to define it that way. That such was only established in the 19th century goes to show how unnatural such abstract precise definitions are for the human brain (more reason to use cognitive genomics to enhance it 😉 ). At that time, I would not have imagined easily that this limit definition could be generalized further, discarding the deltas and epsilons, which presumes and restricts to real numbers, as it already felt abstract enough. Don’t even get me started on topological spaces, nets, filters, and ultrafilters; my understanding of them is still cursory at best, but someday I will fully internalize them.

Fortunately, I have improved significantly since then, both in terms of experience and in terms of my biological intelligence, that last night, I was able to reconstruct in my head the proof of the Arzela-Ascoli theorem, which also had been once upon a time mind-boggling. Remarkably, that proof, viewed appropriately, comes naturally out of just a few simple, critical observations.

The statement of the theorem is as follows.

Arzela-Ascoli theorem Let F be a family of functions from I = [a,b] to \mathbb{R} that are uniformly bounded and equicontinuous. Then there is a sequence of f_n of elements in F that converges uniformly in I.

The rationals are dense in the reals and make an excellent starting point. Uniform boundedness enables us employ Bolzano-Weierstrass to construct a sequence of functions convergent at any rational in I. With a countable number of such rationals, we can iteratively filter this sequence to construct one that converges at every rational in I. Essentially, we have an enumeration of the rationals in I and a chain of subsequences based on that, where in the nth subsequence is convergent at the first n rationals in the enumeration, and Bolzano-Weierstrass is applied onto the results of the application of the functions of that subsequence on the n+1th rational to yield another subsequence. Take, diagonally, the nth function in the nth subsequence, to derive our desired uniformly convergent sequence, which we call \{f_i\}.

To show uniform convergence, it suffices to show uniform Cauchyness, namely that for any \epsilon > 0, there is an N such that n,m > N implies |f_m(x) - f_n(x)| < \epsilon for all x \in I. By compactness, open neighborhoods of all rationals of I, as an open cover, has a finite subcover. Each element of the subcover comes from some rational of I and across that finite subset of I we can for any \epsilon > 0 take the max of all the Ns for convergence. This means that so long as our neighborhoods are sufficiently small, we can for any point x \in I have some x' that is the point of focus of the neighborhood of our finite subcover containing x and thereby connect f_m(x) to f_m(x') by equicontinuity and use our maximum N (over finite elements) to connect that to f_n(x') and use equicontinuity again to connect that to f_n(x). Thus, triangle inequality over three of \epsilon/3 suffices.

More explicitly, equicontinuity-wise, we have for every x some open neighborhood of U_x such that s,t \in U_x implies that |f(s) - f(t)| < \epsilon.