Cayley-Hamilton theorem and Nakayama’s lemma

The Cayley-Hamilton theorem states that every square matrix over a commutative ring A satisfies its own characteristic equation. That is, with I_n the n \times n identity matrix, the characteristic polynomial of A

p(\lambda) = \det (\lambda I_n - A)

is such that p(A) = 0. I recalled that in a post a while ago, I mentioned that for any matrix A, A(\mathrm{adj}(A)) = (\det A) I_n, a fact that is not hard to visualize based on calculation of determinants via minors, which is in fact much of what brings the existence of this adjugate to reason in some sense. This can be used to prove the Cayley-Hamilton theorem.

So we have

(\lambda I_n - A)\mathrm{adj}(\lambda I_n - A) = p(\lambda)I_n,

where p is the characteristic polynomial of A. The adjugate in the above is a matrix of polynomials in t with coefficients that are matrices which are polynomials in A, which we can represent in the form \displaystyle\sum_{i=0}^{n-1}t^i B_i.

We have

\displaystyle {\begin{aligned}p(\lambda)I_{n} &= (\lambda I_n - A)\displaystyle\sum_{i=0}^{n-1}\lambda^i B_i \\ &= \displaystyle\sum_{i=0}^{n-1}\lambda^{i+1}B_{i}-\sum _{i=0}^{n-1}\lambda^{i}AB_{i} \\ &= \lambda^{n}B_{n-1}+\sum _{i=1}^{n-1}\lambda^{i}(B_{i-1}-AB_{i})-AB_{0}.\end{aligned}}

Equating coefficients gives us

B_{n-1} = I_n, \qquad B_{i-1} - AB_i = c_i I_n, 1 \leq i \leq n-1, \qquad -AB_0 = c_0I_0.

With this, we have

A^n + c_{n-1}A^{n-1} + \cdots + c_1A + c_0I_n = A^nB_{n-1} + \displaystyle\sum_{i=1}^{n-1} (A^iB_{i-1} - A^{i+1}B_i) - AB_0 = 0,

with the RHS telescoping and annihilating itself to 0.

There is generalized version of this for a module over a ring, which goes as follows.

Cayley-Hamilton theorem (for modules) Let A be a commutative ring with unity, M a finitely generated A-module, I an ideal of A, \phi an endomorphism of M with \phi M \subset IM.

Proof: It’s mostly the same. Let \{m_i\} \subset M be a generating set. Then for every i, \phi(m_i) \in IM, with \phi(m_i) = \displaystyle\sum_{j=1}^n a_{ij}m_j, with the a_{ij}s in I. This means by closure properties of ideals the polynomial coefficients in the above will stay in I.     ▢

From this follows easily a statement of Nakayama’s lemma, ubiquitous in commutative algebra.

Nakayama’s lemma  Let I be an ideal in R, and M a finitely-generated module over R. If IM = M, then there exists an r \in R with r \equiv 1 \pmod{I}, such that rM = 0.

Proof: With reference to the Cayley-Hamilton theorem, take \phi = I_M, the identity map on M, and define the polynomial p as above. Then

rI_M = p(I_M) = (1 + c_{n-1} + c_{n-2} + \cdots + c_0)I_M = 0

both annihilates the c_is, coefficients residing in I, so that r \equiv 1 \pmod{I} and gives the zero map on M in order for rM = 0.     ▢


Jordan normal form

Jordan normal form states that every square matrix is similar to a Jordan normal form matrix, one of the form

J = \begin{bmatrix}J_1 & \; & \; \\ \; & \ddots & \; \\\; & \; & J_p \\ \end{bmatrix}

where each of the J_i is square of the form

J_i = \begin{bmatrix}\lambda_i & 1 & \; & \; \\ \; & \lambda_i \; & \ddots & \; \\ \; & \; & \ddots & 1 \\ \; & \; & \; & \lambda_i \\ \end{bmatrix}.

This is constructed via generalized eigenvectors. One can observe that each block matrix corresponds to an invariant subspace, and generalized eigenvectors (of a matrix) are a set of chains, each of which is its own invariant subspace.

We let A be any linear transformation from V to V, where V is of course a vector space.

It is more common knowledge that Ker(A - \lambda I) is the mechanism used to solve for eigenvectors. Let us first observe that v \in Ker(A - \lambda I) is such that also Av \in Ker(A - \lambda I) since A commutes with A - \lambda I and that this extends to Ker(A - \lambda I)^t for any natural number t. This gives us a way to identify larger invariant subspaces.

Let W_i = Ker(A - \lambda I)^i. W_i \subset W_{i+1} is obvious, and there will be some and a smallest t at which W_t = W_{t+1}. Afterwards, the W_i must all be equal. If not, there will be in the intermediary on iterating A - \lambda I against a vector from which W_t = W_{t+1} is contradicted.

Next, we observe that Ker(A - \lambda I)^t \cap Im(A - \lambda I)^t = \emptyset. Suppose not. Then, we would have some w \in W_{2t} but not in W_t.

Rank nullity theorem says that the remainder after pulling out Ker(A - \lambda I)^t for some eigenvalue \lambda is Im(A - \lambda I)^t. We can run the same algorithm on that for another eigenvalue. So this is resolved by induction.

The result is that if A has distinct eigenvalues \lambda_1, \lambda_2, \ldots, \lambda_k, there are a_1, a_2, \ldots, a_k such that the domain of A is

(A - \lambda_1 I)^{a_1} \oplus (A - \lambda_2 I)^{a_2} \oplus \cdots \oplus (A - \lambda_k I)^{a_k}.

Now does Ker(A - \lambda I)^t correspond to a irreducible invariant subspace necessarily? No, as there is a difference between algebraic and geometric multiplicity.

Now we will show, as the second part of the proof, to be invoked on the components in the direct sum decomposition from the preceding first part of the proof that if T is nilpotent, meaning that T^n = 0 for some n, then there are v_1, \ldots, v_k and a_1, \ldots, a_k such that \{v_1, Tv_1, \ldots, T^{a_1-1}v_1, v_2, \ldots, v_k, Tv_k, \ldots, T^{a_k-1}v_k\} is a basis (linearly independent by definition) for the domain of T, with \sum a_k = \dim V and \max(a_1, \ldots, a_k) = n. (Note that here n is the smallest with T^n = 0.)

For any eigenvalue, there is an eigenvector space associated with it. Take its preimage with respect to A. Do this successively until nothing remains, which will be at the n-1th iteration. In particular, take u_1, \ldots, u_k to be the basis of the eigenvector space. For each one of these that has non-empty preimage, we take the element with the kernel projected out. This accumulates a set of vectors of the format specified. It has to be exhaustive with respect to the vector space under the nilpotence assumption, from which termination is also guaranteed. It remains to show that these are linearly independent. We can using our eigenvector space as our base case take an inductive hypothesis where the vectors accumulated prior to the kth iteration are linearly independent. Now we show that the vector set remains so after adding in the ones obtained from taking preimage. We note that first, the added ones are linearly independent themselves (if a nontrivial linear combinations gives zero, applying A would violate our inductive hypothesis). There is also that a nontrivial linear combination of the newly added ones cannot equal a linear combination of the rest. To show this, assume otherwise, and apply A just enough times (k times) for one side to disappear. The other side should be a linear combination with respect to our designated basis of the eigenvector space, which cannot disappear. This concludes our construction.

Essentially we have chains (of vectors) which terminate when an element is no longer found after our preimage operation. Applying to this T = A - \lambda I, we see that for some element u in our chain, Au = \lambda u + v, where v is the previous element of the chain, with 0 signifying that we are the at an eigenvector (non-generalized), at the front. Ones along the superdiagonal correspond to the 1 coefficient of v above.



Math sunday

I had a chill day thinking about math today without any pressure whatsoever. First I figured out, calculating inductively, that the order of GL_n(\mathbb{F}_p) is (p^n - 1)(p^n - p)(p^n - p^2)\cdots (p^n - p^{n-1}). You calculate the number of k-tuples of column vectors linear independent and from there derive p^k as the number of vectors that cannot be appended if linear independence is to be preserved. A Sylow p-group of that is the group of upper triangular matrices with ones on the diagonal, which has the order p^{n(n-1)/2} that we want.

I also find the proof of the first Sylow theorem much easier to understand now, the inspiration of it. I had always remembered that the Sylow p-group we are looking for can be the stabilizer subgroup of some set of p^k elements of the group where p^k divides the order of the group. By the pigeonhole principle, there can be no more than p^k elements in it. The part to prove that kept boggling my mind was the reverse inequality via orbits. It turns out that that can be viewed in a way that makes its logic feel much more natural than it did before, which like many a proof not understood, seems to spring out of the blue.

We wish to show that the number of times, letting p^r be the largest pth power dividing n, that the order of some orbit is divided by p is no more than r-k. To do that it suffices to show that the sum of the orders of the orbits, \binom{n}{p^k} is divided by p no more than that many times. To show that is very mechanical. Write out as m\displaystyle\prod_{j = 1}^{p^k-1} \frac{p^k m - j}{p^k - j} and divide out each element of the product on both the numerator and denominator by p to the number of times j divides it. With this, the denominator of the product is not a multiple of p, which means the number of times p divides the sum of the orders of the orbits is the number of times it divides m, which is r-k.

Following this, Brian Bi told me about this problem, starred in Artin, which means it was considered by the author to be difficult, that he was stuck on. To my great surprise, I managed to solve it under half an hour. The problem is:

Let H be a proper subgroup of a finite group G. Prove that the conjugate subgroups of H don’t cover G.

For this, I remembered the relation |G| = |N(H)||Cl(H)|, where Cl(H) denotes the number of conjugate subgroups of H, which is a special case of the orbit-stabilizer theorem, as conjugation is a group action after all. With this, given that |N(H)| \geq |H| and that conjugate subgroups share the identity, the union of them has less than |G| elements.

I remember Jonah Sinick’s once saying that finite group theory is one of the most g-loaded parts of math. I’m not sure what his rationale is for that exactly. I’ll say that I have a taste for finite group theory though I can’t say I’m a freak at it, unlike Aschbacher, but I guess I’m not bad at it either. Sure, it requires some form of pattern recognition and abstraction visualization that is not so loaded on the prior knowledge front. Brian Bi keeps telling me about how hard finite group theory is, relative to the continuous version of group theory, the Lie groups, which I know next to nothing about at present.

Oleg Olegovich, who told me today that he had proved “some generalization of something to semi-simple groups,” but needs a bit more to earn the label of Permanent Head Damage, suggested upon my asking him what he considers as good mathematics that I look into Arnold’s classic on classical mechanics, which was first to come to mind on his response of “stuff that is geometric and springs out of classical mechanics.” I found a PDF of it online and browsed through it but did not feel it was that tasteful, perhaps because I’m been a bit immersed lately in the number theoretic and abstract algebraic side of math that intersects not with physics, though I had before an inclination towards more physicsy math. I thought of possibly learning PDEs and some physics as a byproduct of it, but I’m also worried about lack of focus. Maybe eventually I can do that casually without having to try too hard as I have done lately for number theory. At least, I have not the right combination of brainpower and interest sufficient for that in my current state of mind.

一说起偏微分方程,想到此行有不少杰出的浙江裔学者,最典型的可以说是谷超豪。想起,华盛顿大学一位做非交换代数几何的教授,浙江裔也,的儿子,曾经说起他们回国时谷超豪,复旦的,如他父亲一样,逝世了,又半开玩笑言,“据说谷超豪被选为院士,是因为他曾经当过地下党。”记得看到杨振宁对谷超豪有极高的评价,大大出于谷超豪在杨七十年代访问复旦的促动下解决了一系列有关于杨-米尔斯理论的数学问题。之外,还有林芳华,陈贵强,都是非常有名气的这套数学的教授,也都是浙江人。我们都知道浙江人是中国的犹太人,昨天Brian Bi还在说”there are four times more Zhejiangnese than Jews.” 可惜我不是浙江人,所以成为数学家可能希望不大了。:(

A recurrence relation

I noticed that

(x_1 - x_k)\displaystyle\sum_{i_1+\cdots+i_k=n} x_1^{i_1}\cdots x_k^{i_k} = \displaystyle\sum_{i_1+\cdots+i_{k-1}=n+1} x_1^{i_1}\cdots x_{k-1}^{i_{k-1}} - \displaystyle\sum_{i_2+\cdots+i_k=n+1} x_2^{i_2}\cdots x_k^{i_k}.

In the difference on the RHS, it is apparent that terms without x_1 or x_k will vanish. Thus, all the negative terms which are not cancelled out have a x_k and all such positive terms have a x_1. Combinatorially, all terms of degree n+1 with x_k can be generated by multiplying x_k on all terms of degree n. Analogous holds for the positive terms. The terms with only x_1 and x_k are cancelled out with the exception of the x_1^{n+1} - x_k^{n+1} that remains.

This recurrence appears in calculation of the determinant of the Vandermonde matrix.

On the adjugate

I learned that the adjugate is the transpose of the matrix with the minors with the appropriate sign, that as we all know, alternates along rows and columns, corresponding to each element of the matrix on which the adjugate is taken. The matrix, multiplied with its adjugate, in fact, yields the determinant of that matrix, times the identity of course, to matrix it. Note that the diagonal elements of its result is exactly what one gets from applying the minors algorithm for calculating the determinant along each row. The other terms vanish. There are n(n-1) of them, where n is the number of rows (and columns) of the (square) matrix. They are, for each column of the adjugate and each column of it not equal to the current column, the sum of each entry in the column times the minor (with sign applied) determined by the removal of the other selected column (constant throughout the sum) and the row of the current entry. In the permutation expansion of this summation, each element has a (unique) sister element, with the sisterhood relation symmetric, determined by taking the entry of the adjugate matrix in the same column as the non minor element to which the permutation belongs and retrieving in the permutation expansion of the element times minor for that element the permutation, the product representing which contains the exact same terms of the matrix. Note that shift in position of the swapped element in the minor matrix is one less than that in the adjugate matrix. Thus, the signs of the permutations cancel. From this, we arrive at that the entire sum of entry times corresponding minor across the column is zero.

A corollary of this is that \mathrm{adj}(\mathbf{AB}) = \mathrm{adj}(\mathbf{B})\mathrm{adj}(\mathbf{A}).

More math

Last night, I learned, once more, the definition of absolute continuity. Formally, a function f : X \to Y‘s being absolutely continuous is its for any \epsilon > 0, having a \delta > 0 such that for any finite number of pairs of points (x_k, y_k) with \sum |x_k - y_k| < \delta implies \sum |f(x_k) - f(y_k)| < \epsilon. It is stronger than uniform continuity, a special case of it. I saw that it implied almost everywhere differentiability and is intimately related to the Radon-Nikodym derivative. A canonical example of a function not absolute continuous but uniformly continuous, to my learning last night afterwards, is the Cantor function, this wacky function still to be understood by myself.

I have no textbook on this or on anything measure theoretic, and though I could learn it from reading online, I thought I might as well buy a hard copy of Rudin that I can scribble over to assist my learning of this core material, as I do with the math textbooks I own. Then, it occurred to me to consult my math PhD student friend Oleg Olegovich on this, which I did through Skype this morning.

He explained very articulately absolute continuity as a statement on bounded variation. It’s like you take any set of measure less than \delta and the total variation of that function on that set is no more than \epsilon. It is a guarantee of a stronger degree of tightness of the function than uniform continuity, which is violated by functions such as x^2 on reals, the continuity requirements of which increases indefinitely as one goes to infinity and is thereby not uniformly continuous.

Our conversation then drifted to some lighter topics, lasting in aggregate almost 2 hours. We talked jokingly about IQ and cultures and politics and national and ethnic stereotypes. In the end, he told me that введите общение meant “input message”, in the imperative, and gave me a helping hand with the plural genitive conjugation, specifically for “советские коммунистические песни”. Earlier this week, he asked me how to go about learning Chinese, for which I gave no good answer. I did, on this occasion, tell him that with all the assistance he’s provided me with my Russian learning, I could do reciprocally for Chinese, and then the two of us would become like Москва-Пекин, the lullaby of which I sang to him for laughs.

Back to math, he gave me the problem of proving that for any group G, a subgroup H of index p, the smallest prime divisor of |G|, is normal. The proof is quite tricky. Note that the action of G on G / H induces a map \rho : G \to S_p, the kernel of which we call N. The image’s order, as a subgroup of S_p must divide p!, and as an isomorphism of a quotient group of G must divide n. Here is where the smallest prime divisor hypothesis is used. The greatest common divisor of n and p! cannot not p or not 1. It can’t be 1 because not everything in G is a self map on H. N \leq H as everything in N must take H to itself, which only holds for elements of H. By that, [G:N] \geq [G:H] = p which means N = H. The desired result thus follows from NgH = gH for all g \in G.

Later on, I looked at some random linear algebra problems, such as proving that an invertible matrix A is normal iff A^*A^{-1} is unitary, and that the spectrum of A^* is the complex conjugate of the spectrum of A, which can be shown via examination of A^* - \lambda I. Following that, I stumbled across some text involving minors of matrices, which reminded me of the definition of determinant, the most formal one of which is \sum_{\sigma \in S_n}\mathrm{sgn}(\sigma)\prod_{i=1}^{n}a_{i,\sigma_{i}}. In school though we learn its computation via minors with alternating signs as one goes along. Well, why not relate the two formulas.

In this computation, we are partitioning based on the element that 1 or any specific element of [n] = \{1, 2, \ldots, n\}, with a corresponding row in the matrix, maps to. How is the sign determined for each? Why does it alternate. Well, with the mapping for 1 already determined in each case, it remains to determine the mapping for the remainder, 2 through n. There are (n-1)! of them, from \{2, 3, \ldots, n\} to [n] \setminus \sigma_1. If we were to treat 1 through i-1 as shifted up by one so as to make it a self map on \{2, 3, \ldots, n\} then each entry in the sum of the determinant of the minor would have its sign as the sign of the number of two cycles between consecutive elements (which generate the symmetric group). Following that, we’d need to shift back down \{2, 3, \ldots, i\}, the presentation of which, in generator decomposition, would be (i\ i+1)(i-1\ i) \ldots (1\ 2), which has sign equal to the sign of i, which is one minus the column we’re at, thereby explaining why we alternate, starting with positive.