using LinearAlgebra, Plots
gr(size=(500, 500), legend=false)
An affine transformation converts a standard normal distribution into a possibly non-centered and correlated normal distribution.
xx = 0:0.01:2π
d = randn(30, 2)
plot(sin.(xx), cos.(xx))
plot!(2 .* sin.(xx), 2 .* cos.(xx))
plot!(3 .* sin.(xx), 3 .* cos.(xx))
scatter!(d[:,1], d[:,2])
transformation(x) = (x * [1. 3; 2 1] .+ [1. 2])
x1 = transformation([sin.(xx) cos.(xx)])
x2 = transformation(2 .*[sin.(xx) cos.(xx)])
x3 = transformation(3 .*[sin.(xx) cos.(xx)])
plot(x1[:,1], x1[:,2])
plot!(x2[:,1], x2[:,2])
plot!(x3[:,1], x3[:,2])
dt = transformation(d)
scatter!(dt[:,1], dt[:,2])
First goals of PCA:
This boils down to the task of tranforming the data back into standard normally distributed ones.
Final goal of PCA:
PCA relies on the assumption that data points a independent draws from a transformed normal distribution.
Population data (including language data) are not independent.
Before we can apply PCA to the columns, we have to decorrelate the rows!
Let us assume a phylogeny $\mathcal{T}$ with $n$ branches and $m$ tips, and a character $x$ evolving along $\mathcal{T}$ according to a Brownian motion. Evolution along each branch $i$ ($1\leq i\leq n$) is drawn from an independent normal distribution with mean 0 and variance $l_i$, leading to a vector $B$ of length $n$ which is multivariate normal with mean $\mathbf{0}$ and covariance matrix $\mathrm{diag}(l_1\cdots l_n) \doteq \Lambda$.
The state of $x$ at the root is $x_r$.
The topology of $\mathcal{T}$ can be represented by an $n\times n$ matrix $T$ where
$$ t_{ij} = \left\{\begin{align}1 &\mbox{ if branch }j\mbox{ is on the path from}\\ &\mbox{ the root to the node at the bottom of branch }j\\0&\mbox{ else}\end{align}\right. $$(Alternatively, one can identify the rows and columns with the non-root nodes, where $t_{ij}=1$ means that $j$ is on the path from the root to $i$, including $i$ itself. So $T$ is essentially the transitive closure of the adjacency matrix of $\mathcal{T}$.)
Let us consider a few examples.
using SymPy
sy(A) = convert(Array{Sym, 2}, A)
function sySVD(A)
V, D = (A.T * A).diagonalize(normalize=true, sort=true);
S = reverse(sqrt.(D[D.> 0]))
V = V[:,reverse(1:size(V,2))]
V = V[:,1:size(A,1)]
U = A * V * Diagonal(1 ./ S)
return (U, S, V)
end;
function syPinv(A)
U, S, V = sySVD(A)
return V * Diagonal(1 ./ S) * U.T
end;
T = sy([
1 0 1
0 1 1
0 0 1
])
T = sy([
1 0 0
0 1 0
0 0 1
])
T = sy([
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 0 1
0 0 0 1 0 1
0 0 0 0 1 0
0 0 0 0 0 1
])
Let $b$ be a vector of branch-wise Brownian motion steps, i.e., a random vector drawn from $$ b \sim \mathcal{N}(\mathbf{0}, \Lambda) $$ Then $x=Tb + x_r\mathbf{1}$ is the vector of values of $x$ at the nodes (including the internal ones). $x$ is distributed according to a multivarate normal distribution $$ x \sim \mathcal{N}(x_r\mathbf{1}, \Sigma), $$ where $\sigma_{ij} = \sum_k t_{ik}t_{il}tl_k$. Proof:
$$ \begin{align} Cov((Tb+x_r\mathbf{1})_i, (Tb+x_r\mathbf{1})_j) &= E((Tb+x_r\mathbf{1})_i (Tb+x_r\mathbf{1})_j) - E((Tb+x_r\mathbf{1})_i)E((Tb+x_r\mathbf{1})_j)\\ &= E(((Tb)_i+x_r) ((Tb)_j+x_r)) - E((Tb)_i+x_r)E((Tb)_j+x_r)\\ &= E((Tb)_i(Tb)_j + ((Tb)_i+(Tb)_j)x_r +x_r^2) - x_r^2\\ &= E((Tb)_i(Tb)_j) + x_rE((Tb)_i+(Tb)_j) \\ &= E((Tb)_i(Tb)_j)\\ &= E(\sum_k t_{ik}b_k\sum_l t_{jl}b_l)\\ &= \sum_{kl}t_{ik}a_{jl}E(b_kb_l)\\ &= \sum_k t_{ik}t_{jk}l_k \end{align} $$It is easy to see that $$ \begin{align} \Sigma &= T\Lambda T'\\ &=(T\Lambda^{\frac{1}{2}})(T\Lambda^{\frac{1}{2}})' \end{align} $$
Let $x_t$ be the restriction of $x$ to the tips, and $T_t$ the restriction of $T$ to the tip rows. Then by the same token,
$$ x_t \sim \mathcal{N}(x_r\mathbf{1}, (T_t\Lambda^{\frac{1}{2}})(T_t\Lambda^{\frac{1}{2}})') $$For simplicity's sake, I define $$ \Phi \doteq T_t\Lambda^{\frac{1}{2}} $$
$\Phi$ is the result of multiplying each column of the topology matrix $T$ with the square root of the corresponding branch length and restricting the rows to the tips.
Examples:
Φ = sy([
1 0 1
0 1 1
])
Φ = sy([
1 0 0
0 1 0
0 0 1
])
Φ = sy([
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 0 1
0 0 0 1 0 1
])
Φ = sy([
1 0 0 1
0 1 0 1
0 0 sympy.sqrt(2) 0
])
$\Phi\Phi'$ is the covariance matrix of $x_t$. It is easy to see that $(\Phi\Phi')_{ij}$ is the length of the path from the root to the latest common ancestor of tips $i$ and $j$.
Φ * Φ.T
$\Phi$ can be seen as a linear transformation that maps a standard normally distributed vector to $x-x_r$. The Singular Value Decomposition of $\Phi$ decomposes this transformation into three components:
It holds that
$$ \Phi = USV^T, $$and this factorization is always possible.
Φ = sy([
1 0 1
0 1 1
])
U,S,V = sySVD(Φ);
U
Diagonal(S)
V
Φ = sy([
1 0 0
0 1 0
0 0 1
])
U, S, V = sySVD(Φ);
U
diagm(S)
V
Φ = sy([
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 0 1
0 0 0 1 0 1
])
U, S, V = sySVD(Φ);
U
diagm(S)
V
Φ = sy([
1 0 0 1
0 1 0 1
0 0 sympy.sqrt(2) 0
])
U, S, V = sySVD(Φ);
U
Diagonal(S)
V
Φ = sy([
sympy.sqrt(2) 0 0 0 0 1 sympy.sqrt(2) 0
0 sympy.sqrt(2) 0 0 0 1 sympy.sqrt(2) 0
0 0 sympy.sqrt(3) 0 0 0 sympy.sqrt(2) 0
0 0 0 1 0 0 0 2
0 0 0 0 1 0 0 2
])
Φ * Φ.T
U, S, V = sySVD(Φ);
U
S
V
U, S, V = svd(convert(Matrix{Float64}, Φ));
U
S
V
Φ = sy([
1 0 0 0 0 0 0 0 0 1 0 0
0 1 0 0 0 0 0 0 0 1 0 0
0 0 1 0 0 0 0 0 0 1 0 0
0 0 0 1 0 0 0 0 0 0 1 0
0 0 0 0 1 0 0 0 0 0 1 0
0 0 0 0 0 1 0 0 0 0 1 0
0 0 0 0 0 0 1 0 0 0 0 1
0 0 0 0 0 0 0 1 0 0 0 1
0 0 0 0 0 0 0 0 1 0 0 0
])
Φ * Φ.T
U, S, V = sySVD(Φ);
U
S
V
There is one eigenvector/singular value for each daughter branch of the root. Additionally, for all non-root nodes with $k$ daughters, there are $k-1$ eigenvectors/singular values.
@vars a b c
Φ = sy([
sympy.sqrt(b) 0 sympy.sqrt(a)
0 sympy.sqrt(c) sympy.sqrt(a)
])
U, D = (Φ*Φ.T).diagonalize()
U
S = sqrt.(diag(D))
The values $x_t$ at the tips are observed. We want to estimate the value $x_r$ at the root.
Recall that $x_t$ is distributed according to a multivariate normal distribution with mean $x_r\mathbf{1}$ and covariance matrix $\Phi\Phi'$. According to the SVD of $\Phi$, we have
$$ \begin{align}\Phi &= USV'\\\Phi\Phi' &= USV'VSU'\\&= (US)(US)'\end{align} $$Therefore the distribution of $x_t$ can be represented as
$$ \begin{align}z_i &\sim \mathcal{N}(0, 1)\\x_t &:= USz + x_r\mathbf{1}\end{align} $$$\hat x_r$, the maximum likelihood estimate of $x_r$, is the value that maximizes the joint likelihood of $z$. This is equivalent of demanding that $z'z$ — the sum of squares of the components of $z$ — is maximized.
Using calculus to minimize $z'z$, we have
$$ \begin{align}\frac{dz'z}{dx_r} &= \frac{x_t'US^{-2}U'x_t - 2x_rx_t'US^{-2}U'\mathbf{1} + x_r^2\mathbf{1}'US^{-2}U'\mathbf{1}}{dx_r}\\&= -2x_t'US^{-2}U'\mathbf{1} + 2 x_r\mathbf{1}'US^{-2}U'\mathbf{1}\end{align} $$Setting the derivative to 0:
$$ \begin{align}2\hat x_r\mathbf{1}'US^{-2}U'\mathbf{1} &= 2x_t'US^{-2}U'\mathbf{1}\\\hat x_r &= \frac{x_t'US^{-2}U'\mathbf{1}}{\mathbf{1}'US^{-2}U'\mathbf{1}}\end{align} $$Suppose we know the phylogeny, but the unit of times of the branch lengths are not identical to the time unit of the Brownian motion. Then we have another parameter $\sigma\in \mathbb R^+$, the standard deviation of $x$ after one unit of time.
Then the branch-wise Brownian motion steps are distributed as
$$ b \sim \mathcal{N}(\mathbf{0}, \sigma^2\Lambda), $$and $x_t$ as
$$ x_t \sim \mathcal{N}(x_r\mathbf{1}, \sigma^2\Phi\Phi') $$The distribution of $x_t$ can be represented as
$$ \begin{align}z_i &\sim \mathcal{N}(0, \sigma^2)\\x_t &:= USz + x_r\mathbf{1}\end{align} $$Therefore
$$ z = S^{-1}U'(x_t-x_r\mathbf{1}) $$The log-likelihood of $z$ is
$$ \mathcal{LL}(z|x_r,\sigma) = -\frac{n}{2}\log{2\pi} -n\log\sigma -\frac{z'z}{2\sigma^2} $$Then the log-likelihood of $x_t$ is
$$ \begin{align}\mathcal{LL}(x_t|x_r,\sigma) &= -\frac{n}{2}\log{2\pi} -n\log\sigma -\frac{z'z}{2\sigma^2} + \log\left|\frac{\partial z}{\partial x_t}\right|\\&= -\frac{n}{2}\log 2\pi - n\log\sigma - \frac{(x_t'-x_r\mathbf{1}')US^{-2}U'(x_t-x_r\mathbf{1})}{2\sigma^2} + \log \left|S\right|\end{align} $$The gradient against $x_r$ and $\sigma$ is
$$ \begin{align}\frac{\partial}{\partial x_r}\mathcal{LL}(x_t|x_r,\sigma) &= \sigma^{-2}(x'_tUS^{-2}U'\mathbf{1} - x_r\mathbf{1}'US^{-2}\mathbf{1})\\\frac{\partial}{\partial \sigma}\mathcal{LL}(x_t|x_r,\sigma) &= -\frac{n}{\sigma} +\frac{(x_t'-x_r\mathbf{1}')US^{-2}U'(x_t-x_r\mathbf{1})}{\sigma^3} \end{align} $$Setting the gradient to 0, we get the maximum likelihood estimates
$$ \begin{align}\hat x_r &= \frac{x_t'US^{-2}U'\mathbf{1}}{\mathbf{1}'US^{-2}U'\mathbf{1}}\\\hat \sigma &= \sqrt{\frac{(x_t'-\hat x_r\mathbf{1}')US^{-2}U'(x_t-\hat x_r\mathbf{1})}{n}}\end{align} $$Φ = sy([
1 0 0 1
0 1 0 1
0 0 sympy.sqrt(2) 0
])
U, S, V = sySVD(Φ);
U
diagm(S)
xt = [3, 3, -1]
invCov = U * diagm(S.^-2) * U'
xr = (xt' * invCov * ones(Int, 3))/(ones(Int, 3)' * invCov*ones(Int, 3))
float(xr)
residuals = xt .- xr
σ = sqrt((residuals' * invCov * residuals)/length(xt))
float(σ)
float(residuals/σ)
z = convert(Vector{Sym}, diagm(1 ./ S) * U' * (xt .- xr))
float.(z)
The phylogenetic mean is a weighted mean of the values at the tips. The weights only depend on $\Phi$; they are given by the formula
$$ w = \frac{US^{-2}U'\mathbf{1}}{\mathbf{1}'US^{-2}U'\mathbf{1}} $$w = (invCov * ones(Int, 3)) / (ones(Int, 3)' * invCov * ones(Int, 3))
Another example:
Φ = [
sqrt(2) 0 0 0 0 1 sqrt(2) 0
0 sqrt(2) 0 0 0 1 sqrt(2) 0
0 0 sqrt(3) 0 0 0 sqrt(2) 0
0 0 0 1 0 0 0 2
0 0 0 0 1 0 0 2
]
invCov = inv(Φ*Φ')
w = (invCov * ones(Int, 5)) / (ones(Int, 5)' * invCov * ones(Int, 5))
We assume that we are dealing with several characters that evolve according to a correlated Brownian motion along the branches of a phylogeny. This means that the evolution on each branch is drawn from a centered but correlated normal distribution.
Let $Z$ be a $n\times k$ matrix, where $n$ is the number of branches of the phylogeny and $k$ the number of features. We assume that the rows of $Z$ are stochastically independent, and each row is drawn from a multivariate normal distribution:
$$ Z_i \sim \mathcal{N}(\mathbf{0}, \Sigma) $$$\Phi$ is an $m\times n$ phylogenetic matrix, with the SVD
$$ \Phi = USV' $$$V$ is the matrix of eigenvectors of $\Phi' \Phi$, so it is a permutation matrix.
$$ \begin{align} (V'Z)_{ij} &= \sum_a V'_{ia} Z_{aj}\\ &= \sum_a V_{ai}Z_{aj}\\ E(V'Z)_{ij} &= 0\\ \end{align} $$So the rows of $V'Z$ are mutually independent and drawn from the same distribution as $Z$.
The observed data $X$ form an $m\times k$-matrix ($m$ being the number of tips and $k$ the number of features). It is generated by the process
$$ X_t := \mathrm{diag}(\vec\sigma)\Phi Z + \vec x_r $$According to the SVD theorem,
$$ \begin{align} \mathrm{diag}(\vec\sigma)^{-1}(X_t -\vec x_r) &= \Phi Z\\ &= USV' Z\\ S^{-1}U' \mathrm{diag}(\vec\sigma)^{-1}(X_t-\vec x_r) &= V'Z \end{align} $$We can estimate the phylogenetic means $\vec x_r$ and the rates of evolution $\vec\sigma$. After normalizing each of the columns of $X_t$ (subtracting the phylogenetic mean and dividing by the rate of evolution), we apply the transformation $S^{-1}U'$ to obtain estimates for $V'Z$. Since the rows of this matrix are drawn from the same distribution as $Z$, we can apply regular PCA to this matrix.