Discrete Cosine Transform

[This should work fine with recent desktop versions of Chrome and Firefox. I haven't tested with other browsers.]

The discrete cosine transform (DCT) is used in many areas, the most prominent one probably being lossy compresion of audio and images. If one leaves out the mathematical derivation and the proofs, then the basic idea isn't that hard to explain visually.

We start by drawing the cosine function between $0$ and $\pi$. We then sample the function at certain points: We divide the interval from $0$ to $\pi$ into three parts of equal size and take their centers, i.e. $x=\frac\pi6$, $x=\frac\pi2$, and $x=\frac{5\pi}6$. Now we compute the values of the cosine function at these points. For image processing, we translate the results to different shades of gray, from black ($-1$) to white ($1$), so that $0$ is medium gray.

We not only do this with the function $\color{blue}{x \mapsto \cos(x) = \cos(1 \cdot x)}$, but also with $\color{green}{x \mapsto \cos(2 \cdot x)}$ and with $\color{brown}{x \mapsto \cos(0 \cdot x) = 1}$. This gives us three vectors of three numbers, which we can interpret as three rectangles, each consisting of three gray blocks:

It is now not hard to prove that these three vectors $\color{brown}{(1, 1, 1)}$, $\color{blue}{(\frac{\sqrt3}2, 0, -\frac{\sqrt3}2)}$, and $\color{green}{(\frac12, -1, \frac12)}$ are linearly independent, which in turn implies that every vector of three numbers can be expressed as a linear combination of them. Or, in other words, each image consisting of three grayscale pixels can be pieced together using the three "Lego bricks" , , and .

What does that mean? Suppose we want to "build" the "picture" , which has the gray values $\color{red}{0.9294}$, $\color{red}{0.2000}$, and $\color{red}{0.0667}$ (from left to right). We want to express it as $\mathrel{=} \color{brown}{\alpha} \cdot$ $\mathop{+} \color{blue}{\beta} \cdot$ $\mathop{+} \color{green}{\gamma} \cdot$ .

Using numbers, this yields a simple system of linear equations \[\begin{align*} \color{red}{0.9294} &= \color{brown}{\alpha} + \color{blue}{\frac{\sqrt3}2 \beta} + \color{green}{\frac12 \gamma} \\ \color{red}{0.2000} &= \color{brown}{\alpha} + \color{white}{\frac{\sqrt3}2 \beta} - \color{green}{\gamma} \\ \color{red}{0.0667} &= \color{brown}{\alpha} \color{blue}{- \frac{\sqrt3}2 \beta} + \color{green}{\frac12 \gamma} \\ \end{align*}\] with solutions $\color{brown}{\alpha\approx0.3987}$, $\color{blue}{\beta\approx0.4981}$, and $\color{green}{\gamma\approx0.1987}$.

Our "graphical equation" from above thus has this solution: $\mathrel{=}$ $\mathop{+}$ $\mathop{+}$ . Note that adding something that's lighter than medium gray makes the overall result lighter while adding something darker makes the result darker (with medium gray itself being the identity element).

We can now go on and add the two functions $\color{orange}{x \mapsto \cos(3x)}$ and $\color{purple}{x \mapsto \cos(4x)}$ and sample all five functions at five (instead of three) equidistant points. That'll give us more and larger "Lego bricks" which we can use to construct all pictures consisting of five pixels:

And we can of course do this for ten, or fifty, or many, many more pixels. But there's something even more interesting we can do. We can arrange the pixels in two dimensions! For example, we can arrange nine pixels together with their "sample points" in a three-by-three square. We then let one cosine function run in one direction and another one perpendicular to it. The product of the two functions will yield the values determining the gray tones.

You can see what we get below. Click on of the numbers at the bottom of the picture to see different patterns. And within the picture, move your mouse with the button pressed to see the cosine curves from different angles.

These are the patterns we get (and you can click on them to see the curves they were generated from):

Note that we have arranged the patterns in a certain way. If you go through this table from left to right, you're observing increasing horizontal frequencies in the patterns. If you go from top to bottom, you're seeing increasing vertical frequencies. In other words, if you traverse the table diagonally (from the upper left to the lower right corner), the way the patterns are numbered, you move from the pattern with no variation at all towards more "turbulent" patterns.

It's again easy to see mathematically that all three-by-three pictures can be represented (uniquely, by the way) as combinations of these nine patterns. And it's instructive to look at a few examples.

	1	2	3	4	5	6	7	8	9

A	0	0	0	0	0	0	0	0	0
B	0	0	0	0	0	0	0	0	0
C	0	0	0	0	0	0	0	0	0
D	0	0	0	0	0	0	0	0	0
E	0	0	0	0	0	0	0	0	0
F	0	0	0	0	0	0	0	0	0

Picture A is identical to pattern 3, so it is obvious that we only need this pattern to reconstruct the picture.
Picture B is similar to pattern 4, but it is lighter. To reconstruct it, we use pattern 1 for the "basic tint" and we then subtract a fraction of pattern 4 to get it right.
Picture C has only one horizontal frequency (from light to dark) in it, but apart from patterns 1 and 3 we also need a bit of pattern 6 to achieve the correct grays.
Picture D is a mix of vertical and horizontal frequencies and the most complicated picture to reconstruct so far. But note that we still didn't need patterns 7 to 9.
Picture E is a very "turbulent" picture and to reconstruct it, we need to combine all nine patterns.
However, we needed only a very small amount of pattern 9. Picture F is the result of reconstructing picture E from only the first eight patterns. It is pretty hard, if not impossible, to perceive a difference between F and E.

Note that the blue squares on the right visualize how important the individual patterns are to reconstruct a picture. One can see that the less "turbulent" a picture is, the more "energy" is concentrated in the upper left of the square.

Finally, let's look at how this works with a real picture. Below is a detail of a former German bank note showing the famous mathematician Carl Friedrich Gauß. It's a $32\times32$ image blown up by a factor of four in each dimension so that we can see the individual pixels.

With a $32 \times 32$ grid, we can build every picture as a combination of $32\cdot32=1024$ patterns. If you move the slider from left to right, you will see how the Gauß portrait is successively reconstructed from more and more patterns, starting with the constant pattern and moving towards patterns with more variation. The blue square on the right has the same meaning as the blue squares above. Here's what the three squares in the middle show:

The left one shows the basic patterns.
The middle one shows the same pattern as the left one, but already multiplied with the coefficient needed for reconstruction. If the factor was very small, this will be almost homogenous medium gray.
The right square shows the reconstruction so far, i.e. the more you move the slider to the right, the more patterns will have been added up.

If you look at the pictures in their real size, you'll notice that once the slider is at about 50%, it is already hard to see a difference between the original picture and the reconstructed one. That's the main ingredient of compression techniques like JPEG: In typical pictures, almost all of the "energy" is concentrated in the lower frequencies (in the upper left part of the blue square), so the remaining information can safely be thrown away without a noticable effect.

(If the result at the far right end doesn't look exactly like the picture we started with, that's due to rounding errors. The algorithm itself is exact.)

Here are two more examples. The picture of the football has less detail than the Gauß portrait. We can see a much more pronounced concentration of the "energy" in the upper left corner and thus you will already see an acceptable rendition of the football once the slider has traveled about a third of the full distance.

The final picture is just white noise. The coefficients are all over the place and it takes a long time until the reconstructed image resembles the original.

Some final remarks: We said above that the vectors were linearly independent. In fact, they are even pairwise orthogonal. That's also the reason why most libraries that implemenent the DCT will likely compute coefficients which are different from the ones shown here (for didactical reasons). They'll make sure the vectors derived from "sampling" the various cosine functions will form a orthonormal basis. In practice, this doesn't make a difference, though, as you will almost always only be interested in the ratios of the coefficients and not in the individual values.

	1	2	3	4	5	6	7	8	9

A	0	0	0	0	0	0	0	0	0
B	0	0	0	0	0	0	0	0	0
C	0	0	0	0	0	0	0	0	0
D	0	0	0	0	0	0	0	0	0
E	0	0	0	0	0	0	0	0	0
F	0	0	0	0	0	0	0	0	0

	1	2	3	4	5	6	7	8	9

A	0	0	0	0	0	0	0	0	0
B	0	0	0	0	0	0	0	0	0
C	0	0	0	0	0	0	0	0	0
D	0	0	0	0	0	0	0	0	0
E	0	0	0	0	0	0	0	0	0
F	0	0	0	0	0	0	0	0	0

	1	2	3	4	5	6	7	8	9

A	0	0	0	0	0	0	0	0	0
B	0	0	0	0	0	0	0	0	0
C	0	0	0	0	0	0	0	0	0
D	0	0	0	0	0	0	0	0	0
E	0	0	0	0	0	0	0	0	0
F	0	0	0	0	0	0	0	0	0