Almost Orthogonal Vectors
First published: 24th April 2025
Last updated: 30th April 2025
I came across this claim in a machine learning paper
As a result, an n-dimensional representation may encode features not with the n basis directions (neurons) but with the possible almost orthogonal directions (Elhage et al., 2022b) , leading to polysemanticity.
All that the reference had to say about it was this:
Although it's only possible to have orthogonal vectors in an n-dimensional space, it's possible to have many "almost orthogonal" ( cosine similarity) vectors in high-dimensional spaces. See the Johnson—Lindenstrauss lemma .
For us mere mortals, let's fill in some of the gaps here.
Almost Orthogonal
What does it mean to be "almost orthogonal" anyway? Two vectors are orthogonal if the angle between them is a right angle, or equivalently, if their dot-product is .
The angle between two vectors is given by , or
Thus we can pick a small and say that two vectors are almost orthogonal if
Proof using the Johnson—Lindenstrauss Lemma
The lemma states:
Given , a set of points in , and an integer , there is a linear map such that for all .
To translate this into an expression about angles, note that
We can rearrange this to get an expression for
As the transformation is linear, , so putting , we get
Now, let be the angle between and
We can now compare the angles
So if we take our points to be exactly orthogonal vectors in , they will be mapped to almost-orthogonal vectors in .
Since the lemma holds for , we can fix and have grow exponentially with . QED.
Thinking Geometrically
I found the above argument persuasive but not very intuitive. Here is a scheme I came up with for getting a better feeling of why this happens in high dimensions even though it isn't the case in 2 or 3 dimensions.
We are going to pick points on the unit sphere one by one. Points on the sphere obviously correspond to unit vectors. After picking a point , we will shade out on the sphere all points that are "not orthogonal enough" to (i.e. is too large).

Figure 1: After we have picked our first point (shown here at the north pole), we shade out the areas (grey) that are too close to and
At each step, we will only be allowed to pick a point if it isn't shaded in. As we add more and more points to our set, more and more of the area of the sphere will be shaded in, until eventually we run out of space.

Figure 2: After adding each point, we shade out more of the sphere.
As a very rough lower bound, we know that the number of points we can pick before this happens is at least the total surface area of the sphere divided by the area that each point will cause to be shaded. In practice, this is a huge underestimate because there will be a lot of overlap in the shaded regions.
The area of a strip on an (n-1)-sphere between and , is the area of an (n-2)-sphere with radius , multiplied by .
The area of a (n-2)-sphere with radius is some constant times .
Thus the proportion of the sphere that is not shaded grey after we have added the first points is
Now, as gets big, gets very very sharply peaked around , so most of the area is concentrated in that central white band.
Here is a plot of for :

And here is :

The proportion of the grey area is:
In the range , we use .
This tells us that
In the range , we let and use (see graph).

This tells us that
Where is the integral of the Gaussian normal. As , this tends to 1.
As shrinks exponentially in but behaves like for large , the fraction of the sphere that is shaded grey at each step shrinks exponentially in .
So there you have it. For fixed , the number of vectors we can find where every pair has an angle within of a right angle is exponential in . This is because the area of the sphere is being concentrated around the equator.
_ Likes