Uniform Distribution on a Sphere: Part 0 (Preliminaries)

(Note that this topic has been covered in detail by several well-known websites. While their material is sound, I feel that some of the interesting details in this problem are obscured by  multivariate calculus. In this series of posts, I will attempt to explain this problem assuming limited prior mathematical knowledge. Readers with a good grasp of probability will want to skip past the preliminaries.)

In this series of posts we will answer the question: how do we choose a point uniformly at random on the surface of a sphere? (Think: how do we pick a random place on a globe without bias to any particular location(s)?)

Before we get to the solution, we need to address a few basics. First consider the phrase “uniformly at random.” What does this mean?

The Simple Definition

Consider any surface. It can be a sheet of paper, or a globe, or a pyramid, or a… whatever. On your surface, pick any two non-overlapping regions that have equal area. Now pick 1,000 “random” points on the surface. Count the number of points that end up in each region. Erase all the points. Then pick another 1,000 points and count again. Imagine that we keep repeating this process. These points are distributed randomly if and only if both regions contain the same number of points on average. That is, the points appear to be evenly spread out over the surface; the points do not cluster anywhere. We will use this definition in the following posts and refer to it as the “equal volume/equal mass” rule.[1]

With this simple definition of uniformity, you can continue to Part 1. I have provided the following more technical definition of uniformity for those who are interested. It will be needed for Parts 3b and 4, but is not required to understand the solution of the problem in Part 3a.

The Technical Definition

Let’s say that we run an experiment with two different outcomes—for instance, flipping a coin. We can call these outcomes H and T and the set \{H,T\} our sample space. In order to define a probability distribution on this space—and note that we always define a distribution on a particular space—we will assign a probability to each outcome in the space denoting its likelihood of occurrence. These probabilities must satisfy two rules:

  1. Each probability must be non-negative.
  2. The probabilities must sum to 1—i.e., our distribution must completely account for the possible outcomes of our experiment.

In the case of our coin flipping experiment, we might say that the probability of seeing heads is 1/2. We write this as P(H)=1/2. If we also say that P(T)=1/2, then we have satisfied the two conditions above and have defined a valid probability distribution on our sample space. Notice that in this distribution both outcomes have the same assigned probability; they are equally likely. We call this distribution the uniform distribution. Remember: a distribution only makes sense when defined on a sample space. This means that there are many uniform distributions! Usually the sample space is clear in context, and this doesn’t present a problem.

To illustrate this, let’s consider another simple experiment. Now we will throw a six-sided die and define the sample space as \{1,2,\ldots,6\}. We assume (and casinos hope) that each of these six outcomes is equally likely. Therefore each outcome must have a probability of 1/6. This is clearly a different distribution than above, but we say that both are uniform. A general rule for defining uniform distributions should be becoming clear, but let’s introduce one more concept before we state it explicitly.

Often we will want to be able to refer to the unknown outcome of an experiment in a concise way. For instance, let’s say that X refers to the result of one roll of our six-sided die. Until we roll the die, we do not know X. It could be 1, or 2, or 3, etc. Therefore we call X a random variable. It stands for something—just like variables in algebra—but that thing is determined randomly. Now we can speak about a probability distribution for X, which is simply the rule above that assigns outcomes to probabilities. In mathematics we call a rule of this sort a function. Here we will call the function f_X (where we use the subscript X to make clear which random variable the function corresponds to), and say that f_X(A)=P(X=A) where A is some event in our sample space. Since the probabilities must follow the two rules above, our function must follow these two rules:

  1. f_X(\omega)>0 where \omega is an outcome in the sample space.
  2. \sum_{\omega\in\Omega}f_X(\omega)=1 where \Omega is the sample space, \in is the mathematical symbol for “in”, and \sum means “sum”.

Any function that follows these two rules is known as a (probability) mass function. Since this function corresponds directly to the assigned probabilities, we can see that it completely determines the distribution of X.

Now we can define the uniform distribution in a more general way. Take a random variable X and a sample space \Omega. We denote the number of elements in \Omega as |\Omega|. X follows the uniform distribution if and only if f_X(\omega)=1/|\Omega| for all \omega in \Omega. It is clear that each outcome is equally likely and the sum of the mass function for all outcomes is 1.

What happens if I choose my sample space to be all real numbers between 0 and 1? We denote this interval as [0,1] and our sample space \Omega=[0,1]. Now what is |\Omega|? There are infinitely many real numbers between 0 and 1. In fact, there are so many they are literally uncountable! So |\Omega|=\infty and our rule of 1/|\Omega| no longer makes sense. Can we actually choose from an uncountable number of possibilities uniformly at random?

Yes, but we will first need to talk about an important distinction between two types of random variables. Up until now we have dealt with discrete random variables. These are random variables that take on a countable number of possible outcomes—i.e., we can arrange all the outcomes in an ordered list. To handle an uncountable sample space we will need to look at continuous random variables.

Consider a continuous random variable that takes values in the interval [a,b]. I will put forth this claim: only a countable number of outcomes can have mass—i.e., P(\omega)>0 for only a countable number of \omega in [a,b]. As a consequence, we still must have that P(\omega)=0 for an infinite (in fact, uncountable) number of \omega. (Take this as a given for now, I will post a proof on the Wiki later.)

This means that we can always divide a probability space into a countable discrete part (where outcomes have mass) and an uncountable continuous part. So let’s just assume that our sample space is continuous and P(\omega)=0 for all \omega. How is it possible for us to have a meaningful random variable over this space?

We must introduce the concept of probability density. Think back to grade school science class where you defined density as mass divided by volume. When we think of real objects it makes intuitive sense that mass must be spread out over some physical space—i.e., over some volume. The same Is true for continuous probability distributions. Instead of having the mass concentrated exactly at some number of points, we have it spread out over the sample space. Let’s rewrite our density formula as mass = density \cdot volume. If we consider the one-dimensional interval [0,1], our notion of “volume” is really length. If the probability density is constant over the interval, then our formula provides a simple way to find the mass contained in any subinterval—i.e., the probability that a random sample will fall in that subinterval. For example, let’s take the density on this interval to be 1. Then the probability that a random sample falls between 0.25 and 0.75 is 0.5 (mass=1\cdot(0.75-0.25)=0.5).

What happens if the density of a random variable is non-constant? We will consider the function f_X(x) that gives the density of the random variable X at each point along the interval. This is called the probability density function (pdf). Say we want to figure out the mass contained in the interval [0.25,0.75] for some continuous probability distribution with a non-constant density. Since the density is changing over this interval, let’s split the interval into slivers of length \mathrm{d}x. These slivers will be small so that the density is approximately constant over each sliver. Now we can apply our formula for constant density on each sliver to approximate its mass. If we choose f_X(x) as the density over the sliver from x to x+\mathrm{d}x, we find that the mass is approximately f_X(x)\cdot\mathrm{d}x.[2] See Figure 1 for the graphical interpretation of this problem.

Approximating the mass of a sliver
Figure 1: The area of the shaded rectangle approximates the mass of the sliver from x to x+\mathrm{d}x. As we make \mathrm{d}x smaller, the approximation becomes more accurate. By adding up all such slivers in an interval, we will find the mass contained in that interval (i.e., the area under the curve).

By adding up the mass of all the slivers, we will approximate the mass in the entire interval. We see that we could improve our approximation by adding the mass over a larger number of smaller slivers—i.e., our approximation improves as we make \mathrm{d}x smaller. Thus we want the length \mathrm{d}x to be as small as possible. We cannot actually pick a smallest (non-zero) length, so we say that we want to find the sum of the masses in our slivers as \mathrm{d}x approaches 0. In order to represent this we use the symbol \int[3], which will indicate that we want the sum in the limit as the length of our slivers goes to 0.[4] Using this notation, our calculation becomes

 P(0.25 < X < 0.75)=\int_{0.25}^{0.75}\!f_X(x)\,\mathrm{d}x

. (For those of you who know calculus, you will recognize this entire exercise as taking the integral of the density over the interval [0.25, 0.75].)

Above we had two rules that probability mass functions needed to satisfy to represent probability distributions. Here we can recast those rules for a continuous random variable X that takes values on the interval [a,b]:

  1. f_X(x)>0 where x is in the interval [a,b].
  2. \int_a^b\!f_X(x)\,\mathrm{d}x=1.

Now, we have all the pieces we need to define the continuous uniform distribution. In the discrete setting, we required that all outcomes are equally likely (i.e., they have equal mass). Here we see that when we divide our interval into these infinitesimal slivers, they will all need to have the same mass. According to our formula above each sliver has mass f(x)\cdot\mathrm{d}x, so we see that we will need to ensure that f(x) is constant. Therefore uniform distributions have constant probability density functions. In order to satisfy the second rule above, we need that f(x)=1/(b-a) for a uniform distribution on the interval [a,b]. (Using our simple formula for constant density we have that the mass contained in the entire interval is mass=1/(b-a)\cdot(b-a)=1.)

Note that we can (and will need to) extend this concept to higher dimensions. For instance, if we take two random variables X and Y, we can consider their joint distribution. That is, we can consider our samples as a pair of numbers (x,y) with a density function f_{X,Y}(x,y). If we want to figure out the mass contained in any two-dimensional region we now consider splitting it into tiny rectangles instead of slivers. The  little rectangles have area \mathrm{d}x\cdot\mathrm{d}y. So the mass contained in a region A is

 P( (X,Y)\in A)=\iint_A\! f_{X,Y}(x,y)\,\mathrm{d}x\mathrm{d}y

where this is now a sum over two-dimensional rectangles with area approaching zero. Again, we are using f_{X,Y}(x,y) to approximate the density over the entire rectangle with corners (x,y) and (x+\mathrm{d}x,y+\mathrm{d}y). Convince yourself that uniform distributions in higher dimensions must also have constant density.

Before we move on, there is one last thing to notice. Given a uniform probability distribution, subintervals (think: slivers or rectangles) of equal “volume” have equal mass. This is an equivalent definition of the uniform distribution. Remember this, because it will be critical later!

OK. That was a lot of material, but now we are finally ready to address the actual problem. Continue on to Part 1 for more.

  1. [1] Volume refers to the area of the path. Mass refers to the expected number of points. See the technical definition for a more detailed explanation of probability mass.
  2. [2] Note that we could choose the density at any point in the sliver to make this approximation. Here we just happen to choose the left endpoint x.
  3. [3] Like the summation symbol \Sigma that we used before, this is just another fancy “S”.
  4. [4] It is not immediately clear that this sum actually approaches a unique number as we decrease the length of our slivers. You will have to trust that it does in our problem, and you can learn how to prove this if you take real analysis.