Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Foundations of Probability & Random Variables

Learning Objectives

Motivation

Engineering, physics, and applied mathematics constantly face uncertainty. Probability provides a quantitative language for describing, analyzing, and predicting outcomes in the presence of uncertainty.

Some physicists believe that the universe itself is fundamentally probabilistic at the quantum level. Einstein famously protested: “God does not play dice with the universe.” But modern physics suggests that, in some situations, nature does roll the dice. Even if we prepare the same system in the same way, the outcome may vary. You don’t need to know any quantum mechanics for this course—the point is simply: Some randomness is believed to be built into the laws of physics.

Other kinds of uncertainty come from systems that are deterministic but too complex to predict exactly. Example: Rolling a die A die obeys Newton’s laws. But tiny changes in position, velocity, friction, or air flow make the outcome unpredictable. So we treat the result as a random variable. Geophysics examples:

Even when the physics is deterministic, we model our lack of knowledge using probability.

In summary, Probability is useful whether nature is truly random or merely too complicated. Either way, we have to “play dice” in our models.

Probability Spaces and Axioms

This section introduces the mathematical foundation of probability.

Sample Space Ω\Omega

The sample space is the set of all possible outcomes of an experiment.

The sample space should include every outcome that could happen, and nothing else.

Also note that Ω\Omega can contain non-numerical values. This is a description of the real world we like to model.

Events as Subsets of Ω\Omega

An event is any subset of the sample space.
Examples (die roll):

Ω={1,2,3,4,5,6}\Omega = \{1,2,3,4,5,6\}

Operations on events follow set operations:

In a finite or countable setting, every subset is a valid event.

Probability Measure

A probability measure P\mathbb{P} assigns a number P(A)\mathbb{P}(A) to each event AA, following three axioms (Kolmogorov, simplified). Note that P\mathbb{P} takes a subset of Ω\Omega (an event) as input, not individual elements ωΩ\omega \in \Omega.

A Note on the Uncountable Case

So far we have focused on finite or countable sample spaces, where probabilities are assigned by listing the probability of each individual outcome. However, many real-world quantities in geophysics (and physics in general) take real-number values, so the sample space becomes uncountable, typically a subset of R\mathbb{R} or Rn\mathbb{R}^n.

This requires more structure than in the discrete case. If done without care, one can run into contradiction and paradoxes. We will not be dealing with the mathematical complexities in this course. Despite skipping the full measure‑theoretic machinery, everything we do in this course is completely rigorous for the kinds of distributions and integrals used in geophysics and engineering. The advanced theory is only needed for pathological cases we will rarely encounter, so the tools we use remain fully sound and mathematically correct.

Random Variables

Random variables allow us to translate uncertain outcomes into numerical quantities that we can analyze mathematically. They provide the connection between abstract probability spaces and the real-valued measurements used in geophysics and engineering. Although ω\omega may be abstract, X(ω)X(\omega) is a real number we can compute with.

Definition

A random variable is a function

X:ΩR,X:\Omega \to \mathbb{R},

assigning a real number to each outcome ωΩ\omega \in \Omega.

A random variable is often misunderstood because of the word random. Mathematically, a random variable is not itself random—it is a deterministic function from the sample space to the real numbers,

X:ΩR.X : \Omega \to \mathbb{R}.

The randomness comes entirely from the underlying experiment, which selects an outcome ω\omega according to the probability measure P\mathbb{P}. Once ω\omega is fixed, the value X(ω)X(\omega) is completely determined. The role of a random variable is simply to assign numerical values to outcomes. All probability statements about a random variable XX (such as P(Xx)\mathbb{P}(X \le x)) are really probability statements about the underlying set of outcomes

{ωΩ:X(ω)x}.\{\omega \in \Omega : X(\omega) \le x\}.

Thus, a random variable is simply a function—its “randomness” comes entirely from the randomness of the experiment that selects ω\omega, not from the function XX itself.

Connecting Outcomes to Probabilities

Once we have a random variable XX, statements about real numbers like “XxX \le x” or “X[a,b]X \in [a,b]” correspond to events in the original probability space:

P(XB)    =    P({ωΩ:  X(ω)B})\mathbb{P}(X \in B) \;\;=\;\; \mathbb{P}\big(\{\omega \in \Omega:\; X(\omega) \in B\}\big)

The probability of these events is inherited from P\mathbb{P} on Ω\Omega. This is how we build a distribution for XX—the assignment of probability to intervals and sets of real numbers. The formal description of this distribution is the subject of the next section.

Distribution Functions

The distribution of a random variable describes how probability is assigned to different real values. We formalize this through the cumulative distribution function (CDF) and, when it exists, the probability density function (PDF).

Cumulative Distribution Function (CDF)

For any random variable XX, the cumulative distribution function (CDF) is defined by

FX(x)=P(Xx).F_X(x) = \mathbb{P}(X \le x).

Universal Properties:

Every random variable — discrete, continuous, or mixed — has a CDF. The CDF fully determines the probability distribution.

Probability Density Function (PDF)

If the CDF of XX is differentiable, XX can be described by a probability density function (PDF) fX(x)f_X(x) satisfying

FX(x)=xfX(t)dtF_X(x) = \int_{-\infty}^{x} f_X(t)\, \mathrm{d}t

and

fX(x)=ddxFX(x).f_X(x) = \frac{\mathrm{d}}{\mathrm{d}x} F_X(x).

Computing Probabilities from PDF:

Discrete, Continuous, and Mixed Random Variables

We classify XX by how its probability is distributed over R\mathbb{R}, as revealed by its CDF.

Discrete

XX is discrete if it places probability on a finite or countable set {x1,x2,}\{x_1,x_2,\dots\}:

P(XB)  =  xiBP(X=xi).\mathbb{P}(X \in B) \;=\; \sum_{x_i \in B} \mathbb{P}(X = x_i).

The CDF is a step function with jumps at the atoms {xi}\{x_i\}, where each jump has size

P(X=xi)  =  FX(xi)limxxiFX(x).\mathbb{P}(X = x_i) \;=\; F_X(x_i) - \lim_{x \uparrow x_i} F_X(x).

Continuous

XX is continuous if its CDF is differentiable and has a probability density function fX(x)f_X(x) such that

FX(x)=xfX(t)dt.F_X(x) = \int_{-\infty}^{x} f_X(t)\, dt.

For such a distribution, all individual points have zero probability:

P(X=x)=0for all xR.\mathbb{P}(X = x) = 0 \quad \text{for all } x \in \mathbb{R}.

Mixed

XX is mixed if it has both discrete (jump) and continuous parts. The CDF has the form

FX(x)  =  iP(X=xi)1{xix}  +  xfX(t)dt,F_X(x) \;=\; \sum_{i} \mathbb{P}(X = x_i)\,\mathbf{1}_{\{x_i \le x\}} \;+\; \int_{-\infty}^{x} f_X(t)\, dt,

where:

Example: Travel time in seismology may be continuous when a wave arrives, but with some probability the wave fails to arrive, contributing a discrete mass at ++\infty (or equivalently, a point mass for “no arrival”).

Moments

Moments summarize key numerical properties of a random variable. They describe averages, variability, and dependence, and are central to modeling uncertainty in geophysics and engineering.

Expectation

The expectation (or mean) of a random variable XX, written E[X]\mathbb{E}[X], is the “average value” of XX under repeated sampling.

For any random variable,

E[X]=xdFX(x),\mathbb{E}[X] = \int_{-\infty}^{\infty} x \, \mathrm{d}F_X(x),

where FXF_X is the CDF of XX.

dFX\mathrm{d}F_X is interpreted as a Riemann–Stieltjes integral. It automatically covers discrete, continuous, and mixed cases. We will expand in these cases below.

Discrete case

If XX takes values {xi}iI\{x_i\}_{i\in I} (finite or countable) with point probabilities pi=P(X=xi)p_i = \mathbb{P}(X = x_i), then

E[X]  =  iIxipi.\mathbb{E}[X] \;=\; \sum_{i\in I} x_i \, p_i.

Continuous case (with a PDF)

If XX has a probability density function fXf_X then

E[X]  =  xfX(x)dx.\mathbb{E}[X] \;=\; \int_{-\infty}^{\infty} x\, f_X(x)\, \mathrm{d}x.

Mixed case

If XX has atoms at {xi}\{x_i\} with masses pi=P(X=xi)p_i = \mathbb{P}(X=x_i) and a continuous part with density fXf_X, then

E[X]  =  ixipi  +  xfX(x)dx.\mathbb{E}[X] \;=\; \sum_i x_i\, p_i \;+\; \int_{-\infty}^{\infty} x\, f_X(x)\, \mathrm{d}x.

Expectation is linear

That is:

E[aX+bY]=aE[X]+bE[Y].\mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y].

Second Moment

One can easily generalize the expectation to define the second moment of XX

E[X2].\mathbb{E}[X^2].

For discrete XX:

E[X2]=ixi2pi.\mathbb{E}[X^2] = \sum_i x_i^2 \, p_i.

For continuous XX with PDF fXf_X:

E[X2]=x2fX(x)dx.\mathbb{E}[X^2] = \int_{-\infty}^{\infty} x^2 \, f_X(x)\, \mathrm{d}x.

The second moment captures the “typical squared magnitude” of XX. It plays a central role in defining variance.

Variance and Standard Deviation

The variance measures spread around the mean. It is the second central moment (i.e., the second moment about the mean):

Var(X)=E ⁣[(XE[X])2].\mathrm{Var}(X) = \mathbb{E}\!\left[(X - \mathbb{E}[X])^2\right].

By expanding the square and using linearity of expectation, we obtain a useful computational formula in terms of the first and second moments:

Var(X)=E[X2](E[X])2.\mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2.

Interpretation: Variance quantifies how “spread out” the values of XX are around the mean. A small variance means values cluster tightly; a large variance means wider dispersion.

The standard deviation is the square root:

σX=Var(X).\sigma_X = \sqrt{\mathrm{Var}(X)}.

Standard deviation has the same units as XX, making it easier to interpret physically (e.g., “typical deviation from the mean”).

Scaling Properties of Variance and Standard Deviation

An important and frequently used property is how variance and standard deviation scale when a random variable is multiplied by a constant.

For any constant cc and random variable XX:

Var(cX)=c2Var(X),\mathrm{Var}(cX) = c^2 \, \mathrm{Var}(X),

σcX=cσX.\sigma_{cX} = |c| \, \sigma_X.

Example: If a measurement has standard deviation σ=5\sigma = 5 meters, and you measure the same quantity in centimeters (multiplying by c=100c = 100), the new standard deviation is 100×5=500100 \times 5 = 500 centimeters, and the variance becomes 1002×Var(X)=10000×Var(X)100^2 \times \mathrm{Var}(X) = 10000 \times \mathrm{Var}(X).

Higher Moments

More generally, the kk‑th moment of XX is

E[Xk].\mathbb{E}[X^k].

The kk‑th central moment (moment about the mean) is

E ⁣[(XE[X])k].\mathbb{E}\!\left[(X - \mathbb{E}[X])^k\right].

Examples:

Higher moments quantify shape features of the distribution beyond location and spread.

Covariance and Correlation

For random variables XX and YY with finite second moments, the covariance is

Cov(X,Y)=E ⁣[(XE[X])(YE[Y])].\mathrm{Cov}(X, Y) = \mathbb{E}\!\left[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])\right].

Equivalent form:

Cov(X,Y)=E[XY]E[X]E[Y].\mathrm{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\,\mathbb{E}[Y].

Covariance measures whether XX and YY tend to increase together (positive), move oppositely (negative), or behave independently (zero for many but not all cases).

The correlation coefficient is the normalized covariance:

ρX,Y=Cov(X,Y)σXσY,\rho_{X,Y} = \frac{\mathrm{Cov}(X,Y)}{\sigma_X \sigma_Y},

where σX\sigma_X and σY\sigma_Y are the standard deviations.

Correlation satisfies:

Random Variables in Multiple Dimensions

Many problems in geophysics and applied mathematics involve multiple uncertain quantities at once. For example:

To model such situations, we introduce random vectors and their joint distributions.

Random Vectors

A random vector is a function

(X1,X2,,Xn):ΩRn.(X_1, X_2, \dots, X_n): \Omega \to \mathbb{R}^n.

Just as a single random variable is a deterministic function of ωΩ\,\omega \in \Omega\,, a random vector assigns an nn–tuple of real numbers to each outcome. All randomness still comes from the selection of ω\omega.

We will focus mainly on the two‑dimensional case (X,Y)(X,Y), but all definitions extend naturally to higher dimensions.

Joint Distributions

Joint Cumulative Distribution Function (Joint CDF)

For two random variables (X,Y)(X, Y), the joint CDF is

FX,Y(x,y)=P(Xx,  Yy).F_{X,Y}(x,y) = \mathbb{P}(X \le x,\; Y \le y).

This fully characterizes the joint distribution of (X,Y)(X,Y).

Properties:

Joint Probability Density Function (Joint PDF)

If the joint CDF is differentiable, we can define the joint PDF

fX,Y(x,y)=2xyFX,Y(x,y).f_{X,Y}(x,y) = \frac{\partial^2}{\partial x\, \partial y} F_{X,Y}(x,y).

The joint PDF satisfies:

P((X,Y)A)=AfX,Y(x,y)dxdyfor any region AR2,\mathbb{P}\big((X,Y) \in A\big) = \iint_A f_{X,Y}(x,y)\, \mathrm{d}x\, \mathrm{d}y \quad \text{for any region } A \subset \mathbb{R}^2,

and the normalization condition:

R2fX,Y(x,y)dxdy=1.\iint_{\mathbb{R}^2} f_{X,Y}(x,y)\, \mathrm{d}x\, \mathrm{d}y = 1.

Marginal Distributions

The individual (1D) distributions of XX and YY are obtained by integrating out the other variable from the joint PDF.

Marginal PDFs

If (X,Y)(X,Y) has joint PDF fX,Yf_{X,Y}, then:

Interpretation:

The joint PDF describes the full 2D distribution, while a marginal PDF gives the distribution of one variable considered on its own.

Independence

Definition via Joint CDF

Random variables XX and YY are independent if and only if

FX,Y(x,y)=FX(x)FY(y)for all x,y.F_{X,Y}(x,y) = F_X(x)\, F_Y(y) \quad \text{for all } x,y.

Equivalent Definition via Joint PDF

If densities exist, independence is equivalent to:

fX,Y(x,y)=fX(x)fY(y).f_{X,Y}(x,y) = f_X(x)\, f_Y(y).

Independence means: knowing one variable tells you nothing about the other.

Important Example of Dependence Without Correlation

It is possible for XX and YY to be dependent even if Cov(X,Y)=0\mathrm{Cov}(X,Y)=0. Thus:

Zero covariance does not imply independence.

Consequences of Independence

Independence dramatically simplifies many computations.

Expectation of Sums

Using linearity of expectation:

E[X+Y]=E[X]+E[Y](always true, no independence needed).\mathbb{E}[X+Y] = \mathbb{E}[X] + \mathbb{E}[Y] \quad \text{(always true, no independence needed).}

Expectation of Products (requires independence)

If XX and YY are independent:

E[XY]=E[X]E[Y].\mathbb{E}[XY] = \mathbb{E}[X]\, \mathbb{E}[Y].

This is not generally true without independence.

Variance of sum of random variables (requires independence)

If random variables X1,X2,,XnX_1, X_2, \dots, X_n are independent, then the variance of their sum is the sum of their variances:

Var ⁣(i=1nXi)=i=1nVar(Xi).\mathrm{Var}\!\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \mathrm{Var}(X_i).

This is not generally true without independence.

Conditional Probability

Conditional probability allows us to update or refine probability assessments when new information becomes available. It is fundamental in statistics, inversion theory, and Bayesian inference used throughout geophysics.

Conditional Probability of Events

For events AA and BB with P(B)>0\mathbb{P}(B) > 0, the conditional probability of AA given BB is

P(AB)=P(AB)P(B).\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}.

Interpretation: We restrict our sample space to BB (the information we now know has occurred), then renormalize the probability of AA within this reduced space.

Equivalently,

P(AB)=P(AB)P(B).\mathbb{P}(A \cap B) = \mathbb{P}(A \mid B)\,\mathbb{P}(B).

This identity is often used as the starting point for defining conditional densities.

Conditional PDFs

Let (X,Y)(X,Y) be jointly continuous with joint PDF fX,Y(x,y)f_{X,Y}(x,y) and marginal PDF fY(y)>0f_Y(y) > 0.
The conditional PDF of XX given Y=yY=y is defined by

fXY(xy)=fX,Y(x,y)fY(y).f_{X \mid Y}(x \mid y) = \frac{f_{X,Y}(x,y)}{f_Y(y)}.

This parallels the discrete formula
P(AB)=P(AB)/P(B)\mathbb{P}(A \mid B) = \mathbb{P}(A \cap B)/\mathbb{P}(B)
but in density form.

It has the properties

  1. Normalization

    +fXY(xy)dx=1.\int_{-\infty}^{+\infty} f_{X \mid Y}(x \mid y)\, \mathrm{d}x = 1.
  2. Relation to joint and marginal PDFs

    fX,Y(x,y)=fXY(xy)fY(y)=fYX(yx)fX(x).f_{X,Y}(x,y) = f_{X \mid Y}(x \mid y)\, f_Y(y) = f_{Y \mid X}(y \mid x)\, f_X(x).
  3. Independence If XX and YY are independent, then

    fXY(xy)=fX(x),f_{X \mid Y}(x \mid y) = f_X(x),

    so conditioning on YY has no effect on the PDF of XX.

Conditional Expectation

The conditional expectation of XX given Y=yY=y is the expectation taken with respect to the conditional PDF:

E[XY=y]=+xfXY(xy)dx.\mathbb{E}[X \mid Y=y] = \int_{-\infty}^{+\infty} x\, f_{X \mid Y}(x \mid y)\, \mathrm{d}x.

This defines a function of the variable yy.
In many statistical applications (including inversion and filtering), this function acts as the “best predictor” of XX given knowledge of YY.

We note the key identities

Bayes’ Rule

Conditional probability allows us to update beliefs when new information becomes available.
Bayes’ Rule formalizes how to reverse the conditioning: it expresses P(AB)\mathbb{P}(A \mid B) in terms of P(BA)\mathbb{P}(B \mid A).

Bayes’ Rule

For events AA and BB with P(B)>0\mathbb{P}(B) > 0,

P(AB)=P(BA)P(A)P(B).\mathbb{P}(A \mid B) = \frac{\mathbb{P}(B \mid A)\,\mathbb{P}(A)}{\mathbb{P}(B)}.

This follows directly from the definition:

P(AB)=P(AB)P(B)=P(BA)P(A).\mathbb{P}(A \cap B) = \mathbb{P}(A \mid B)\mathbb{P}(B) = \mathbb{P}(B \mid A)\mathbb{P}(A).

Likelihood, Prior, and Posterior

Bayes’ Rule is often interpreted in terms of three components:

Bayes’ Rule becomes:

Posterior=Likelihood×PriorEvidence.\text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Evidence}}.

The denominator

P(B)=P(BA)P(A)+P(BAc)P(Ac)\mathbb{P}(B) = \mathbb{P}(B\mid A)\mathbb{P}(A) + \mathbb{P}(B\mid A^c)\mathbb{P}(A^c)

acts as a normalization constant.

Example: Medical Testing and Base Rates

Medical tests illustrate Bayes’ Rule clearly—especially how rare events can dramatically affect posterior probabilities.

Suppose:

Posterior: We want the probability the patient actually has the disease given a positive test:

P(D+)=P(+D)P(D)P(+D)P(D)+P(+Dc)P(Dc).\mathbb{P}(D \mid +) = \frac{\mathbb{P}(+\mid D)\mathbb{P}(D)} {\mathbb{P}(+\mid D)\mathbb{P}(D) + \mathbb{P}(+\mid D^c)\mathbb{P}(D^c)}.

Plug in the numbers:

Compute:

P(D+)=0.950.010.950.01+0.050.99=0.00950.0590.161.\mathbb{P}(D \mid +) = \frac{0.95 \cdot 0.01} {0.95 \cdot 0.01 + 0.05 \cdot 0.99}= \frac{0.0095}{0.059} \approx 0.161.

Interpretation

Even though the test is quite accurate, a positive result gives only a 16% chance of actually having the disease.
This is because the disease is rare, and most positives come from false positives among the 99% of healthy patients.

This phenomenon—counterintuitive but ubiquitous—is the base-rate effect.

Key Points

Foundations

Random Variables and Distributions

Moments and Dependence

Conditioning and Bayes’ Rule