Information Theory: Complete Notes

1. Introduction to Information Theory

Information theory is a mathematical framework for quantifying, storing, and communicating information. Developed by Claude Shannon in 1948, it has become fundamental to data compression, communication systems, cryptography, and many other fields.

Core Concept: Information

Information is a measure of the reduction in uncertainty. The more unlikely an event, the more information it provides when it occurs. Information theory provides mathematical tools to quantify this concept.

2. Entropy: The Measure of Information

Definition

Entropy (H) is the average amount of information produced by a stochastic source of data. For a discrete random variable X with possible values {x₁, x₂, ..., xₙ} and probability mass function P(X), the entropy H(X) is given by:

H(X) = -∑ P(xᵢ) log₂ P(xᵢ)

Where:

The summation is over all possible values of X
log₂ is the logarithm base 2, giving entropy in bits
By convention, 0 × log₂(0) = 0

Example: Entropy Calculation

Consider tossing a fair coin with P(heads) = P(tails) = 0.5:

H(X) = -[0.5 × log₂(0.5) + 0.5 × log₂(0.5)]
= -[0.5 × (-1) + 0.5 × (-1)]
= 1 bit

This means a fair coin toss provides exactly 1 bit of information.

Now, consider a biased coin with P(heads) = 0.9 and P(tails) = 0.1:

H(X) = -[0.9 × log₂(0.9) + 0.1 × log₂(0.1)]
= -[0.9 × (-0.152) + 0.1 × (-3.322)]
= 0.469 bits

This biased coin provides less information because its outcome is more predictable.

Entropy Calculator

Enter probability values (must sum to 1):

3. Information Properties

Key Properties of Information

Non-negativity: Entropy is always non-negative, H(X) ≥ 0
Maximum entropy: For a discrete random variable with n possible values, entropy is maximized when all outcomes are equally likely, with H(X) = log₂(n)
Additivity: For independent random variables X and Y, H(X,Y) = H(X) + H(Y)
Conditioning reduces entropy: H(X|Y) ≤ H(X) - knowledge never increases uncertainty

Interactive: Entropy vs. Probability Distribution

Adjust the probability of Heads (P) for a biased coin:

P(Heads) = 0.5, P(Tails) = 0.5

Entropy: 1.00 bits

4. Mutual Information

Definition

Mutual information I(X;Y) measures how much information variable X provides about another variable Y. It quantifies the reduction in uncertainty about one variable given knowledge of the other.

I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)

Where:

H(X) and H(Y) are the individual entropies
H(X|Y) is the conditional entropy of X given Y
H(X,Y) is the joint entropy of X and Y

Example: Mutual Information

Consider two events:

X: Whether it rains today (Rain or No Rain)
Y: Whether someone carries an umbrella (Umbrella or No Umbrella)

Suppose we have the following joint probability distribution:

P(X,Y)	Umbrella	No Umbrella
Rain	0.4	0.1
No Rain	0.1	0.4

Calculate:

P(Rain) = 0.5, P(No Rain) = 0.5
P(Umbrella) = 0.5, P(No Umbrella) = 0.5

H(X) = -[0.5 × log₂(0.5) + 0.5 × log₂(0.5)] = 1 bit
H(Y) = -[0.5 × log₂(0.5) + 0.5 × log₂(0.5)] = 1 bit

H(X,Y) = -[0.4 × log₂(0.4) + 0.1 × log₂(0.1) + 0.1 × log₂(0.1) + 0.4 × log₂(0.4)]
= -[0.4 × (-1.32) + 0.1 × (-3.32) + 0.1 × (-3.32) + 0.4 × (-1.32)]
= 1.722 bits

I(X;Y) = H(X) + H(Y) - H(X,Y) = 1 + 1 - 1.722 = 0.278 bits

This positive mutual information indicates that knowing whether someone carries an umbrella gives us information about whether it will rain.

5. Channel Capacity

Definition

Channel capacity (C) is the maximum rate at which information can be transmitted over a communication channel reliably (i.e., with arbitrarily small error probability).

C = max[I(X;Y)]

Where the maximum is taken over all possible input distributions p(x).

Example: Binary Symmetric Channel (BSC)

A binary symmetric channel (BSC) has binary input X ∈ {0,1} and binary output Y ∈ {0,1} with error probability p.

When p = 0, it's a perfect channel; when p = 0.5, it's a completely random channel.

C = 1 - H(p) = 1 + p × log₂(p) + (1-p) × log₂(1-p)

For example, with error probability p = 0.1:

C = 1 - H(0.1) = 1 - 0.469 = 0.531 bits/transmission

Interactive: Binary Symmetric Channel Capacity

Adjust the error probability (p) of a BSC:

Error Probability = 0.1

Channel Capacity: 0.531 bits/transmission

6. Source Coding: Data Compression

Source Coding Theorem

The Source Coding Theorem states that the optimal code length for a symbol with probability p is -log₂(p) bits. For a source with entropy H(X), the average code length L satisfies:

H(X) ≤ L < H(X) + 1

This means we cannot compress data to fewer bits than its entropy, on average.

Example: Huffman Coding

Consider encoding the letters A, B, C, D with probabilities:

P(A) = 0.4
P(B) = 0.3
P(C) = 0.2
P(D) = 0.1

Huffman coding would assign:

A: 0 (1 bit)
B: 10 (2 bits)
C: 110 (3 bits)
D: 111 (3 bits)

Average code length:

L = 0.4 × 1 + 0.3 × 2 + 0.2 × 3 + 0.1 × 3 = 1.9 bits/symbol

Source entropy:

H(X) = -(0.4 × log₂(0.4) + 0.3 × log₂(0.3) + 0.2 × log₂(0.2) + 0.1 × log₂(0.1)) = 1.846 bits/symbol

The average code length (1.9) is indeed between H(X) and H(X) + 1.

Huffman Coding

Arithmetic Coding

LZW Compression

Huffman Coding Step-by-Step

Sort symbols by probability (descending)
Take the two least probable symbols and create a node with them as children
Assign the new node a probability equal to the sum of its children
Replace the two symbols with this new node in the list
Repeat until only one node remains (the root)
Assign '0' to each left branch and '1' to each right branch
Read the code for each symbol from root to leaf

Huffman coding is optimal for symbol-by-symbol encoding when all probabilities are powers of 1/2.

Arithmetic Coding

Arithmetic coding encodes the entire message as a single number in the range [0,1).

Begin with the interval [0,1)
For each symbol, narrow the interval proportionally according to its probability
After encoding all symbols, any number within the final interval represents the message

Arithmetic coding can achieve compression rates very close to the theoretical entropy limit.

LZW (Lempel-Ziv-Welch) Compression

LZW is a dictionary-based compression method that doesn't require prior knowledge of symbol probabilities.

Initialize dictionary with all single-character strings
Find the longest string W in the dictionary that matches the current input
Output the code for W and remove W from the input
Add W+next character to the dictionary
Repeat until input is exhausted

LZW is effective for texts with repeated patterns and is used in formats like GIF and PDF.

7. Channel Coding: Error Correction

Noisy Channel Coding Theorem

Shannon's Noisy Channel Coding Theorem states that for any communication channel with capacity C, information can be transmitted at any rate R < C with arbitrarily small error probability using appropriate coding.

This theorem established the theoretical maximum data rate for a given channel.

Example: Repetition Code

A simple 3-repetition code:

Encode 0 as 000
Encode 1 as 111

Decoding: Use majority rule (if more 1s than 0s, decode as 1, otherwise 0)

For a BSC with error probability p = 0.1:

P(correct) = P(0 errors) + P(1 error) = (1-p)³ + 3p(1-p)² = 0.972

This reduces the error probability from 0.1 to 0.028, but at the cost of a 3× reduction in data rate.

Hamming Codes

Reed-Solomon Codes

LDPC Codes

Hamming Codes

Hamming codes can detect up to two-bit errors and correct one-bit errors. A common Hamming code is (7,4), which uses 3 parity bits to protect 4 data bits.

The positions of parity bits are powers of 2: 1, 2, 4, 8, etc.

Each parity bit checks specific bits according to its position in binary.

Hamming distance between valid codewords is at least 3, allowing single-error correction.

Reed-Solomon Codes

Reed-Solomon codes are block codes that operate on multiple bits simultaneously. They're particularly effective against burst errors.

A Reed-Solomon code RS(n,k) with symbols from GF(2ᵐ):

Takes k data symbols of m bits each
Adds (n-k) parity symbols
Can correct up to (n-k)/2 symbol errors

Reed-Solomon codes are used in CDs, DVDs, QR codes, and deep-space communications.

Low-Density Parity-Check (LDPC) Codes

LDPC codes use sparse parity-check matrices, making them efficient to decode using iterative algorithms.

They can achieve performance very close to the Shannon limit.

LDPC codes are used in:

5G and Wi-Fi networks
Digital television standards (DVB-S2)
Solid-state drives (SSDs)

Decoding uses belief propagation or the sum-product algorithm.

8. Rate Distortion Theory

Definition

Rate distortion theory addresses lossy compression, where some information loss is acceptable. It quantifies the tradeoff between:

Rate (R): Bits used per symbol
Distortion (D): Measure of information loss

The rate-distortion function R(D) gives the minimum rate required to achieve a specified distortion.

Example: Binary Source with Hamming Distortion

For a fair binary source (P(0) = P(1) = 0.5) with Hamming distortion (d(x,y) = 0 if x = y, 1 otherwise):

R(D) = 1 - H(D) for 0 ≤ D ≤ 0.5

This means if we're willing to accept errors in D fraction of the bits, we need at least 1-H(D) bits per source bit.

For example, with D = 0.1 (10% error):

R(0.1) = 1 - H(0.1) = 1 - 0.469 = 0.531 bits/symbol

So we can compress a binary source to about 53.1% of its original size if we accept 10% error.

Interactive: Rate-Distortion Function

Adjust the acceptable distortion level (D):

Distortion (D) = 0.1

Minimum Required Rate: 0.531 bits/symbol

9. Applications of Information Theory

Communications

Data Compression

Cryptography

Machine Learning

Communications Systems

Cellular networks: Optimize transmission power and bandwidth allocation
Wi-Fi systems: Error correction codes ensure reliable data delivery
Satellite communications: Handle severe channel constraints
Fiber optic networks: Achieve data rates approaching channel capacity

Modern communication systems use sophisticated coding schemes (LDPC, Turbo codes) to operate within 0.0045 dB of the Shannon limit.

Data Compression

Lossless compression: ZIP, PNG, FLAC
Lossy compression: JPEG, MP3, MP4
Text compression: Most text compression algorithms achieve 2:1 to 8:1 compression ratios depending on the content
Image compression: JPEG typically achieves 10:1 to 20:1 compression with acceptable quality loss

Video streaming platforms use adaptive bitrate techniques based on information-theoretic principles to deliver the best possible quality for available bandwidth.

Cryptography

Perfect secrecy: Shannon proved that perfect secrecy requires a key at least as long as the message (One-time pad)
Randomness extraction: Converting biased randomness into uniform randomness
Secure key exchange: Information-theoretic approaches to key distribution
Quantum cryptography: Secure communication based on quantum information theory

Information-theoretic security provides protection even against attackers with unlimited computational power.

Machine Learning

Feature selection: Mutual information identifies the most relevant features
Regularization: Information-theoretic approaches to prevent overfitting
Clustering: Information bottleneck method for finding compact representations
Neural networks: Information-theoretic explanations for deep learning phenomena

The Minimum Description Length (MDL) principle, derived from information theory, provides a formal framework for model selection in machine learning.

10. Common Problem-Solving Approaches

Problem-Solving Strategy

Identify the information source: Determine the probability distribution
Calculate relevant information measures: Entropy, mutual information, etc.
Apply appropriate theorems: Source coding, channel coding, etc.
Design coding schemes: Based on theoretical limits
Optimize for constraints: Consider computational complexity, latency, etc.

Entropy Problems

Coding Problems

Capacity Problems

Solving Entropy Problems

Identify all possible outcomes of the random variable
Determine the probability of each outcome
Apply the entropy formula: H(X) = -∑ P(xᵢ) log₂ P(xᵢ)
Remember special cases:
- Uniform distribution maximizes entropy: H(X) = log₂(n)
- Deterministic variable has zero entropy: H(X) = 0
- Binary entropy function: H(p) = -p log₂(p) - (1-p) log₂(1-p)

Tip: For joint entropy, use the joint probability distribution. For conditional entropy, apply the chain rule.

Solving Coding Problems

Calculate the entropy of the source to determine the theoretical limit
For lossless compression:
- Huffman coding: Optimal for symbol-by-symbol coding
- Arithmetic coding: Approaches entropy limit for large sequences
- Dictionary-based: Effective for repeating patterns
For error correction:
- Determine the error model (random vs. burst errors)
- Calculate required redundancy based on error model
- Choose appropriate code (Hamming, Reed-Solomon, etc.)
Analyze the resulting code rate and compare to theoretical limits

Tip: Always check if your code rate exceeds the channel capacity - no reliable communication is possible in that case.

Solving Capacity Problems

Identify the channel model and its parameters
For discrete memoryless channels:
- Set up the mutual information I(X;Y) formula
- Find the input distribution p(x) that maximizes I(X;Y)
For Gaussian channels with noise power N and signal power S:
- C = ½ log₂(1 + S/N) bits/transmission
For bandlimited channels with bandwidth B:
- C = B log₂(1 + S/N) bits/second

Tip: For complex channels, numerical optimization methods may be needed to find capacity.

Information Theory Quiz

1. What is the entropy of a fair six-sided die?

A) 1 bit

B) 2 bits

C) 2.585 bits

D) 6 bits

The entropy of a fair six-sided die is log₂(6) = 2.585 bits. Since all outcomes are equally likely with probability 1/6, we get H(X) = -∑ (1/6) × log₂(1/6) = log₂(6).