Information Theory: Complete Notes
1. Introduction to Information Theory
Information theory is a mathematical framework for quantifying, storing, and communicating information. Developed by Claude Shannon in 1948, it has become fundamental to data compression, communication systems, cryptography, and many other fields.
Core Concept: Information
Information is a measure of the reduction in uncertainty. The more unlikely an event, the more information it provides when it occurs. Information theory provides mathematical tools to quantify this concept.
2. Entropy: The Measure of Information
Definition
Entropy (H) is the average amount of information produced by a stochastic source of data. For a discrete random variable X with possible values {x₁, x₂, ..., xₙ} and probability mass function P(X), the entropy H(X) is given by:
Where:
- The summation is over all possible values of X
- log₂ is the logarithm base 2, giving entropy in bits
- By convention, 0 × log₂(0) = 0
Example: Entropy Calculation
Consider tossing a fair coin with P(heads) = P(tails) = 0.5:
= -[0.5 × (-1) + 0.5 × (-1)]
= 1 bit
This means a fair coin toss provides exactly 1 bit of information.
Now, consider a biased coin with P(heads) = 0.9 and P(tails) = 0.1:
= -[0.9 × (-0.152) + 0.1 × (-3.322)]
= 0.469 bits
This biased coin provides less information because its outcome is more predictable.
Entropy Calculator
Enter probability values (must sum to 1):
3. Information Properties
Key Properties of Information
- Non-negativity: Entropy is always non-negative, H(X) ≥ 0
- Maximum entropy: For a discrete random variable with n possible values, entropy is maximized when all outcomes are equally likely, with H(X) = log₂(n)
- Additivity: For independent random variables X and Y, H(X,Y) = H(X) + H(Y)
- Conditioning reduces entropy: H(X|Y) ≤ H(X) - knowledge never increases uncertainty
Interactive: Entropy vs. Probability Distribution
Adjust the probability of Heads (P) for a biased coin:
P(Heads) = 0.5, P(Tails) = 0.5
4. Mutual Information
Definition
Mutual information I(X;Y) measures how much information variable X provides about another variable Y. It quantifies the reduction in uncertainty about one variable given knowledge of the other.
Where:
- H(X) and H(Y) are the individual entropies
- H(X|Y) is the conditional entropy of X given Y
- H(X,Y) is the joint entropy of X and Y
Example: Mutual Information
Consider two events:
- X: Whether it rains today (Rain or No Rain)
- Y: Whether someone carries an umbrella (Umbrella or No Umbrella)
Suppose we have the following joint probability distribution:
P(X,Y) | Umbrella | No Umbrella |
---|---|---|
Rain | 0.4 | 0.1 |
No Rain | 0.1 | 0.4 |
Calculate:
P(Umbrella) = 0.5, P(No Umbrella) = 0.5
H(X) = -[0.5 × log₂(0.5) + 0.5 × log₂(0.5)] = 1 bit
H(Y) = -[0.5 × log₂(0.5) + 0.5 × log₂(0.5)] = 1 bit
H(X,Y) = -[0.4 × log₂(0.4) + 0.1 × log₂(0.1) + 0.1 × log₂(0.1) + 0.4 × log₂(0.4)]
= -[0.4 × (-1.32) + 0.1 × (-3.32) + 0.1 × (-3.32) + 0.4 × (-1.32)]
= 1.722 bits
I(X;Y) = H(X) + H(Y) - H(X,Y) = 1 + 1 - 1.722 = 0.278 bits
This positive mutual information indicates that knowing whether someone carries an umbrella gives us information about whether it will rain.
5. Channel Capacity
Definition
Channel capacity (C) is the maximum rate at which information can be transmitted over a communication channel reliably (i.e., with arbitrarily small error probability).
Where the maximum is taken over all possible input distributions p(x).
Example: Binary Symmetric Channel (BSC)
A binary symmetric channel (BSC) has binary input X ∈ {0,1} and binary output Y ∈ {0,1} with error probability p.
When p = 0, it's a perfect channel; when p = 0.5, it's a completely random channel.
For example, with error probability p = 0.1:
Interactive: Binary Symmetric Channel Capacity
Adjust the error probability (p) of a BSC:
Error Probability = 0.1
6. Source Coding: Data Compression
Source Coding Theorem
The Source Coding Theorem states that the optimal code length for a symbol with probability p is -log₂(p) bits. For a source with entropy H(X), the average code length L satisfies:
This means we cannot compress data to fewer bits than its entropy, on average.
Example: Huffman Coding
Consider encoding the letters A, B, C, D with probabilities:
- P(A) = 0.4
- P(B) = 0.3
- P(C) = 0.2
- P(D) = 0.1
Huffman coding would assign:
- A: 0 (1 bit)
- B: 10 (2 bits)
- C: 110 (3 bits)
- D: 111 (3 bits)
Average code length:
Source entropy:
The average code length (1.9) is indeed between H(X) and H(X) + 1.
Huffman Coding Step-by-Step
- Sort symbols by probability (descending)
- Take the two least probable symbols and create a node with them as children
- Assign the new node a probability equal to the sum of its children
- Replace the two symbols with this new node in the list
- Repeat until only one node remains (the root)
- Assign '0' to each left branch and '1' to each right branch
- Read the code for each symbol from root to leaf
Huffman coding is optimal for symbol-by-symbol encoding when all probabilities are powers of 1/2.
Arithmetic Coding
Arithmetic coding encodes the entire message as a single number in the range [0,1).
- Begin with the interval [0,1)
- For each symbol, narrow the interval proportionally according to its probability
- After encoding all symbols, any number within the final interval represents the message
Arithmetic coding can achieve compression rates very close to the theoretical entropy limit.
LZW (Lempel-Ziv-Welch) Compression
LZW is a dictionary-based compression method that doesn't require prior knowledge of symbol probabilities.
- Initialize dictionary with all single-character strings
- Find the longest string W in the dictionary that matches the current input
- Output the code for W and remove W from the input
- Add W+next character to the dictionary
- Repeat until input is exhausted
LZW is effective for texts with repeated patterns and is used in formats like GIF and PDF.
7. Channel Coding: Error Correction
Noisy Channel Coding Theorem
Shannon's Noisy Channel Coding Theorem states that for any communication channel with capacity C, information can be transmitted at any rate R < C with arbitrarily small error probability using appropriate coding.
This theorem established the theoretical maximum data rate for a given channel.
Example: Repetition Code
A simple 3-repetition code:
- Encode 0 as 000
- Encode 1 as 111
Decoding: Use majority rule (if more 1s than 0s, decode as 1, otherwise 0)
For a BSC with error probability p = 0.1:
This reduces the error probability from 0.1 to 0.028, but at the cost of a 3× reduction in data rate.
Hamming Codes
Hamming codes can detect up to two-bit errors and correct one-bit errors. A common Hamming code is (7,4), which uses 3 parity bits to protect 4 data bits.
The positions of parity bits are powers of 2: 1, 2, 4, 8, etc.
Each parity bit checks specific bits according to its position in binary.
Hamming distance between valid codewords is at least 3, allowing single-error correction.
Reed-Solomon Codes
Reed-Solomon codes are block codes that operate on multiple bits simultaneously. They're particularly effective against burst errors.
A Reed-Solomon code RS(n,k) with symbols from GF(2ᵐ):
- Takes k data symbols of m bits each
- Adds (n-k) parity symbols
- Can correct up to (n-k)/2 symbol errors
Reed-Solomon codes are used in CDs, DVDs, QR codes, and deep-space communications.
Low-Density Parity-Check (LDPC) Codes
LDPC codes use sparse parity-check matrices, making them efficient to decode using iterative algorithms.
They can achieve performance very close to the Shannon limit.
LDPC codes are used in:
- 5G and Wi-Fi networks
- Digital television standards (DVB-S2)
- Solid-state drives (SSDs)
Decoding uses belief propagation or the sum-product algorithm.
8. Rate Distortion Theory
Definition
Rate distortion theory addresses lossy compression, where some information loss is acceptable. It quantifies the tradeoff between:
- Rate (R): Bits used per symbol
- Distortion (D): Measure of information loss
The rate-distortion function R(D) gives the minimum rate required to achieve a specified distortion.
Example: Binary Source with Hamming Distortion
For a fair binary source (P(0) = P(1) = 0.5) with Hamming distortion (d(x,y) = 0 if x = y, 1 otherwise):
This means if we're willing to accept errors in D fraction of the bits, we need at least 1-H(D) bits per source bit.
For example, with D = 0.1 (10% error):
So we can compress a binary source to about 53.1% of its original size if we accept 10% error.
Interactive: Rate-Distortion Function
Adjust the acceptable distortion level (D):
Distortion (D) = 0.1
9. Applications of Information Theory
Communications Systems
- Cellular networks: Optimize transmission power and bandwidth allocation
- Wi-Fi systems: Error correction codes ensure reliable data delivery
- Satellite communications: Handle severe channel constraints
- Fiber optic networks: Achieve data rates approaching channel capacity
Modern communication systems use sophisticated coding schemes (LDPC, Turbo codes) to operate within 0.0045 dB of the Shannon limit.
Data Compression
- Lossless compression: ZIP, PNG, FLAC
- Lossy compression: JPEG, MP3, MP4
- Text compression: Most text compression algorithms achieve 2:1 to 8:1 compression ratios depending on the content
- Image compression: JPEG typically achieves 10:1 to 20:1 compression with acceptable quality loss
Video streaming platforms use adaptive bitrate techniques based on information-theoretic principles to deliver the best possible quality for available bandwidth.
Cryptography
- Perfect secrecy: Shannon proved that perfect secrecy requires a key at least as long as the message (One-time pad)
- Randomness extraction: Converting biased randomness into uniform randomness
- Secure key exchange: Information-theoretic approaches to key distribution
- Quantum cryptography: Secure communication based on quantum information theory
Information-theoretic security provides protection even against attackers with unlimited computational power.
Machine Learning
- Feature selection: Mutual information identifies the most relevant features
- Regularization: Information-theoretic approaches to prevent overfitting
- Clustering: Information bottleneck method for finding compact representations
- Neural networks: Information-theoretic explanations for deep learning phenomena
The Minimum Description Length (MDL) principle, derived from information theory, provides a formal framework for model selection in machine learning.
10. Common Problem-Solving Approaches
Problem-Solving Strategy
- Identify the information source: Determine the probability distribution
- Calculate relevant information measures: Entropy, mutual information, etc.
- Apply appropriate theorems: Source coding, channel coding, etc.
- Design coding schemes: Based on theoretical limits
- Optimize for constraints: Consider computational complexity, latency, etc.
Solving Entropy Problems
- Identify all possible outcomes of the random variable
- Determine the probability of each outcome
- Apply the entropy formula: H(X) = -∑ P(xᵢ) log₂ P(xᵢ)
- Remember special cases:
- Uniform distribution maximizes entropy: H(X) = log₂(n)
- Deterministic variable has zero entropy: H(X) = 0
- Binary entropy function: H(p) = -p log₂(p) - (1-p) log₂(1-p)
Tip: For joint entropy, use the joint probability distribution. For conditional entropy, apply the chain rule.
Solving Coding Problems
- Calculate the entropy of the source to determine the theoretical limit
- For lossless compression:
- Huffman coding: Optimal for symbol-by-symbol coding
- Arithmetic coding: Approaches entropy limit for large sequences
- Dictionary-based: Effective for repeating patterns
- For error correction:
- Determine the error model (random vs. burst errors)
- Calculate required redundancy based on error model
- Choose appropriate code (Hamming, Reed-Solomon, etc.)
- Analyze the resulting code rate and compare to theoretical limits
Tip: Always check if your code rate exceeds the channel capacity - no reliable communication is possible in that case.
Solving Capacity Problems
- Identify the channel model and its parameters
- For discrete memoryless channels:
- Set up the mutual information I(X;Y) formula
- Find the input distribution p(x) that maximizes I(X;Y)
- For Gaussian channels with noise power N and signal power S:
- C = ½ log₂(1 + S/N) bits/transmission
- For bandlimited channels with bandwidth B:
- C = B log₂(1 + S/N) bits/second
Tip: For complex channels, numerical optimization methods may be needed to find capacity.
Information Theory Quiz
1. What is the entropy of a fair six-sided die?
2. For a binary symmetric channel with error probability p = 0.25, what is the channel capacity?
3. If we have a source that produces symbols A, B, C, and D with probabilities 0.5, 0.25, 0.125, and 0.125 respectively, what is the optimal codeword length for symbol A using Huffman coding?
4. What is the mutual information between two independent random variables?
5. Shannon's Source Coding Theorem establishes a limit on:
6. Which of the following coding techniques can achieve compression ratios closest to the entropy limit for large sequences?
7. For a Gaussian channel with bandwidth B and signal-to-noise ratio S/N, the capacity in bits per second is given by:
8. What is the entropy of a binary source with P(0) = 0.8 and P(1) = 0.2?
9. Which of the following codes is most suited for correcting burst errors?
10. What happens to the entropy of a random variable if we know another related random variable?
Quiz Results
You scored: 0 out of 10