Video Encoding – Part 2 (Video Codecs)

In the previous section Video Encoding Part-1, we looked at how individual images are represented and compressed using formats like BMP, PNG, and JPEG. However, video is not just a collection of independent images—it leverages compression both within each frame (spatial compression) and across frames (temporal compression) by exploiting similarities between consecutive frames.

1. Sequence of images and video encoding

A video is a sequence of such images (called frames) that are stored or transmitted one after another.
As discussed in the previous article, an image is captured (for example from a camera) as a raw frame, which can be very large in size. To make it practical for storage and transmission, it is compressed into formats like JPEG.
As a first step, each frame is compressed using image compression techniques (similar to JPEG).
However, video encoding goes a step further; Instead of treating every frame independently, it compares each frame with its adjacent frames and encodes only the differences between them.
This process—called video encoding, and it significantly reduces the amount of data, often by more than 10×, compared to sending individually compressed images for every frame.
Video encoding is usually a computational intense process; And depending on the parameters chosen, the process can take significant processor and system resources.
Video codecs typically use a combination of:
- Intra-frame compression (within a frame, similar to JPEG)
- Inter-frame compression (between frames, using motion and differences)

2. Video encoding algorithms

There are various algorithms used to perform this inter-frame compression, such as H.264, H.265, etc.
Each algorithm offers different trade-offs:
- Some prioritize better visual quality
- Some prioritize faster encoding/decoding speed
- Some aim for higher compression efficiency (smaller file size at similar quality)
In addition to the algorithm itself, several encoding parameters also impact the final output:
- Bitrate: controls how much data is used per second (higher = better quality, larger size)
- Frame rate (FPS): affects smoothness and bandwidth
- Resolution: higher resolution increases data size
- Keyframe interval (I-frame frequency): impacts seeking, latency, and compression efficiency

3. Frame types (I,P,B)

Video codecs do not treat all frames equally. Instead, they use different frame types to reduce redundancy and improve compression efficiency.
The three main frame types are:

I-frames (aka Key frame) (Intra-coded frames): a fully self-contained video frame that is encoded independently, without referencing any other frames, and acts as a complete image in the video stream.

- Fully self-contained frames; 
- Similar to a standalone image (like JPEG)
- Do not depend on any other frame
- Used as reference points for decoding and seeking

P-frames (Predicted frames): A video frame that stores only the differences from a previous. It depends on I-frame for its reconstruction because it is a difference frame.

- Store only the difference from previous frames
- Depend on past I-frames or P-frames
- Require less data than I-frames
- Common in normal video playback

B-frames (Bidirectional frames): a video frame that is reconstructed using information from both previous and next reference frames (I-frames or P-frames), allowing for higher compression efficiency.

- Use information from both previous and next frames
- Achieve the highest compression efficiency
- More complex to encode/decode
- May increase latency slightly

By combining these frame types, video codecs significantly reduce file size while maintaining visual quality.

4. Codecs vs Containers

Codec (a compression and decompression technique)

A video codec is about "how video frames are encoded are decoded".
Examples are: H.264, H.265 etc.
A codec compresses frames and generates I, P, and B frames

Container

A container is a file format that stores video frames (i.e. coded output), audio stream, subtitles, metadata
Example: MP4, MKV, AVI (older containers)
In other words, code is algorithm to compress video frames into I,P,B frames; Container is how the frames and other infromation like audio etc. is to be stored in file.

5. Video resolutions (CIF, QCIF, HD, FHD, etc.)

These are early standardized low-resolution video formats used in video conferencing and surveillance systems when bandwidth was very limited.

Video resolution defines the spatial size of each frame, i.e., how many pixels make up a single image in a video stream.
Earlier formats like QCIF and CIF were designed for video conferencing and surveillance when bandwidth and processing power were very limited
Modern formats like HD and Full HD (FHD) are used for high-quality streaming and display systems.

Video resolution comparison (QCIF → 8K)

Format	Resolution (W × H)	Total Pixels	Aspect Ratio	Era / Use Case
QCIF	176 × 144	~25K	11:9	Very low-bandwidth video calls, early mobile video
CIF	352 × 288	~101K	11:9	Early video conferencing, CCTV systems
4CIF	704 × 576	~405K	11:9	Improved surveillance, broadcast systems
SD (480p)	720 × 480	~345K	4:3 / 16:9	DVD, legacy TV systems
HD (720p)	1280 × 720	~921K	16:9	Basic HD streaming
FHD (1080p)	1920 × 1080	~2.07M	16:9	Standard streaming, video conferencing
2K (QHD)	2560 × 1440	~3.69M	16:9	High-end monitors, gaming, streaming
4K (UHD)	3840 × 2160	~8.29M	16:9	Ultra HD streaming, cinema-grade content
8K (UHD)	7680 × 4320	~33.18M	16:9	Premium displays, professional production

6. Audio codecs

Audio data can also be compressed, similar to video, to reduce storage size and bandwidth usage during transmission.
Unlike a video frame, which is a 2D signal (X and Y axes), audio is a 1D time-domain signal, so its compression algorithms are different and operate over time rather than spatial dimensions.
Common audio codecs include formats like AAC and MP3, which exploit redundancy in sound signals and human hearing perception.
Audio compression is outside the scope of this article and will be covered in a separate discussion.

7. Audio and video synchronization

Audio and video are kept in sync using timestamps (PTS) attached to both streams.
Each video frame and audio chunk has a time reference that tells the player when to play it.
During playback, the player aligns audio and video on a common timeline.
If one stream arrives earlier (common in streaming), the player uses a buffer to wait and match timing.
The goal is simple: lips, sound, and motion should match naturally during playback.
This topic is not covered in this article; it is mentioned for the sake of curious readers.