For decades, video compression has been a game of smart removal — deleting what the human eye won't miss. From MPEG-2 to H.264 to AV1, every generation has been about the same goal: keep what matters, hide what doesn't.

But we've hit the limit.

Compression Isn't Getting Smarter — It's Getting Softer

The upcoming AV2 codec is a perfect example. On paper, it claims up to 40% better compression than AV1. In reality, most of that "gain" comes from post-filters that blur fine details to please the metrics. Sharp edges, textures, and grain — all softened by denoising filters and restoration passes.

It looks cleaner to an algorithm. It looks softer to a human.

We're no longer improving compression; we're improving the illusion of quality.

From Blocks to Brains: We've Already Mined the Math

Old codecs worked in simple blocks. 8×8 macroblocks were once huge. Then came 16×16, and now we have adaptive block trees, quarter-pixel motion vectors, and P/B-frame pyramids with recursive prediction. The math has gotten incredible — but the gains are shrinking.

We've squeezed every drop out of DCTs, transforms, and entropy coding. Every new codec claims another 10–15% improvement — but at massive computational cost and often at the expense of detail.

Visual Evolution: Block Sizes Through the Decades

Here's how block-based encoding has evolved:

MPEG-2 (1995) - Fixed 8×8 Blocks
┌────┬────┬────┬────┐
│ 8×8│ 8×8│ 8×8│ 8×8│  Simple, uniform grid
├────┼────┼────┼────┤  Every block treated equally
│ 8×8│ 8×8│ 8×8│ 8×8│  ~6 Mbps for broadcast quality
├────┼────┼────┼────┤
│ 8×8│ 8×8│ 8×8│ 8×8│
└────┴────┴────┴────┘

H.264 (2003) - Adaptive 16×16 Macroblocks with Sub-partitions
┌─────────────┬──────┬──────┐
│             │  8×8 │  8×8 │  Can split 16×16 down to 4×4
│   16×16     ├──────┼──────┤  Motion vectors per partition
│             │  8×8 │  8×8 │  ~3 Mbps for same quality
├──────┬──────┼──────┴──────┤
│  8×8 │  8×8 │    16×16    │  Adaptive based on content
├──────┴──────┤             │
│    16×16    │             │
└─────────────┴─────────────┘

H.265/HEVC (2013) - Quad-tree 64×64 CTUs
┌───────────────────────┬──────┬──────┐
│                       │      │      │  Recursive quad-tree splitting
│                       │ 16×16│ 16×16│  64×64 → 32×32 → 16×16 → 8×8
│       64×64           ├──────┴──────┤  35 different prediction modes
│                       │    32×32    │  ~1.5 Mbps for same quality
│                       │             │
├───────┬───────┬───────┼─────────────┤
│ 32×32 │ 16×16 │ 16×16 │             │
│       ├───┬───┼───┬───│    32×32    │
│       │8×8│8×8│8×8│8×8│             │
└───────┴───┴───┴───┴───┴─────────────┘

AV1 (2018) - Flexible 128×128 Superblocks with Non-Square Partitions
┌─────────────────────────────┬────────┐
│                             │  32×16 │  Non-square partitions!
│                             ├────────┤  Can be 32×8, 8×32, 64×16, etc.
│          128×64             │  32×16 │  Compound prediction modes
│                             ├────┬───┤  ~1 Mbps for same quality
│                             │16×8│16│
├────────┬────────┬───────────┴────┴─8─┤
│  64×32 │  32×32 │       64×64        │
├────────┼────┬───┤                    │
│  64×32 │32×16   │                    │
└────────┴────┴───┴────────────────────┘

Notice the trend: blocks keep getting larger and more flexible. But each generation requires exponentially more computation to decide how to split them.

Frame Types: The GOP Structure

Early codecs used simple I-frame (keyframe) and P-frame (predicted) structures. Then B-frames (bi-directional) were added for even better compression. Modern codecs now use long chains of P-frames with occasional B-frame pyramids.

MPEG-2 Era - Simple GOP (Group of Pictures)
I───P───B───B───P───B───B───P───B───B───I
│   ↑   ↑   ↑   ↑   ↑   ↑   ↑   ↑   ↑   │
│   └───┴───┘   └───┴───┘   └───┴───┘   │ GOP = 12 frames
│                                        │ 2 B-frames between each I/P
└────────────────────────────────────────┘

I  = Intra-frame (keyframe) - fully encoded, no prediction
P  = Predicted frame - references previous I or P frame
B  = Bi-directional - references both past and future frames


H.264 Era - B-Frame Pyramids
I───────P───────────────P───────────────I
│       │               │               │
│       ├───B───────────┤               │ Hierarchical B-frames
│       │   ↑           │               │ B-frames can reference
│       │   │           │               │ other B-frames
│       │   B───B───B───┤               │
│       │   ↑   ↑   ↑   │               │
│       └───┴───┴───┴───┘               │
└────────────────────────────────────────┘
        More compression, more latency


Modern H.264/H.265 - Long P-Frame Chains
I───P───P───P───P───P───P───P───P───P───P───P───I
│   ↑   ↑   ↑   ↑   ↑   ↑   ↑   ↑   ↑   ↑   ↑   │
│   └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘   │ GOP = 60+ frames
│                                                │ Low latency
└────────────────────────────────────────────────┘ Long-term references

Common for surveillance and live streaming
Lower latency than B-frames
Requires good bitrate control

Surveillance systems typically use long P-frame chains (I-frame every 1-2 seconds) because low latency matters more than maximum compression. Live streaming and broadcasting still use B-frames for better compression, accepting the latency trade-off.

Bitrate Allocation Across Frame Types

Typical Bitrate Distribution in a GOP (3 Mbps stream, 30 fps, I-frame every 60 frames)

I-Frame (Keyframe):
████████████████████████████████████████ 600 Kbps (20% of total, 1/60 frames)
Full image encoded, all macroblocks intra-coded

P-Frames (59 frames):
████████ 40 Kbps each × 59 = 2.36 Mbps (78% of total)
Only encode differences from previous frame
Motion vectors + residuals

Overhead (headers, metadata):
█ 40 Kbps (2% of total)

Total: 3 Mbps average, but highly variable:
Frame #1  (I): ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 600 Kbps ◄── Keyframe spike
Frame #2  (P): ▓▓▓▓ 50 Kbps
Frame #3  (P): ▓▓▓ 35 Kbps
Frame #4  (P): ▓▓ 30 Kbps
Frame #5  (P): ▓▓▓ 38 Kbps
...
Frame #30 (P): ▓▓▓▓▓ 55 Kbps ◄── Accumulated drift
Frame #31 (P): ▓▓▓▓ 48 Kbps
...
Frame #60 (P): ▓▓▓▓▓▓ 65 Kbps ◄── Quality degradation before next I-frame
Frame #61 (I): ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 600 Kbps ◄── Refresh and reset

This is why I-frame intervals matter so much for forensic quality. By frame #60, you've accumulated prediction errors across 59 P-frames. That license plate might be blurry until the next I-frame resets everything.

The Codec Evolution Timeline

At this point, compression isn't about math anymore. It's about psychology.

Axis and the Art of Not Sending Video

Axis Communications saw this years ago. Instead of focusing on better transforms, they focused on not sending data that doesn't matter.

Their motion-based "Zipstream" idea was simple genius: only transmit full frames when there's real movement. The rest is stillness, and stillness compresses infinitely well.

The Zipstream Insight

Traditional codecs treat every second of video as equally important. Zipstream recognizes that surveillance video is 90% static scenes with occasional motion events.

Why send full bitrate to encode a static parking lot?

The innovation isn't better compression — it's contextual awareness. Send high quality when motion is detected, drop to minimal bitrate when nothing changes.

That concept has since spread across the entire industry — everyone from Hikvision to Dahua now calls it "smart codec," but the core idea remains: don't waste bandwidth on static pixels.

This kind of temporal intelligence is the real future. It's not about another codec; it's about deciding when to send video, not just how to compress it.

The Frame Rate Myth

Another industry obsession: frame rate. People assume 60 fps means "better." For live streaming or sports, maybe. For surveillance? Not really.

If you lock your bitrate at 3 Mbps and jump from 25 fps to 60 fps, you're cutting the bits per frame by more than half. You end up compressing motion blur itself — sharper movement, worse image.

The Frame Rate Reality Check

More frames don't mean more clarity — they mean less data per frame.

Security Cameras: The 15 FPS Advantage

For security cameras with bandwidth considerations, you're often better off using 15 fps instead of 30 fps. Here's why:

15 quality frames beats 30 low-quality frames. With an I-frame interval around once per second, you get good usable frames of data at regular intervals. When you need to go back and look for a license plate or get a clean picture of a person's face, that lower frame rate with higher per-frame quality gives you far better forensic data.

At 30 fps with limited bandwidth, you're spreading your bitrate too thin. Motion might look smoother during live playback, but the quality of each individual frame tends to go down. When it matters most — during forensic analysis — you want sharp, detailed frames, not smooth motion.

This is where these new perception-based filters will really shine. They're specifically designed to account for things that need to be kept: fine details, faces, text, edges. A perception-aware encoder at 15 fps could preserve license plate clarity and facial features far better than a traditional encoder at 30 fps with the same bandwidth budget.

The industry has been chasing higher frame rates for the wrong reasons. For surveillance and forensics, frame quality beats frame rate every single time.

The Next Chapter: Perception-Based Compression

We've mastered compression. The next frontier is perception.

Codecs will start to prioritize faces, license plates, edges, and text — regions the human eye or AI models actually care about. The rest can fade into softness.

Future encoders won't just predict pixels; they'll predict attention.

What Perception-Based Encoding Looks Like

That's where true efficiency lies — not in another 5% bitrate savings, but in understanding what matters visually and discarding the rest intelligently.

Why H.264 Still Wins

With all this talk of next-generation codecs, there's an inconvenient truth: H.264 is still the best choice for most real-world deployments.

Not because it's technically superior. But because:

H.265 and AV1 promise better compression, but in typical surveillance scenarios (400-800 Kbps bitrates, wireless links, variable packet loss), the gains evaporate. You end up fighting codec quirks, decoder compatibility, and unpredictable quality.

H.264 just works. And in a world where "smarter" codecs keep getting softer, that reliability is worth more than benchmark scores.

The Backport Opportunity

Here's where it gets interesting: many of these new post-processing filters and perceptual optimizations being developed for AV2 and VVC aren't fundamentally tied to those codecs. They're preprocessing and postprocessing techniques that could theoretically be backported to H.264 and H.265.

We might soon see movement in the open-source encoder libraries that have been relatively stable for years:

The Open Source Encoder Landscape

x264 (libx264) — VideoLAN's H.264 encoder, the gold standard for software encoding. While still maintained with periodic updates, the core algorithm has been largely stable since the mid-2010s as H.264 reached maturity.

x265 (libx265) — MulticoreWare's H.265/HEVC encoder. Development has slowed considerably in recent years as focus shifted to newer codecs, though critical updates still occur.

Both projects could benefit enormously from incorporating modern perception-based optimizations — saliency-aware bit allocation, AI-driven preprocessing, and context-sensitive encoding decisions — without touching the core codec spec.

The beauty of this approach? You get the perceptual improvements of next-gen codecs while maintaining the universal compatibility and reliability of H.264/H.265. No new decoders required. No compatibility headaches. Just smarter encoding of a proven format.

If the industry is smart, we'll see these techniques trickle down to the encoders everyone actually uses, rather than being locked inside codecs nobody can decode.

What Can Actually Be Backported While Maintaining Standards

Here's the crucial insight: most of the innovation in AV2 and VVC isn't in the codec specification itself — it's in the encoder intelligence layer. Pre-processing filters, content analysis, and encoding decision logic can all be added to H.264 and H.265 encoders without breaking compatibility.

A decoder doesn't know (or care) what you did before encoding. As long as the bitstream conforms to the H.264 or H.265 standard, any compliant decoder will play it perfectly.

The Three Layers of Modern Encoding

┌─────────────────────────────────────────────────────────────┐
│  Layer 1: PRE-PROCESSING (Before Encoding)                   │
│  ─────────────────────────────────────────────────────────   │
│  • Adaptive denoising (preserve detail, remove noise)        │
│  • Content-aware sharpening (edges, faces, text)             │
│  • Perceptual preprocessing (ROI detection, saliency maps)   │
│  • Temporal filtering (reduce flicker, stabilize)            │
│  • Scene detection (identify cuts, fades, motion events)     │
│                                                               │
│  ✓ Can be backported to H.264 and H.265                      │
│  ✓ No decoder changes needed                                 │
└─────────────────────────────────────────────────────────────┘
         ↓ Feed processed frames to encoder
┌─────────────────────────────────────────────────────────────┐
│  Layer 2: ENCODING INTELLIGENCE (During Encoding)            │
│  ─────────────────────────────────────────────────────────   │
│  • ROI-based bit allocation (faces get more bits)            │
│  • Motion-adaptive encoding (Zipstream-style logic)          │
│  • Semantic segmentation (prioritize important regions)      │
│  • Psychovisual optimization (tune for human perception)     │
│  • Rate control intelligence (scene-adaptive bitrate)        │
│                                                               │
│  ✓ H.265: Nearly everything can be backported                │
│  ⚠ H.264: Some limitations due to simpler block structure    │
└─────────────────────────────────────────────────────────────┘
         ↓ Standard H.264/H.265 bitstream
┌─────────────────────────────────────────────────────────────┐
│  Layer 3: POST-PROCESSING (After Decoding) - OPTIONAL        │
│  ─────────────────────────────────────────────────────────   │
│  • Deblocking filters (reduce compression artifacts)         │
│  • Deringing (smooth edge halos)                             │
│  • Detail restoration (sharpen lost textures)                │
│  • Neural enhancement (AI-based upscaling, cleanup)          │
│                                                               │
│  ⚠ Decoder-side implementation required                      │
│  ⚠ Not part of the standard (player-dependent)               │
└─────────────────────────────────────────────────────────────┘

Pre-Processing: Safe for Both H.264 and H.265

These techniques happen before the encoder ever sees the video. The codec doesn't know you did anything.

Proven Pre-Processing Techniques

Adaptive Temporal Denoising
Modern cameras produce noise in low light. Traditional denoisers blur everything. Smart denoisers (like those in AV2) preserve edges and fine detail while removing sensor noise. This gives the encoder cleaner input, resulting in better compression.

Content-Aware Sharpening
Instead of sharpening the entire frame uniformly, modern filters detect faces, license plates, and text, then apply selective sharpening. Background textures (grass, sky) are left alone or slightly softened to save bits.

Scene Change Detection with Frame Preparation
When a scene cut is detected, force an I-frame. When motion stops (Zipstream-style), reduce bitrate dramatically. This is pure encoder logic — the bitstream remains standard-compliant.

Perceptual Color Grading
Human eyes are more sensitive to certain color ranges. Pre-processing can subtly adjust saturation and hue to allocate more bits where perception matters most. The codec never knows you did this.

Backport Status: ✓ All of these work with H.264 and H.265. No changes to the codec standard required.

Encoding Intelligence: H.265 Gets More, H.264 Gets Some

This is where H.265's more flexible structure pays off. H.264 can benefit from smarter encoding decisions, but H.265's larger CTUs (Coding Tree Units) and more prediction modes give it more room to work with.

What H.265 Can Do (That H.264 Struggles With)

Fine-Grained ROI Encoding
H.265's 64×64 CTUs with recursive quad-tree splitting allow for much more precise control over where bits go. You can allocate 3-4× more bitrate to a detected face while keeping the background at minimal quality. H.264's 16×16 macroblocks make this harder — you can do it, but with less precision.

Sophisticated Motion-Adaptive Bitrate Control
H.265 can more easily vary quality across a frame. Static regions can drop to extremely low bitrate while moving objects maintain high quality. H.264 can do this, but it's more limited by its block structure.

Better Psychovisual Tuning
H.265 encoders can take advantage of the codec's flexibility to optimize for human perception rather than PSNR metrics. This includes preserving high-frequency detail in faces while softening backgrounds. H.264 can do some of this, but not as effectively.

Backport Status:
✓ H.265: Nearly unlimited potential for encoder intelligence
⚠ H.264: Works, but with reduced precision due to smaller macroblocks

What H.264 CAN Do (And Should)

Motion-Based Bitrate Allocation (Zipstream Logic)
This works perfectly in H.264. When motion stops, drop the bitrate. When motion resumes, increase it. The standard fully supports variable bitrate encoding — this is just smarter rate control.

Macroblock-Level ROI Encoding
H.264's QP (Quantization Parameter) can be adjusted per macroblock. You can absolutely prioritize faces, license plates, and detected objects. It's just less granular than H.265 (16×16 blocks vs 64×64 CTUs).

Scene-Adaptive I-Frame Placement
Detect scene changes, motion events, or static periods and place I-frames intelligently. Modern H.264 encoders already do this, but it can be much more aggressive with better scene analysis.

Multi-Pass Encoding with Saliency Maps
For non-realtime applications (archival, post-production), H.264 can absolutely benefit from multi-pass encoding where the first pass generates a saliency map, and the second pass allocates bits accordingly.

Backport Status: ✓ All of these work today in H.264. Just requires smarter encoder implementations.

The Real Innovation: Content Awareness

The biggest opportunity isn't new math — it's understanding what's in the video.

Traditional Encoder:
┌──────────────────┐
│  Input Frame     │──→ Encode every pixel equally ──→ Output Bitstream
└──────────────────┘

Context-Aware Encoder (What Can Be Backported):
┌──────────────────┐
│  Input Frame     │
└────────┬─────────┘
         ↓
┌────────────────────────────────────────┐
│  Content Analysis Layer                │
│  ────────────────────────────────────  │
│  • Face Detection (YOLOv8, OpenCV)     │
│  • License Plate Detection (OCR prep)  │
│  • Motion Detection (MOG2, KNN)        │
│  • Edge/Text Detection (Canny, Sobel)  │
│  • Saliency Mapping (where eyes look)  │
└────────┬───────────────────────────────┘
         ↓
┌────────────────────────────────────────┐
│  Intelligent Bit Allocation            │
│  ────────────────────────────────────  │
│  Face regions:    High QP (low quant)  │
│  License plates:  High QP (low quant)  │
│  Moving objects:  Medium QP            │
│  Static bg:       Low QP (high quant)  │
│  Sky/pavement:    Minimal QP           │
└────────┬───────────────────────────────┘
         ↓
┌──────────────────┐
│  H.264/H.265     │──→ Standard Compliant Bitstream
│  Encoder         │     (Any decoder can play it)
└──────────────────┘

This entire content analysis layer is completely independent of the codec. It just makes smarter decisions about what to preserve and what to sacrifice.

Why This Matters for Surveillance

Surveillance video has unique characteristics that make these backported techniques incredibly effective:

Surveillance-Specific Wins

The x264/x265 Opportunity

Both x264 and x265 are open-source projects with active communities. There's nothing preventing these encoders from implementing:

The technology exists. The codecs support it. We just need encoder implementations to catch up to what AV2 and VVC research has taught us.

And the best part? Every device on the planet can already decode it.

WINK's Perspective

At WINK, we don't chase benchmark scores. We care about perceived quality, reliability, and latency.

Our philosophy has always been simple: compression should never compromise visibility. A frame that arrives late or loses detail isn't worth sending at all.

Our Engineering Principles

  1. Compatibility over cleverness: We use codecs that work everywhere, not just in lab conditions
  2. Reliability over ratios: A stream that stays connected is better than one that's 10% smaller
  3. Transparency over magic: Operators should understand what's happening, not trust black-box algorithms
  4. Field-tested over theoretical: We trust what performs in real networks with real packet loss

The industry has reached the end of mathematical compression. What comes next will be contextual, perceptual, and situational.

We've spent 20 years getting smarter about what to remove.

Now it's time to get smarter about what to keep.

The Future of Video Encoding

The compression wars are over. The winners are:

The next decade won't be defined by AV2 or H.266. It'll be defined by systems that understand their content and adapt intelligently.

Related Topics