High-level structure of the AV1 decoder design

ECE 5760 | AV1 Decoding

Hardware “Accelerated” AV1 Decoding" using an FPGA


ECE 5760 - Hardware Acceleration via FPGA final project.

We integrated custom FPGA inverse discrete-cosine-transform (DCT) blocks with the open-source dav1d AV1 decoder running on the DE1-SoC HPS.

Note: We did not manage to achieve performance gains over dav1d's optimized C and ARM assembly paths, but the project was a useful learning experience.

Overview

High-level structure of the design

High-level split between dav1d on the HPS and transform compute on the FPGA.

Project Introduction

Accelerating the entirety of the AV1 decode process purely in hardware is fundamentally impossible on our target De1-SoC FPGA due to the amount of logic and memory resources required, despite the FPGA being relatively powerful for a college course. Therefore, our goal was not to complete the entire video decoding process in hardware, but to accelerate some of the key transformations that would otherwise be done in software.

Profiling graphs for 1D transform usage in dav1d

Profiling showed DCT transforms dominated 1D transformation workload.

1D Transform Selection

To identify the ideal candidate for hardware acceleration, we profiled the dav1d software decoder to analyze 1D Transform usage on a sample video.

We chose the inverse DCT transformation because our profiling showed that it was the most expensive part of dav1d's decode pipeline, and because it is heavily math based, there was a lot of potential for parallelism.

Dependency graph of IDCT steps

Dependency graph used to combine valid optimized IDCT steps.

Inverse DCT

My main contribution was the inverse DCT implementation and helper-function package. I translated the AV1 specification's transform steps into SystemVerilog.

The first IDCT implementation was a direct state-machine implementation of the spec, which was useful for getting correctness but wildly expensive: a single naive 32-wide block used 66 DSPs. Later we wrote an optimized version wrapped the butterfly operation in reusable modules and muxed inputs through only two butterfly units, bringing usage down to four DSPs per block.

Because the target block was 32x32, I could remove AV1 steps that only apply to 64-wide transforms. We also used a step dependency graph to combine safe Hadamard and butterfly operations while preserving the bit-exact ordering required by the decoder.

Resource usage with a single naive IDCT block

The naive IDCT used 66 DSPs for one block.

Hardware/Software Tradeoffs

The DCT was useful because hardware can apply many register-level operations in parallel, but dav1d's assembly is extremely optimized and the HPS/FPGA transfer path is slow. So in the end, moving data to the accelerator cost more than the saved compute.

We also hit FPGA resource limits hard. Only six transform blocks fit, mostly due to register and LUT pressure.

Results

Frame from Rick Astley's Never Gonna Give You Up decoded by our accelerated dav1d

A frame from Rick Astley's "Never Gonna Give You Up" decoded by our accelerated version of dav1d.

A (very slightly) Accelerated Rick Roll

For our test case we ran Rick Astley's "Never Gonna Give You Up" through our FPGA-assisted dav1d build. Due to the limited storage available on our SD card, we had to downscale the video to 256x144 — larger versions simply would not fit on the SD card once decoded into raw video.

The decoder produced correct, watchable frames, confirming that the inverse DCT co-processor was integrated end-to-end with the software decode path.

Diff between CPU and FPGA-assisted decode showing no differences

Diff between CPU and FPGA-assisted decode. The grey shows that there are no differences.

Verification

To make sure we were getting it right, we compared our FPGA-assisted output against software-only dav1d's output frame by frame. The diff image is almost entirely uniform grey, which means there are no differences between the two decodes, our hardware IDCT was bit-exact with the reference.

Small white speck revealing a decoding error in dav1d's ARM assembly

A small white speck in the diff revealed a bug in dav1d's ARM assembly path.

A funny dav1d bug (we think)

While testing the identity transform, we noticed a small white speck in the diff, indicating a real difference between the two outputs. After digging in, we determined that the ARM assembly version of dav1d had a bug, not in the transform logic itself, but somewhere higher up the pipeline, that was causing the transform logic to receive different coefficients.