SIEDD: Shared Implicit Encoder with Discrete Decoders

Vikram Rangarajan*, Shishira Maiya*, Max Ehrlich, Abhinav Shrivastava
University of Maryland

*Indicates Equal Contribution

Abstract

Implicit Neural Representations (INRs) offer exceptional fidelity for video compression by learning per-video optimized functions, but their adoption is crippled by impractically slow encoding times. Existing attempts to accelerate INR encoding often sacrifice reconstruction quality or crucial coordinate-level control essential for adaptive streaming and transcoding. We introduce SIEDD (Shared-Implicit Encoder with Discrete Decoders), a novel architecture that fundamentally accelerates INR encoding without these compromises. SIEDD first rapidly trains a shared, coordinate-based encoder on sparse anchor frames to efficiently capture global, low-frequency video features. This encoder is then frozen, enabling massively parallel training of lightweight, discrete decoders for individual frame groups, further expedited by aggressive coordinate-space sampling. This synergistic design delivers a remarkable 20-30X encoding speed-up over state-of-the-art INR codecs on HD and 4K benchmarks, while maintaining competitive reconstruction quality and compression ratios. Critically, SIEDD retains full coordinate-based control, enabling continuous resolution decoding and eliminating costly transcoding. Our approach significantly advances the practicality of high-fidelity neural video compression, demonstrating a scalable and efficient path towards real-world deployment. Our codebase is available at https://github.com/VikramRangarajan/SIEDD.

Method and Architecture

Description of Figure 1

Our method involves two training phases. In the first, which is the training of the shared encoder, we sample key frames from the video uniformly over time. We use this to train a SIEDD model with a shared encoder MLP, but a separate decoder MLP for each frame.

The second training phase involves training the individual video frames. Frames are trained sequentially in frame groups which is an important tunable hyperparameter (in this figure, 20). The shared encoder weights are transferred and frozen while the closest decoder weights are transferred into the new shared decoder. The last layer is kept separate.

During training, coordinates are sampled and the reconstruction loss for the associated pixels (L1 or L2) are minimized.

Visual Results (Drag on the Top Ground Truth Image to Crop)


Figures

Here, we show the rate-distortion tradeoff.

Here, we show the speed-quality tradeoff. The size of the points represents the compression rate (in BPP).

The figures above are a comparison of our method and baselines on UVG-HD.

The any fps decoding capabilities of SIEDD. Because of the coordinate-based approach, lower resolutions result in a lower batch size, speeding up inference and removing the need for expensive transcoding.

The second training phase of SIEDD is embarassingly parallel, meaning different GPUs can simultaneously train different frame groups.

The PSNR vs. BPP curve when ablating the model layer dimension and the number of decoder layers separately. Note that SIEDD-S is the d=512 point, SIEDD-M is the D=768 point, and SIEDD-L is the d=1024 point. These are unquantized bpp values.

The PSNR curve with the same ablations as above, but in relation to the encoding time.

BibTeX