Implicit Neural Representations (INRs) offer exceptional fidelity for video compression by learning per-video optimized functions, but their adoption is crippled by impractically slow encoding times. Existing attempts to accelerate INR encoding often sacrifice reconstruction quality or crucial coordinate-level control essential for adaptive streaming and transcoding. We introduce SIEDD (Shared-Implicit Encoder with Discrete Decoders), a novel architecture that fundamentally accelerates INR encoding without these compromises. SIEDD first rapidly trains a shared, coordinate-based encoder on sparse anchor frames to efficiently capture global, low-frequency video features. This encoder is then frozen, enabling massively parallel training of lightweight, discrete decoders for individual frame groups, further expedited by aggressive coordinate-space sampling. This synergistic design delivers a remarkable 20-30X encoding speed-up over state-of-the-art INR codecs on HD and 4K benchmarks, while maintaining competitive reconstruction quality and compression ratios. Critically, SIEDD retains full coordinate-based control, enabling continuous resolution decoding and eliminating costly transcoding. Our approach significantly advances the practicality of high-fidelity neural video compression, demonstrating a scalable and efficient path towards real-world deployment. Our codebase is available at https://github.com/VikramRangarajan/SIEDD.
Our method involves two training phases. In the
first, which is the training of the shared encoder,
we sample key frames from the video uniformly over
time. We use this to train a SIEDD model with a
shared encoder MLP, but a separate decoder MLP for
each frame.
The second training phase involves training the
individual video frames. Frames are trained
sequentially in frame groups which is an important
tunable hyperparameter (in this figure, 20). The
shared encoder weights are transferred and frozen
while the closest decoder weights are transferred
into the new shared decoder. The last layer is kept
separate.
During training, coordinates are sampled and the
reconstruction loss for the associated pixels (L1 or
L2) are minimized.
Here, we show the rate-distortion tradeoff.
Here, we show the speed-quality tradeoff. The size of the points represents the compression rate (in BPP).
The figures above are a comparison of our method and baselines on UVG-HD.
The any fps decoding capabilities of SIEDD. Because of the coordinate-based approach, lower resolutions result in a lower batch size, speeding up inference and removing the need for expensive transcoding.
The second training phase of SIEDD is embarassingly parallel, meaning different GPUs can simultaneously train different frame groups.
The PSNR vs. BPP curve when ablating the model layer dimension and the number of decoder layers separately. Note that SIEDD-S is the d=512 point, SIEDD-M is the D=768 point, and SIEDD-L is the d=1024 point. These are unquantized bpp values.
The PSNR curve with the same ablations as above, but in relation to the encoding time.