Remove Background Noise from Video Without Re-encoding: An Audio-Only Approach with DeepFilterNet3

The Problem

You record a 15-minute interview, a drone flyover, or a screen capture — and the audio has that familiar hum: HVAC, wind, fan noise, room tone. The footage itself is great. The audio ruins it.

Thank you for reading this post, don't forget to subscribe!

The standard fix is to run the whole file through a video editor or ffmpeg with a noise filter. That works, but it re-encodes the video stream. For a 6 GB 4K HEVC file, that means:

30–90 minutes of CPU time
A generation of quality loss from re-encoding
Another 6 GB of temporary disk space

There’s a much better way.

The Insight: Only the Audio Needs to Change

Video containers (MP4, MKV) store the video and audio as separate streams. You can replace just the audio track and copy the video bytes untouched — no decoding, no re-encoding, no quality loss.

The pipeline is:

Extract audio → lossless FLAC (takes seconds)
Denoise the audio with an ML model
Remux: original video stream + cleaned audio → new file (also takes seconds)

That 6 GB 4K file? The video remux step takes about 8 seconds. The only slow part is the ML inference on the audio, which is proportional to the audio length — not the video resolution.

The Tool

I built denoise, a Python CLI that implements this pipeline with multiple denoising backends and a clean interactive interface:

╭─────────────────────────────────────────────────╮
│ denoise  ·  video background noise removal       │
╰─────────────────────────────────────────────────╯

  file    interview.mp4
  video   hevc
  audio   aac
  length  15m 32s
  size    6.7 GB

   1  DeepFilterNet3 ★   state-of-the-art ML speech enhancement
   2  DeepFilterNet2     previous generation — faster
   3  noisereduce        profiles your actual noise — preserves voice
   4  RNNoise · bd       FFmpeg neural net, no extra deps
   ...

  select method [1]:

  parameter       default   range      description
  passes          1         1–4        runs of the model
  atten_lim_db    15.0      6–40       max attenuation in dB

  adjust (key=value · Enter to keep defaults): passes=2

  ✓  Extracted    77.5 MB  (64s)
  ✓  Denoised     [pass 1/2]  (14m 23s)
  ✓  Denoised     [pass 2/2]  (14m 19s)
  ✓  Audio saved  → outputs/interview_audio_clean.flac
  ✓  Done         → interview_clean.mp4  6.7 GB  (8s)

Denoising Backends

The tool auto-detects what’s installed and builds the menu accordingly.

DeepFilterNet3 (best quality)

DeepFilterNet3 is a recurrent neural network trained specifically for speech enhancement. It operates in the frequency domain, separating speech from stationary and non-stationary noise with remarkable precision. It’s the recommended choice for any footage with human voice.

It runs on CPU — no GPU required — though inference is slower than the other options (~1× real-time on an M-series Mac).

The key parameter is atten_lim_db: the maximum attenuation applied to any frequency band. The default of 15 dB is conservative and preserves naturalness. Crank it to 30+ for aggressive cleanup at the risk of some artefacts. Setting it below 10 gives a gentle pass that’s almost imperceptible.

Multi-pass is particularly effective with DeepFilterNet3. Running the model twice (passes=2) often cleans residual noise that survived the first pass without introducing new artefacts — the model is stable enough to run on its own output.

noisereduce

noisereduce uses spectral gating: it samples a short clip of noise-only audio (the first ~0.75s by default), builds a noise profile, then subtracts that profile from the full recording.

This approach shines when your noise is consistent — air conditioning, projector hum, camera body noise. It’s faster than DeepFilterNet and tends to leave voice texture more intact when the noise is well-profiled.

Tune noise_clip_s to point it at a section of your recording that contains only noise (no speech). If your recording starts with speech, trim a noise sample first and adjust accordingly.

FFmpeg RNNoise (arnndn)

ffmpeg’s built-in arnndn filter runs Mozilla’s RNNoise — a recurrent neural net — entirely in C. No Python dependencies at all once you have a .rnnn model file.

Clone the model zoo:

git clone https://github.com/GregorR/rnnoise-models rnn-models

The tool will pick them up automatically and list each one in the menu. The models vary in training data; beguiling-drafter and conjoined-burgers tend to work well for general speech.

afftdn and anlmdn

These are pure-ffmpeg filters, no extra installs needed.

afftdn (Adaptive Frequency Filter for Noise) uses spectral subtraction. The critical parameters are nr (noise reduction in dB, default 15) and nf (noise floor in dBFS, default -40). The original reason for building this tool was that the default nf=-25 used in most ffmpeg recipes is far too aggressive — it treats everything below -25 dBFS as noise, which eats soft consonants and produces the hoarse, “underwater” quality people complain about.

anlmdn (Adaptive Non-Local Means Denoising) compares overlapping audio windows to detect and suppress repetitive noise patterns. It’s gentler on transients than spectral subtraction and worth trying when afftdn sounds over-processed.

Install

git clone https://github.com/lynchaos/denoise
cd denoise
python3 -m venv .venv
source .venv/bin/activate
pip install rich

For the ML backends:

# macOS — Rust is required to compile deepfilterlib
brew install rust
pip install deepfilternet noisereduce soundfile torch torchaudio torchcodec

Model weights (~50 MB) are downloaded from Hugging Face on first use and cached locally.

Usage

python denoise.py video.mp4

Select a method, optionally adjust parameters inline (passes=2 atten_lim_db=20), and wait. The cleaned audio is saved to outputs/ as a standalone FLAC alongside the remuxed video — useful for checking the result before committing.

Technical Notes

Why FLAC for the intermediate file? It’s lossless and roughly half the size of WAV, which matters when you’re working with 15+ minute recordings. The audio never loses quality through the extract → denoise → remux pipeline.

Why not just use ffmpeg’s arnndn filter directly on the video? You can — but you lose the ability to use Python-based ML models (DeepFilterNet, noisereduce), tune parameters interactively, save the isolated audio, or run multi-pass. The intermediate extraction step costs ~60 seconds but unlocks the full range of options.

Container compatibility: MP4 is used when the source is MP4 with an H.264/HEVC/AV1 video stream. Everything else falls back to MKV, which accepts virtually any codec combination without complaints.

Stereo handling in DeepFilterNet3: The model is trained on mono speech, so stereo channels are processed independently and then recombined. This avoids the phase artefacts that can appear when stereo audio is naively downmixed to mono and back.

Source

MIT licensed: github.com/lynchaos/denoise

Kemal Yaylali
orcid.org/0000-0003-1190-7807 · [email protected]