Political speeches, candidate interviews, and public hearings contain dozens of quotable moments — but finding them means scrubbing through hours of video. I created a Claude Code skill that helps me speed up the process of finding the right section of a speech.
The skill, called speech-clip-extractor, takes a video file and a set of topics you care about, then returns a table of timestamps, quotes, and reasons each moment is worth clipping. From there, it extracts the clips using ffmpeg.
Transcription with mlx-whisper
The first step is generating a subtitle file. I used mlx-whisper, Apple’s port of OpenAI Whisper optimized for Apple Silicon. Running on the Neural Engine instead of the CPU, it transcribes a 60-minute video in a few minutes rather than the better part of an hour. One gotcha: mlx-whisper only installs on arm64 Python. If you’re running Anaconda, which ships as x86_64 under Rosetta, the install will silently fail. The fix is to use Homebrew’s Python instead.
Finding the highlights
Once Claude Code has the VTT transcript, it reads the full text and looks for moments that match the topics I specified. For a recent recorded interview with a Cook County Commissioner candidate, I asked it to find moments about property taxes, vacant land, housing, and the Cook County Land Bank. Claude returned a table of eight clips with timestamps, direct quotes, and a one-line note on why each moment was notable — things like “specific stat, strong soundbite” or “names names, calls out a specific corrupt dynamic.”
I also used this skill to locate specific timestamps in Governor Pritzker’s speech from his budget address last week, to locate the parts about housing that I had heard live as he was speaking. It allowed me to quickly pull out clips to show the part when he said “everything is just too damned expensive” and when he talked about parking reform.
Cutting the clips
The skill uses ffmpeg to extract each clip and optionally concatenate them into a single video. Even if you don’t use the step of extracting each clip having the timestamps can be immensely helpful to guide you to the right spot in a long video file.
For vertical social media cuts, it applies a 9:16 center crop. When there are two speakers in the frame, it switches the crop to follow whoever is talking — cutting to the left side of the frame when the interviewer speaks and the right when the candidate answers. However, in my experience so far the crop is poorly done so I created my own crop. (I use Final Cut Pro for iPad.)
The whole workflow — transcription, analysis, extraction — takes about ten minutes for a one-hour video. What used to require scrubbing through footage manually is now a matter of describing what you’re looking for and letting the model find it.