Powered by Modal.com for parallel processing on-demand, an hour audio file can be transcribed in ~1 minute.
Repo: https://github.com/mharrvic/fast-audio-video-transcribe-with-whisper-and-modal
"Modal’s dead-simple parallelism primitives are the key to doing the transcription so quickly. Even with a GPU, transcribing a full episode serially was taking around 10 minutes. But by pulling in ffmpeg with a simple .pip_install("ffmpeg-python") addition to our Modal Image, we could exploit the natural silences of the podcast medium to partition episodes into hundreds of short segments. Each segment is transcribed by Whisper in its own container task with 2 physical CPU cores, and when all are done we stitch the segments back together with only a minimal loss in transcription quality. This approach actually accords quite well with Whisper’s model architecture:"
“The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder.” - Introducing Whisper
Demo
Audio Transcription
Video Transcription
How to use
-
Create a Modal account and get your API key.
-
Run this command to install modal client and generate token.
pip install modal-client modal token new
-
The first command will install the Modal client library on your computer, along with its dependencies.
-
The second command creates an API token by authenticating through your web browser. It will open a new tab, but you can close it when you are done.
-
-
-
Deploy your modal project with the following command.
modal deploy api.main
-
Transcribe your audio file using the following curl command. Replace the
your-modal-endpoint
,your-audio-src-url
,title_slug
, andis_video
(for video transcribe) with your own.curl --location --request POST 'your-modal-endpoint/api/transcribe?src_url=your-audio-src-url&title_slug=your-amazing-title-slug&is_video=false'
Sample response:
{ "call_id": "your-call-id" }
-
Check the status of your transcription using the following curl command. Replace the
your-call-id
with your own (return from the previous command).curl --location 'your-modal-endpoint/api/status/your-call-id'
Sample initial response:
{ "finished": false, "total_segments": 49, "tasks": 49, "done_segments": 0 }
Sample final response(poll this endpoint until
finished
istrue
):{ "finished": true, "total_segments": 49, "tasks": 49, "done_segments": 49 }
-
Download the transcription using the following curl command. Replace the
your-modal-endpoint
andyour-title-slug
with your own (return from the previous command).curl --location 'your-modal-endpoint/api/audio/your-title-slug'
Sample response:
{ "segments": [ { "text": " Productivity also means that you're able to maximize the hours that you have and also rest deliberately in between. That's real productivity because if you're just constantly working without breaks and without really knowing what your goals are and what you're achieving,", "start": 0.0, "end": 19.0 }, { "text": " that's not productivity, that's just busyness. So that's the difference between productivity and busyness and it really starts from the very beginning of your day.", "start": 19.0, "end": 45.0 } ] }
Resources:
- https://modal.com/docs/guide/whisper-transcriber (most of the codes were recycled from here)
- https://openai.com/blog/whisper/