I wanted to obtain the transcript of a video for a project I've been working on. Naturally, I googled to find some scripts, only to be left disappointed. There is always some unforeseen error about something, which is really irritating because it clearly looked like it should work. So, I set out to build my own little app that does this. Transcribing audio from YouTube videos can be incredibly useful for a variety of reasons, whether you're creating subtitles, making content more accessible, or simply looking to convert speech to text for easier reference. This guide will walk you through a step-by-step process to build a YouTube audio transcription application using Python, Streamlit, and the Whisper model by OpenAI.
Setting Up the Environment
First, let's set up our Python environment. Create a new directory for your project and navigate into it:
mkdir youtube-transcription
cd youtube-transcriptionNext, create a virtual environment to keep our dependencies isolated:
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`Now, let's install the required libraries. Create a requirements.txt file with the following libraries and their versions:
yt-dlp==2024.7.16
python-dotenv==1.0.1
openai-whisper==20231117
streamlit==1.36.0
ffmpeg==1.4Install the dependencies:
pip install -r requirements.txtWriting the Code
We'll break down the project into three main files: app.py, get_audio.py, and get_transcript.py.
app.py
This is the main application file where we use Streamlit to create a simple web interface.
import streamlit as st
import os
from get_audio import download_audio_from_youtube
from get_transcript import transcribe_audio_to_text
st.title("YouTube Audio Transcription")
youtube_url = st.text_input("Enter YouTube URL:")
if st.button("Transcribe"):
if youtube_url:
try:
audio_file_path = download_audio_from_youtube(youtube_url, output_path='.')
if os.path.exists(audio_file_path):
transcription = transcribe_audio_to_text(audio_file_path)
st.subheader("Transcription")
st.write(transcription)
else:
st.error(f"Error: File {audio_file_path} does not exist.")
except Exception as e:
st.error(f"Error: {e}")Explanation:
- Streamlit Setup: This file sets up a basic Streamlit web app. The st.title() function creates a title for the web app.
- User Input: The st.text_input() function takes a YouTube URL from the user.
- Button Interaction: When the "Transcribe" button is clicked, the code attempts to download and transcribe the audio.
get_audio.py
This file handles downloading the audio from YouTube and converting it to a suitable format for transcription.
import os
import yt_dlp as youtube_dl
from dotenv import load_dotenv
import re
def sanitize_filename(filename):
# Remove any invalid characters and replace with underscore
sanitized = re.sub(r'[^a-zA-Z0-9_\-]', '_', filename)
# Remove any double underscores or trailing underscores
sanitized = re.sub(r'_+', '_', sanitized).strip('_')
return sanitized
def download_audio_from_youtube(youtube_url, output_path='.'):
load_dotenv('.env')
ydl_opts = {
'username': os.getenv('YOUTUBE_EMAIL'),
'password': os.getenv('YOUTUBE_PASSWORD'),
'format': 'bestaudio/best',
'postprocessors': [{
'key': 'FFmpegExtractAudio', # Use FFmpeg to extract audio
'preferredcodec': 'mp3', # Convert audio to mp3
'preferredquality': '192', # Use a bitrate of 192 kbps
}],
'nocheckcertificate': True, # Ignore SSL certificate errors
}
# Extract video information without downloading
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
info_dict = ydl.extract_info(youtube_url, download=False)
title = info_dict.get('title', 'audio')
sanitized_title = sanitize_filename(title)
audio_file_path = os.path.join(output_path, f"{sanitized_title}")
ydl_opts['outtmpl'] = audio_file_path
# Download the audio using the sanitized file name
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
ydl.download([youtube_url])
return audio_file_path + '.mp3'Explanation:
- Environment Variables: Loads YouTube credentials from a .env file using python-dotenv. This ensures that sensitive information like your YouTube login credentials are kept secure and not hard-coded into your scripts.
- Sanitizing Filenames: The sanitize_filename() function ensures filenames are safe for use in the filesystem by replacing invalid characters with underscores. This prevents issues with file naming conventions across different operating systems.
Downloading Audio:
- yt-dlp Setup: The ydl_opts dictionary configures yt-dlp (a popular tool for downloading videos and audio from YouTube). Key options include:
- Authentication: Uses your YouTube email and password to handle private or age-restricted videos.
- Format: Specifies the best available audio format.
- Post-processing: Uses FFmpeg to extract and convert the audio to MP3 format with a bitrate of 192 kbps.
- Certificate Handling: Ignores SSL certificate errors to ensure the download process is not interrupted by certificate issues.
- Extracting Video Information: Before downloading, the script extracts information about the video, such as its title, to create a sanitized filename.
- Downloading and Converting Audio: The script then downloads the audio, converts it to MP3, and saves it to the specified output path.
get_transcript.py
This file is responsible for transcribing the downloaded audio using the Whisper model.
import whisper
def transcribe_audio_to_text(audio_path):
# Load the whisper model
model = whisper.load_model("small")
# Transcribe the audio file
result = model.transcribe(audio_path)
return result["text"]Explanation:
- Loading the Model: The transcribe_audio_to_text() function loads the pre-trained Whisper model. This model is specifically designed for high-quality speech recognition.
- The Whisper model by OpenAI comes in multiple sizes, each designed to balance accuracy and computational efficiency. Smaller models, like the tiny and base versions, are faster and require less computational power, making them suitable for devices with limited resources. However, they may not be as accurate as the larger models. On the other hand, larger models like the small, medium, and large versions offer better accuracy but need more computational power and memory. The small model provides a good balance between speed and accuracy, while the large model, being the most accurate, is best for high-precision tasks but is more resource-intensive. Choosing the right model depends on your specific needs and the resources you have available. If you need quick and reasonably accurate transcriptions on a personal laptop, the small or medium models are a good fit. For the highest accuracy and if you have access to powerful hardware, the large model is ideal.
- Transcription: The function then uses this model to transcribe the given audio file and returns the text. The transcription process involves converting spoken words in the audio file into written text. Running the Application
With the code in place, create an .env file in the root of your project directory and add your YouTube credentials.
YOUTUBE_EMAIL=your_youtube_email
YOUTUBE_PASSWORD=your_youtube_passwordImportant note: Don't forget to add the .env file in .gitignore or you might accidentally commit it to git.
To run the application, use the following command:
streamlit run app.pyThis opens up a browser tab, and you'll see a simple interface where you can enter a YouTube URL and get the transcribed text.

Conclusion
In this guide, we walked through the process of setting up a Python project to transcribe YouTube audio using Youtube_dl, Streamlit and the Whisper model. This code can be incredibly useful for converting spoken content into text, enhancing accessibility, and enabling further analysis.
You can clone the repo here: https://github.com/naveen-malla/Youtube_Summarizer
I am soon going to add LLM capability to the project to summarize the transcript and enable chatting with it.
If you enjoyed this post, please consider
- holding the clap button for a few seconds (it goes up to 50) and
- following me for more updates.
It gives me the motivation to keep going and helps the story reach more people like you. I share stories every week about machine learning concepts and tutorials on interesting projects. See you next week. Happy learning!