Transcribing any Youtube Video with Python: A Step-by-Step guide

Script that actually works

Naveen Malla

~5 min read · July 23, 2024 (Updated: November 20, 2024) · Free: Yes

I wanted to obtain the transcript of a video for a project I've been working on. Naturally, I googled to find some scripts, only to be left disappointed. There is always some unforeseen error about something, which is really irritating because it clearly looked like it should work. So, I set out to build my own little app that does this. Transcribing audio from YouTube videos can be incredibly useful for a variety of reasons, whether you're creating subtitles, making content more accessible, or simply looking to convert speech to text for easier reference. This guide will walk you through a step-by-step process to build a YouTube audio transcription application using Python, Streamlit, and the Whisper model by OpenAI.

Photo by Alexander Shatov on Unsplash

Setting Up the Environment

First, let's set up our Python environment. Create a new directory for your project and navigate into it:

mkdir youtube-transcription
cd youtube-transcription

Next, create a virtual environment to keep our dependencies isolated:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Now, let's install the required libraries. Create a requirements.txt file with the following libraries and their versions:

yt-dlp==2024.7.16
python-dotenv==1.0.1
openai-whisper==20231117
streamlit==1.36.0
ffmpeg==1.4

Install the dependencies:

pip install -r requirements.txt

Writing the Code

We'll break down the project into three main files: app.py, get_audio.py, and get_transcript.py.

app.py

This is the main application file where we use Streamlit to create a simple web interface.

import streamlit as st
import os
from get_audio import download_audio_from_youtube
from get_transcript import transcribe_audio_to_text

st.title("YouTube Audio Transcription")

youtube_url = st.text_input("Enter YouTube URL:")
if st.button("Transcribe"):
    if youtube_url:
        try:
            audio_file_path = download_audio_from_youtube(youtube_url, output_path='.')

            if os.path.exists(audio_file_path):
                transcription = transcribe_audio_to_text(audio_file_path)
                st.subheader("Transcription")
                st.write(transcription)
            else:
                st.error(f"Error: File {audio_file_path} does not exist.")
        except Exception as e:
            st.error(f"Error: {e}")

Explanation:

Streamlit Setup: This file sets up a basic Streamlit web app. The st.title() function creates a title for the web app.
User Input: The st.text_input() function takes a YouTube URL from the user.
Button Interaction: When the "Transcribe" button is clicked, the code attempts to download and transcribe the audio.

get_audio.py

This file handles downloading the audio from YouTube and converting it to a suitable format for transcription.

import os
import yt_dlp as youtube_dl
from dotenv import load_dotenv
import re

def sanitize_filename(filename):
    # Remove any invalid characters and replace with underscore
    sanitized = re.sub(r'[^a-zA-Z0-9_\-]', '_', filename)
    # Remove any double underscores or trailing underscores
    sanitized = re.sub(r'_+', '_', sanitized).strip('_')
    return sanitized

def download_audio_from_youtube(youtube_url, output_path='.'):
    load_dotenv('.env')

    ydl_opts = {
        'username': os.getenv('YOUTUBE_EMAIL'), 
        'password': os.getenv('YOUTUBE_PASSWORD'),  
        'format': 'bestaudio/best',  
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',  # Use FFmpeg to extract audio
            'preferredcodec': 'mp3',  # Convert audio to mp3
            'preferredquality': '192',  # Use a bitrate of 192 kbps
        }],
        'nocheckcertificate': True,  # Ignore SSL certificate errors
    }

    # Extract video information without downloading
    with youtube_dl.YoutubeDL(ydl_opts) as ydl:
        info_dict = ydl.extract_info(youtube_url, download=False)
        title = info_dict.get('title', 'audio')
        sanitized_title = sanitize_filename(title)
        audio_file_path = os.path.join(output_path, f"{sanitized_title}")
        ydl_opts['outtmpl'] = audio_file_path

    # Download the audio using the sanitized file name
    with youtube_dl.YoutubeDL(ydl_opts) as ydl:
        ydl.download([youtube_url])
    
    return audio_file_path + '.mp3'

Explanation:

Environment Variables: Loads YouTube credentials from a .env file using python-dotenv. This ensures that sensitive information like your YouTube login credentials are kept secure and not hard-coded into your scripts.
Sanitizing Filenames: The sanitize_filename() function ensures filenames are safe for use in the filesystem by replacing invalid characters with underscores. This prevents issues with file naming conventions across different operating systems.

Downloading Audio:

yt-dlp Setup: The ydl_opts dictionary configures yt-dlp (a popular tool for downloading videos and audio from YouTube). Key options include:
Authentication: Uses your YouTube email and password to handle private or age-restricted videos.
Format: Specifies the best available audio format.
Post-processing: Uses FFmpeg to extract and convert the audio to MP3 format with a bitrate of 192 kbps.
Certificate Handling: Ignores SSL certificate errors to ensure the download process is not interrupted by certificate issues.
Extracting Video Information: Before downloading, the script extracts information about the video, such as its title, to create a sanitized filename.
Downloading and Converting Audio: The script then downloads the audio, converts it to MP3, and saves it to the specified output path.

get_transcript.py

This file is responsible for transcribing the downloaded audio using the Whisper model.

import whisper

def transcribe_audio_to_text(audio_path):
    # Load the whisper model
    model = whisper.load_model("small")

    # Transcribe the audio file
    result = model.transcribe(audio_path)

    return result["text"]

Explanation:

Loading the Model: The transcribe_audio_to_text() function loads the pre-trained Whisper model. This model is specifically designed for high-quality speech recognition.
The Whisper model by OpenAI comes in multiple sizes, each designed to balance accuracy and computational efficiency. Smaller models, like the tiny and base versions, are faster and require less computational power, making them suitable for devices with limited resources. However, they may not be as accurate as the larger models. On the other hand, larger models like the small, medium, and large versions offer better accuracy but need more computational power and memory. The small model provides a good balance between speed and accuracy, while the large model, being the most accurate, is best for high-precision tasks but is more resource-intensive. Choosing the right model depends on your specific needs and the resources you have available. If you need quick and reasonably accurate transcriptions on a personal laptop, the small or medium models are a good fit. For the highest accuracy and if you have access to powerful hardware, the large model is ideal.
Transcription: The function then uses this model to transcribe the given audio file and returns the text. The transcription process involves converting spoken words in the audio file into written text. Running the Application

With the code in place, create an .env file in the root of your project directory and add your YouTube credentials.

YOUTUBE_EMAIL=your_youtube_email
YOUTUBE_PASSWORD=your_youtube_password

Important note: Don't forget to add the .env file in .gitignore or you might accidentally commit it to git.

To run the application, use the following command:

streamlit run app.py

This opens up a browser tab, and you'll see a simple interface where you can enter a YouTube URL and get the transcribed text.

image by author

Conclusion

In this guide, we walked through the process of setting up a Python project to transcribe YouTube audio using Youtube_dl, Streamlit and the Whisper model. This code can be incredibly useful for converting spoken content into text, enhancing accessibility, and enabling further analysis.

You can clone the repo here: https://github.com/naveen-malla/Youtube_Summarizer

I am soon going to add LLM capability to the project to summarize the transcript and enable chatting with it.

If you enjoyed this post, please consider

holding the clap button for a few seconds (it goes up to 50) and
following me for more updates.

It gives me the motivation to keep going and helps the story reach more people like you. I share stories every week about machine learning concepts and tutorials on interesting projects. See you next week. Happy learning!

LinkedIn, Medium, GitHub

#python #youtube #machine-learning #data-science-projects #towards-data-science

Transcribing any Youtube Video with Python: A Step-by-Step guide

Script that actually works

Setting Up the Environment

Writing the Code

Conclusion

Reporting a Problem