Recently, there has been a lot of buzz around OpenAI and its advanced AI capabilities. As someone who is interested in AI and machine learning, I was excited to explore the possibilities of working on a project using OpenAI. While browsing YouTube, I stumbled across a video by devaslife showcasing a transcription/translation tool he built using OpenAI's GPT-3. I was inspired by the tool and wanted to recreate it myself, but also improve upon it. While working on this project, I found myself struggling with certain topics in AI and machine learning. To help me learn, I turned to ChatGPT, another OpenAI tool that allowed me to ask questions and receive helpful explanations in real-time. With the help of OpenAI's tools, I was able to learn and improve my skills as I worked on my project.
Project Details
The project was built using Next.js (Typescript, React, Radix UI, and Tailwind CSS), Python, which was used to manage the YouTube audio downloader, transcription, and translator using the OpenAI API, and shell scripts to run the Python code.
_10PROJECT_ROOT_10├── src # Source_10│ └── pages # Pages_10│ └── api # API routes_10│ └── components # React components_10│ └── styles # Styling_10├── public_10├── transcription # Python scripts_10├── uploads # Temporary files_10└── utils # Utility tools
Python
The project consists of two python scripts: transcribe.py
and translate.py
. Just as the name suggests, they manage the transcription and translation of the YouTube video text.
The transcribe.py
script utilizes the OpenAI API to transcribe the YouTube video audio to text, while translate.py
script then takes the resulting text and uses the OpenAI API to translate it to the desired language. These scripts are run using shell scripts that manage the downloading of the YouTube audio and the passing of the audio to the transcription and translation scripts.
Get the Audio file
yt-dlp is a python package that allows you to download YouTube files.
_10#!/bin/zsh_10_10VIDEO_ID=$1_10_10[ -z "$VIDEO_ID" ] && echo "ERROR: No video ID specified" && exit 1_10_10yt-dlp "https://www.youtube.com/watch?v=$VIDEO_ID" --format m4a -o "./tmp/%(id)s.%(ext)s" 2>&1
This script downloads a YouTube video with the specified video ID in the M4A format to a temporary directory with a specific file name format. It checks for errors and exits the script with an error message if no video ID is specified.
Transcription
Once we have the audio file, we can use the neural network, Whisper, an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Using Whisper we can transcribe the audio and return an srt file.
An SRT file, also known as a SubRip Subtitle file, is a plain-text file that contains critical information about subtitles. This includes the start and end timecodes of your text to ensure your subtitles match your audio, as well as the sequential number of subtitles.
_23import os_23import openai_23import sys_23from decouple import config_23_23openai.api_key = config("OPENAI_API_KEY")_23video_id = sys.argv[1]_23audio_url = os.path.join(os.getcwd(), 'uploads', video_id + '.m4a')_23_23audio_file = open(audio_url, "rb")_23_23#Set the custom parameters_23params = {_23 'file': audio_file,_23 'model': 'whisper-1',_23 'prompt': 'Transcribe this audio file:',_23 'response_format': 'srt'_23}_23_23#Call the transcribe function with the custom parameters_23transcription = openai.Audio.transcribe(**params)_23_23print(transcription)
Translate
Once the audio is transcribed, we can read the srt data from the console and parse the data to Davinci, one of the language models offered by OpenAI. It is powered by the GPT-3 (Generative Pretrained Transformer 3) architecture. The translate function takes in three inputs: prompt, max tokens, and temperature. Prompts are used to give the model a starting point to generate text. Max tokens is the maximum number of tokens to generate for the completion. Temperature controls the model's creativity. A higher temperature will generate more diverse and creative responses, while a lower temperature will generate more conservative and predictable responses.
An important lesson I learned when working with command-line arguments is sanitizing. Sanitizing command-line arguments is an important aspect of writing secure code. It involves validating and cleaning up the data provided by the user to prevent malicious usage or unexpected errors.
_45import sys_45import openai_45from decouple import config_45import pysrt_45import argparse_45_45openai.api_key = config("OPENAI_API_KEY")_45input_string = sys.stdin.read()_45subtitles = pysrt.from_string(input_string)_45_45parser = argparse.ArgumentParser(description="A demo script")_45_45parser.add_argument('lang', type=str, help='The language to use for translation')_45parser.add_argument('max_tokens', type=int, choices=range(1, 5000), help='The maximum number of tokens to generate')_45parser.add_argument('temperature', type=float, choices=[x * 0.1 for x in range(0, 11)], help='The temperature for the model')_45_45args = parser.parse_args()_45_45lang = args.lang_45max_tokens = args.max_tokens_45temperature = args.temperature_45_45prompt_base = (_45 "You are a skilled polyglot with proficiency in over 100 languages. "_45 "Below is a segment of the transcript from a video. "_45 f'Please accurately translate the ensuing text into {lang}, '_45 "ensuring you maintain proper grammar, stylistic nuance, and tone. "_45 "Commence the translation from [START] to [END]:\n[START]\n"_45)_45_45def translate(text):_45 prompt = prompt_base + text + "\n[END]" _45_45 res = openai.Completion.create(_45 model="text-davinci-003",_45 prompt=prompt,_45 max_tokens=max_tokens,_45 temperature=temperature_45 )_45 translation = res.choices[0].text.strip()_45 return translation_45_45for index, subtitle in enumerate(subtitles):_45 subtitle.text = translate(subtitle.text)_45 print(subtitle, flush=True)
Building the UI
To build out the layout for the transcription/translation tool, you used Radix UI, a UI component library for React. Radix UI provides a variety of pre-built components that can be easily customized to suit your needs. This allowed me to quickly create a responsive and visually appealing user interface without having to spend too much time on styling.
APIs
To make the audio download, transcription, and translation functionalities accessible via API calls, I used Next.js API routes. This allowed me to easily create RESTful API endpoints that could be called from the front-end.
The API routes are located in the pages/api
directory. There are three API routes: download
, transcription
, and translation
.
Download
The download
API route takes a YouTube video ID as a query parameter and returns the M4A audio file for that video. Here's the code for the download
API route:
_16export default function GET(_16 request: NextApiRequest,_16 response: NextApiResponse_16) {_16 const video_id = request.query.video_id as string;_16 if (typeof video_id !== "string") {_16 response.status(400).json({ error: "Invalid request" });_16 return;_16 }_16_16 console.log("video ID:", video_id);_16 const cmd = spawn(path.join(process.cwd(), "transcription/get-audio.sh"), [_16 video_id || "",_16 ]);_16 transferChildProcessOutput(cmd, response);_16}
Transcription
The transcription
API route takes the audio file as a Base64-encoded string in the request body and returns the transcription as an SRT file. Here's the code for the transcription
API route:
_12export default async (req, res) => {_12 const { audio } = req.body;_12_12 if (!audio) {_12 res.status(400).json({ error: 'No audio provided' });_12 return;_12 }_12_12 const transcription = await transcribeAudio(audio);_12_12 res.status(200).json({ transcription });_12};
Translation
The translation
API route takes the SRT file as a Base64-encoded string in the request body and returns the translated SRT file as a JSON object. Here's the code for the translation
API route:
_12export default async (req, res) => {_12 const { srt, lang, maxTokens, temperature } = req.body;_12_12 if (!srt) {_12 res.status(400).json({ error: 'No SRT file provided' });_12 return;_12 }_12_12 const translation = await translateSrt(srt, lang, maxTokens, temperature);_12_12 res.status(200).json({ translation });_12};
To use these API routes, you can make HTTP requests to the appropriate endpoint from the front-end. For example, to download the audio file for a YouTube video, you can make a GET request to /api/download?videoId=YOUR_VIDEO_ID
.
To use the audio download, transcription, and translation APIs with the React components on the site, HTTP requests need to be made to the appropriate endpoint from the front-end. For example, to download the audio file for a YouTube video, a GET request can be made to /api/download?videoId=YOUR_VIDEO_ID
.
Then using these APIs, in tandem with our React components to display the data! To see more of the code please visit the Github repo: https://github.com/noahguale/yt-transcribe-translate