The Power of OpenAI
2023-05-18

Recently, there has been a lot of buzz around OpenAI and its advanced AI capabilities. As someone who is interested in AI and machine learning, I was excited to explore the possibilities of working on a project using OpenAI. While browsing YouTube, I stumbled across a video by devaslife showcasing a transcription/translation tool he built using OpenAI's GPT-3. I was inspired by the tool and wanted to recreate it myself, but also improve upon it. While working on this project, I found myself struggling with certain topics in AI and machine learning. To help me learn, I turned to ChatGPT, another OpenAI tool that allowed me to ask questions and receive helpful explanations in real-time. With the help of OpenAI's tools, I was able to learn and improve my skills as I worked on my project.

Project Details

The project was built using Next.js (Typescript, React, Radix UI, and Tailwind CSS), Python, which was used to manage the YouTube audio downloader, transcription, and translator using the OpenAI API, and shell scripts to run the Python code.


_10
PROJECT_ROOT
_10
├── src # Source
_10
│ └── pages # Pages
_10
│ └── api # API routes
_10
│ └── components # React components
_10
│ └── styles # Styling
_10
├── public
_10
├── transcription # Python scripts
_10
├── uploads # Temporary files
_10
└── utils # Utility tools

Python

The project consists of two python scripts: transcribe.pyand translate.py. Just as the name suggests, they manage the transcription and translation of the YouTube video text.

The transcribe.py script utilizes the OpenAI API to transcribe the YouTube video audio to text, while translate.py script then takes the resulting text and uses the OpenAI API to translate it to the desired language. These scripts are run using shell scripts that manage the downloading of the YouTube audio and the passing of the audio to the transcription and translation scripts.

Get the Audio file

yt-dlp is a python package that allows you to download YouTube files.


_10
#!/bin/zsh
_10
_10
VIDEO_ID=$1
_10
_10
[ -z "$VIDEO_ID" ] && echo "ERROR: No video ID specified" && exit 1
_10
_10
yt-dlp "https://www.youtube.com/watch?v=$VIDEO_ID" --format m4a -o "./tmp/%(id)s.%(ext)s" 2>&1

This script downloads a YouTube video with the specified video ID in the M4A format to a temporary directory with a specific file name format. It checks for errors and exits the script with an error message if no video ID is specified.

Transcription

Once we have the audio file, we can use the neural network, Whisper, an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Using Whisper we can transcribe the audio and return an srt file.

An SRT file, also known as a SubRip Subtitle file, is a plain-text file that contains critical information about subtitles. This includes the start and end timecodes of your text to ensure your subtitles match your audio, as well as the sequential number of subtitles.


_23
import os
_23
import openai
_23
import sys
_23
from decouple import config
_23
_23
openai.api_key = config("OPENAI_API_KEY")
_23
video_id = sys.argv[1]
_23
audio_url = os.path.join(os.getcwd(), 'uploads', video_id + '.m4a')
_23
_23
audio_file = open(audio_url, "rb")
_23
_23
#Set the custom parameters
_23
params = {
_23
'file': audio_file,
_23
'model': 'whisper-1',
_23
'prompt': 'Transcribe this audio file:',
_23
'response_format': 'srt'
_23
}
_23
_23
#Call the transcribe function with the custom parameters
_23
transcription = openai.Audio.transcribe(**params)
_23
_23
print(transcription)

Translate

Once the audio is transcribed, we can read the srt data from the console and parse the data to Davinci, one of the language models offered by OpenAI. It is powered by the GPT-3 (Generative Pretrained Transformer 3) architecture. The translate function takes in three inputs: prompt, max tokens, and temperature. Prompts are used to give the model a starting point to generate text. Max tokens is the maximum number of tokens to generate for the completion. Temperature controls the model's creativity. A higher temperature will generate more diverse and creative responses, while a lower temperature will generate more conservative and predictable responses.

An important lesson I learned when working with command-line arguments is sanitizing. Sanitizing command-line arguments is an important aspect of writing secure code. It involves validating and cleaning up the data provided by the user to prevent malicious usage or unexpected errors.


_45
import sys
_45
import openai
_45
from decouple import config
_45
import pysrt
_45
import argparse
_45
_45
openai.api_key = config("OPENAI_API_KEY")
_45
input_string = sys.stdin.read()
_45
subtitles = pysrt.from_string(input_string)
_45
_45
parser = argparse.ArgumentParser(description="A demo script")
_45
_45
parser.add_argument('lang', type=str, help='The language to use for translation')
_45
parser.add_argument('max_tokens', type=int, choices=range(1, 5000), help='The maximum number of tokens to generate')
_45
parser.add_argument('temperature', type=float, choices=[x * 0.1 for x in range(0, 11)], help='The temperature for the model')
_45
_45
args = parser.parse_args()
_45
_45
lang = args.lang
_45
max_tokens = args.max_tokens
_45
temperature = args.temperature
_45
_45
prompt_base = (
_45
"You are a skilled polyglot with proficiency in over 100 languages. "
_45
"Below is a segment of the transcript from a video. "
_45
f'Please accurately translate the ensuing text into {lang}, '
_45
"ensuring you maintain proper grammar, stylistic nuance, and tone. "
_45
"Commence the translation from [START] to [END]:\n[START]\n"
_45
)
_45
_45
def translate(text):
_45
prompt = prompt_base + text + "\n[END]"
_45
_45
res = openai.Completion.create(
_45
model="text-davinci-003",
_45
prompt=prompt,
_45
max_tokens=max_tokens,
_45
temperature=temperature
_45
)
_45
translation = res.choices[0].text.strip()
_45
return translation
_45
_45
for index, subtitle in enumerate(subtitles):
_45
subtitle.text = translate(subtitle.text)
_45
print(subtitle, flush=True)

Building the UI

To build out the layout for the transcription/translation tool, you used Radix UI, a UI component library for React. Radix UI provides a variety of pre-built components that can be easily customized to suit your needs. This allowed me to quickly create a responsive and visually appealing user interface without having to spend too much time on styling.

APIs

To make the audio download, transcription, and translation functionalities accessible via API calls, I used Next.js API routes. This allowed me to easily create RESTful API endpoints that could be called from the front-end.

The API routes are located in the pages/api directory. There are three API routes: download, transcription, and translation.

Download

The download API route takes a YouTube video ID as a query parameter and returns the M4A audio file for that video. Here's the code for the download API route:


_16
export default function GET(
_16
request: NextApiRequest,
_16
response: NextApiResponse
_16
) {
_16
const video_id = request.query.video_id as string;
_16
if (typeof video_id !== "string") {
_16
response.status(400).json({ error: "Invalid request" });
_16
return;
_16
}
_16
_16
console.log("video ID:", video_id);
_16
const cmd = spawn(path.join(process.cwd(), "transcription/get-audio.sh"), [
_16
video_id || "",
_16
]);
_16
transferChildProcessOutput(cmd, response);
_16
}

Transcription

The transcription API route takes the audio file as a Base64-encoded string in the request body and returns the transcription as an SRT file. Here's the code for the transcription API route:


_12
export default async (req, res) => {
_12
const { audio } = req.body;
_12
_12
if (!audio) {
_12
res.status(400).json({ error: 'No audio provided' });
_12
return;
_12
}
_12
_12
const transcription = await transcribeAudio(audio);
_12
_12
res.status(200).json({ transcription });
_12
};

Translation

The translation API route takes the SRT file as a Base64-encoded string in the request body and returns the translated SRT file as a JSON object. Here's the code for the translation API route:


_12
export default async (req, res) => {
_12
const { srt, lang, maxTokens, temperature } = req.body;
_12
_12
if (!srt) {
_12
res.status(400).json({ error: 'No SRT file provided' });
_12
return;
_12
}
_12
_12
const translation = await translateSrt(srt, lang, maxTokens, temperature);
_12
_12
res.status(200).json({ translation });
_12
};

To use these API routes, you can make HTTP requests to the appropriate endpoint from the front-end. For example, to download the audio file for a YouTube video, you can make a GET request to /api/download?videoId=YOUR_VIDEO_ID.

To use the audio download, transcription, and translation APIs with the React components on the site, HTTP requests need to be made to the appropriate endpoint from the front-end. For example, to download the audio file for a YouTube video, a GET request can be made to /api/download?videoId=YOUR_VIDEO_ID.

Then using these APIs, in tandem with our React components to display the data! To see more of the code please visit the Github repo: https://github.com/noahguale/yt-transcribe-translate