Get Rewarded! We will reward you with up to €50 credit on your account for every tutorial that you write and we publish!

Building a Flask API to Transcribe Audio Files using Whisper AI

profile picture
Author
Daniel Charles Mwangila
Published
2025-02-17
Time to read
15 minutes reading time

About the author- Daniel Charles Mwangila is a software engineer with a passion for building scalable and efficient systems. He has experience working with various technologies and frameworks, and is always looking for new challenges to tackle.

Introduction

In this tutorial, we’ll walk through building a Flask API that leverages OpenAI’s Whisper AI to transcribe audio files. The API will handle multiple files concurrently and provide real-time progress updates for each transcription task. By the end of this tutorial, you’ll have a fully functional Flask-based service capable of processing audio files efficiently while keeping users informed about the progress.


What is Whisper AI?

Whisper AI is an open-source automatic speech recognition (ASR) model developed by OpenAI. It is designed to convert spoken language into written text with remarkable accuracy across multiple languages. Whisper supports a wide range of tasks, including speech-to-text transcription, language detection, and translation. Its versatility makes it suitable for applications such as transcription services, voice assistants, meeting summaries, podcast indexing, and more.

One of the standout features of Whisper is its ability to handle diverse audio inputs, from clear studio recordings to noisy real-world environments. This robustness stems from its training on a vast dataset of multilingual and multitask audio data, enabling it to generalize well across different accents, dialects, and recording conditions.


Whisper Models and Their Uses

Whisper comes in several pre-trained model sizes, each offering a trade-off between performance and resource requirements:

Model size Description
Tiny The smallest and fastest model, ideal for resource-constrained environments or quick prototyping. It has fewer parameters and is less accurate but highly efficient.
Base A balanced model suitable for general-purpose transcription tasks. It offers a good mix of speed and accuracy.
Small Slightly larger and more accurate than the Base model, making it a solid choice for most use cases.
Medium A high-performance model that delivers excellent accuracy at the cost of increased computational demands.
Large The most accurate model, designed for scenarios where precision is critical. However, it requires significant computational resources and is best suited for powerful hardware like GPUs.

Each model variant can be selected based on your specific needs. For example, if you’re building a lightweight transcription service for mobile devices, the Tiny model might suffice. On the other hand, if you’re working on a professional-grade transcription platform, the Large model would be more appropriate.


Benefits of Using Whisper AI

Benefit Description
Multilingual Support Whisper supports over 90 languages, making it a versatile tool for global applications.
Open Source Being open source, Whisper allows developers to customize and extend its capabilities to fit their unique requirements.
High Accuracy Its state-of-the-art architecture ensures reliable transcriptions even in challenging audio conditions.
Ease of Use With libraries available in Python, integrating Whisper into your projects is straightforward.

In this tutorial, we’ll focus on using the Base model for transcription, but you can easily switch to other models depending on your needs.


Prerequisites

Before diving into the tutorial, ensure you have the following prerequisites in place:

  • A server running Ubuntu or Debian

    • You need access to the root user or a user with sudo permissions.
    • Before you start, you should complete some basic configuration, including setting up a firewall. If you haven't done so, you can follow this guide to configure ufw (Uncomplicated Firewall).
  • Basic knowledge

    • Familiarity with Python and Flask.
    • Understanding of REST APIs and how they work.
    • Basic command-line skills for interacting with the server.
  • Software dependencies

    • Python 3.8 or higher, the Python package manager pip, venv and ffmpeg already installed
      python3 --version
      sudo apt install python3-pip
      sudo apt install python3-venv
      sudo apt install ffmpeg

Minimum hardware requirements

  • CPU: Ideally an Intel/AMD CPU that supports AVX512 or DDR5. Whisper AI performs better on modern CPUs with advanced instruction sets. To check if your CPU supports AVX512 and DDR5, run the following commands:

    lscpu | grep -o 'avx512' && sudo dmidecode -t memory | grep -i "DDR5"
  • RAM: 16GB (minimum). Transcription tasks can be memory-intensive, especially when handling multiple files concurrently.

  • Disk space: About 50GB of free disk space. This ensures you have enough room for uploaded audio files, transcription outputs, and logs.

  • GPU: Not mandatory

By ensuring these prerequisites are met, you’ll have a smooth experience setting up and running the Flask API for audio transcription using Whisper AI. Let’s get started! 🚀

Step 1 - Setting Up the Environment

Before diving into the code, let’s set up the project structure and install the necessary dependencies.

Project Structure

Create the following directory structure:

flask-whisper-api/
├── app.py                # Main Flask application
├── requirements.txt      # Python dependencies
├── uploads/              # Folder to store uploaded audio files
└── transcriptions/       # Folder to store transcription results

Installing Dependencies

Create a requirements.txt file with the following content:

Flask==3.1.0
openai-whisper==20240930

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install the dependencies:

pip install -r requirements.txt

Step 2 - Building the Flask API

Now, let’s dive into building the Flask API step by step. We’ll break down the process into smaller subsections, explaining each part of the code and its purpose. By the end, you’ll have a fully functional API that transcribes audio files and provides a download link for the transcription.

Edit app.py and add the chunks described below.

Step 2.1 - Importing Required Libraries

We start by importing the necessary libraries. Each library serves a specific purpose in our application:

  • Flask: A lightweight web framework for building APIs.
  • os: Provides utilities for interacting with the operating system (e.g. file handling).
  • threading: Enables concurrent execution of tasks.
  • queue.Queue: Manages a queue of transcription tasks.
  • whisper: The Whisper AI library for speech-to-text transcription.
  • concurrent.futures.ThreadPoolExecutor: Handles multiple transcription tasks concurrently.
  • time: Simulates progress updates during transcription.
from flask import Flask, request, jsonify
import os
import threading
from queue import Queue
from whisper import load_model
from concurrent.futures import ThreadPoolExecutor
import time

Step 2.2 - Setting Up Flask and Configurations

Here, we initialize the Flask app and configure key settings like upload and transcription directories. We also define the maximum number of concurrent workers and load the Whisper model.

app = Flask(__name__)

# Configuration
UPLOAD_FOLDER = 'uploads'                 # Folder to store uploaded audio files
TRANSCRIPTIONS_FOLDER = 'transcriptions'  # Folder to store transcription results
MAX_WORKERS = 5                           # Number of concurrent transcription tasks

# Create directories if they don't exist
os.makedirs(UPLOAD_FOLDER, exist_ok=True)
os.makedirs(TRANSCRIPTIONS_FOLDER, exist_ok=True)

# Global state for tracking progress
progress_queue = Queue()     # Queue to manage transcription tasks
transcription_status = {}    # Tracks the status of each transcription task

# Load the Whisper model (base model for general-purpose use)
model = load_model("base")

Step 2.3 - Creating the /upload Endpoint

This endpoint allows users to upload audio files. Once uploaded, the file is saved, and a unique transcription ID is generated. This ID is used to track the progress of the transcription.

@app.route('/upload', methods=['POST'])
def upload_file():
    """
    Handles file uploads and queues the file for transcription.
    Returns a unique transcription ID for progress tracking.
    """
    if 'file' not in request.files:
        return jsonify({"error": "No file part"}), 400

    file = request.files['file']
    if file.filename == '':
        return jsonify({"error": "No selected file"}), 400

    # Save the uploaded file
    file_path = os.path.join(UPLOAD_FOLDER, file.filename)
    file.save(file_path)

    # Generate a unique transcription ID
    transcription_id = str(hash(file_path))
    transcription_status[transcription_id] = {"status": "pending", "progress": 0}

    # Add the task to the queue
    progress_queue.put((transcription_id, file_path))

    return jsonify({
        "message": "File uploaded successfully",
        "transcription_id": transcription_id
    }), 200

Step 2.4 - Creating the /progress Endpoint

This endpoint allows users to check the progress of a specific transcription task using its unique transcription ID.

@app.route('/progress/<transcription_id>', methods=['GET'])
def get_progress(transcription_id):
    """
    Retrieves the progress of a transcription task.
    Returns the status, progress percentage, and download URL (if completed).
    """
    status = transcription_status.get(transcription_id)
    if not status:
        return jsonify({"error": "Invalid transcription ID"}), 404

    # Include download URL if transcription is completed
    if status["status"] == "completed":
        status["download_url"] = f"/download/{transcription_id}.txt"

    return jsonify(status), 200

Step 2.5 - Defining the Transcription Function

This function performs the actual transcription. It updates the progress periodically and saves the result to a file. If successful, it provides a download URL for the transcription.

def transcribe_audio(transcription_id, file_path):
    """
    Transcribes an audio file using Whisper AI.
    Updates progress periodically and saves the result to a file.
    """
    try:
        transcription_status[transcription_id]["status"] = "in_progress"

        # Simulate progress updates
        for i in range(1, 11):
            time.sleep(1)  # Simulating transcription work
            transcription_status[transcription_id]["progress"] = i * 10

        # Perform actual transcription
        result = model.transcribe(file_path)
        transcription_text = result["text"]

        # Save transcription to file
        output_filename = f"{transcription_id}.txt"
        output_path = os.path.join(TRANSCRIPTIONS_FOLDER, output_filename)
        with open(output_path, 'w') as f:
            f.write(transcription_text)

        # Update status and include download URL
        transcription_status[transcription_id]["status"] = "completed"
        transcription_status[transcription_id]["output_path"] = output_path
        transcription_status[transcription_id]["download_url"] = f"/download/{output_filename}"
    except Exception as e:
        transcription_status[transcription_id]["status"] = "failed"
        transcription_status[transcription_id]["error"] = str(e)

Step 2.6 - Handling Concurrent Tasks with a Worker

We use a background worker thread with a ThreadPoolExecutor to handle multiple transcription tasks concurrently. This ensures that up to MAX_WORKERS tasks run simultaneously.

def worker():
    """
    Background worker to process transcription tasks concurrently.
    Uses a ThreadPoolExecutor to manage multiple tasks.
    """
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        while True:
            transcription_id, file_path = progress_queue.get()
            executor.submit(transcribe_audio, transcription_id, file_path)
            progress_queue.task_done()

# Start the background worker thread
threading.Thread(target=worker, daemon=True).start()

Step 2.7 - Adding a /download Endpoint

This endpoint allows users to download the transcription file once it’s ready.

@app.route('/download/<filename>', methods=['GET'])
def download_file(filename):
    """
    Serves the transcription file for download.
    """
    file_path = os.path.join(TRANSCRIPTIONS_FOLDER, filename)
    if not os.path.exists(file_path):
        return jsonify({"error": "File not found"}), 404

    with open(file_path, 'r') as f:
        content = f.read()

    return jsonify({
        "filename": filename,
        "content": content
    }), 200

Step 2.8 - Running the Flask App

Finally, we add the code to run the Flask app in debug mode. This will start the server and make the API accessible.

if __name__ == '__main__':
    app.run(debug=True)
Click here to view a full sample code

Here’s the complete code for your Flask app:

from flask import Flask, request, jsonify
import os
import threading
from queue import Queue
from whisper import load_model
from concurrent.futures import ThreadPoolExecutor
import time

app = Flask(__name__)

# Configuration
UPLOAD_FOLDER = 'uploads'
TRANSCRIPTIONS_FOLDER = 'transcriptions'
MAX_WORKERS = 5

os.makedirs(UPLOAD_FOLDER, exist_ok=True)
os.makedirs(TRANSCRIPTIONS_FOLDER, exist_ok=True)

progress_queue = Queue()
transcription_status = {}
model = load_model("base")

@app.route('/upload', methods=['POST'])
def upload_file():
    if 'file' not in request.files:
        return jsonify({"error": "No file part"}), 400

    file = request.files['file']
    if file.filename == '':
        return jsonify({"error": "No selected file"}), 400

    file_path = os.path.join(UPLOAD_FOLDER, file.filename)
    file.save(file_path)

    transcription_id = str(hash(file_path))
    transcription_status[transcription_id] = {"status": "pending", "progress": 0}

    progress_queue.put((transcription_id, file_path))

    return jsonify({
        "message": "File uploaded successfully",
        "transcription_id": transcription_id
    }), 200

@app.route('/progress/<transcription_id>', methods=['GET'])
def get_progress(transcription_id):
    status = transcription_status.get(transcription_id)
    if not status:
        return jsonify({"error": "Invalid transcription ID"}), 404

    if status["status"] == "completed":
        status["download_url"] = f"/download/{transcription_id}.txt"

    return jsonify(status), 200

@app.route('/download/<filename>', methods=['GET'])
def download_file(filename):
    file_path = os.path.join(TRANSCRIPTIONS_FOLDER, filename)
    if not os.path.exists(file_path):
        return jsonify({"error": "File not found"}), 404

    with open(file_path, 'r') as f:
        content = f.read()

    return jsonify({
        "filename": filename,
        "content": content
    }), 200

def transcribe_audio(transcription_id, file_path):
    try:
        transcription_status[transcription_id]["status"] = "in_progress"

        for i in range(1, 11):
            time.sleep(1)
            transcription_status[transcription_id]["progress"] = i * 10

        result = model.transcribe(file_path)
        transcription_text = result["text"]

        output_filename = f"{transcription_id}.txt"
        output_path = os.path.join(TRANSCRIPTIONS_FOLDER, output_filename)
        with open(output_path, 'w') as f:
            f.write(transcription_text)

        transcription_status[transcription_id]["status"] = "completed"
        transcription_status[transcription_id]["output_path"] = output_path
        transcription_status[transcription_id]["download_url"] = f"/download/{output_filename}"
    except Exception as e:
        transcription_status[transcription_id]["status"] = "failed"
        transcription_status[transcription_id]["error"] = str(e)

def worker():
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        while True:
            transcription_id, file_path = progress_queue.get()
            executor.submit(transcribe_audio, transcription_id, file_path)
            progress_queue.task_done()

threading.Thread(target=worker, daemon=True).start()

if __name__ == '__main__':
    app.run(debug=True)

Step 3 - Testing the API

Now that we’ve built the Flask API, let’s test it to ensure everything works as expected. We’ll explore multiple tools and methods for interacting with the API, including curl, Postman, and even Python scripts. This section will guide you through uploading an audio file, checking the transcription progress, and downloading the final transcription file.

First, run the Python script:

python3 app.py

Step 3.1 - Using curl to Test the API

Uploading a File

To upload an audio file using curl, run the following command:

curl -v -X POST http://127.0.0.1:5000/upload \
     -F "file=@path/to/audio.mp3"

Example Response:

{
  "message": "File uploaded successfully",
  "transcription_id": "1234567890"
}

The transcription_id is crucial for tracking the progress of the transcription.

Checking Progress

Once the file is uploaded, you can check the progress of the transcription using the /progress/<transcription_id> endpoint:

curl http://127.0.0.1:5000/progress/1234567890

Example Response (In Progress):

{
  "status": "in_progress",
  "progress": 50
}

Example Response (Completed):

{
  "status": "completed",
  "progress": 100,
  "download_url": "/download/1234567890.txt"
}

Downloading the Transcription

Once the transcription is complete, use the download_url provided in the progress response to retrieve the transcription file:

curl http://127.0.0.1:5000/download/1234567890.txt

Example Response:

{
  "filename": "1234567890.txt",
  "content": "This is the transcribed text from the audio file."
}

Step 3.2 - Alternative Tools for Testing

If you’re looking for alternatives to curl, here are some other tools you can use to test the API:

Tool Description
Postman A popular graphical tool for testing APIs. You can create POST requests to upload files, GET requests to check progress, and download transcriptions.
Insomnia A lightweight alternative to Postman with similar functionality.
HTTPie A user-friendly command-line HTTP client that simplifies API testing compared to curl.
Swagger UI If you integrate Swagger/OpenAPI documentation into your Flask app, you can test the API directly from the browser.

Each of these tools provides a modern and intuitive way to interact with your API, making it easier to debug and validate your Flask application.

By testing the API with curl and exploring alternative tools like Postman, Insomnia, or HTTPie, you can verify that the Flask application works as intended. Whether you prefer command-line tools or graphical interfaces, you now have multiple ways to interact with your transcription service!

Step 4 - Making the Project Production-Ready

To make the Flask API production ready, you’ll need to configure it to run as a background service, set up a domain, and secure it with SSL. Below are references to guides that will help you achieve this:

  1. Set Up and Deploy a Flask Web Application as a Systemctl Service with Nginx
    Learn how to deploy your Flask application using uWSGI and Nginx, and configure it to run as a systemctl service for reliability and scalability. Read the guide here.

  2. Automating SSL Certificate Issuance with acme.sh through DNS
    Secure your Flask API with SSL certificates by automating the issuance process using acme.sh. This ensures your API is accessible over HTTPS. Read the guide here.

  3. Simple Firewall Management with UFW
    Protect your server by configuring a firewall with UFW (Uncomplicated Firewall). This guide covers the basics of setting up and managing firewall rules. Read the guide here.

By following these guides, you can ensure your Flask API is production-ready, secure, and accessible via a custom domain. These steps will also help you automate maintenance tasks like SSL certificate renewal and firewall management.

With these configurations in place, your Flask API will be robust, scalable, and ready to handle real-world workloads! 🚀

Conclusion

In this tutorial, we built a Flask API that uses Whisper AI to transcribe audio files concurrently. We also implemented a mechanism to provide real-time progress updates for each transcription task. This setup is scalable and can be extended further to include features like authentication, logging, and error handling.

Feel free to experiment with different Whisper models and adjust the concurrency settings based on your hardware capabilities. Happy coding! 🚀

License: MIT
Want to contribute?

Get Rewarded: Get up to €50 in credit! Be a part of the community and contribute. Do it for the money. Do it for the bragging rights. And do it to teach others!

Report Issue

Discover our

GPU Server

Kickstart your AI project with our cost-efficient and powerful dedicated GPU servers.

Want to contribute?

Get Rewarded: Get up to €50 credit on your account for every tutorial you write and we publish!

Find out more