WhisperLive-Server/HYBRID_SERVER_README.md

6.3 KiB

WhisperLive Hybrid Server

This hybrid server extends the original WhisperLive-Server to support both WebSocket connections (for real-time audio streaming) and HTTP endpoints (for file transcription) in a single container.

Features

  • WebSocket Server: Original real-time audio transcription functionality
  • HTTP Server: New file upload and transcription endpoints
  • Single Container: Both services run in the same Docker container
  • GPU Sharing: Both services share the same GPU resources

Architecture

The hybrid server runs two services simultaneously:

  1. WebSocket Server: Handles real-time audio streaming transcription
  2. HTTP Server: Handles file uploads and transcription requests

Both services use the same WhisperLive transcriber instance, ensuring efficient resource usage.

Ports

  • WebSocket Port: Default 5050 (configurable via PORT_WHISPERLIVE)
  • HTTP Port: Default 8080 (configurable via HTTP_PORT)

HTTP Endpoints

1. Health Check

GET /health

Returns server health status.

Response:

{
  "status": "healthy",
  "service": "WhisperLive Hybrid Server"
}

2. OpenAI Compatible Endpoints

POST /v1/audio/transcriptions
POST /v1/audio/translations

Fully compatible drop-in replacements for the standard OpenAI Whisper API.

Parameters:

  • file (required): Audio file (WAV, MP3, FLAC, M4A, OGG, WEBM, MP4, MPEG, MPGA)
  • model (optional): Model size (default: "base")
  • language (optional): Language code (e.g., "en", "es", "fr")
  • prompt (optional): Text to guide the model's style
  • response_format (optional): "json", "text", "srt", "verbose_json", "vtt" (default: "json")
  • temperature (optional): Sampling temperature (0.0 to 1.0)

Example Request:

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F "file=@audio.wav" \
  -F "model=whisper-1" \
  -F "response_format=json"

Response (JSON format):

{
  "text": "Hello, this is a test."
}

3. Legacy File Transcription

POST /transcribe

Transcribes an uploaded audio file.

Parameters:

  • file (required): Audio file (WAV, MP3, FLAC, M4A, OGG, WEBM)
  • language (optional): Language code (e.g., "en", "es", "fr")
  • task (optional): "transcribe" or "translate" (default: "transcribe")
  • model (optional): Model size (default: "base")

Example Request:

curl -X POST http://localhost:8080/transcribe \
  -F "file=@audio.wav" \
  -F "language=en" \
  -F "task=transcribe" \
  -F "model=base"

Response:

{
  "success": true,
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, this is a test.",
      "no_speech_prob": 0.1
    }
  ],
  "info": {
    "language": "en",
    "language_probability": 0.95,
    "duration": 10.5,
    "duration_after_vad": 10.5,
    "transcription_options": {}
  },
  "filename": "audio.wav"
}

3. URL Transcription (Placeholder)

POST /transcribe/url

Endpoint for transcribing audio from URLs (ready for implementation).

Usage Examples

Python Client

import requests

# Transcribe a file
with open('audio.wav', 'rb') as f:
    response = requests.post('http://localhost:8080/transcribe', 
                           files={'file': f},
                           data={'language': 'en', 'model': 'base'})
    
if response.status_code == 200:
    result = response.json()
    print(f"Transcription: {result['segments']}")

JavaScript/Node.js

const FormData = require('form-data');
const fs = require('fs');

const form = new FormData();
form.append('file', fs.createReadStream('audio.wav'));
form.append('language', 'en');
form.append('model', 'base');

fetch('http://localhost:8080/transcribe', {
    method: 'POST',
    body: form
})
.then(response => response.json())
.then(result => console.log(result));

cURL

# Basic transcription
curl -X POST http://localhost:8080/transcribe \
  -F "file=@audio.wav"

# With parameters
curl -X POST http://localhost:8080/transcribe \
  -F "file=@audio.wav" \
  -F "language=es" \
  -F "task=translate" \
  -F "model=small"

Configuration

Environment Variables

  • PORT_WHISPERLIVE: WebSocket port (default: 5050)
  • HTTP_PORT: HTTP port (default: 8080)
  • FASTERWHISPER_MODEL: Custom model path
  • OMP_NUM_THREADS: OpenMP thread count

Docker Compose

services:
  whisperlive:
    ports:
      - "5050:5050"  # WebSocket
      - "8080:8080"  # HTTP
    environment:
      PORT_WHISPERLIVE: 5050
      HTTP_PORT: 8080

Testing

1. Test Script

Run the Python test script:

python3 test_http_endpoints.py

2. Web Interface

Open test_form.html in a web browser to test the HTTP endpoints with a user-friendly interface.

3. Health Check

curl http://localhost:8080/health

Backend Support

Currently, the HTTP endpoints support:

  • faster_whisper: Full support for all features
  • tensorrt: Basic support (needs adaptation)
  • openvino: Basic support (needs adaptation)

File Size Limits

  • Maximum file size: 100MB
  • Supported formats: WAV, MP3, FLAC, M4A, OGG, WEBM

Performance Considerations

  • File transcription uses the same model instance as WebSocket connections
  • Temporary files are automatically cleaned up after processing
  • Both services share GPU memory efficiently
  • HTTP requests are processed in separate threads

Troubleshooting

Common Issues

  1. Port Already in Use

    • Check if ports 5050 or 8080 are available
    • Use different ports via environment variables
  2. File Upload Errors

    • Ensure file size is under 100MB
    • Check file format is supported
    • Verify file is not corrupted
  3. GPU Memory Issues

    • Monitor GPU memory usage
    • Consider using smaller model sizes
    • Restart container if needed

Logs

Check container logs for detailed error information:

docker logs whisperlive

Migration from Original Server

The hybrid server is fully backward compatible. Your existing WebSocket clients will continue to work without changes. The HTTP endpoints are additional functionality that doesn't interfere with the original service.

Future Enhancements

  • Support for more audio formats
  • Batch file processing
  • Progress tracking for long files
  • Authentication and rate limiting
  • WebSocket support for file transcription progress