kcar 83edfff9d3 feat: apply local modifications to WhisperLive-Server

2026-05-13 22:33:35 +00:00

6.3 KiB

Raw Blame History

WhisperLive Hybrid Server

This hybrid server extends the original WhisperLive-Server to support both WebSocket connections (for real-time audio streaming) and HTTP endpoints (for file transcription) in a single container.

Features

WebSocket Server: Original real-time audio transcription functionality
HTTP Server: New file upload and transcription endpoints
Single Container: Both services run in the same Docker container
GPU Sharing: Both services share the same GPU resources

Architecture

The hybrid server runs two services simultaneously:

WebSocket Server: Handles real-time audio streaming transcription
HTTP Server: Handles file uploads and transcription requests

Both services use the same WhisperLive transcriber instance, ensuring efficient resource usage.

Ports

WebSocket Port: Default 5050 (configurable via PORT_WHISPERLIVE)
HTTP Port: Default 8080 (configurable via HTTP_PORT)

HTTP Endpoints

1. Health Check

GET /health

Returns server health status.

Response:

{
  "status": "healthy",
  "service": "WhisperLive Hybrid Server"
}

2. OpenAI Compatible Endpoints

POST /v1/audio/transcriptions
POST /v1/audio/translations

Fully compatible drop-in replacements for the standard OpenAI Whisper API.

Parameters:

file (required): Audio file (WAV, MP3, FLAC, M4A, OGG, WEBM, MP4, MPEG, MPGA)
model (optional): Model size (default: "base")
language (optional): Language code (e.g., "en", "es", "fr")
prompt (optional): Text to guide the model's style
response_format (optional): "json", "text", "srt", "verbose_json", "vtt" (default: "json")
temperature (optional): Sampling temperature (0.0 to 1.0)

Example Request:

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F "file=@audio.wav" \
  -F "model=whisper-1" \
  -F "response_format=json"

Response (JSON format):

{
  "text": "Hello, this is a test."
}

3. Legacy File Transcription

POST /transcribe

Transcribes an uploaded audio file.

Parameters:

file (required): Audio file (WAV, MP3, FLAC, M4A, OGG, WEBM)
language (optional): Language code (e.g., "en", "es", "fr")
task (optional): "transcribe" or "translate" (default: "transcribe")
model (optional): Model size (default: "base")

Example Request:

curl -X POST http://localhost:8080/transcribe \
  -F "file=@audio.wav" \
  -F "language=en" \
  -F "task=transcribe" \
  -F "model=base"

Response:

{
  "success": true,
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, this is a test.",
      "no_speech_prob": 0.1
    }
  ],
  "info": {
    "language": "en",
    "language_probability": 0.95,
    "duration": 10.5,
    "duration_after_vad": 10.5,
    "transcription_options": {}
  },
  "filename": "audio.wav"
}

3. URL Transcription (Placeholder)

POST /transcribe/url

Endpoint for transcribing audio from URLs (ready for implementation).

Usage Examples

Python Client

import requests

# Transcribe a file
with open('audio.wav', 'rb') as f:
    response = requests.post('http://localhost:8080/transcribe', 
                           files={'file': f},
                           data={'language': 'en', 'model': 'base'})
    
if response.status_code == 200:
    result = response.json()
    print(f"Transcription: {result['segments']}")

JavaScript/Node.js

const FormData = require('form-data');
const fs = require('fs');

const form = new FormData();
form.append('file', fs.createReadStream('audio.wav'));
form.append('language', 'en');
form.append('model', 'base');

fetch('http://localhost:8080/transcribe', {
    method: 'POST',
    body: form
})
.then(response => response.json())
.then(result => console.log(result));

cURL

# Basic transcription
curl -X POST http://localhost:8080/transcribe \
  -F "file=@audio.wav"

# With parameters
curl -X POST http://localhost:8080/transcribe \
  -F "file=@audio.wav" \
  -F "language=es" \
  -F "task=translate" \
  -F "model=small"

Configuration

Environment Variables

PORT_WHISPERLIVE: WebSocket port (default: 5050)
HTTP_PORT: HTTP port (default: 8080)
FASTERWHISPER_MODEL: Custom model path
OMP_NUM_THREADS: OpenMP thread count

Docker Compose

services:
  whisperlive:
    ports:
      - "5050:5050"  # WebSocket
      - "8080:8080"  # HTTP
    environment:
      PORT_WHISPERLIVE: 5050
      HTTP_PORT: 8080

Testing

1. Test Script

Run the Python test script:

python3 test_http_endpoints.py

2. Web Interface

Open test_form.html in a web browser to test the HTTP endpoints with a user-friendly interface.

3. Health Check

curl http://localhost:8080/health

Backend Support

Currently, the HTTP endpoints support:

faster_whisper: Full support for all features
tensorrt: Basic support (needs adaptation)
openvino: Basic support (needs adaptation)

File Size Limits

Maximum file size: 100MB
Supported formats: WAV, MP3, FLAC, M4A, OGG, WEBM

Performance Considerations

File transcription uses the same model instance as WebSocket connections
Temporary files are automatically cleaned up after processing
Both services share GPU memory efficiently
HTTP requests are processed in separate threads

Troubleshooting

Common Issues

Port Already in Use
- Check if ports 5050 or 8080 are available
- Use different ports via environment variables
File Upload Errors
- Ensure file size is under 100MB
- Check file format is supported
- Verify file is not corrupted
GPU Memory Issues
- Monitor GPU memory usage
- Consider using smaller model sizes
- Restart container if needed

Logs

Check container logs for detailed error information:

docker logs whisperlive

Migration from Original Server

The hybrid server is fully backward compatible. Your existing WebSocket clients will continue to work without changes. The HTTP endpoints are additional functionality that doesn't interfere with the original service.

Future Enhancements

Support for more audio formats
Batch file processing
Progress tracking for long files
Authentication and rate limiting
WebSocket support for file transcription progress

6.3 KiB Raw Blame History