Models
View the list of models supported by the Spectropic API.
The Spectropic API currently supports two models for transcription and diraization. The standard
model is the default model and is suitable for most use cases. The enhanced
model is more accurate but slower.
Model | Languages | Price per second | Price per hour |
---|---|---|---|
standard | all | $0.0001 / second of audio | $0.36 / hour of audio |
enhanced | all | $0.0005 / second of audio | $1.80 / hour of audio |
Standard model
The standard model is based on Whisper large-v3-turbo for transcription and pyannote.audio 3.3.0 for diraization. It is suitable for most use cases and works well for languages like English, German, French, Spanish, Italian, and Dutch.
Costs are $0.0001
per second of audio equivalent to $0.36
per hour of audio.
Example
Enhanced model
The enhanced model is based on the standard model, but does LLM-based post processing to improve the accuracy of the transcription, get more detailed diarization segments, and infer labels or names of speakers.
The enhanced model can work well for languages that are lower in the accuracy of the standard model.
Limitations compared to the standard model:
- The enhanced model is slower.
- Unlike the standard model, the enhanced model does not output word level timestamps and confidence scores.
- Max input audio input duration is currently 60 minutes (this will be increased in the near future).
Costs are $0.0005
per second of audio equivalent to $1.80
per hour of audio.
Warning: The enhanced model is experimental and may fail or produce incorrect results (you won’t be charged on transcribe failure). Feel free to try it out and let us know what you think (mail to [email protected] or on X @thomas_mol).