Audio-To-Text

Deepgram model

Deepgram model

Deepgram is an audio-to-text model that is able to transcribe audio files in real-time, with a high degree of accuracy.

The models available are:

  • nova-2
  • nova
  • enhanced
  • base
  • whisper (i.e., Deepgram’s own hosting of OpenAI’s model)

OpenAI Whisper model

Whisper model

Whisper is an audio-to-text model developed by OpenAI, that embodies a versatile speech recognition solution, having been developed using a substantial assortment of different audio data. Its capabilities span beyond mono-lingual speech recognition, extending to translating speech and pinpointing languages.

When using audio-to-text, two nodes will be required at minimum: the model node (right now only OpenAI model is available) and an output node.

  • The model node will request an url that should contain the audio file (i.e., .mp3, .wav).
  • The output node will display the result of the transcription performed by the model.

The result of the model node could be instead sent to an LLM node for processing. In the video below, the transcription is sent to the LLM that combines this information with data coming from an url that has been retrieved and processed.

Whisper model

Audio input options

URL as input

The most common way to use the audio-to-text node is to provide an URL that contains the audio file. This URL could be the one of a file stored in a cloud storage service (e.g., Google Drive, Dropbox, etc.) or the one of a file stored in a server.

Upload file as input

You can also upload a file from your computer and it will get transcribed.

Record your voice as input

Finally, you can record your voice directly from the browser to test the workflow.