Building an Automatic Speech Recognition (ASR) System with PyTorch & Hugging Face

Building an Automatic Speech Recognition (ASR) System with PyTorch & Hugging Face

Automatic speech recognition (ASR) is a crucial technology in many applications, from voice assistants to transcription services. In this tutorial, we aim to build an ASR pipeline capable of transcribing speech into text using pre-trained models from Hugging Face. We will use a lightweight dataset for efficiency and employ Wav2Vec2, a powerful self-supervised model for speech recognition.

Our system will:

  • Load and preprocess a speech dataset
  • Fine-tune a pre-trained Wav2Vec2 model
  • Evaluate the model's performance using word error rate (WER)
  • Deploy the model for real-time speech-to-text inference

To keep our model lightweight and efficient, we will use a small speech dataset rather than large datasets like Common Voice.

Step 1: Installing Dependencies

Before we start, we need to install the necessary libraries. These libraries will allow us to load datasets, process audio files, and fine-tune our model.

pip install torch torchaudio transformers datasets soundfile jiwer

The main purpose for the following libraries:

  • transformers: Provides pre-trained Wav2Vec2 models for speech recognition
  • datasets: Loads and processes speech datasets
  • torchaudio: Handles audio processing and manipulation
  • soundfile: Reads and writes .wav files
  • jiwer: Computes the WER for evaluating ASR performance
  • Step 2: Loading a Lightweight Speech Dataset

    Instead of using large datasets like Common Voice, we use SUPERB KS, a small dataset ideal for quick experimentation. This dataset consists of short spoken commands like "yes," "no," and "stop."

    from datasets import load_dataset

    dataset = load_dataset("superb", "ks", split="train[:1%]") # Load only 1% of the data for quick testing

    print(dataset)

    This loads a tiny subset of the dataset to reduce computational cost while still allowing us to fine-tune the model. Warning: the dataset still requires storage space, so be mindful of disk usage when working with larger splits.

    Step 3: Preprocessing the Audio Data

    To train our ASR model, we need to ensure that the audio data is in the correct format. The Wav2Vec2 model requires:

    • 16 kHz sample rate
    • No padding or truncation (handled dynamically)

    We define a function to process the audio and extract relevant features.

    import torchaudio

    def preprocess_audio(batch):

    speecharray, samplingrate = torchaudio.load(batch["audio"]["path"])

    batch["speech"] = speech_array.squeeze().numpy()

    batch["samplingrate"] = samplingrate

    batch["target_text"] = batch["label"] # Use labels as text output

    return batch

    dataset = dataset.map(preprocess_audio)

    This ensures all audio files are loaded correctly and formatted properly for further processing.

    Step 4: Loading a Pre-trained Wav2Vec2 Model

    We use a pre-trained Wav2Vec2 model from Hugging Face's model hub. This model has already been trained on a large dataset and can be fine-tuned for our specific task.

    from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

    processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

    Here we define both the processor that converts raw audio into model-friendly features and the model, consisting of a Wav2Vec2 pre-trained on 960 hours of speech.

    Step 5: Preparing Data for the Model

    We must tokenize and encode the audio so that the model can understand it.

    def preprocessformodel(batch):

    inputs = processor(batch["speech"], samplingrate=16000, returntensors="pt", padding=True)

    batch["inputvalues"] = inputs.inputvalues[0]

    return batch

    dataset = dataset.map(preprocessformodel, removecolumns=["speech", "samplingrate", "audio"])

    This step ensures that our dataset is compatible with the Wav2Vec2 model.

    Step 6: Defining Training Arguments

    Before training, we need to set up our training configuration. This includes batch size, learning rate, and optimization steps.

    from transformers import TrainingArguments

    training_args = TrainingArguments(

    output_dir="./wav2vec2",

    perdevicetrainbatchsize=4,

    evaluation_strategy="epoch",

    save_strategy="epoch",

    logging_dir="./logs",

    learning_rate=1e-4,

    warmup_steps=500,

    max_steps=4000,

    savetotallimit=2,

    gradientaccumulationsteps=2,

    fp16=True,

    pushtohub=False,

    )

    Step 7: Training the Model

    Using Hugging Face's Trainer, we fine-tune our Wav2Vec2 model.

    from transformers import Trainer

    trainer = Trainer(

    model=model,

    args=training_args,

    train_dataset=dataset,

    tokenizer=processor,

    )

    trainer.train()

    Step 8: Evaluating the Model

    To measure how well our model transcribes speech, we compute the WER.

    import torch

    from jiwer import wer

    def transcribe(batch):

    inputs = processor(batch["inputvalues"], returntensors="pt", padding=True)

    with torch.no_grad():

    logits = model(inputs.input_values).logits

    predicted_ids = torch.argmax(logits, dim=-1)

    batch["predictedtext"] = processor.batchdecode(predicted_ids)[0]

    return batch

    results = dataset.map(transcribe)

    werscore = wer(results["targettext"], results["predicted_text"])

    print(f"Word Error Rate: {wer_score:.2f}")

    A lower WER score indicates better performance.

    Step 9: Running Inference on New Audio

    Finally, we can use our trained model to transcribe real-world speech.

    import torchaudio

    import soundfile as sf

    speecharray, samplingrate = torchaudio.load("example.wav")

    inputs = processor(speecharray.squeeze().numpy(), samplingrate=16000, return_tensors="pt", padding=True)

    with torch.no_grad():

    logits = model(inputs.input_values).logits

    predicted_ids = torch.argmax(logits, dim=-1)

    transcription = processor.batchdecode(predictedids)

    Conclusion

    And that's it. You've successfully built an ASR system using PyTorch & Hugging Face with a lightweight dataset.