Building an Automatic Speech Recognition (ASR) System with PyTorch & Hugging Face

Automatic speech recognition (ASR) is a crucial technology in many applications, from voice assistants to transcription services. In this tutorial, we aim to build an ASR pipeline capable of transcribing speech into text using pre-trained models from Hugging Face. We will use a lightweight dataset for efficiency and employ Wav2Vec2, a powerful self-supervised model for speech recognition.

Our system will:

Load and preprocess a speech dataset
Fine-tune a pre-trained Wav2Vec2 model

Evaluate the model's performance using word error rate (WER)
Deploy the model for real-time speech-to-text inference

To keep our model lightweight and efficient, we will use a small speech dataset rather than large datasets like Common Voice.

Step 1: Installing Dependencies

Before we start, we need to install the necessary libraries. These libraries will allow us to load datasets, process audio files, and fine-tune our model.

```

pip install torch torchaudio transformers datasets soundfile jiwer

```

The main purpose for the following libraries:

transformers: Provides pre-trained Wav2Vec2 models for speech recognition
datasets: Loads and processes speech datasets

torchaudio: Handles audio processing and manipulation
soundfile: Reads and writes .wav files

jiwer: Computes the WER for evaluating ASR performance

Step 2: Loading a Lightweight Speech Dataset

Instead of using large datasets like Common Voice, we use SUPERB KS, a small dataset ideal for quick experimentation. This dataset consists of short spoken commands like "yes," "no," and "stop."

```

from datasets import load_dataset

dataset = load_dataset("superb", "ks", split="train[:1%]") # Load only 1% of the data for quick testing

print(dataset)

```

This loads a tiny subset of the dataset to reduce computational cost while still allowing us to fine-tune the model. Warning: the dataset still requires storage space, so be mindful of disk usage when working with larger splits.

Step 3: Preprocessing the Audio Data

To train our ASR model, we need to ensure that the audio data is in the correct format. The Wav2Vec2 model requires:

16 kHz sample rate
No padding or truncation (handled dynamically)

We define a function to process the audio and extract relevant features.

```

import torchaudio

def preprocess_audio(batch):

speecharray, samplingrate = torchaudio.load(batch["audio"]["path"])

batch["speech"] = speech_array.squeeze().numpy()

batch["samplingrate"] = samplingrate

batch["target_text"] = batch["label"] # Use labels as text output

return batch

dataset = dataset.map(preprocess_audio)

```

This ensures all audio files are loaded correctly and formatted properly for further processing.

Step 4: Loading a Pre-trained Wav2Vec2 Model

We use a pre-trained Wav2Vec2 model from Hugging Face's model hub. This model has already been trained on a large dataset and can be fine-tuned for our specific task.

```

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

```

Here we define both the processor that converts raw audio into model-friendly features and the model, consisting of a Wav2Vec2 pre-trained on 960 hours of speech.

Step 5: Preparing Data for the Model

We must tokenize and encode the audio so that the model can understand it.

```

def preprocessformodel(batch):

inputs = processor(batch["speech"], samplingrate=16000, returntensors="pt", padding=True)

batch["inputvalues"] = inputs.inputvalues[0]

return batch

dataset = dataset.map(preprocessformodel, removecolumns=["speech", "samplingrate", "audio"])

```

This step ensures that our dataset is compatible with the Wav2Vec2 model.

Step 6: Defining Training Arguments

Before training, we need to set up our training configuration. This includes batch size, learning rate, and optimization steps.

```

from transformers import TrainingArguments

training_args = TrainingArguments(

output_dir="./wav2vec2",

perdevicetrainbatchsize=4,

evaluation_strategy="epoch",

save_strategy="epoch",

logging_dir="./logs",

learning_rate=1e-4,

warmup_steps=500,

max_steps=4000,

savetotallimit=2,

gradientaccumulationsteps=2,

fp16=True,

pushtohub=False,

)

```

Step 7: Training the Model

Using Hugging Face's Trainer, we fine-tune our Wav2Vec2 model.

```

from transformers import Trainer

trainer = Trainer(

model=model,

args=training_args,

train_dataset=dataset,

tokenizer=processor,

)

trainer.train()

```

Step 8: Evaluating the Model

To measure how well our model transcribes speech, we compute the WER.

```

import torch

from jiwer import wer

def transcribe(batch):

inputs = processor(batch["inputvalues"], returntensors="pt", padding=True)

with torch.no_grad():

logits = model(inputs.input_values).logits

predicted_ids = torch.argmax(logits, dim=-1)

batch["predictedtext"] = processor.batchdecode(predicted_ids)[0]

return batch

results = dataset.map(transcribe)

werscore = wer(results["targettext"], results["predicted_text"])

print(f"Word Error Rate: {wer_score:.2f}")

```

A lower WER score indicates better performance.

Step 9: Running Inference on New Audio

Finally, we can use our trained model to transcribe real-world speech.

```

import torchaudio

import soundfile as sf

speecharray, samplingrate = torchaudio.load("example.wav")

inputs = processor(speecharray.squeeze().numpy(), samplingrate=16000, return_tensors="pt", padding=True)

with torch.no_grad():

logits = model(inputs.input_values).logits

predicted_ids = torch.argmax(logits, dim=-1)

transcription = processor.batchdecode(predictedids)

```

Conclusion

And that's it. You've successfully built an ASR system using PyTorch & Hugging Face with a lightweight dataset.

Building an Automatic Speech Recognition (ASR) System with PyTorch & Hugging Face

Step 1: Installing Dependencies

Step 2: Loading a Lightweight Speech Dataset

Step 3: Preprocessing the Audio Data

Step 4: Loading a Pre-trained Wav2Vec2 Model

Step 5: Preparing Data for the Model

Step 6: Defining Training Arguments

Step 7: Training the Model

Step 8: Evaluating the Model

Step 9: Running Inference on New Audio

Conclusion

관련 뉴스

인터랙티브 데이터 앱 개발: Streamlit, Pandas, Plotly 통합 가이드

AI 데이터 분석을 배우는 방법: 2025년 완벽 가이드