Loading...

Implementing Speech-to-Text Conversion with React and Web Speech API

In this blog post, we'll explore how to implement speech-to-text conversion in a React application using the Web Speech API. This functionality can be incredibly useful for accessibility features, voice commands, or any application that benefits from voice input.

Try the SpeechRecognition component here:

Understanding the Core Functionality

Let's break down the key parts of our SpeechRecognition component that enable speech-to-text conversion:

Component Setup and State Management

Firstly, we'll create our SpeechRecognition component and initialise it.

src/components/speech-recognition.tsx

const SpeechRecognition: React.FC = () => {
  const [isListening, setIsListening] = useState<boolean>(false);
  const [transcript, setTranscript] = useState<string>('');
  const [recognition, setRecognition] = useState<CustomSpeechRecognition | null>(null);
  const [error, setError] = useState<string>('');

  // other code...

}

We use React hooks to manage our component's state:

Initializing Speech Recognition

One of the first things we have to do when the component mounts is to check that the Web Speech API is supported.

src/components/speech-recognition.tsx

useEffect(() => {
  if ('webkitSpeechRecognition' in window) {
    const recognitionInstance = new window.webkitSpeechRecognition(); 
    recognitionInstance.continuous = true;
    recognitionInstance.interimResults = true;

    // Event handlers setup...

    setRecognition(recognitionInstance);
  } else {
    setError('Speech recognition is not supported in this browser.');
  }
}, []);

In the useEffect hook, we:

  1. Check if the browser supports speech recognition.
  2. If supported, create a new instance of webkitSpeechRecognition. The window.webkitSpeechRecognition() object in the Web Speech API creates a new instance of a speech recognition service. This object provides methods and properties for controlling speech recognition, such as starting and stopping recognition, accessing recognized text, and handling various events related to speech recognition.
  3. Set it to be continuous (doesn't stop listening after the first recognized phrase) and provide interim results (for real-time feedback).
  4. Set up event handlers (which we'll discuss next).
  5. Save the instance to our state.

Handling Speech Recognition Events

The Web Speech API allows us to be notified for some key events:

src/components/speech-recognition.tsx

recognitionInstance.onresult = (event: SpeechRecognitionEvent) => {
  const current = event.resultIndex;
  const transcript = event.results[current][0].transcript;
  setTranscript(transcript);
};

recognitionInstance.onerror = (event: SpeechRecognitionError) => {
  console.error('Speech recognition error', event.error);
  setIsListening(false);
  // Error handling logic...
};

recognitionInstance.onend = () => {
  setIsListening(false);
};

We set up three crucial event handlers:

Toggling Speech Recognition

In a Real World app, it will be useful to toggle the Speech Recognition Feature. This can be achieved in the following way:

src/components/speech-recognition.tsx

const toggleListening = useCallback(() => {
  if (isListening) {
    recognition?.stop();
  } else {
    setError('');
    navigator.mediaDevices
      .getUserMedia({ audio: true })
      .then(() => {
        recognition?.start();
        setIsListening(true);
      })
      .catch((err: Error) => {
        console.error('Error accessing the microphone', err);
        setError(
          'Unable to access the microphone. Please ensure you have given permission.',
        );
      });
  }
}, [isListening, recognition]);

This function toggles speech recognition on and off. When starting:

  1. It first requests microphone access.
  2. If granted, it starts the speech recognition.
  3. If denied, it sets an error message.

Rendering the Component

src/components/speech-recognition.tsx

return (
  <div className={styles.container}>
    {error && <div className={styles.errorMessage}>{error}</div>}
    <Button
      onClick={toggleListening}
      variant="primary"
      userStyles={styles.button}
      disabled={!recognition}
    >
      {isListening ? 'Stop Listening' : 'Start Listening'}
    </Button>
    <div className={styles.transcript}>
      <p>{transcript}</p>
    </div>
  </div>
);

Our component renders:

  1. An error message (if any).
  2. A button to start/stop listening.
  3. The current transcript.

Handling TypeScript errors

When the code is first deployed, it results in some TypeScript errors. To fix these, we have to declare some new types:

src/components/speech-recognition.tsx

interface SpeechRecognitionEvent extends Event {
  resultIndex: number;
  results: SpeechRecognitionResultList;
}

interface SpeechRecognitionError extends Event {
  error: string;
}

interface CustomSpeechRecognition extends EventTarget {
  continuous: boolean;
  interimResults: boolean;
  start: () => void;
  stop: () => void;
  onresult: (event: SpeechRecognitionEvent) => void;
  onerror: (event: SpeechRecognitionError) => void;
  onend: () => void;
}

declare global { 
  interface Window { 
    webkitSpeechRecognition: new () => CustomSpeechRecognition; 
  } 
}

and then using TypeScript's declaration merging feature, we extend the Window interface by adding the new type. This will include the webkitSpeechRecognition property, allowing TypeScript to recognize it as valid.

Conclusion

This React component demonstrates the core functionality of speech-to-text conversion using the Web Speech API. It handles:

The development of this component illustrates the core concepts of using the Web Speech API to build a speech-to-text capability in the browser. This was a fun project to build, but it's clear from the results, that whilst it is a cool technology, it requires further refinement before it could be used in a production environment. The potential for such a capability is vast and could provide some serious enhancements to existing and future web applications.

Complete Code

Here's a complete listing of the code:


src/components/speech-recognition.tsx

'use client';
import React, { useState, useEffect, useCallback } from 'react';
import Button from '../button/button';
import Tooltip from '../tooltip/tooltip';
import { styles } from './web-speech.styles';

const micOn = (
  <svg
    xmlns="http://www.w3.org/2000/svg"
    height="24px"
    viewBox="0 -960 960 960"
    width="24px"
    fill="#fff"
  >
    <path d="M480-400q-50 0-85-35t-35-85v-240q0-50 35-85t85-35q50 0 85 35t35 85v240q0 50-35 85t-85 35Zm0-240Zm-40 520v-123q-104-14-172-93t-68-184h80q0 83 58.5 141.5T480-320q83 0 141.5-58.5T680-520h80q0 105-68 184t-172 93v123h-80Zm40-360q17 0 28.5-11.5T520-520v-240q0-17-11.5-28.5T480-800q-17 0-28.5 11.5T440-760v240q0 17 11.5 28.5T480-480Z" />
  </svg>
);
const micOff = (
  <svg
    xmlns="http://www.w3.org/2000/svg"
    height="24px"
    viewBox="0 -960 960 960"
    width="24px"
    fill="#fff"
  >
    <path d="m710-362-58-58q14-23 21-48t7-52h80q0 44-13 83.5T710-362ZM480-594Zm112 112-72-72v-206q0-17-11.5-28.5T480-800q-17 0-28.5 11.5T440-760v126l-80-80v-46q0-50 35-85t85-35q50 0 85 35t35 85v240q0 11-2.5 20t-5.5 18ZM440-120v-123q-104-14-172-93t-68-184h80q0 83 57.5 141.5T480-320q34 0 64.5-10.5T600-360l57 57q-29 23-63.5 39T520-243v123h-80Zm352 64L56-792l56-56 736 736-56 56Z" />
  </svg>
);

interface SpeechRecognitionEvent extends Event {
  resultIndex: number;
  results: SpeechRecognitionResultList;
}

interface SpeechRecognitionError extends Event {
  error: string;
}

interface CustomSpeechRecognition extends EventTarget {
  continuous: boolean;
  interimResults: boolean;
  start: () => void;
  stop: () => void;
  onresult: (event: SpeechRecognitionEvent) => void;
  onerror: (event: SpeechRecognitionError) => void;
  onend: () => void;
}

declare global {
  interface Window {
    webkitSpeechRecognition: new () => CustomSpeechRecognition;
  }
}

const SpeechRecognition: React.FC = () => {
  const [isListening, setIsListening] = useState<boolean>(false);
  const [transcript, setTranscript] = useState<string>('');
  const [recognition, setRecognition] =
    useState<CustomSpeechRecognition | null>(null);
  const [error, setError] = useState<string>('');

  useEffect(() => {
    if ('webkitSpeechRecognition' in window) {
      const recognitionInstance = new window.webkitSpeechRecognition();
      recognitionInstance.continuous = true;
      recognitionInstance.interimResults = true;

      recognitionInstance.onresult = (event: SpeechRecognitionEvent) => {
        const current = event.resultIndex;
        const transcript = event.results[current][0].transcript;
        setTranscript(transcript);
      };

      recognitionInstance.onerror = (event: SpeechRecognitionError) => {
        console.error('Speech recognition error', event.error);
        setIsListening(false);
        if (event.error === 'not-allowed') {
          setError(
            'Microphone access was denied. Please allow microphone access and try again.',
          );
        } else {
          setError(`Speech recognition error: ${event.error}`);
        }
      };

      recognitionInstance.onend = () => {
        setIsListening(false);
      };

      setRecognition(recognitionInstance);
    } else {
      setError('Speech recognition is not supported in this browser.');
    }
  }, []);

  const toggleListening = useCallback(() => {
    if (isListening) {
      recognition?.stop();
    } else {
      setError('');
      navigator.mediaDevices
        .getUserMedia({ audio: true })
        .then(() => {
          recognition?.start();
          setIsListening(true);
        })
        .catch((err: Error) => {
          console.error('Error accessing the microphone', err);
          setError(
            'Unable to access the microphone. Please ensure you have given permission.',
          );
        });
    }
  }, [isListening, recognition]);

  return (
    <div className={styles.container}>
      {error && <div className={styles.errorMessage}>{error}</div>}
      <Tooltip
        content={isListening ? 'Listening' : 'Not Listening'}
        position="right"
      >
        <Button
          onClick={toggleListening}
          variant="primary"
          userStyles={styles.button}
          disabled={!recognition}
          icon={isListening ? micOn : micOff}
          iconPosition="left"
        ></Button>
      </Tooltip>

      <div className={styles.transcript}>
        <p>{transcript}</p>
      </div>
    </div>
  );
};

export default SpeechRecognition;
Table of Contents
© 2024 - Mo Sayed