Step-by-Step Guide to Using Whisper-CPP for Meeting Summaries on Apple Silicon

·

3 min read

Introduction

Recently, I attended a training session where the host used many filler words and repeated examples, causing it to last longer than it should have. Fortunately, I recorded the video using the Tencent Meeting software. I had an idea to extract the audio from this video clip (since there is barely any video movement during the session), then convert the audio to text, and finally, let GPT-4 summarize it for me.

My solution seems clear: use FFMPEG (for video extraction), ASR (to recognize the voice), and ChatGPT (preferably GPT-4-16k, as the text is usually too long for the standard GPT-4 model).

DIY

FFMPEG

There are two purposes for using ffmpeg here:

  1. ffmpeg can extract the video clip to a pure audio file.

  2. ffmpeg can convert the existing audio file to a harmless format like wav, which is required by most ASR models. However, you should know that ffmpeg cannot improve the audio file quality, such as bitrate (which is common knowledge).

Installation is very easy if you use the right software management tools.

brew install ffmpeg

Use the command with a small modification (change input.mp4/output.wav):

ffmpeg -i input.mp4 -acodec pcm_s16le -ac 2 -ar 16000 output.wav

Whisper

Actually, we are using the whisper-cpp, a high-performance version of OpenAI's Whisper automatic speech recognition (ASR) model. The most important feature is that it includes some pre-trained models under the Apple ML standard, which can fully utilize the powerful NPU in Apple Silicon.

Installation

# Installation of whisper is not necessary, however, it helps you preinstall a lot of stuff
brew install whisper 
brew install whisper-cpp

# Set this path, and the subdirectory models will be used to store all models.
export GGML_METAL_PATH_RESOURCES=/opt/homebrew/share/whisper-cpp

Finally, you can check this page to manually download the appropriate model for yourself. I prefer the small model, which is totally fine. After that, you can use the command whisper-cpp freely to recognize any text you want.

The basic command is shown below, and you can specify the language using the -l option.

-m specify the model.
-otxt output filename with a .txt extension; it can support multiple output formats.
whisper-cpp -m ./models/ggml-small.bin ~/Downloads/audio.wav -l zh -otxt hello.txt

GPT-4 integration

I don't think it's necessary to include this part here because there are many similar blogs, whether using GPT-4, Copilot, or APIs (provided by Microsoft Azure or OpenAI). The main thing you should focus on is using the 16K model. This is because meeting transcripts are usually very long, and the standard GPT-4 model can't handle them effectively.

End

Have fun! I believe it's a better way to save your time from long and boring meetings.