Building a Smart Speaker from scratch, P2: MAX98537A driver design

The hardware arrived! And we 3D printed an enclosure :) now to get started on the software...

Today we're going to get set up for embedded development, which means all of the embedded toolchains and build frameworks. It also means setting up the project, getting a couple of key sensor drivers written, and getting blob working with the smart speaker to unlock a magnitude of software-driven potential!

Project setup

There are a couple of different questions we should answer before we go about building an embedded project like this.

Should we use the Arduino framework or Espressif-IDF framework?

Honestly, this was a bit of a tough one to answer without just trying it out... Arduino provides a whole bunch of simplified APIs for embedded development, but as we will discover, this often meant sacrificing some of the functionality that was necessary for building the real-time application. What we came to realise is that Arduino APIs for the ESP32 were naturally built on top of the ESP-IDF framework anyway, so provided you knew which version of ESP-IDF they relied on, if you were using Arduino, you were able to use ESP-IDF functions in your design anyway!

So, Arduino it is.

Should we use the Arduino IDE or VSCode + PlatformIO?

This was a much less difficult question to answer. The Arduino IDE has been traditionally used by developers to bring up arduino projects. However, PlatformIO - a framework to quickly bring Arduino APIs and 3rd party libraries to a given embedded platform - has great integration with VSCode and heavily supports the ESP32 microcontrollers. VSCode is a killer IDE, and if we use it we should be able to leverage many of the other extensions, as well as the cross-language support. Since this particular project will be spanning C++, Javascript and Python, VSCode is the natural choice.

VSCode + PlatformIO Project

After downloading the extension for PlatformIO in VSCode, we set up a project for the ESP32-S3-DevKitC-1. This is a different chip from the ESP32-S3-DevKitM-1 (mini microcontroller version), but the project was able to build and upload perfectly fine.

The platformio.ini file is configured as follows, which will help configure a project from scratch through the PlatformIO setup:

; PlatformIO Project Configuration File
;
;   Build options: build flags, source filter
;   Upload options: custom upload port, speed and extra flags
;   Library options: dependencies, extra library storages
;   Advanced options: extra scripting
;
; Please visit documentation for the other options and examples
; https://docs.platformio.org/page/projectconf.html

[env:esp32-s3-devkitc-1]
platform = espressif32 ; ESP32
board = esp32-s3-devkitc-1 ; Not the DevkitM-1 but close enough
framework = arduino ; Use the arduino framework, not the ESP-IDF
upload_port = COM3 ; COM port to upload to
monitor_speed = 115200 ; Serial monitor speed
monitor_port = COM3 ; COM port to monitor

; Build flags. These are to enable UDP and IPv4/IPv6 packet reassembly (for large Blob packets)
build_flags = '-D BLOB_ESP32_UDP=1 -D CONFIG_LWIP_IP4_REASSEMBLY=1 -D CONFIG_LWIP_IP6_REASSEMBLY=1 -D LWIP_IPV4=1 -D LWIP_FEATURES=1'

; Depends on ESP32-UDP and Blob libraries
lib_deps =
    AsyncUDP
    Blob

To obtain the necessary library dependencies:

1. From the newly installed PlatformIO tab in VSCode, we need to select "Libraries" and search for ASyncUDP - which is an Arduino-style wrapper around the network UDP functionality.

2. Clone the blob library into the /lib folder in the project repo from my GitHub:

https://github.com/jzmcke/blob

DMA buffers

It's time to get something playing! After scouring the internet, I found it pretty hard to find a nice wrapper for the MAX98537A which supported real-time and low-latency streaming. Most of the examples online were simply reading from a .wav file that was stored in SPIF file storage, or generating simple tones in the code itself.

We want MORE! We want to be able to stream music from a computer, or voice from a friend who has a microphone... or have low-latency conversations with ChatGPT!

The key design criteria here is low-latency. To achieve low latency, we need to be able to control the DMA buffers that interface to the speaker in a way that, on average, they neither overflow or underflow. This can be achieved most simply if we are streaming packets from the network at the same rate as the audio device reads a set number of samples from the DMA buffer.

DMA buffers operate in a ping-pong methodology. To begin, the user code will write into the top half of the DMA buffer (blue arrow) while the I2S audio output driver simultaneously reads content from the lower half of the DMA buffer (red arrow).

Next, when the I2S output driver has finished playing out the audio in the lower buffer (the read arrow makes it down to the lowest address 0xFFF), a notification is typically sent to the user code to notify that the data is finished. Then, the roles of each DMA half are reversed!

This enables the audio device to stream, avoiding the circumstance where the user code collides with the I2S device by overwriting the I2S playback audio in the middle of the buffer read.

For this project, we are going to go for a 10ms tick, and a sample rate of 16kHz. This is unfortunately not quite high enough for the best quality music streaming, but it is for the moment a limitation of the receivable UDP packet size in the ESP32's AsyncUDP library. This means that one half of the DMA buffer must be able to store 0.01 x 16000 = 160 samples of audio per output channel. we have two output channels and are outputting 16-bit audio, so the DMA buffers must be configured to store a total of 160 x 2 channels x 2 bytes per sample x 2 halves of the DMA buffer = 1280 bytes.

That calculation helps us reason about the memory footprint of our DMA buffer (~1kB), but the ESP32 simply requires us to configure the I2S interface with the number of samples per channel - so 160, since we configure the number of DMAs (2) and number of output channels (2) via other config parameters.

Driver Code

Again, I struggled to find a nice embedded wrapper of the MAX98537A and the SPH0645 devices that enabled a low-latency, glitchless and easy to use interface to the I2S bus... so I wrote them!

The full project source code can be found at https://github.com/jzmcke/smart-speaker/tree/main/src

MAX98357A design

The key usability features of the MAX98537A (output) driver is as follows:

The interface is configured with the ESP32 pins the amplifier is connected to.
The sample rate is configurable, for this project we use 16kHz.
spkr_write_cadence_ms is the tick period, used to instantiate the DMA buffers to the correct length. This number should correspond to the cadence we are expecting to receive audio from the main loop.
b_is_output_dma_empty() is a public method, returning a boolean, which can be used to notify the main loop when the I2S device has finished reading from one half of the DMA buffer and can be written to once again with the write() method. The I2S interface will ensure this is set to True whenever the DMA buffer empties, and it will be set to False when a write operation has completed.

#include <stdlib.h>
#include "FreeRTOS.h"
#include "freertos/queue.h"
#define MAX98357A_ERR (-1)
#define MAX98357A_OK  (0)


class MAX98357A
{
    public:
        MAX98357A(int i2s_port_num
                 ,int pin_sdi
                 ,int pin_bclk
                 ,int pin_lrck
                 ,int sample_rate_hz
                 ,int spkr_write_cadence_ms);

        int write(float *p_data_ch1, float *p_data_ch2);
        bool m_b_configured = false;
        size_t m_send_size;
        int m_n_samples_per_ch;
        bool b_is_output_dma_empty(void);

    private:
        int m_pin_sdo;
        int m_pin_bclk;
        int m_pin_lrck;
        int m_sample_rate_hz;
        int m_n_target_bytes_write;
        int m_n_dma_buffers;
        int m_i2s_port_num;
        bool m_b_ready_to_fill;
        unsigned char *m_p_audio_send_bytes;
        QueueHandle_t m_evt_queue;
};

SPH0645 design

Before reading, it is worth noting that for microphone capture, the DMA roles of the microcontroller and the I2S device are reversed! The microcontroller is the device consuming the data from the buffer, and the I2S device is writing the microphone audio to it!

So, in a mirrored fashion, here is the interface to the I2S audio capture device.

The interface is configured with the ESP32 pins the microphone is connected to.
The sample rate is configurable, for this project we use 16kHz.
mic_read_cadence_ms is the tick period, used to instantiate the DMA buffers to the correct length. This number should correspond to the cadence we want to read audio from the main loop.
b_is_input_dma_full() is a public method, returning a boolean, which can be used to notify the main loop when the I2S device has finished writing from one half of the DMA buffer and can be written to once again with the write() method. The I2S interface will ensure this is set to True whenever the DMA buffer fills, and it will be set to False when a read operation has completed.
The read() method can be called, which populates the m_p_ch1 and m_p_ch2 variables with their microphone data.

#include <stdlib.h>
#include "FreeRTOS.h"
#include "freertos/queue.h"

#define SPH0645_ERR (-1)
#define SPH0645_OK  (0)


class SPH0645
{
    public:
        SPH0645(int i2s_port_num
               ,int pin_sdi
               ,int pin_bclk
               ,int pin_lrck
               ,int sample_rate_hz
               ,int mic_read_cadence_ms);
        
        int read();
        bool m_b_configured = false;
        float *m_p_ch1;
        float *m_p_ch2;
        size_t m_rcv_size;
        int m_n_samples_per_ch;
        bool b_is_input_dma_full(void);
    private:
        int m_pin_sdi;
        int m_pin_bclk;
        int m_pin_lrck;
        int m_sample_rate_hz;
        int m_n_target_bytes_read;
        int m_n_dma_buffers;
        int m_i2s_port_num;
        unsigned char *m_p_audio_rcv_bytes;
        float *m_p_audio_rcv_float;
        QueueHandle_t m_evt_queue;
};

Main loop

The following code demonstrates how these drivers are implemented:

https://github.com/jzmcke/smart-speaker/blob/main/src/main.cpp

In this program, we can stream data to the smart speaker via a networked application using blob. There is a python application running on my computer, capturing the output audio in Loopback and transmitting it to the device.

https://github.com/jzmcke/blob/blob/main/script/stream_from_computer.py

And the playback result is this!