Building a Smart Speaker from scratch, P1: Goals, Architecture and Hardware

In this blog series we're going to go through some of the hurdles in building a smart speaker from scratch. So what are the personal objectives?

Sharpen our embedded design skills on one of the industry's most ubiquitous microcontrollers (ESP32)
Take the blob protocol for a test run on a real design project, to see if it speeds up, or adds pedagogical value to the design process.
Have a neat platform to do solve some deep learning - perhaps noise reduction, echo reduction or some spatial audio capture!

As a part of this, we'll figure out how far we go before we are pushing the ESP32 to its limits - how much processing can we do on the edge before we need to offload processing to cloud compute?

Project Goals

To support the personal objectives of the project, let's create some technical objectives:

Personal objective	Technical Objective	Reason
Sharpen our embedded design skills on one of the industry's most ubiquitous microcontrollers (ESP32)	Real-time music streaming and/or half-duplex audio (push-to-talk) communication	Real-time audio should be a fairly intermediate-difficulty task to jump into. Negotiating low-latency playback and capture with the audio drivers and streaming music over the network in a glitch-free manner should present a few technical problems, and most importantly result in really cool end result!
	Build a capacitive touch sensor array	Cap touch sensors are a really elegant and satisfying way to interact with a product. You can also build a lot of complex functionality in software rather than in the hardware. E.g. use the same array for volume, start/stop, mute, etc.
Take the blob protocol for a test run on a real design project, to see if it speeds up, or adds pedagogical value to the design process.	Get at least 1 full audio channel streaming and plotting real-time in the browser, and the full multi-channel data saved to a file for debugging/training.	Being able to visualise a large set of data in real-time will help us be able to debug, log and solve more complex problems.
Have a neat platform to solve some deep learning - perhaps noise reduction, echo reduction or some spatial audio capture!	Use some device logs to train a beamforming algorithm using data collected on two microphones.	I left the spatial audio capture scene at Dolby building linear beamformers. These are subject to aliasing and many other noise gain issues; It would be interesting to compare the performance of some linear beamforming solutions with a deep neural network.

Architecture

The project will be somewhat complex in that there will be more processes than just the device on the edge. This is to enable rich debugging, logging, and compute power that does not exist on the device itself.

One element we might be immediately inclined to question the value of is the Packet forwarding server - is it really necessary that there is a "middle-man" between the smart speaker and the other devices? Could we simply build a packet forwarding server on the ESP32 itself?

The answer is yes, we could. But, as we add more devices to the system (e.g. a voice controlled robot or a wearable with on-board gesture recognition - a mesh network could be a little more difficult to manage. So, to start with, we are building this distributed network up from scratch, so the only device that needs to worry about message forwarding is the server itself.

Hardware selection

All of the device hardware has been purchased through digikey.

For the microphones, we have chosen 2x Knowles SPH0645 digital MEMs microphones, because they boast decent signal to noise ratio (65dBA), are small, and Adafruit have them pre-soldered on a breakout board.

For the speakers amplifiers, we are going to use 2x Analog devices MAX98357A I2S headphone amplifiers, since it can amplify more than enough power (3.2W, which over the speakers we are selecting could produce up to 89 dB SPL at 0.5m without significant distortion (quite loud for a small speaker!). Again, these are provided on an Adafruit breakout board.

2x Pui audio 668-1125-ND speaker driver. This is a reasonably full band driver (at least for voice communications) and has a bandwidth of 100 Hz to 20 kHz. For high fidelity music, you typically want down to 20 Hz. But for the size of the speaker we are trying to build this is a reasonable starting point!

Last but not least, we will use an ESP32 S3-mini! These are relatively new devices (within the last 12 months) but are packed full of peripherals and are sold at a great price point. I can definitely see these being used for other projects in the future. These come in breakout evaluation boards supplied by Espressif!

Building a Smart Speaker from scratch, P1: Goals, Architecture and Hardware

Project Goals

Architecture

Hardware selection

Recent Posts

Comments