I'm not sure what to make of this but the ESP32 Hardware Design Guidelines available at...
http://espressif.com/en/products/hardware/esp32/resourceshas this at the last page...
5.2 ESP32-Lyra Smart Audio Platform
ESP32-Lyra is a cost-effective smart audio platform, which is specifically designed by Espressif for the IoT industry. With its ESP32 dual-core processor and Wi-Fi + BT capability, ESP32-Lyra features voice recognition, audio playing, and access to cloud services. The ESP32-Lyra platform supports systems of artificial intelligence, voice and image recognition, wireless audio systems, as well as smart home networks.
The ESP32-Lyra Smart Audio Platform has the following features:
• Support for multiple audio interfaces with high extensibility
• Support for touch buttons
• Support for multiple audio formats including WMA, ALAC, AAC, FLAC, OPUS, MP3, WAV, and OGG
• Support for multiple wireless audio standards including DLNA, AirPlay and QPlay
• Support for multiple cloud platforms including Ximalaya FM, YunOS and Amazon
• Support for multiple distribution network protocols including ESP-TOUCH, ALINK, JoyLink3.0 and AirKiss
I've been unable to find any additional documentation but it appears that the $7 ESP32 chip has built-in speech capabilities. What other hardware, if any, is needed is unclear at this time.