A MorphCast secret revealed
The R&D team at MorphCast, along with two university research institutes, has created an extremely compact yet powerful AI model for facial expression recognition, capable of running in the browser of various devices such as smartphones, tablets, and PCs. This was made possible through a proprietary architecture of “deep” convolutional neural networks (CNN).
What does DNN stand for?
In machine learning and artificial intelligence, a Deep Neural Network (DNN) is a type of artificial neural network with multiple layers between the input and output layers which enables the learning of high-level abstractions from data.
What does CNN stand for?
It is a class of deep neural networks, most commonly applied to analyzing visual imagery. In a CNN, the network employs a mathematical operation called convolution to process parts of the input data, making it especially effective in identifying spatial relationships in data like the arrangement of pixels in images. This capability makes CNNs extremely effective for tasks like computer vision applications.
What does “convolutional” mean?
In the context of a Convolutional Deep Neural Network (CNN or ConvNet), the term “convolutional” refers to the convolutional layers that are used in the network.
MorphCast Neural Networks
Positioned downstream to the MorphCast face detector is a “deep” neural network, specifically a Convolutional Neural Network (CNN). This network processes the cropped face output from the face detector, delivering results pertinent to the module in question (e.g. emotions, affects, attention, etc.).
This neural network is custom-designed, meticulously crafted to adhere to stringent size (<1 MB) and execution time requirements. Despite operating at a lower resolution, it maintains a rapid prediction timeframe (approximately 30ms on PC), alongside adequate accuracy for the designated task. Unlike other products, this setup achieves a delicate balance between accuracy and processing time, even at such reduced resolutions while maintaining a high level of accuracy.
The bespoke architecture of MorphCast is multi-task oriented, with a fundamental segment (the deepest levels) shared across all face analysis modules. This shared foundation not only minimizes the model size but also trims down processing time. For instance, operating four modules simultaneously instead of just one elevates the CPU or GPU load by merely 30%, a stark contrast to the anticipated 300% increase. This streamlined structure underscores the architecture’s efficiency, fostering a more resourceful processing environment.
The output generated by the neural network undergoes a subsequent layer of post-processing to eliminate noise and conduct beneficial temporal post-processing. This stage can be tailored to optimally extract valuable information from the data, thereby significantly enhancing the utility derived by the user. This feature sets our offering apart, as comparable products in the market lack this refined level of post-processing, underscoring the superior value proposition for our customers.
- Boasts accuracy comparable to competitors utilizing server-side algorithms, as corroborated by an independent study titled, “Emotion Recognition in Humans and Machines Using Posed and Spontaneous Facial Expressions.”
- Harnesses AI algorithms optimized for minimal computation and size overhead, ensuring efficient performance.
- Offers a high degree of flexibility through an extensive array of configurable parameters, catering to diverse user requirements.
- Out-of-the-box readiness with a default configuration where filters are pre-tuned for common use cases, facilitating immediate utilization.
- Features an intuitive, user-friendly interface that simplifies interaction and enhances user experience.
- Ensures broad compatibility across nearly all modern browsers, promoting accessibility.
- Built on a modular architecture, which allows for streamlined integration and scalability.
- Capable of achieving up to 20-30 analyses per second on PC devices, and typically delivers 7 analyses per second on recent-generation mobile devices, showcasing robust performance.
- Incorporates a dynamic power-saving optimizer to efficiently manage CPU and GPU (when available) load, promoting energy conservation.
- Falls short in simultaneously detecting expressions across multiple faces, limiting its application in scenarios requiring multi-face analysis.
- Experiences challenges in outside-browser usage, exhibiting more complexities compared to other native language libraries, which may pose integration hurdles.