Firmware Security and Edge Machine Learning, how Exein is building the future of IoT Security

8 min readMay 18, 2020

Understanding the Machine Learning Engine that powers the Exein Core firmware security solution for IoT devices.

Introduction

Traditional firmware security solutions have been looking for known attack patterns inside devices in order to secure them from external threats. While this approach can be effective at detecting and blocking known exploitations, it completely fails at protecting the device from yet to be discovered vulnerabilities.

On the contrary, the purpose of Exein Core is to protect embedded devices from both known and unknown vulnerabilities. Doing this with the use of handwritten rules as in traditional solutions would be simply impossible because there are just too many ways an attacker could exploit existing IoT devices vulnerabilities and gain access to them. For this reason, Exein Core makes extensive use of machine learning to autonomously learn how a device is meant to operate, constantly monitor its behavior and protect it from every form of unexpected variation in its functioning. What is more, it does so entirely on the edge, making Exein Core work smoothly even if no internet connection is available and keeping all the users data inside their devices, where it is meant to be.

In this blog post, we will have a high level look at the “brains” inside Exein Core: its Machine Learning Engine (MLE). This is the part of Exein Core that is responsible for deciding whether a device is behaving “normally” or, instead, if it appears to be under cyber-attack.

The Machine Learning Engine (MLE)

In order to understand the way Exein Core MLE is structured and how it helps in securing firmware, there are three main components that need to be addressed:

The flow of data that identifies the way the machine learning model is exposed to the firmware level activity of the device.
The machine learning model that learns the device expected behavior.
The machinery that uses the trained ML model to identify threats or other types of anomalies in the device activity.

Each of them will be addressed separately in the following paragraphs and then we will see at the end how they all combine with each other to create Exein Core MLE.

Gathering Data

As is the case with all machine learning models, the MLE really starts with data. Another part of Exein Core, the LSM, is responsible for collecting in real time a large number of features (more than one thousand) that describe the internal “state” of the firmware at any given time. These features are essentially the traces left behind by the Linux kernel when it executes the operations the user requested.

A look inside a working u-http server through the eyes of Exein Core LSM.

When observed over a sufficiently long period of time, the features collected by the LSM give a complete description of the temporal evolution of the operations carried out by the device. For example, if we observe a web server serving three requests, we might observe the same pattern of features repeated three times in the data produced by the LSM, with each pattern representing what happens at a very deep level inside the firmware when a web server serves a web page.

The whole rationale behind Exein Core is that embedded devices are programmed to accomplish very repetitive tasks: a web server is born to serve web pages, and that’s all it should do for its entire life, much in the same way an industrial sensor should only record some variable’s state and communicate it to a server. Starting from this fundamental assumption, the goal of the MLE is to autonomously learn to identify patterns in the data produced by the LSM for a particular device when it is observed for a sufficiently long amount of time operating under its normal operational circumstances. These patterns are nothing but the actual signature, at firmware level, of the tasks that the device is been programmed to accomplish. Then, once the MLE has learned all these patterns during the training phase, it can use this understanding in the production phase to monitor in real time the degree of similarity between the observed current behavior and the ideal one learned during the training phase.

CNNs and Temporal Convolutions

The kernel sliding across the input window during the convolution operation of a CNN layer.

Convolutional Neural Networks are a particular kind of neural networks that make use of convolution operations to take advantage of the properties of locality and translation invariance in the input data. This allows them to be particularly effective in the case where the data being modeled shows some structure in space or time. CNNs are extremely popular in the field of machine vision, because images lend themselves naturally to the properties that are best modeled by CNNs. In fact, pixels that are close in an image are indeed related to each other (for example they might represent an object or part of an object inside a larger context). Moreover, the meaning of objects is invariant under translations inside an image (a bicycle is still a bicycle whether it appears in the upper left or lower right of an image).

It turns out that CNNs are also well suited to understand the firmware level behavior of embedded devices because the traces produced by the LSM also share the locality and translation invariance properties that CNNs are best suited to exploit. In fact, traces that are close to each other are indeed related to each other by the fact that they appear one after the other during the execution of programs (locality in time). Moreover, the traces produced by the LSM are also invariant under translations because patterns that happen at the beginning or at the end of given time windows share the same meaning, representing the same part of an operation carried out by the kernel (translation in time).

At the heart of the MLE there is a CNN-based model trained to predict the next time step appearing after a given sequence of traces generated by the LSM. In fact, if some patterns appear naturally at firmware level (representing the normal operations of embedded devices), given a sufficiently long input sequence the ML model should be able to predict with reasonable confidence and accuracy the next part of the sequence. If the model is capable of doing so accurately in the training phase (where the data is assumed to represent the ideal, expected behavior of the device) but fails at predicting correctly the next time step at some point in production phase, one might deduce that the current behavior of the device differs significantly from the learned one, which violates the initial assumption that embedded devices are programmed to always accomplish the same tasks. For this reason, one can use the “accuracy” of the CNN model predictions as a proxy for the similarity of the current behavior and the learned “normal” one.

This is really the heart of how the MLE understands the behavior of a device: it makes predictions about future traces and compares these with the actually observed ones. When predictions and observations diverge significantly, an anomaly is occurring or, in other words, the device is under attack.

Detecting Anomalies

Successfully developing and training the CNN model described in the previous section is only one part of the job. This, in fact, leaves us with a ML model that is capable of making predictions about the future traces generated by the device, but knows nothing about the difference between normal behavior and a cyber-attack.

In particular, the output of the CNN model at each time step is a vector whose entries represent the probabilities that each “state” will be the next to be observed in the time series. In order to define what “normal behavior” means from the point of view of the ML model, the MLE computes the cross entropy between the model predictions and the actual observations. Intuitively, the cross entropy is a measure of distance between two probability distributions and tends to zero when the distributions match (i.e. the model is very confident about its prediction, and the prediction is right), while on the contrary it tends to infinity when the distributions are very different from each other (that is, the model is very much wrong at predicting the next trace).

It is perfectly normal and expected for any ML model not be perfectly accurate all the time. When a model makes a sporadic error, the cross entropy over that single prediction immediately shoots up to a very high value, which could trick the model into thinking that an attack is occurring. In order to avoid this kind of false positives, the MLE averages the cross entropy of single predictions around a contiguous window of neighboring ones: in this way it would take a substantial and — most importantly — sustained paradigm shift for the average cross entropy to rise significantly, indicating that a real attack is under way.

Using the moving average smoothing procedure described above, the MLE is capable of producing a very stable measure of how “anomalous” adjacent predictions are, that is how anomalous the times series is when considered in small adjacent blocks. By looking at this measure, referred to as the “anomaly score” one can determine when the device is behaving as expected and when, instead, it is behaving in an unexpected way. This, though, requires setting a threshold for the anomaly score that determines the boundary separating normality from anomaly. In order to set this threshold, the MLE computes the anomaly score over all the data used for training (that we know a priori does not contain any anomaly in it) and simply sets the threshold at a multiple of the maximum anomaly score encountered over this dataset, in our case 2x.

Real Time Execution

A real buffer overflow attack executed on a u-http server. As soon as the attack starts (red input data) the MLE anomaly score rapidly increases and exceeds the threshold, signalling the LSM to take immediate action to stop the malicious process.

When combined together, the three components described in the previous sections create a model that is capable of detecting anomalies in the firmware-level behavior of embedded devices and so effectively protecting them from external threats: this is the Exein Core MLE. After being trained, the MLE effectively acts as a machine learning “engine” for the whole Exein Core firmware security suite.

In particular, the trained MLE can be injected inside the final layer of Exein Core: the MLE Player. The way the MLE Player works is by constantly monitoring all the traces produced by the active processes running inside a device. Every time a new trace is generated for a given process, the Player uses the MLE to make a prediction about it and compute the respective cross entropy, that is added to the rolling window of errors for smoothing. Then, it computes the mean value of the error window and compares it with the threshold: if the average cross entropy is lower than the threshold, the MLE Player tells the LSM that the process is behaving correctly; otherwise, the Player communicates the anomaly to the LSM for taking immediate action and promptly blocking the malicious process.

Learn More

Head over to exein.io for more information about Exein Core, including how to get started at securing your own firmware.

Github : https://github.com/Exein-io/exein

Posted by Giovanni Alberto Falcione, Head of Machine Learning Exein