Why It Is So hard to Apply AI in Industry? Part I: The Technical Challenges

Aug 01, 2024·By Edu Magalhães, MSc

TL;DR: The technical challenges of data science in industry. The main reasons why applying data science in industry is challenging are:

1. The journey data takes before it is available for analysis

2. The design and conception of machine learning algorithms

3. The scarcity and robustness of data

Let’s delve into these reasons.

1. The Long Journey Data Takes Before Being Available for Analysis

Anyone familiar with the reality of a typical modern industry knows that there is a long and arduous journey before data reaches the hands of an analyst, engineer, or even a data scientist. The diagram below illustrates a typical example of the journey data takes to get to an analyst.

How the data flows from sensor to a csv file an OT environment — Typical data journey in a OT (Operational Technology) environment

The data journey begins at the physical instrument, i.e., the sensor (temperature, pressure, etc.) that needs to be physically installed in the production process (the well-known, factory floor).

The signal measured by the sensor is sent to a data and logic “hub” (in industry, known as the famous PLC or Programmable Logical Controller) via a 4 to 20 mA current signal. It’s important to note that this signal can suffer interference or even some “compression” when it needs to be converted from analog to digital. In this conversion process, quantization of the signal occurs at the interface (analog card) that receives it and sends it to the “hub.” The figure below shows an example of the quantization phenomenon.

How a quantized (or quantization) changes the original analog signal. — Quantized signal example

Once the data is virtually available in the PLC, it is ready to be used by the PLC itself (which is basically an industrial “computer”) and also to be sent over the network to other interfaces/systems. That is, the data becomes available to another interface, the network card, which handles transmitting the data over TCP/IP using the famous and well-established industrial open communication protocol, OPC. This protocol is widely recognized and used, ensuring interoperability between different systems.

From here, the data can pass through multiple interfaces, switches, and firewalls before actually reaching a server. Note that so far this data has not been stored in any database! We will talk about this shortly.

Once the data reaches a server, it “enters” this machine through an application we call an OPC server. In other words, this application receives the data stream coming from the PLC and makes it available to be sent to more servers/applications via network communication using Microsoft’s well-known DCOM (Distributed Component Object Model), despite its information security issues. DCOM is still widely used for communication in Operational Technology (OT) environments.

Nowadays, it is more common to find OPC UA (the standard protocol for Industry 4.0 applications) with security certificates, which mitigate cybersecurity risks

After the data reaches a computer and is ready to be distributed, it is finally stored in a database. However, storing data in an industrial environment is not done simply using a relational database with SQL. Due to the need to manage the traffic, storage, and access to information in industrial environments, a specific system for this purpose was created, known as PIMS (Plant Information Management System).

PIMS systems were developed to be a kind of “Hadoop” for industry, facilitating the storage of large volumes of data (mainly time series) in on-premise environments. Emerging in the 1980s, these systems revolutionized industry, and even today, there are industrial plants that do not have a PIMS system. It is important to note that, to enable the storage of large time series, PIMS systems also incorporated data compression algorithms.

The most well-known compression algorithm is the box/car backslope, which allowed for the storage of a vast amount of data on 1 TB hard drives (considered significant storage space until about five years ago). This ability to store large volumes of data, combined with efficient information management, makes PIMS systems a fundamental solution for the industrial sector.

However, as commonly known, solving one problem often creates another. Data compression also has its drawbacks, as well explained in this article.

Finally, the data reaches its final destination, the PIMS. However, with current cloud technologies, data continues its long journey to a Data Warehouse or Data Mart.

Regardless, it is from the PIMS that we perform data extraction of variables for exploratory analyses and predictive model development. It is in the PIMS that we query and export data to the well-known *.csv format.

Note that, in industry, for data to be minimally available to the end user, it goes through a long journey where the probability of interference is significant. This is quite different from collecting and evaluating user click data on a web page to interpret engagement, for example, where the digitization and availability journey of the data for analysis is much shorter.

Obviously, the goal here is not to compare journeys A and B but to highlight the importance of considering the long and complex path that data travels in an industrial environment before being available for analysis.

Now, let’s address the next major challenge of data science in industry.

2. Conception of Machine Learning Algorithms

The second challenge of data science in industry is related to the conception of most machine learning algorithms.

As mentioned by Andrew Ng in this live session, most ML (machine learning) algorithms were conceived and validated by academia and internet-based companies using reference data from digital world datasets and experiments. A notable example is the MNIST DATABASE, probably the largest and most reliable database of handwritten digits in the world, used in the architecture and validation of many deep learning algorithms.

However, finding something similar for industry and manufacturing is practically impossible. Despite some small initiatives, nothing compares to the internet world. This is mainly because industries do not want to share their production process data due to intellectual property and brand protection issues. While curating and anonymizing this data would allow the scientific and professional community to improve research and developments in the industrial context, this practice is still in its infancy, as exemplified by the initiatives of Ocean Protocol.

While I was working at ihm stefanini (back in 2018) and I managed to do something similar when one of our clients donated part of their data so we could make it available on Kaggle, anonymizing the data from a production process.

Kaggle industrial data set example. — Industrial dataset on Kaggle example.

Link to industrial datasets on Kaggle

I receive messages almost daily from people around the world thanking us for the available data. It was then that I clearly realized the scarcity of this type of data for the global community.

If there is not much data access, how do we design and validate algorithms? This becomes practically unfeasible or is done only by companies and private research where intellectual property is developed.

In practice, we need to take these algorithms and try to adapt them to the industrial context, which often faces the third challenge I mention next.

3. Scarcity and Robustness of Data

Since the algorithms were not conceived considering the challenges of an industrial environment, as mentioned by Andrew Ng in the same live session cited earlier, we often observe a significant difference in the performance of the model obtained during development and the model in production. This occurs even in cases where the scientific process was rigorously followed and respected (correct data splitting, cross-validation, absence of data leakage, etc.). More details can be found in this important article How to avoid machine learning pitfalls: a guide for academic researchers.

The main reasons for this are:

1. Amount of data available (a classic problem in industry as well)

2. Constant occurrence of drift (data drift and concept drift)

3. Variations in production campaigns or changes in the operational point of the production process

4. Sensor degradation and/or interference in the data journey

We will not go into detail about each of these reasons in this post, but we can comment that, especially when creating anomaly detection models, it is common to have imbalanced classes, particularly regarding the positive class label of failure or anomalous behavior.

Drift is also a problem that affects models in production. It is common to need to retrain models every 3-4 months.

The most challenging aspect to deal with, due to the difficulty of control, is the third item, related to operational variations or production campaigns. Often, we cannot have this scenario in the training dataset or even never in the test dataset, meaning you end up with little visibility of the model’s actual generalization capability.

It is also worth mentioning that, besides all the influences data undergoes, there is a need for instrument or sensor maintenance and recalibration. This can be expensive and is not always performed with the proper frequency. To make matters worse, it is difficult to determine how much error a damaged sensor is introducing into the real signal.

In Summary

The challenges mentioned in this post should always be remembered and considered when estimating the effort and risks of a data project in industry.

In the next post of the series, we will address the business challenges that also impact the outcome of a data project in industries.