Lenzing’s Transition to Machine Learning Based Data-Driven Predictive Quality Control
Lenzing Group is one of the world’s largest producers of cellulose fibers.
Furthermore, ©Lenzing is an exemplary European company leading the way in sustainable manufacturing and eco-friendly products. Their innovative Lyocell fabric, a sustainable alternative to cotton, is created entirely from natural materials, emphasizing their commitment to environmentally friendly production and sustainable materials.
To stay competitive in the global market, Lenzing introduced the vision to be the undisputed quality leader in wood based cellulosic fibers. A key component of their strategy is implementing data-driven quality control (DDQC) throughout the production process, aiming to shift from traditional retrospective testing to predictive control, ensuring consistent quality and efficiency.
Project Highlights
Fine Grained Quality Control
In the traditional fiber production process, we conduct three laboratory quality checks each day while producing two fiber bales every ten minutes. These measurements guide precise adjustments to our operation settings, ensuring consistently high-quality output and minimizing the risk of lower-grade products.
In contrast, machine learning models, powered by process expertise and live sensor data, provide timely predictions for expected product quality and process disturbance indicators.
The DDQC system helps plant operators in two major ways. First, it reduces the time needed to take action, as operators no longer have to wait hours for the next lab measurement. Second, it provides a trend based on detailed quality estimates, helping operators spot shifts from the desired state or signs of equipment wear. This allows them to improve product quality and yield while also planning maintenance activities more effectively.

Figure 1: A control center of Lenzing Group, with the prototype interface for data driven quality control: “We experience an all-time low on low grade product for the production lines operated with ML-support. Overall, we expect 50% fewer low-grade cases compared to traditionally monitored production lines. With the support of models and the system increased awareness of operators, periods of more than 6 months without low grade for the monitored quality parameters is within reach.”
Selecting The Tech-Stack : Focus on Your Value Contribution
Software maintenance can account for 60% to 80% of total lifecycle costs[1]. Beyond financial and resource constraints, the manufacturing industry faces increasing security requirements. With the recent adoption of the EU AI Act and the EU Data Act, predictive controlling systems may be subject to additional monitoring requirements, depending on the specific use case. Choosing a professional solution that provides a secure, scalable foundation for data driven solutions while ensuring compliance with current and future regulations and maintaining the quality of predictive systems is crucial for the success of future data science projects. After all, the core expertise of a manufacturer’s data science or process engineering team lies in understanding production processes and translating them into machine learning systems, not in developing and maintaining IT systems.
At Lenzing we decided to use a hyperscaler and selected AWS as our cloud provider for implementing the data-driven quality control project. On a high-level, we split the overall architecture into four service domains: data ingestion & data lake, data processing, visualization and data science. Data is ingested from an on-premises Aveva PI data source, then stored and distributed. The data science domain uses the data to train machine learning models. The processing domain processes the raw data, enriches them with ML predictions, then stores the results in the data lake. The visualization domain provides plant operators access to dashboards, providing decision support with the generated insights.

Figure 2: Overview of service domains for the data-driven quality control project and the data flow between domains.
The following sections provide a detailed description of each domain and its purpose.
Extending the On-Premises System:
Two fundamental on-premises services are extended into the AWS cloud. Firstly, Lenzing collects most industrial operational data within Aveva PI. To make the data dynamically available and processable to the data science team, we stream the data from Aveva PI to the data lake from where it is made available to attached services.
Secondly, the Lenzing AWS organization is set up, such that the existing identity management system and security governance processes are seamlessly extended into the AWS cloud environment. The central IT can manage and revoke access to AWS services centrally at any time and all accounts and access roles are directly connected to the internal identity management system.
The Data Lake:
The data lake is the central location where all data is stored, governed, and made accessible. The implementation leverages fundamental services like Amazon Simple Storage Service (S3) and AWS Lambda, which allows us to fully utilize the elasticity of the AWS cloud.
The volume of data ingested is dynamic. While most data is injected at a regular 1-minute frequency, there are additional imports from the quality management system every two hours, as well as intermittent historical data imports for backfilling. These result in load spikes that are 50% and 2500% respectively higher than the baseline. A serverless implementation automatically scales resources like compute to meet the required demand, resulting in a cost-efficient solution.

Figure 3: Overview over the utilization of the total number of allowed concurrent Lambda executions, which is set to a max. of 5000 at this stage. The spikes show historic data imports.
For building and training the ML models, we use Amazon Athena to query the necessary data using SQL. Prior to that, data scientists had to load all data from csv files to train models. Implementing the data lake, the overall process time was reduced from roughly 4 hours to 2 minutes which reduced the training costs from roughly €35 per model per training to less than 30 cents. In addition, the waiting period for data scientists in between experiments is greatly decreased which as well drastically shortens the time to production for new ML-models.
Retention Time Correction Data Processing:
In manufacturing, we regularly face the problem of correlating sensor data from different process steps. For the given production process, the raw material takes approximately 8 hours from entering the process until being converted to viscose. Therefore, the sensor values in the beginning of the process cannot be directly compared to the sensor values at the end of the process. A retention time correction needs to be performed to properly align the sensor timeseries. To create a consistent dataset for DDQC as well as for other process analytics initiatives, the retention time correction was set up as a dedicated data processing service deployed on the Amazon Elastic Container Service (ECS). It buffers raw data for 24 hours, then processes batches whenever they are ready using AWS Lambda. Like the design of the data lake, this setup enables effortless scaling during load spikes (for example during historical data processing). This comes in handy, when additional production lines are on-boarded.

Figure 4: A visualization of the main data flow and the interaction between data lake, data processing and machine learning. By abstracting the data processing from machine learning, both components can be developed and extended independently.
As this processed data is the foundation for predictive modeling, the results are streamed back to the data lake and made available through Amazon Athena.
After executing the retention time correction and computing ML predictions, the results, are published through a message broker. This service is currently consumed once by the data lake for long term storage, model training and data analysis and once by the visualization service to make data and predictions available to production plant operators.
Visualization
To give access to insights from data driven quality control, a sub-project was set up, taking care of UX and UI topics. All data is made available to the team through AWS IoT SiteWise, a managed service which allows the team to organize and transform all incoming data without in-depth technical know-how. For the visualization itself, the team decided to use Grafana, which AWS also offers as a managed service via Amazon Managed Grafana. It provides a wide variety of visualization and annotation possibilities, making it well suited for operational dashboards. Without the need to write a single line of code, an intuitive interface was created, offering quick overviews of the overall production status across multiple production lines, as well as drill-down capabilities for in-depth analysis of production issues.
After running the prototype, which is visualized in Figure 1, the production manager stated that they “[…] want another oversized monitor to give all operators access to the excellent DDQC quality reports.”
Data Science / ML Platform
The core piece and key to success of such a platform is the data science and machine learning environment. We can fully rely on Amazon SageMaker for this component. Data Scientists run their own environments in SageMaker, from where they can access the relevant data directly from the data lake and run their experiments. Since June 2024, SageMaker offers a managed ML flow capability which is a widely used open-source tool for ML lifecycle management and greatly increased the traceability of experiments and model performance tracking for the data science team.
Once training processes are ready for production, they are pushed to the production environment utilizing CI/CD pipelines. For regular (re-)training of ML-models, we use SageMaker Pipelines, ensuring performance tracking is available in the same monitoring environment as the development experiments. The scheduled pipelines deploy the models directly, based on pre-defined quality measures.
Figure 5: Visualization of the abstracted ML workflow. Both the experimentation phase and the production phase have access to the same centralized data source. All model performance metrics are stored in a centralized MLFlow service.
Besides technical aspects, SageMaker allows us to dynamically adjust the resources of both the testing and production environments to meet the actual demand, with the added benefit of increased cost efficiency. Compared to an early prototype of the solution that used Amazon EC2 instances directly, the costs went down by over 80%, while increasing the service quality through better model tracking and pipeline monitoring. On top of that, access is centrally managed by the standard identity management processes, which improves governance and security.
Key Takeaways and Future Outlook:
When building an ML platform, the most important decision is the technology stack. You must feel comfortable with it for the years to come. There is always a tradeoff between the level of individualization and the required maintenance effort – should you go for fully managed services or build from scratch on barebones? Our conclusion is that you need to allow for questioning of existing workflows. With little adjustments and few compromises, you can greatly decrease overhead and utilize managed services, which speeds up the development and most of the time also increases the product quality.
The second big finding was following the AWS multi-account strategy, and the splitting of domains on the AWS account level enabled sub-teams to work more independently. The setup empowered teams to take ownership of their topics and increased transparency while reducing disruptive factors when multiple teams try to steer the development of the same service.
Lastly, the vendor lock-in issue has been a constant concern for us. If designed properly, the fear of losing control over data and being unable to migrate is often much greater than the actual risk. The platform was built in a way that even a migration to an on-premises system can be executed in the short term. AWS offers a wide range of managed services, providing the best of both worlds—open-source solutions with minimal maintenance effort and a strong security posture.
At Lenzing, we continue to develop and implement ML models in an agile way. Leveraging our new ML platform, we feel confident to scale out existing models to additional production lines, while developing models for new quality parameters in parallel. While it took almost one year to get the first model into production, we were able to add seven additional models within three months, while also having three more in the testing stage. We are now able to add up to three additional production lines every second sprint, pushing towards the vision of worldwide coverage of all Lenzing production sites with data-driven quality control.
With our industrial cloud solutions, we help manufacturers optimize production efficiency. Find out how our AWS expertise can enhance your manufacturing operations and drive efficiency.
[1] H. Washizaki, eds., Guide to the Software Engineering Body of Knowledge (SWEBOK Guide), Version 4.0, IEEE Computer Society, 2024; www.swebok.org.