Estrazione e pre-processing dei dati per un progetto di manutenzione predittiva

Nowadays the importance of data as a source of value is growing more and more. For this reason, companies are trying to exploit the competitive advantage that can come from data analysis in order to increase production performance. This thesis is about the process of extraction, pre-processing and validation of data coming from sensors and PLC of a machine for the production of a dispensing pump. This predictive maintenance project has as its ultimate goal to improve the OEE and other production performances of the aforementioned machine used for production in a multinational company with various facilities in Italy, which I cannot name because of the NDA I signed, where I carried out my curricular internship, as a Data Scientist. The thesis is divided into 9 chapters and opens with a brief explanation of what the company does, we move on to explain what are the various data sources analyzed, describing them in detail and explaining how they can interface with each other and what kind of data are stored inside them. The next step is the one of data ingestion in HDFS by using Apache NiFi for the extraction from the aforementioned sources and storage in HDFS dataframes. As you can imagine we're still talking about raw datas, which needs pre-processing before using them. Among the pre-processing operations, the main ones were the analysis of redundancies and the elimination of unnecessary datas from the dataframes, the analysis and management of timestamps and the handling of missing values. This process, however, can lead to errors, so to ensure the correctness of the data the last step is the one of validatation. Thanks to the validation it was possible to perform a merge in order to obtain, as output of this process, an unique file that can be used as input for the Machine Learning algorithms in order to improve the performance of the machine.

Nel mondo in cui viviamo oggi l'importanza dei dati come fonte di valore sta crescendo sempre di più. Per questo motivo anche all'interno delle aziende si cerca di sfruttare il vantaggio competitivo che può provenire dall'analisi dei dati al fine di aumentare le performances produttive. La tesi parla del processo di estrazione, pre-processing e validazione dei dati provenienti da sensori e PLC di un macchinario per la produzione di una pompa erogatrice. Questo progetto di manutenzione predittiva ha come fine ultimo quello di migliorare l'OEE e le altre performances produttive del suddetto macchinario utilizzato per la produzione in un'azienda multinazionale con varie sedi in Italia, di cui non posso fare il nome avendo io firmato un accordo di non divulgazione, presso la quale ho svolto il mio tirocinio curriculare, nel ruolo di Data Scientist. La tesi è divisa in 9 capitoli e si apre con una breve spiegazione di cosa fa l'azienda, si passa a spiegare quali siano le numerose sorgenti dei dati che sono stati analizzati descrivendole nei minimi dettagli e spiegando come possano interfacciarsi tra di loro e che tipologie di dati abbiano salvate al loro interno. Step successivo è quello della data ingestion in HDFS tramite l'utilizzo di Apache NiFi per l'estrazione dalle suddette sorgenti e lo storage in HDFS sotto forma di dataframe. Come si può immaginare stiamo ancora parlando di dati grezzi, che necessitano di operazioni di pre-processing prima di poter essere utilizzati e fornire effettivamente valore. Tra le operazioni di pre-processing quelle principali sono state l'analisi delle ridondanze e l'eliminazione dai dataframe di dati superflui, l'analisi e la gestione dei timestamp e la gestione dei missing values. Questo processo però, in generale non è immune da errori, quindi per garantire la correttezza dei dati l'ultimo passaggio è quello di validazione fatta sugli attributi comuni ai vari dataframe. Grazie alla validazione è stato possibile effettuare un merge al fine di ottenere, come output di questo processo, un file unico da dare in pasto poi agli algoritmi di Machine Learning al fine di migliorare le performances della macchina.