Multimodal Classification for Pattern and Anomaly Detection in Patients using Biomonitoring Sensors

Transcription

1 Università degli Studi di Napoli Federico II Dipartimento di Ingegneria Elettrica e delle Tecnologie dell Informazione Classe delle Lauree Magistrali in Ingegneria dell Informazione, Classe n. LM-21 Corso di Laurea Magistrale in Ingegneria Biomedica Thesis Multimodal Classification for Pattern and Anomaly Detection in Patients using Biomonitoring Sensors Supervisor: Prof. De Benedetto Egidio Co-Supervisor: Prof. Portugal David B. S. Ing. Famá Fernanda Candidate: De Felice Pierluigi Matr. M54/951 a Academic Year 2020/2021

2

3 Acknowledgements E comune dire che i grandi obiettivi non sono realizzati da soli. A volte pensiamo che la nostra conoscenza proviene da tutte le informazioni che accumuliamo e conserviamo dalla ricerca, articoli, libri e tanto altro ancora. E spesso, tutta questa conoscenza, lo studio e la fatica accumulata porta a frustrazioni, mal di testa e delusioni. Quando tutto non sembra essere abbastanza, quando tutte le nostre idee brillanti sembrano essere solo idee, quando tutto il nostro sforzo ci sembra non sia ricompensato, in quel momento, iniziamo a rinunciare. Poi, in una semplice chiacchiera tra amici, in un momento di spensieratezza con le persone con cui siamo davvero a nostro agio, riusciamo a ritrovare noi stessi e riproviamo con un altro spirito. Cominciamo a vedere i problemi in altri modi, vediamo soluzioni dove non avevamo mai pensato e portiamo avanti i nostri obiettivi. Ci rendiamo conto che tutto ciò di cui avevamo bisogno era di socializzare, discutere, ridere con qualcuno, di confrontarci, di condividere pensieri ed esperienze, di renderci conto ancora una volta che non eravamo soli, e che la conoscenza non proviene soltanto da ciò che studiamo e apprendiamo, ma anche dalle persone che ci circondano e dalle avventure che affrontiamo. Ritengo quindi di dover ringraziare coloro che mi hanno sempre supportato ed ascoltato, e quelle persone con cui ho condiviso gli ultimi due anni e mezzo di questo percorso. Prima di tutto, vorrei ringraziare il mio relatore, il Prof. De Benedetto, per il suo aiuto in particolare nell ultimo periodo, ma sopratutto per avermi concesso la possibilità di svolgere il progetto all estero, garantendomi il suo supporto al mio rientro e permettendomi di affrontare quest ultima sfida con una maggiore serenità. ii

4 I would like to thank my supervisors, Prof. David Portugal and Ing. Fernanda Famà, for giving me the opportunity to join their project in such an inspirational environment, to confront a new reality and new challenges, for their guidance, for continuously making me rethink my work through their interventions. Thank you for all the patience and your availability during my stay in Coimbra. It was really an amazing experience and I really learned a lot from it. Il ringraziamento più grande è dedicato ai miei genitori. Questa tesi è per loro e a loro dedico la gioia del tagliare il traguardo della laurea. A mamma, una donna unica, instancabile, immensa, ma anche dolce e comprensiva, che ha sempre dovuto fare i conti con la mia apatia. Ma posso assicurarti che continuerò sempre a chiamarti, ovunque io possa mai essere, per chiederti consiglio, per imparare, perchè so che ci sarà sempre qualcosa che potrai e dovrai insegnarmi. E a papà. Grazie è una parola enorme ma ancora troppo poco per esprimere la mia gratitudine per l appoggio che mi hai sempre dato e per la stima che sempre mi hai dimostrato. Se un giorno saprò di essere diventato anche solo una parte della persona che sei tu, allora potrò ritenermi soddisfatto. A mia sorella Sara, sei stata la mia più grande complice negli ultimi anni, in tutto il periodo trascorso in casa, in cui quando mi sentivo smarrito, mi bastava cercarti nell altra stanza per ritrovarmi, con un caffè al tramonto o con un episodio di Friends. Tutti dovrebbero avere assolutamente una sorella come te... anche perchè che fatica sopportarti solo io! Infine vorrei ringraziare tutti i miei amici. A tutti voi, così diversi ma così importanti, ognuno per ragioni uniche, voglio esprimere la mia più assoluta gratitudine. A Michele e Mattia, per tutte le bottiglie di Gin, per avermi sfamato per due anni mentre vivevo a casa loro, per avermi insultato quando sprecavo il mio tempo a piangermi addosso. Ad Alessandro, il mio modello di ingegnere, il miglior motivational coach che potesse mai capitarmi, sempre pronto a spronarmi anche nei momenti peggiori. A Federico, perchè anche se abbiamo trascorso pochissimo iii

5 tempo insieme negli ultimi due anni sei parte del mio percorso, sei parte di questa avventura, e so che se fossi rimasto qui con noi tutto sarebbe stato ancora più intenso. A Noemi, la mia prima amichetta, con me dal primissimo giorno quando eravamo ancora dei bimbi fino a quest ultimo in cui sono diventato "dottore". Con voi cinque ho condiviso centinaia di ore in facoltà, a lezione, al bar, le attese agli appelli, le ansie pre-esame, le sbronze post-esame. Senza di voi arrivare alla laurea non avrebbe avuto senso e sarebbe stato sicuramente molto più noioso e molto meno divertente. A Gigi, molto più di un coinquilino, ma un fratello, un grande amico. Grazie per avermi convinto a partire per quella che è stata una delle esperienze più belle della mia vita, te la devo fino alla fine, senza di te non sarebbe stato lo stesso e sono sicuro che da solo non sarei mai riuscito a trovare una casa stupenda come quella in cui abbiamo convissuto gli ultimi mesi. A Steppi, il mio amico di sempre, la mia ancora nella vita, fisicamente e metaforicamete. Sei la persona su cui posso sempre contare, grazie per volermi bene per quello che sono e per essere sempre al mio fianco, per essere sempre pronto a condividere una storia e una birra con me.a quelle bestie dei miei amici, che più che incoraggiarmi sono probabilmente la causa del mio anno in più di università. Ma a loro devo i momenti più divertenti della mia vita. Sono così tanti i ricordi che mi passano per la testa che è impossibile trovare le parole giuste per descriverli. A farlo sono le mie emozioni, i miei sorrisi e i miei sentimenti che insieme si mescolano in un bagaglio di affetto sincero e gratitudine per tutti voi. Grazie per aver reso tutto molto più speciale. Finally, I need to thank the people with whom I have shared the last months, the friends with whom I have lived my fantastic Erasmus. It was one of the most intense experiences of my life. Being myself, living. Learning a new language and culture. In an incredible city. I arrived thinking of meeting good friends and I found the most wonderful people ever. It s amazing how in such a short time you can create a family that you would never want to part with. Thank you really, you ll always be a part of me. Uma vez Coimbra, para sempre saudade. iv

6 I have not failed. I ve just found 10,000 ways that won t work. Thomas Edison

7

8 Contents Acknowledgements List of Acronyms List of Figures List of Tables ii ix xi xiii 1 Introduction The importance of sensor fusion in digital healthcare Context and motivation Main Goal Background and Related work Fundamentals of sensor fusion for healthcare Traditional sensor fusion techniques Artificial Intelligence and Machine Learning approaches for sensor fusion Support Vector Machine Neural Network K-Nearest Neighbor Decision Trees and Random Forest Identified problems and main challenges Methodology Database choice and motivation MIMIC III database vii

9 3.2.1 CITI course Database structure Database filtering Python What is Python? Python libraries and packages Jupyter Notebook Implementation Overview of the implementation process Data pre-processing Data Reduction Data Cleaning Features extraction Features from Numerical data Features from Waveform data Features selection and fusion Machine Learning algorithms Data partitioning Results and Discussion Comparison of Implemented Algorithms Evaluation of a classifier Results from the implemented methods Discussion Conclusions Future Work Bibliography 82 viii

10 List of Acronyms ABP AI AVF CITI CPT DRG ECG EDA EEG EHR EMG EKF GSR HIPAA HIS HRV ICD-9 ICU Arterial Blood Pressure Artificial Intelligence Arteriovenous Fistula Collaborative Institutional Training Initiative Current Procedural Terminology Diagnosis Related Group Electrocardiogram Electrodermal Activity Electroencephalogram Electronic Health Records Electromyography Extended Kalman Filter Galvanic Skin Response Health Insurance Portability and Accountability Act Hospital Information System Heart Rate Variability International Classification of Disease, 9th Edition Intensive Care Unit ix

11 IDE IMU IoT K-NN MIMIC III ML NN PAP PPG RNN ROC RR SVM TNR TPR UKF WFDB WoW Integrated Development Environment Inertial Measurement Unit Internet of Things K-Nearest Neighbor Medical Information Mart for Intensive Care III Machine Learning Neural Network Pulmonary Artery Pressure Photoplethysmogram Recurrent Neural Network Receiver Operator Characteristics Respiration Rate Support Vector Machine True Negative Rate True Positive Rate Unscented Kalman Filter WaveForm DataBase Wireless biomonitoring stickers and smart bed architecture: towards Untethered Patients x

12 List of Figures 2.1 Data-level fusion. Source: [39] Feature-level fusion. Source: [39] Decision-level fusion. Source: [39] Classification of data by Support Vector Machine (SVM), taken from [15] An example of NN architecture [41] Overview of the MIMIC-III critical care database. Source: [26] Distribution of primary International Classification of Diseases, 9th Edition (ICD-9) codes by care unit for patients aged 16 years and above. Source: [26] Distribution of circulatory system diseases codes; the selected rows are the codes considered for the recognition of patients Overview of the implementing process, starting from the ICU data to the model prediction Waveform Record from Physionet, Subject_ID= Numeric Record from Physionet, Subject_ID= Block diagram of the pre-processing phase of the Pan Tompkins algorithm Recap of the considered signals for the feature extraction R-peak detection using the processing subpackage of the WFDB library DataFrame of the extracted ECG features for patients with heart diseases Final Dataframe including all the features for all the patients xi

13 4.9 Data visualization using the boxplot function Distribution of dataset after partitioning Comparative graph of the results Results for K-NN Results for SVM Results for Decision Trees (Training set) Results for Decision Trees (Test set) Results for Random Forest (Training set) Results for Random Forest (Test set) xii

14 List of Tables 2.1 Common features in time domain and spectral domain, for different physiological signals Overview of AI sensor fusion methods surveyed Analyzed datasets Overview of some of the data tables comprising the MIMIC-III (v1.3) critical care database xiii

15

16 1 Introduction Vast amounts of data are around us in our world, raw data that is difficult to manage for digital applications. So, the analysis of such data is now a necessity. Sensor data fusion has been a fast developing area of research in recent years, thanks to the increase of availability and types of sensors and because software in the area of data fusion applications is becoming available in the commercial marketplace [22]. With recent progress in digital data acquisition, machine learning and computing infrastructure, new applications are expanding into areas that were previously thought to be only the province of human experts, gradually changing the landscape of healthcare and biomedical research. Numerous algorithms and techniques have been introduced or applied, ranging from classic estimation and statistical methods and pattern recognition methods, to artificial intelligence, i.e. machine learning and deep learning techniques. In this chapter, an initial introduction about the topics of the project is provided, introducing the reader to the context of the work and its main objectives. Moreover, the last section gives the document overview, explaining the contents and main objective addressed in the development of the project. 1.1 The importance of sensor fusion in digital healthcare In digital healthcare applications, the increasing quantity of information arising from multiple sensors, i.e. the collection and classification of diverse multimodal data, including heart, muscle and brain activities i.e. Electrocardiogram (ECG)/ Electromyography (EMG)/ Electroencephalogram (EEG) respectively, respiration rate (RR), body temperature, IR Response, blood oxygen, sweat analysis, 1

17 body motion through Inertial Measurement Unit (IMU) devices, and Galvanic Skin Response (GSR), has led to the need to fuse such data to combine information and integrate them, allowing a better study of physiological conditions in health monitoring, prediction and anomaly detection. The data generated by sensors can allow health practitioners to identify sensitive circumstances, such as heart or respiration diseases, obstructive sleep apnea syndrome and distress situations, more rapidly and more accurately and encourage patients to be better aware of their symptoms and changes [39]. Due to their complex nature, in many diseases identification there is a need for multimodal signals, therefore fusion of the data from multiple, potentially heterogeneous, sensor sources is becoming a fundamental task that directly impacts identification performance. In general, multi-sensor fusion data provides significant advantages as compared to using only a single source data; for example, information obtained throughout the fusion process have a higher abstract level than the original individual input data set. Moreover, it provides a gain in certainty and accuracy [12], due to the parallel processing of different information from multiple sensors. 1.2 Context and motivation In the context of sensor fusion and multimodal classification, the WoW R&D project 1, 2 proposes an innovative architecture for wearable devices, based on electronic skin (e-skin) patches that adhere non-intrusively to the human epidermis, which collect physiological and behavioral data. These patches, designated as Biostickers include sensors for EMG, ECG, EEG, GSR, temperature, motion, location monitoring based on state-of-the-art methods for fabrication of stretchable electronics circuits, using a patented ink formulation and fabrication method [33]. The stickers are wiresslessly connected to a smart IoT unit embedded in the hospital beds, enabling data acquisition and transmission. Each smart bed is associated with a patient and collects all data from his/her sticker that are then processed in a local or cloud server, allowing communication to a proprietary show%28%29&idproject=226 2

18 Hospital Information System (HIS). The multimodal data obtained by the Biostickers can be used to identify physiological and emotional responses through its combination and classification, using sensor fusion and AI techniques. Indeed, when a diverse range of data is fed into a multi sensor classification algorithm, by combining the physiological and emotional computing data with AI algorithms, one may discover new digital biomarkers, i.e. correlations between the physiological data, and various health conditions. 1.3 Main Goal Abnormal patterns in physiological data are significantly valuable in the medical domain. Classify unusual patterns in health parameters, especially for digital healthcare systems, enables the clinicians to make accurate decisions in short time [5]. With this in mind, the main goal of this dissertation work is to study, design and develop multimodal classification methods, having as input distinct vital signs such as ECG, temperature, heart rate, respiratory rate and pulse oximetry data in order to identify patterns, thus classifying and detecting relevant anomalies in the patient s state, such as unusual patterns which do not conform to the expected behavior of the data, in particular the presence or absence of heart diseases. We will explore well known Sensor fusion techniques and Machine Learning classification algorithm from the current state-of-art, developing a system that, starting from raw electronic health records, can lead us to a new decision-making model, using as input features extracted from heterogeneous vital signs of different nature. 3

19 2 Background and Related work In this chapter, a study of key concepts, techniques and main methods from the literature is presented. Firstly, it introduces basic knowledge about sensor fusion, focusing on healthcare applications and physiological signals, while referring to relevant related research work. Then, a review of the main techniques is reported, highlighting their advantages and disadvantages. Finally, the last section sums up the possible problems and the main challenges of sensor fusion. 2.1 Fundamentals of sensor fusion for healthcare Humans and animals have evolved the capability to use multiple senses to improve their ability to survive, recognizing their environment by the evaluation of signals from multiple and multifaceted sensors. Nature has found a way to integrate information from multiple sources to a reliable and feature-rich recognition [10]. Humans for example combine signals from the five body senses (sight, hearing, smell, taste, and touch) with knowledge of the environment to create a dynamic model of the world in which they live. Based on this information the individual interacts with the environment and makes decisions about present and future actions. Sensor fusion is a method of integrating signals from multiple sources. It allows extracting information from several different sources to integrate them into single signal or information [52]. This chapter aims to describe a comprehensive survey of multimodal digital signals fusion schemes and techniques that have been proposed, focusing on digital healthcare applications. In general, multi-sensor fusion data provides significant advantages as compared to using only a single source data [12], [18]; for instance, the improvement of 4

20 performance can be summarized in three general areas: Representation: information obtained throughout the fusion process have an abstract level higher than the original individual input data set, thanks to all the data coming from different sources that contributes to a better estimation of the working environment. For this reason, fusion of multiple sensor data provides a more complete view of the process under study. Moreover, when multiple independent measurements of the same feature/attribute are fused, the resolution of the resulting value is higher than what can be achieved using a single sensor. Certainty: if more than one sensor is used, the redundant information from the same environment resulting from the combination of data allows to obtain more accurate information. Increasing the dimensionality of the measurement space significantly enhances robustness against environmental interferences. Accuracy: thanks to the parallel processing of different information from multiple sensors, if at first data is noisy or contains errors, the fusion process should try to reduce or eliminate these. In addition, when information is redundant and concordant, improved accuracy can be obtained. Usually, the gain in certainty and the gain in accuracy are correlated [21]. According to each specific problem, different sensor fusion approaches can be adopted. Gravina et al. [18] provide a comprehensive review of the state-of-theart techniques on multi-sensor fusion. Their survey discusses clear motivations and advantages of multi-sensor data fusion, aiming at providing a systematic categorization and common comparison framework of the literature, by identifying distinctive properties and parameters affecting data fusion design choices at different levels (data, feature, and decision). The survey also covers data fusion in the domains of physical activity recognition, emotion recognition and general health. In terms of data processing level of abstraction, multi-sensor fusion is typically divided in three main categories: data-level fusion, feature-level fusion, and decision-level fusion. In particular, if the system involves multiple homogeneous sensors measuring the same physical phenomena, then sensor data can be 5

21 Figure 2.1: Data-level fusion. Source: [39]. directly fused [28]. On the contrary, data generated from heterogeneous sources cannot be combined directly and feature or decision-level fusion techniques must be adopted. Data-level fusion relates to multiple homogeneous data sources that are collected and fused at data-level (or sensor-level). Specifically, data signals can come from different channels of the same sensor (e.g. a three-axis accelerometer), different nodes with the same sensor type, or by a combination of the previous options. Figure 2.1 shows the representative scheme of data-level fusion. In [40], Nathan and Jafari presented a fusion of multiple signal modalities, including ECG, PPG, and accelerometer data to improve heart rate estimates. The signal processing methods described by the authors show promising average error rates (below 2 beats/min) in the presence of motion artifacts, which is a key source of noise that usually affects traditional heart rate estimation algorithms. However, the technique has not been tested on patients with heart rate variability or other cardiac conditions, which would likely require some tuning of the parameters. Feature-level fusion is applied after extracting features independently from each data source. Feature-level fusion includes feature normalization, i.e. when feature values variate both in range and distribution, it is useful to normalize their baseline and amplitude to ensure that the contribution of each feature is comparable; and feature selection, i.e. to obtain the most significant feature vector. Then, the feature sets extracted can be fused to create a new high-dimension feature vector that represents the input for the classification/pattern recognition step [18]. In this type of fusion, neural network and probability statistics can be 6

22 Figure 2.2: Feature-level fusion. Source: [39]. utilized, as they adapt well to this type of strategy. Figure 2.2 illustrates this architecture. According to the magnitude and complexity of the raw data, feature extraction provides a meaningful representation of the sensor data, which can formulate the relation of raw data with the expected knowledge for decision making, pattern recognition and/or anomaly detection [5]. According to specific application requirements, features can be extracted in the time domain feature, in the frequency domain, or in a combination of both. Since sensor data which provide monitoring of vital sign parameters tend to be continuous time series readings, most of the considered features are related to the properties of time series signals. In the time domain, the extracted features usually include basic waveform characters and statistical parameters such as mean, variance, pick counts, etc. However, for extra knowledge about the periodic behavior of time series, the features acquired from frequency-domains, such as power spectral density, low-pass/high-pass filters, spectral energy, and wavelet coefficients of the signal, should be considered too [5]. For example, online classification of sleep/awake states as described in [29] uses a feedforward Neural Network on ECG and RR features in the frequency domain. The method was conceived in view of its applicability to a wearable sleepiness monitoring device and it demonstrates that the combination of ECG and respiratory signals can discriminate with high accuracy between sleep and wake states. However, the method proposed requires a preliminary and time-consuming stage of labeling recorded data into sleep and wake states, and for this reason only data from a relatively small sample of subjects are available for the training of the NN. 7

23 Table 2.1 summarises the most common features in the literature extracted from physiological sensor data. Table 2.1: Common features in time domain and spectral domain, for different physiological signals. Time domain ECG Mean R-R, Standard deviation R-R, Mean HR, Std HR, Number of R-R interval, Mean R-R interval, Standard deviation R-R interval. SpO2 Mean, zero crossing counts, entropy, Slope, Self-similarity. HR Mean, Slope, Selfsimilarity, Standard deviation. RR Mean, Min, Max. Other Zero crossings count, Peak value, Rise time (EMG). Spectral domain Spectral energy, Power spectral density, Lowpass filter, Low/high frequency. Energy, Low frequency. Energy, Low/high frequency, Wavelet coefficients of data segments, Power spectral density. Spectral energy (EEG), Median and mean Frequency, Spectral energy (EMG). Decision-level fusion is applied after the classification step, and it utilizes the information that has been already abstracted to a certain level through preliminary sensor data or feature-level processing to generate high-level decisions. A representative scheme is shown in Figure 2.3. An important aspect of decision fusion is that it allows the combination of the heterogeneous sensors whose measurement domains have been processed with different algorithms. Common decision-level sensor fusion methods include Bayesian inference, fuzzy logic, and 8

24 Figure 2.3: Decision-level fusion. Source: [39]. classical inference. For instance, in [64], an emotion recognition framework focused on a decision-level weighted fusion approach of multichannel physiological signals was implemented. The authors chose four signals, EEG, ECG, Respiration Amplitude (RA), and Galvanic Skin Response (GSR). This method reduces the influence of weak correlation features and enhances the influence of strong correlation features by weighted features, thus improving the robustness of the classification algorithm. However, various physiological signals have different abilities to classify specific emotion and, for this reason, each physiological signal should be combined in a different way that can benefit the results of the classifier. The three different level fusion strategies can also be applied in a multi-level knowledge fusion approach, as proposed in [55] to learn a blood pressure model using electrocardiogram (ECG) sensor details. The advantage of the methodology is in fusing the models built for different configurations in order to obtain the best results without the need for calibration. However, there is still an effort that should be done to make these results more close to be acceptable for medical purposes. 2.2 Traditional sensor fusion techniques When multiple sensors are used, the data is multivariate with possible dependencies. This means that in order to make sense of the data, appropriate data fusion techniques are essential [56]. Different algorithms are considered for different levels of fusion. Traditional data fusion techniques include probabilistic and statistical fusion (e.g., Kalman Filtering and Bayesian fusion), evidential belief 9

25 reasoning fusion (e.g., Dempster-Shafer theory), and artificial intelligence techniques (e.g., Support Vector Machine and Neural Networks) [12]. A survey of the main techniques based on statistical approaches is reported in this section. Kalman Filtering The Kalman Filter is a statistical recursive data processing algorithm which allows several models of sensors to be easily incorporated. The algorithm works as a two-phase process: "Predict" and "Update". For the prediction phase, the Kalman Filter produces estimates of the current state variables, along with their uncertainties, from the previous timestep. Once the outcome of the next measurement (necessarily corrupted with some error, including random noise) is observed, these estimates are updated using a weighted average, with more weight being given to estimates with greater certainty. Typically, the two phases alternate, so the algorithm is recursive. As every iteration requires almost the same effort, the Kalman Filter is well adapted for real-time usage, using only the present input measurements and the state calculated previously and its uncertainty matrix. The main advantage of Kalman Filter is that it has high computational efficiency since the entire sequence of old observations are not reprocessed with every new observation. However, it is restricted to linear system dynamics, but extensions and generalizations of the method have also been developed [27], such as the Extended Kalman Filter (EKF) and the Unscented Kalman Filter (UKF) which works on nonlinear systems [63]. This approach implements a distributed filter in which each node dynamically tracks the instantaneous least-squares fusion of the current input measurements. This allows the nodes to run independent local Kalman filters using the globally fused input, and obtain the performance of a centralized Kalman filter [57]. Bayesian Inference Technique Bayesian inference is a statistical data fusion algorithm based on Bayes theorem, with a recursive predict-update process [1]. Bayes rule has to be applied recursively because in sensor fusion the system state is usually time-dependent, and so it changes over time. However, when Bayesian inference is used for sensor fusion, certain drawbacks can emerge, for instance the required knowledge of the a priori probabilities, may not always be available [6]. 10

26 Naive Bayes classifier is a classification technique based on Bayes Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assign the most likely class to a given example described by its feature vector assuming that the presence of a particular feature in a class is unrelated to the presence of any other feature [49]. Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods. A system for activity recognition using multi-sensor fusion based on Naive Bayes classifier is proposed in [13] for activity recognition. Using a fusion approach with four sensors, the recognition accuracy can reach very high values. Meanwhile, the system is also robust to when one sensor is not working. Anyway, in order to eliminate the impacts due to the sensor orientation changing, which can deliver different reports about a movement, an estimation of the constant gravity vector is required. L. Gao et al [14] also propose a hierarchical classifier that combines the Decision Tree classifier with the Naive Bayes classifier. The proposed hierarchical classifier consists of two layers: the preliminary classifier and the final classifier. The Naive Bayes classifier is adopted as the preliminary classifier, while the Decision Tree classifier is used as the final one. The results demonstrated this system could just reduce the recognition accuracy by 2.8%, but the savings in energy consumption are much higher. Dempster-Shafer (D-S) Theory of Evidence D-S Theory allows representation and combination of different measures of evidence. It can be considered as a generalization of the Bayesian framework and permits the characterization of uncertainty and ignorance. D-S theory allows one to combine evidence from different sources and arrive at a degree of belief (represented by a mathematical object called belief function) that takes into account all the available evidences [60]. Even if Dempster-Shafer methods use a general level of uncertainty, the difficulty to estimate belief function and their restrict domain of application are some of the main challenges. Fuzzy Logic involves extension of Boolean logic (i.e., two-valued logic) to a continuous-valued logic via the concept of membership functions, which are con- 11

27 tinuous functions defined on the interval (0,1) which may be used to quantify vagueness and imprecise concepts [16]. In [36], a telemonitoring system based on fuzzy logic is proposed, ensuring pervasive in-home health monitoring for elderly people. This multimodal fusion increases the reliability and the robustness of the whole system taking into account temporary sensor malfunction and environmental disturbances. Moreover, the Fuzzy Logic approach allows the easiest combination between data and adding other sensors. However, the system includes different sensors which use different physical principles, cover different information space, and generating data in different formats at different sampling rates. Therefore, a complex pre-processing strategy is required. 2.3 Artificial Intelligence and Machine Learning approaches for sensor fusion Artificial Intelligence is a growing field with a variety of daily life practical applications and active research topics. AI techniques developed for data association make use of expert systems and neural networks, which are computer systems designed to emulate the decision making ability of the human brain [19]. Thus, like human learning, the computer becomes capable of improving its performance from acquired knowledge. For this reason, the decisions made by an AI system are based on the information acquired during its development. However, the efficiency of the system is a function of the amount of knowledge pre-programmed into it, so large datasets are required. In this context, Machine Learning (ML) refers to the AI system s ability to acquire, and integrate knowledge through largescale observations, and to improve, and extend itself by learning new knowledge rather than by being programmed with that knowledge [65]. Furthermore, it is a technique that lets the computer learn with provided data without thoroughly and explicitly programming it for every problem. It aims at modeling profound relationships in data inputs and reconstructs a knowledge scheme [37]. The result of learning can be used for estimation, prediction, and classification. Comparing with a range of classical probabilistic data fusion techniques, machine learning methods remarkably renovate fusion techniques by offering a strong ability of 12

28 computing and predicting. methods based on Machine Learning is provided. In this section, a survey on the main data fusion Support Vector Machine The Support Vector Machine (SVM) is an efficient machine learning technique that analyses data, and extract patterns for classification and regression analysis. It is one of the main statistical learning methods which is able to classify unseen information by deriving selected features and constructing a high dimensional hyperplane to separate the data points into two classes in order to make a decision model [8]. In fact, the main objective of SVM is to find this optimal separating hyperplane that correctly classifies and separates data points as far as possible, by minimizing the risk of misclassifying the training samples and unseen test samples. This means that the optimized hyperplane should maximize the margins between the hyperplane and the closest points taken from the training set examples. The idea of SVM classifiers can be described as follows [15]: suppose there are m observation samples (the training set), and each of them is assigned a coded class label, +1 for the positive class and -1 for the negative one. This training set can be separated by the hyperplane: w T x i + b = 0, where w is the weight vector and b is the bias. Instead, the equations of the marginal hyperplanes, H 1 and H 2 are: and H 1 : w T x i + b = 1, (2.1) H 2 : w T x i + b = 1. (2.2) The distance between marginal hyperplanes (i.e., the margin) is equal to 2 w. Any training samples that fall on marginal hyperplanes are support vectors, as shown in Figure 2.4. Then, the optimization criterion to obtain the optimum separating hyperplane is finding the weight vector and the bias by maximizing the margin and minimizing the training error. As the SVM has the ability to handle high dimensional data using minimal training set of features, it is recently very popular for digital healthcare applications, such as anomaly detection and decision making tasks. 13 In [64] a

29 Figure 2.4: Classification of data by Support Vector Machine (SVM), taken from [15]. decision-level weight fusion strategy for emotion recognition based on Support Vector Machine classifier is proposed. The experiment was conducted using the MAHNOB-HCI database Neural Network A Neural Network (NN) is an AI approach which is widely used for classification and prediction [44]. A neural network consists of layers of processing elements that may be interconnected in a variety of ways. Particularly, it is composed of nonlinear computational elements (neurons), operating in parallel and connected as a graph topography characterized by different weighted links. It is comprised of input variables, output variables and weights. The network behavior is dependent on the relationship between input and output variables. In general, there are three types of layers, as shown in Figure 2.5. The first layer is the input layer which receives the raw data fed into the network. The second layer is the hidden layer. There may be several hidden layers depending on the structure of the neural network. The last layer is the output layer and its performance depends on the activity of hidden layers and the weights of hidden and output units. The number of layers and neurons in each layer is specified by the designer throughout the process of trial and error [2]. The NN method models the

30 Figure 2.5: An example of NN architecture [41]. training data by learning the known classification of the records and comparing with predicted classes of the records in order to modify the network weights for the next iterations of learning. NNs have proven to be more powerful and more adaptable methods, compared to traditional linear or non-linear analyses and, for this reason, it is presently one of the most popular data modelling method used in the medical domain. In fact, Neural Networks are capable of learning complicated nonlinear relationships from sets of training examples. This property makes them well suited to pattern recognition problems involving the detection of complicated trends in high-dimensional data sets. One such problem domain is the detection of medical abnormalities from physiological measures. Neural networks have been applied to problems such as the detection of cardiac abnormalities from electrocardiograms and breast cancer from mammograms [20]. However, one major problems currently is determining the best topology for any given task. Vu et al. [62] proposed a framework based on NN to recognize Heart Rate Variability (HRV) patterns using ECG and accelerometer sensors. Experimental results prove that the framework is able to deal with the problem of learning data that never seen input data in non-stationary environment. Whereby the unexpected state of the subject will be detected in real time and in case of emergency the subject would be timely intervened. The major challenge of the approach has been setting of the number of nodes in the hidden layer, in order to avoid overlapping decision areas and overfitting. In [59], a wearable sensor-based system is proposed for activity prediction using Recurrent Neural Network (RNN) on an edge device. RNNs are a class of neu- 15

31 ral networks that are naturally suited to processing time-series data and other sequential data [66]. The reason that RNNs can handle time-series is their recurrent hidden state, whose activation at each time is dependent on that of the previous time. In this case, the input data of the system is obtained from multiple wearable healthcare sensors such as ECG, magnetometer, accelerometer and gyroscope sensors. Then, an RNN is trained based on the features and the trained network is used for predicting the activities. The system has been compared against conventional approaches on the MHEALTH public dataset 2. The experimental results show that the proposed approach outperforms other traditional methods, making it very suitable for real-time analysis K-Nearest Neighbor K-Nearest Neighbor (k-nn) is one of the most fundamental classification methods, and it is commonly used when there is little or no prior knowledge about the distribution of the data [46]. The classification typically involves partitioning samples into training and testing categories. During the training process, the true class of each training sample is used to train the classifier, while during testing the class of each test sample is predicted. It is useful to notice that k-nn is a "supervised" classification method in that it uses the class labels of the training data. K-NN is based on a distance function that measures the difference or similarity between two samples [24]. The standard Euclidean distance d(x, y) between two samples x and y is often used as the distance function, defined as follows: d(x, y) = n (a i (x) a i (y)) 2. (2.3) i=1 This function can also be generalized for a greater number of features and input samples; moreover, different distance functions can be improved, depending on the application. Based on the calculated distances, the k-nearest Neighbor classification rule is to assign to a test sample the majority category label of its k nearest training samples. In practice, k is usually chosen to be odd, so as to avoid ties. The k = 1 rule is generally called the nearest-neighbor classification rule

32 This technique may present several problems if the features do not meet certain conditions about the measurement scale and the uncertainty. However, increased performance of the classifier can sometimes be achieved when the feature values are transformed prior to classification analysis. Two commonly used feature transformations are standardization and fuzzification [46]. The first one removes scale effects caused by use of features with different measurement scales by transforming raw feature values into z-scores using the mean and standard deviation of feature values over all input samples. Thus, the range and scale of the z-scores should be similar and the raw features will have the same influence on the distance between samples. Instead, fuzzification is a transformation which exploits uncertainty in feature values in order to increase classification performance. The technique replaces the original features by mapping original values of an input feature into 3 fuzzy membership values in the range (0,1). In this way, the feature values are more heterogeneous across the classes considered and can discriminate classes well Decision Trees and Random Forest The Decision Tree is an important classification method used in data mining classification. It is a flow-chart-like tree structure, where each internal node is denoted by rectangles and the leaf nodes are denoted by ovals. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data, but the resulting classification tree can be an input for decision making. It is the most commonly used algorithm because of its ease of implementation and easier to understand compared to other classification algorithms [47]. Decision trees can be used to discover features and extract patterns in large databases for discrimination and predictive modeling. This technique is also suitable for handling multivariate sensors due to construction of independent levels in the decision tree. In this method, the most robust features are used to initially split the input data 17

33 by creating a tree-like model. Decision tree is a reliable technique in different areas of the healthcare domain in order to make the right decision. Even if it is simple and easy to implement, since the number of features can impact on the efficiency of the method, decision tree models are not usually applied to large and complex physiological datasets. Frantzidis et al. [11] provide a new technique for motion recognition. It uses a decision tree algorithm based on the Mahalanobis distance. Thanks to the selected approach, the extracted features are easy to compute and can be derived relatively fast, even though not in real time. However, it can reliably provide the result of the user s emotional state during a short temporal window. To take advantage of the sheer size of modern physiological data sets, we need learning algorithms that scale with the volume of information, while maintaining sufficient statistical efficiency. In this case, Random Forest can represent an adequate solution, as it is a computationally efficient technique that can operate quickly over large datasets. The Random Forest classifier consists of a combination of tree classifiers, where each classifier is generated using a random vector sampled independently from the input vector, and each tree casts a unit vote for the most popular class to classify an input vector [43]. In general, the user sets the number of trees in a trial and error basis. However when the number of trees is generally increased above a certain threshold, more computational power is required, for almost no performance gain [42]. Therefore, the choice of the number of trees to be adopted is an important factor to take into account to obtain appropriate results. In Table 2.2, a comparative overview of the most relevant papers referred previously is reported. 18

34 Table 2.2: Overview of AI sensor fusion methods surveyed. Ref. Application Signals/sensors Classifier Database Advantages Disadvantages [55] Blood pressure predic- Multiple ECG sensors SVM, K-means, Ran- Four commercially ac- Best results with no Results still not action dom Forest, Naive cessible ECG sensors calibration method apceptable for medical Bayes and one from the Phyplied purposes. sionet web site [40] Heart Rate estimation ECG, PPG, and the accelerometer data A particle filter formulation 2015 IEEE Signal Processing Average error rates less Cup (SP Cup) a than 2 beats/min in Not tested on patients (ACC) sensors the presence of motion with heart rate variability or other cardiac artifacts conditions. [62] Heart Rate Variability ECG and accelerome- NN Simulation program Ability to deal with Challenge about the (HRV) recognition ter sensors presented by McSharry the problem of learn- setting of the number et al. (2003) ing data that never of nodes in the hidden seen input data in layer. non-stationary environment [36] Distress situations de- Elderly physiologi- Fuzzy logic Two data bases: one Increases the reliabil- A complex pre- tection cal and behavioral recorded by their ity and the robustprocessing strategy is environment of the data, the acoustical self and another one ness, allows the com- required. bination between data elderly recorded in an experimental house by elderly people and adding other sensors [11] Motion recognition Electrodermal activity Decision Trees Private recordings The extracted features Not suitable for real (EDA), ECG, and were easy to compute time applications. EEG and could be derived relatively fast [14] Activity recognition Tri-axial accelerometer Naive Bayes classifier Savings in consumption energy Reduced gain in accuracy. continued on next page a 19

35 [64] Emotion recognition EEG, ECG, RA and SVM 30 subjects and GSR the MAHNOB-HCI database [13] Activity recognition Accelerometer sensors Decision Tree with Naive Bayes 8 subjects, who performed 8 scenario activities [29] Sleep detection system ECG and RR NN ECG and respiratory signals of different subjects recorded over day and night periods MIT-BIH Arhythmia [51] Health-monitoring system Database a. ECG sensor, temperature sensor, accelerometer, vibration motor, LED [59] Activity prediction ECG, magnetometer, RNN MHEALTH public accelerometer and dataset gyroscope sensors a Reduces the influence of weak correlation feature and enhances the influence of strong correlation feature High values of accuracy, well working also when one sensor was not working High accuracy between sleep and wake states Possibility to use the collected biometric information in realtime for monitoring the health state of the user or to facilitate a medical diagnosis Optimal results compared with traditional approaches, suitable for real-time analyzing Each physiological signal should be combined in a different way. A signal trasformation is required in order to eliminate the impacts due to the calibration drift and the sensor orientation. Time-consuming preliminary stage of labeling recorded data into sleep and wake states. Lack of experimental data to certify system performance in a real application scenario. 20

36 2.4 Identified problems and main challenges Despite the many advantages offered, sensor fusion comes with inherent problems. Several issues make data fusion a challenging task. The majority of these issues arise from the different types of data to be fused, imperfection and diversity of the sensor technologies, and the nature of the application environment. In particular, there are four challenging problems of input data that are mainly tackled: data imperfection, data correlation, data inconsistency, and data registration [30]. Diverse formats of data from different environments may create noise and ambiguity in the fusion process. Competitive or conflicting data may result from such errors [12]. Moreover, data provided by sensors is always affected by some level of impreciseness as well as uncertainty in the measurements. Data fusion algorithms should be able to express such imperfections and ambiguities effectively and exploit the data redundancy from multiple sensors to reduce their effects. Another aspect derives from the data registration because individual sensors have their local reference frame from which it provides data. Sensor data must be transformed from each sensor s local frame into a common frame and aligned together before fusion occurs [30]. Data registration is of critical importance to the deployment of fusion techniques, determining whether the process of fusion is successful or not. One aspect of sensor fusion is establishing whether the two tracks from each sensor represent the same object. This is required to know how the data features match each other from different sensors, and knowing whether there are data features that are outliers. So, the data correlation and association is an important aspect when machine learning techniques are involved in the process. Finally, the timing of the signals arriving from different sensor sources plays an important role during the processing. The phenomenon under observation may be time-invariant or varying with time, or sensors may be measuring the same environment at different rates. Another case is two identical sensors measuring at different frequencies due to manufacturing defects [30]. For this reasons, sensor fusion has to be based on a precise time-scale setting to ensure all data is synchronized properly. 21

37 It can be concluded from the existing knowledge on sensor fusion performance that despite the great potentials of sensor fusion, no single data fusion algorithm is capable of addressing all the aforementioned challenges. The methods variety in the literature focus on a subset of these issues to solve, which would be determined based on the considered application. All the described techniques have advantages and disadvantages that must be considered in the sensor fusion process, depending on the application, the required features, and the objectives to be achieved. The choice of the method to be adopted is, therefore, key to obtain successful results. 22

38 3 Methodology In the last decades, related techniques for data recording and saving have allowed the development of numerous databases in digital healthcare, providing information arising from multiple sensors, and the collection and classification of diverse multimodal data. Clinical databases have accumulated large quantities of information about patients and their clinical history, and this data could provide useful knowledge for effective decision-making. To fulfill the objectives of the project, we proceeded to the selection of an appropriate healthcare dataset to serve as basis for the work proposed. Together with the supervising team, we selected and studied the suitability of state-of-art techniques and databases for the application chosen, aiming for using a public dataset that includes different physiological data from multiple sensors, such as ECG, temperature, heart rate, respiratory rate and pulse oximetry data. With this in mind, the MIMIC III database has been chosen. In this chapter, we will explore the motivations about the database choice, presenting a brief summary of the analyzed datasets and introducing the MIMIC III database, a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and Then, basics of the Python language and Jupyter notebook environment will be covered, presenting and describing the programming language, the integrated development environment (IDE) and the tools that enabled the development of the work. 23

39 3.1 Database choice and motivation Data availability is one of the traditional obstacles confronting researchers carrying out empirical studies in healthcare domain. In recent years several databases have claimed to offer comprehensive coverage of electronic health records (EHR), i.e real-time, patient-centered records that make information available instantly and securely to authorized users, and physiological data. Studies of treatments from electronic healthcare databases are critical for producing the evidence necessary to making informed treatment decisions [54]. Notably, the selection of a suitable database depends on the actual context that is investigated. Considering the main project s aim, the analysis of the vital signs included in the dataset, the availability, the works related to the analyzed dataset and their results and the presence of the Ground Truth for ML purposes are the keys for the database choice. In particular, the presence of physiological data such as: ECG temperature heart rate respiratory rate pulse oximetry data are of fundamental importance for the choice of the dataset. Moreover, the presence of the Ground Truth, based on the patients diagnoses, is necessary to carry on the classification regarding the presence or absence of heart diseases. Based on these preliminary assumptions, a study from the state-of-art of the available healthcare databases led to the selection of several datasets. The following table 3.1 summarizes the characteristics considered for each database, reporting the name on the first column and their features on the other columns. It is useful to note that for some of them there are some missing information that were not reported or mentioned in the analyzed related work. 24

40 Name Vital signs Sample Related works Ground truth The MIMIC II Waveform Database The MIMIC II Waveform Database contains thousands of recordings of multiple physiologic signals ("waveforms") and time series of vital signs ("numerics") collected from bedside patient monitors in adult and neonatal intensive care units (ICUs).. Waveforms almost always include one or more ECG signals, and often include continuous arterial blood pressure (ABP) waveforms, fingertip photoplethysmogram (PPG) signals, and respiration, with additional waveforms (up to 8 simultaneously) as available. Numerics typically include heart and respiration rates, SpO2, and systolic, mean, and diastolic blood pressure, together with others as available. digitized at 125 Hz with 8-, 10-, or (occasionally) 12-bit resolution. This database is described in: M. Saeed, M. Villarroel, A.T. Reisner, G. Clifford, L. Lehman, G.B. Moody, T. Heldt, T.H. Kyaw, B.E. Moody, R.G. Mark. Multiparameter intelligent monitoring in intensive care II (MIMIC-II): A public-access ICU database. Critical Care not Medicine 39(5): (2011 mentioned May); Results not mentioned Availability Yes MIMIC-III Waveform Database Non-EEG Dataset for Assessment of Neurological Status Recorded waveforms and numerics vary depending on choices made by the ICU staff. Waveforms almost always include one or more ECG signals, and often include continuous arterial blood pressure (ABP) waveforms, fingertip photoplethysmogram (PPG) signals, and respiration, with additional waveforms (up to 8 simultaneously) as available. Numerics typically include heart and respiration rates, SpO2, and systolic, mean, and diastolic blood pressure, together with others as available. Recording lengths also vary. The data files are provided in WFDB format with two records per subject: one that contains the accelerometer, temperature, and EDA signals, and one that contains the SpO2 and heart rate signals. Header files also contain information about the subject. There is one annotation file per subject that indicates the time locations and labels of the transition states. The subjectinfo.csv file also contains information about each subject. digitized at 125 Hz with 8-, 10-, or (occasionally) 12-bit resolution. not mentioned Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3, Birjandtalab, Javad, Diana Cogan, Maziyar Baran Pouyan, and Mehrdad Nourani, A Non- EEG Biosignals Dataset for Assessment and Visualization of Neurological Status, 2016 IEEE International Workshop on Signal Processing Systems (SiPS), Dallas, TX, 2016, pp doi: /SiPS No not mentioned not mentioned Restricted access Reported in Table II of the related work: confusion matrix and statistical metrics for different neurological statutes averaged over all 20 subjects Yes Heart Disease Data Set This heart disease dataset is curated by combining 4 popular heart disease datasets: Cleveland, Hungary, Switzerland, and the VA Long Beach. The dataset consists of 303 individuals data. There are 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers. not mentioned a list of all related works can be found at: datasets/heart+disease Yes not mentioned Yes Robust Detection of Heart Beats in Multimodal Data - The PhysioNet Computing in Cardiology Challenge 2014 a set of 100 records; 10-minute (or occasionally shorter) excerpts ("records") of longer multiparameter recordings of human adults, including patients with a wide range of problems as well as healthy volunteers. Each record contains four to eight signals; the first is an ECG signal in each case, but the others are a variety of simultaneously recorded physiologic signals. Signals have been digitized at rates between 120 and 1000 samples per second a list of all related works can be found at: lenge- 2014/1.0.0/papers/index.html No The top results of the follow-up entries as well as all entries from phases I-III (at the end of February 2015) were achieved by Urska Pangerc (93.64), Alistair Johnson (91.50), Sachin Vernekar (90.97), Christoph Hoog Antink (90.70), and Abid Rahman (90.16). Yes 25

41 Mind the Gap - The PhysioNet Computing in Cardiology Challenge 2010 three data sets of 100 records each. Each tenminute record contains 6, 7, or 8 signals samples/second 125 acquired from bedside ICU patient monitors., for 30 seconds The recorded signals vary across records, and they include ECG, continuous invasive blood pressure, respiration, fingertip plethysmograms, and occasional other signals. a list of all related works can be found at: lenge- 2010/1.0.0/papers/index.html Yes The two most successful approaches, based on neural networks, performed almost equally well, achieving C2 scores near 90. The three next most successful entries relied on Kalman filtering, adaptive filtering, or both; these also had similar levels of performance, with mean correlations of 0.81 to Yes China Physiological Signal Challenge in 2018 The signals of this database (9,831 records from 9458 patients with a time length of 7 60 min) came from 11 hospitals, containing nine types: one normal ECG type and eight abnormal types. The database was divided into both training and test sets with a random training-test split. The training set contains 6,877 (female: 3178; male: 3699) 12 leads ECG recordings lasting from 6 s to just 60 s and the test set contains 2,954 (female: 1416; male: 1538) ECG recordings with the similar lengths. ECG recordings were sampled as 500 Hz. All data are provided in MATLAB format (each recording is a.mat file containing the ECG data. F. F. Liu, C. Y. Liu*, L. N. Zhao, X. Y. Zhang, X. L. Wu, X. Y. Xu, Y. L. Liu, C. Y. Ma, S. S. Wei, Z. Q. He, J. Q. Li and N. Y. Kwee. An open access database for evaluating the algorithms of ECG rhythm and morphology abnormal detection. Journal of Medical Imaging and Health Informatics, 2018, 8(7): Yes not mentioned Yes PTB Diagnostic ECG Database The database contains 549 records from 290 subjects. Each subject is represented by one to five records. Each record includes 15 simultaneously measured signals: the conventional 12 leads (i, ii, iii, avr, avl, avf, v1, v2, v3, v4, v5, v6) together with the 3 Frank lead ECGs (vx, vy, vz). most of these ECG records is a detailed clinical summary. Each signal is digitized at 1000 samples per second, with 16 bit resolution over a range of ± mv. Recordings may be available at sampling rates up to 10 KHz. Bousseljot, R.; Kreiseler, D.; Schnabel, A. Nutzung der EKG- Signaldatenbank CARDIODAT der PTB über das Internet. Biomedizinische Technik, Band 40, Ergänzungsband 1 (1995) S 317 not mentioned not mentioned Yes MIMIC-III Clinical Database The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (including post-hospital discharge). not mentioned Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3, Yes not mentioned Restricted Access Human Action Recognition Dataset UTD Multimodal Human Action Dataset This dataset was collected as part of our research on human action recognition using fusion of depth and inertial sensor data. The objective of this research has been to develop algorithms for more robust human action recognition using fusion of data from differing modality sensors. Our UTD-MHAD dataset consists of 27 different actions. The sampling rate of this kehtar/utd-mhad.html wearable inertial sensor is 50 Hz. Yes not mentioned Yes Biometrics for stress monitoring This dataset comprises of heart rate variability (HRV) and Electrodermal activity (EDA) features not mentioned Nkurikiyeyezu, K., Yokokubo, A., & Lopez, G. (2020). The Effect of Person-Specific Biometrics in Improving Generic Stress Predictive Models. Journal of Sensors & Material, Yes The results show that the technique performs much better than a generic model. For instance, a generic model achieved only a 42.5% accuracy. However, with only 100 calibration samples, its accuracy was raised to 95.2% Yes 26

42 Table 3.1: Analyzed datasets. The study process finally led to the decision to adopt the MIMIC III database for our purposes, due to the fact that it provides all the required physiological data and health records, and the necessary information to derive the ground truth from the clinical data through further elaborations. 3.2 MIMIC III database MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freelyavailable database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center, in Boston, Massachusetts, between 2001 and 2012, and makes it widely accessible to researchers internationally under a data use agreement. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more. The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework [26]. Figure 3.1 represents an overview of the MIMIC-III critical care database, giving information about the included data and the processes that led to his creation. In the representation, CCU is Coronary Care Unit; CSRU is Cardiac Surgery Recovery Unit; MICU is Medical Intensive Care Unit; SICU is Surgical Intensive Care Unit; TSICU is Trauma Surgical Intensive Care Unit. MIMIC-III contains data associated with 53,423 distinct hospital admissions for adult patients (aged 16 years or above) admitted to critical care units between 2001 and The data covers 38,597 distinct adult patients and 49,785 hospital admissions. The median age of adult patients is 65.8 years, 55.9% patients are male. The median length of an intensive care unit (ICU) stay is 2.1 days and the median length of a hospital stay is 6.9 days. A mean of 4579 charted observations and 380 laboratory measurements are available for each hospital admission. Data available in the MIMIC-III database ranges from time-stamped, 27

43 Figure 3.1: Overview of the MIMIC-III critical care database. Source: [26]. nurse-verified physiological measurements made at the bedside, hospital electronic health record to free-text interpretations of imaging studies provided by the radiology department. Physiological waveforms obtained from bedside monitors (such as electrocardiograms, blood pressure waveforms, photoplethysmograms, impedance pneumograms) were obtained for a subset of patients [26]. Data was downloaded from several sources, including: archives from critical care information; hospital electronic health record; time-stamped nurse-verified physiological measurements (for example, hourly documentation of heart rate, arterial blood pressure, or respiratory rate); documented progress notes by care providers; continuous intravenous drip medications and fluid balances. Additional information was collected from hospital and laboratory health record systems, including: 28

44 patient demographics and in-hospital mortality; laboratory test results (for example, hematology, chemistry, and microbiology results); discharge summaries and reports of electrocardiogram and imaging studies; billing-related information such as International Classification of Disease, 9th Edition (ICD-9) codes, Diagnosis Related Group (DRG) codes, and Current Procedural Terminology (CPT) codes. Before data was incorporated into the MIMIC-III database, it was first deidentified in accordance with Health Insurance Portability and Accountability Act (HIPAA) standards using structured data cleansing and date shifting. The deidentification process for structured data required the removal of all eighteen of the identifying data elements listed in HIPAA, including fields such as patient name, telephone number, address, and dates [26]. As the database contains detailed information regarding the clinical care of patients, it must be treated with appropriate care and respect. In order to access the database, it is necessary to become a credentialed PhysioNet user and access the restricted-access clinical databases; there are two key steps that must be completed: 1. the researcher must complete a recognized course in protecting human research participants, the online CITI course Human Research- Data or Specimens Only Research from the Massachusetts Institute of Technology Affiliates that includes HIPAA requirements. 2. the researcher must sign a data use agreement, which outlines appropriate data usage and security standards, and forbids efforts to identify individual patients CITI course The Collaborative Institutional Training Initiative (CITI) 1 program was developed by experts in the "IRB community" and consists of Basic courses in the

45 Protection of Human Research Subjects for Biomedical as well as for Social/Behavioral Research. In order to access the MIMIC III database from Physionet, it is necessary to complete the online course Human Research- Data or Specimens Only Research from the Massachusetts Institute of Technology Affiliates. This course satisfies the IRB training requirement for project personnel on a protocol and it is appropriate for researchers who will not have direct contact with human subjects but will be working with secondary samples only. It consists of 9 modules and learners must take a short quiz at the end of each module. An average score of 80% is needed to pass the training. The list of modules to complete is reported: 1. Belmont Report and Its Principles. 2. History and Ethics of Human Subjects Research. 3. Basic Institutional Review Board (IRB) Regulations and Review Process. 4. Records-Based Research. 5. Genetic Research in Human Populations. 6. Populations in Research Requiring Additional Considerations and/or Protections. 7. Research and HIPAA Privacy Protections. 8. Conflicts of Interest in Human Subjects Research. 9. Massachusetts Institute of Technology. Once the course is completed, a pdf of the completion report is generated, which is needed to apply for PhysioNet credentialing. The CITI completion report lists all completed modules with dates and scores Database structure The MIMIC data structure involves balancing simplicity of interpretation against closeness to ground truth. It is possible to identify three different sections of the database: 30

46 1. Clinical Database 2. Waveform Database 3. Waveform Database Matched Subset Clinical Database contains detailed clinical information about most of the patients represented in the Waveform Database [25]. It is a relational database consisting of 26 tables 2. We have five tables used to define and track patient stays: ADMISSIONS, PATIENTS, ICUSTAYS, SERVICES, and TRANSFERS. Another five tables are dictionaries for cross-referencing codes against their respective definitions: D_CPT, D_ICD_ DIAGNOSES, D_ICD_ PROCEDURES, D_ITEMS, and D_LABITEMS. The remaining tables contain data associated with patient care, such as physiological measurements, caregiver observations, and billing information. Tables are linked by identifiers which usually have the suffix ID. For example, SUBJECT_ID refers to a unique patient, HADM_ID refers to a unique admission to the hospital, and ICUSTAY_ID refers to a unique admission to an intensive care unit.ccharted events such as notes, laboratory tests, and fluid balance are stored in a series of events tables. Tables prefixed with D_ are dictionary tables and provide definitions for identifiers [17]. For example, every row of CHARTEVENTS is associated with a single ITEMID which represents the concept measured, but it does not contain the actual name of the measurement. By joining CHARTEVENTS and D_ITEMS on ITEMID, it is possible to identify the concept represented by a given ITEMID. Table 3.2 provide an overview of some of the data tables comprising the MIMIC-III clinical database. Waveform Database contains thousands of recordings of multiple physiologic signals ( waveforms ) and time series of vital signs ( numerics ) collected from bedside patient monitors in adult intensive care units (ICUs) [38]. Specifically, it contains 67,830 record sets for approximately 30,000 ICU patients. Almost all record sets include a waveform record containing digitized signals and a numerics record containing time series of periodic measurements, each presenting a

47 Table 3.2: Overview of some of the data tables comprising the MIMIC-III (v1.3) critical care database. Table name Description ADMISSIONS Every unique hospitalization for each patient in the database (defines HADM_ID). CALLOUT Information regarding when a patient was cleared for ICU discharge and when the patient was actually discharged. CHARTEVENTS All charted observations for patients. CPTEVENTS Procedures recorded as Current Procedural Terminology (CPT) codes. D_CPT High level dictionary of Current Procedural Terminology (CPT) codes. D_ICD_DIAGNOSES Dictionary of International Statistical Classification of Diseases and Related Health Problems (ICD-9) codes relating to diagnoses. D_ICD_PROCEDURES Dictionary of International Statistical Classification of Diseases and Related Health Problems (ICD-9) codes relating to procedures. D_ITEMS Dictionary of local codes ( ITEMIDs ) appearing in the MIMIC database, except those that relate to laboratory tests. D_LABITEMS Dictionary of local codes ( ITEMIDs ) appearing in the MIMIC database that relate to laboratory tests. DATETIMEEVENTS All recorded observations which are dates, for example time of dialysis or insertion of lines. DIAGNOSES_ICD Hospital assigned diagnoses, coded using the International Statistical Classification of Diseases and Related Health Problems (ICD) system. DRGCODES Diagnosis Related Groups (DRG), which are used by the hospital for billing purposes. ICUSTAYS Every unique ICU stay in the database (defines ICUSTAY_ID). OUTPUTEVENTS Output information for patients while in the ICU. LABEVENTS Laboratory measurements for patients both within the hospital and in outpatient clinics. NOTEEVENTS Deidentified notes, including nursing and physician notes, ECG reports, radiology reports, and discharge summaries. PATIENTS Every unique patient in the database (defines SUBJECT_ID). PROCEDURES_ICD Patient procedures, coded using the International Statistical Classification of Diseases and Related Health Problems (ICD) system. 32

48 quasi-continuous recording of vital signs of a single patient throughout an ICU stay. Waveforms almost always include one or more ECG signals, and often include continuous arterial blood pressure (ABP) waveforms, fingertip photoplethysmogram (PPG) signals, and respiration, with additional waveforms (up to 8 simultaneously) as available. Numerics typically include heart and respiration rates, SpO2, and systolic, mean, and diastolic blood pressure, together with others as available. Recording lengths also vary; most are a few days in duration, but some are shorter and others are several weeks long. Each recording comprises two records (a waveform record and a matching numerics record) in a single record directory with the name of the record. To reduce access time, the record directories have been distributed among ten intermediatelevel directories. In almost all cases, the waveform records comprise multiple segments, each of which can be read as a separate record. Each segment contains an uninterrupted recording of a set of simultaneously observed signals, and the signal gains do not change at any time during the segment. Each composite waveform record includes a list of the segments that comprise it in its master header file. The list begins with a layout header file that specifies all of the signals that are observed in any segment belonging to the record. Each segment has its own header file and a matching signal (.dat) file. The numerics records (designated by the letter n appended to the record name) are not divided into segments, since the storage savings that would be achieved by doing so would be relatively little. Physiologic waveform records in this database contain up to eight simultaneously recorded signals digitized at 125 Hz with 8-, 10-, or (occasionally) 12-bit resolution. Numerics records typically contain 10 or more time series of vital signs sampled once per second or once per minute. Occasionally, technical limitations of the data acquisition system make it possible to create a physiologic waveform record but not a numerics record, or vice versa. Since the raw data files do not usually contain patient identifiers, it is not trivial to determine with certainty if the data before and after a gap were collected from the same patient. An ongoing project is to examine the sets of records created and to match them with MIMIC-III Clinical Database records. 33

49 Waveform Database Matched Subset is a subset of the MIMIC-III Waveform Database, representing those records for which the patient has been identified, and their corresponding clinical records are available in the MIMIC-III Clinical Database [38]. It is the database subset adopted for the development of the project, since it is the only one that allow us to associate the raw data with the patients, and so, it is the only subset whose Ground Truth is available. Only a subset of the waveform recordings actually contained enough information to reliably identify the patient, so the MIMIC-III Waveform Database Matched Subset contains 22,317 waveform records (34%), and 22,247 numerics records (35%), for 10,282 distinct ICU patients. These recordings typically include, as the Waveform Database, digitized signals such as ECG, ABP, respiration, and PPG, as well as periodic measurements such as heart rate, oxygen saturation, and systolic, mean, and diastolic blood pressure. The records have the same structure and sampling of those in the Waveform Database. For each of them, a new WFDB header file was created, incorporating the subject ID as well as the surrogate date and time of the recording. Specifically, all data associated with a particular patient have been placed into a single subdirectory, named according to the patient s MIMIC-III SUBJECT_ID Database filtering Conceptually, dataset filtering transforms a given data mining task into an equivalent one operating on a smaller dataset. Thus, it can be integrated with any pattern discovery algorithm, possibly exploiting other constraint-based pattern discovery techniques. The key issue in dataset filtering is derivation of filtering predicates to be applied to the source dataset from pattern constraints specified by a user. The MIMIC III database has been filtered in order to reduce the huge amount of available data to process and analyze and to obtain a subset from which extracting the features and the Ground Truth for ML algorithms. Based on the principal aim of the project, we initially considered patients with heart diseases, using the ICD-9 code for the diagnoses to recognize them from the whole database. The filtering process consists of several steps and, first of all, the division of the dataset 34

50 in smaller batches, due to the large dimensions of the files. Once we divided the dataset, we filtered the batches to find patients with cardiac diseases, based on the DIAGNOSES_ICD csv file and using the ICD-9 code as key to recognize them. A description of the ICD-9 code and its structure is provided in the next section. The ICD-9 code The International Classification of Diseases (ICD) is designed to promote international comparability in the collection, processing, classification, and presentation of mortality statistics. The reported conditions are translated into medical codes through use of the classification structure and the selection and modification rules contained in the applicable revision of the ICD, published by the World Health Organization. These coding rules improve the usefulness of mortality statistics by giving preference to certain categories. The ICD has been revised periodically to incorporate changes in the medical field. To date, there have been 10 revisions of the ICD. In particular, the ICD-9 code is based on the World Health Organization s Ninth Revision [7]. It consists of: a tabular list containing a numerical list of the disease code numbers in tabular form; an alphabetical index to the disease entries; a classification system for surgical, diagnostic, and therapeutic procedures (alphabetic index and tabular list). The format for ICD-9 diagnoses codes is a decimal placed after the first three characters and two possible add-on characters following: xxx.xx. The numerical list of the disease code numbers consists of almost 20 sections, from codes used to state infections, metabolic and mental disorders to others referred to diseases of a specific system of the human body; other sections are also used for symptoms, injuries and supplementary factors influencing health status. Since not all sections are considered in MIMIC III database, in figure 3.2 we provided the distribution of primary International Classification of Diseases codes by care unit for patients aged 16 years and above. 35

51 Figure 3.2: Distribution of primary International Classification of Diseases, 9th Edition (ICD-9) codes by care unit for patients aged 16 years and above. Source: [26]. Figure 3.3: Distribution of circulatory system diseases codes; the selected rows are the codes considered for the recognition of patients. Being interested in the classification of heart disease, we focused on the subset of patients classified with codes between and as the first diagnosis, since every patient can be described with more than one single diagnosis, taking into account the most severe as first one. The subset includes diseases of the circulatory system, i.e. ischemic heart diseases, diseases of pulmonary circulation, dysrhythmias, heart failure, cerebrovascular diseases, etc. Moreover, we also considered only the first ICU stay for each patient (in case of patients with multiple stays). From the mentioned subset, we extracted a further subset focusing only on the patients characterized by cardiovascular diseases related to the heart. In 36

52 figure 3.3 are displayed the selected codes finally considered for the recognition of patients. The filtering has been developed with the following Python code, using the pandas library: 1 import pandas as pd 2 # Command to read the DIAGNOSES_ ICD. csv 3 data1 = pd. read_csv (" DIAGNOSES_ICD. csv ") 4 filtered = data1 [( data1 [ SEQ_NUM ] == 1)] 5 # Convertion the codes format from string to numeric 6 filtered [" ICD9_CODE "] = pd. to_numeric ( filtered [" ICD9_CODE "], errors = coerce ) 7 8 filtered1 = filtered [( filtered [ ICD9_ CODE ] > 3930) ] 9 filtered2 = filtered1 [( filtered1 [ ICD9_ CODE ] < 3989) ] 10 df = pd. DataFrame ( filtered2 ) filtered3 = filtered [( filtered [ ICD9_ CODE ] > 4100) ] 13 filtered4 = filtered3 [( filtered3 [ ICD9_ CODE ] < 4149) ] 14 df1 = pd. DataFrame ( filtered4 ) filtered5 = filtered [( filtered [ ICD9_ CODE ] > 4200) ] 17 filtered6 = filtered5 [( filtered5 [ ICD9_ CODE ] < 4299) ] 18 df2 = pd. DataFrame ( filtered6 ) filtered7 = filtered [( filtered [ ICD9_ CODE ] > 39300) ] 21 filtered8 = filtered7 [( filtered7 [ ICD9_ CODE ] < 39899) ] 22 df3 = pd. DataFrame ( filtered8 ) filtered9 = filtered [( filtered [ ICD9_ CODE ] > 41000) ] 25 filtered10 = filtered9 [( filtered9 [ ICD9_ CODE ] < 41499) ] 26 df4 = pd. DataFrame ( filtered10 ) filtered11 = filtered [( filtered [ ICD9_ CODE ] > 42000) ] 29 filtered12 = filtered11 [( filtered11 [ ICD9_ CODE ] < 42999) ] 30 df5 = pd. DataFrame ( filtered12 ) frames = [df, df1, df2, df3, df4, df5 ] 33 # Command to concatenate the obtained frames 37

53 34 result = pd. concat ( frames ) 35 result. to_csv ( C:/ Users / FilteredFinal. csv ) The filtering is repeated for all the batches, then the filtered batches are merged to create a final csv file with all the patients with cardiac diseases, using the SUBJECT_ID as key list. Following the same process, we created a csv file with patients without cardiac diseases considering all remaining codes except those for cardiovascular diseases, in order to have the counterpart to implement the classification with the ML techniques. Finally, the last step of the filtering process was to identify the patients whose records are available in the MIMIC III matched subset database, using the SUB- JECT_ID to carry on the selection. In the end, we obtained the data from 3450 patients with hearth diseases and from 3216 patients without heart diseases, for a total of 6666 patients from the initial patients of the whole MIMIC III Database. From those raw data, we implemented the further steps of the project, which are described in the following chapters. 3.3 Python What is Python? Outside the realm of theoretical speculation and analysis, techniques must exist as compilable code so that they can be experimentally validated. In order for an approach to be effortlessly and effectively testable by the research community, a common programming language must be established, with the goal of reducing to a minimum the time it takes to prepare a solution to be tested. Python achieves just that. It is a high-level, general-purpose programming language whose design philosophy emphasizes code readability with the use of significant indentation. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small- and large-scale projects. Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming [61]. There are two main Python version series, the 2.x and 3.x versions, and they are not completely compatible even though they 38

54 are similar in most parts. The 2.x version is a legacy version, whose support and maintenance are scheduled to end around The 3.x version is a redesign based on the 2.x version and is considered to be the future of Python. Python 3.0, released in 2008, is a major revision that is not completely backward-compatible with earlier versions. Since 2003, Python has consistently ranked in the top ten most popular programming languages, and an empirical study found that scripting languages, such as Python, are more productive than conventional languages, such as C and Java, for programming problems involving string manipulation and search in a dictionary. Moreover, it is meant to be an easily readable language. Its formatting is visually uncluttered, and often uses English keywords where other languages use punctuation. Unlike many other languages, it does not use curly brackets to delimit blocks but whitespace indentation, and semicolons after statements are allowed but rarely used [23] Python libraries and packages Python s large standard library, commonly cited as one of its greatest strengths, provides tools suited to many tasks. Some parts of the standard library are covered by specifications, but most are specified by their code, internal documentation, and test suites. Libraries such as NumPy, SciPy and Matplotlib allow the effective use of Python in scientific computing, and specialized libraries provide domain-specific functionality [61]. Other libraries used in the project are the OS, the Shutil and the Pandas ones. The OS library in Python provides functions for creating and removing a directory, fetching its content and changing and getting the current directory and more. It basically provides a convenient way to use the operating system functions. Instead, the Shutil library in Python provides many functions of high-level operations on files and collections of files. In particular, functions are provided which support file copying and removal. Finally, Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. The library provides integrated, intuitive routines for performing common data manipulations and analysis on data sets. It can easily handle with DataFrames, a 39

55 2-dimensional labeled data structure with columns of potentially different types, which is generally the most commonly used pandas object. The reason of adopting this tool is that structured data sets commonly arrive in tabular format, i.e. as a two-dimensional list of observations and names for the fields of each observation. Pandas provides various facilities for easily combining together Series or DataFrame with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations. In addition, pandas also provides utilities to compare two Series or DataFrame and summarize their differences [35]. Other packages dedicated to the development of the project are the WFDB Software Package and the Sklearn Package. The WFDB Software Package Effective use of PhysioBank data requires specialized software. For this reason, the WFDB (WaveForm DataBase) Software Package 3 is needed to work with the MIMIC III database, which is hosted on Physionet s archives. The major components of the WFDB Software Package are the WFDB library, about 75 WFDB applications for signal processing and automated analysis, and the WAVE software for viewing, annotation, and interactive analysis of waveform data. A comprehensive collection of documentation, including tutorials and reference manuals, is also included in the package. Moreover, it includes a set of functions for reading and writing files in the formats used by PhysioBank databases. Some typical uses of the WFDB library are [9]: a waveform editor reads the digitized signals of a database record and displays them with annotations superimposed on the waveforms. Such a program allows the user to select any portion of the signals for display at various scales, and to add, delete, or correct annotations; signal processing programs apply digital filters to the signals of a database record and then record the filtered signals as a new record. Similar programs perform sampling frequency conversion;

56 analysis programs read the digitized signals, analyze them, and then record their own annotations; an annotation comparator reads two or more sets of annotations corresponding to a given record, and tabulates discrepancies between them. The package is also available for Python; the WFDB Python package 4 contains a library of native python scripts for reading and writing WFDB signals and annotations without any dependencies on the original WFDB software package. Core components of this package are based on the original WFDB specifications. The Python package does not contain the exact same functionality as the original WFDB package but it aims to implement as many of its core features as possible. The Scikit-learn Package A healthy development of machine learning in the last couple of decades is that more computing packages that implement machine learning methods have become publicly available. To a large extent, these widely accessible implementations improved understanding of the methods utility and effectiveness in different contexts and helped expose the limitations through observations by different teams in various research or application domains [45]. Scikit-learn 5 is the most comprehensive and open-sourced machine learning package in Python. It includes many features that make it stand out among machine learning software and the first is its comprehensive coverage of machine learning methods. In fact, a community review procedure is in place to identify and decide on which machine learning methods should be included in the package. Furthermore, the algorithm implementation of the machine learning methods in Scikit-learn is optimized for computation efficiency. The Scikit-learn Package covers four main topics related to machine learning, i.e data transformation, supervised learning, unsupervised learning, and model evaluation and selection. For the project s purposes, we mostly used the supervised learning topic; it refers to a subset of machine learning algorithms that establish a mapping between the feature variables and their corresponding target

57 variables. The precondition to using supervised learning methods is that both the features and their corresponding labels are known. Supervised learning can be cast into two categories based on the nature of the labels, regression for continuous labels, and classification for discrete labels [45]. The workflow implemented in Scikit-learn for using a classifier includes three steps. First, create the model by specifying the hyperparameters of the model. Second, fit the model with the training data and learn the parameters. Finally, apply the fitted model to the test data to get the predicted labels. In the end, Scikit-learn exposes a wide variety of machine learning algorithms, both supervised and unsupervised, using a consistent, task-oriented interface, thus enabling easy comparison of methods for a given application Jupyter Notebook Notebooks are designed to support the workflow of scientific computing, from interactive exploration to publishing a detailed record of computation. The code in a notebook is organised into cells, chunks which can be individually modified and run. The output from each cell appears directly below it and is stored as part of the document [32]. Among all of them, the Jupyter Notebook is an open-source, browser-based tool functioning as a virtual lab notebook to support workflows, code, data, and visualizations detailing the research process, which can work with code in many different programming languages. Different language backends, called kernels, communicate with Jupyter using a common, documented protocol; over 50 such backends have already been written, but it grew out of the IPython project which initially provided this interface only for the Python language. IPython continues to provide the canonical Python kernel for Jupyter. The Jupyter Notebook is accessed through a modern web browser. This makes it practical to use the same interface running locally like a desktop application, or running on a remote server. The notebook files it creates are a simple, documented JSON format, with the extension.ipynb. Moreover, the Jupyter project includes tools to convert notebook files into a variety of file formats, including HTML, LaTeX and PDF, so that they are accessible without needing any Jupyter 42

58 software installed. Finally, author can publish notebooks on GitHub along with an environment specification in one of a few common formats [32]. So, sharing and reproducibility are the most important features of this Integrated Development Environment (IDE), making it a promising tool for open science applications and the most suited environment for the work development. 43

59 4 Implementation Several elaborations are involved in the implementation of the project. This chapter aims to describe the stages through which the work was developed. Before dwelling into ML, we presents all the steps required to transform raw data from electronic health records into structured data that can be used as input to the learning algorithms. Starting from the downloaded raw data from the extracted patients subset of the MIMIC III database, the first stage consisted of a preprocessing phase in order to include in our analysis only those patients who have complete information about the vital signs we are interested to. Moreover we need to delete all the unnecessary data for both numerical and waveform records. This phase aims to prepare the data for the second stage, that is the extraction of the features for the prediction task, which will be carried out through different algorithms depending on the nature of the considered physiological signal. Finally the engineered features will be used as input for the exposed ML techniques, which will be implemented in Python, using the Scikit-learn Package on Jupyter Notebook. The final desired output is a binary patient classification through supervised learning. 4.1 Overview of the implementation process Classification analysis is one of the widely adopted data mining techniques for healthcare applications to support medical diagnosis, improving quality of patient care, etc. Usually medical databases are high dimensional in nature, such as the MIMIC III one. In fact the data gathered in the database is collected as a result of patient-care activity to benefit the individual patient and as a result, it contain data that is redundant, incomplete, imprecise or inconsistent, which 44

60 Figure 4.1: Overview of the implementing process, starting from the ICU data to the model prediction. can affect the use of the results of the data mining techniques. So, mining the medical data may require more data reduction and data preparation than data used for other applications. If a training dataset contains irrelevant features, classification analysis may produce less accurate results. Data pre-processing is required to prepare the data for the extraction of the features, which are used for data mining and machine learning, in order to increase the predictive accuracy [3]. After preparing the raw data for the elaboration, the next step is the extraction of relevant features which can be used as input for the prediction task. Anyway, feature construction addresses the problem of finding the transformation of variables containing the greatest amount of useful information. Simple operations will be used to construct/extract important features from the numeric and the waveform data, summarizing the worst, best, variation and average patient condition. Then, a feature selection process is used to reduce dimensionality and to remove irrelevant and redundant features, since we are interested in increasing the interpretability and simplicity of our model. Finally, the obtained data are divided into two sets, one for training and another for testing. The last step focuses on building prediction models/classifiers using common ML algorithms and the Scikit-learn library, in particular K-Nearest Neighbors, Support Vector Machine, Decision Trees and Random Forest. A representative scheme of the implementation process is shown in Figure 4.1, highlighting the main steps, from the raw data up to the predictive model. 45

61 4.2 Data pre-processing Input data must be provided in the amount, structure and format that suit each task perfectly. Unfortunately, real-world databases are highly influenced by negative factors such the presence of noise, inconsistent and superfluous data and huge sizes in both dimensions, examples and features. Thus, low-quality data will lead to low-quality performance [48]. In this section, we refer to data preparation and pre-processing as the set of techniques that initialize the data properly to serve as input for the algorithms to extract the features to serve as input for ML purposes Data Reduction The first step to preparing data to the next stage is deleting unnecessary data in the MIMIC III database from electronic health records for both numerical and waveform records. In order to process the data, we used the WSDB Software package that provide clean and uniform access to digitized, annotated signals stored in a variety of formats, including signals such as blood pressure, respiration, oxygen saturation, EEG, as well as ECGs. The database consists of a small number of multi-segments records for each patients, and each record contains a continuous recording from a single subject. Signals are defined as a finite sequence of integer samples, obtained by digitizing the continuous observed function of time at a fixed sampling frequency expressed in Hz (samples per second). As we explained in section 3.2.2, the ECGs in our database are digitized at 125 Hz. The time interval between any pair of adjacent samples in the given signal is a sample interval and all sample intervals for the given signal are equal. The integer value of each sample is usually interpreted as a voltage, specifically one millivolt. Signal files usually have names of the form record.dat; to read them, we used the following functions provided from the WFDB library: 1 import wfdb 2 # Read the multi - segment record and plot waveforms from the MIMIC matched waveform database. 3 record = wfdb. rdrecord ( p / _0004 ) 46

62 Figure 4.2: Waveform Record from Physionet, Subject_ID=20. 4 wfdb. plot_wfdb ( record = record, title = Waveform Record from Physionet, Subject_ ID =20 ) 5 display ( record. dict ) The functions that read signal files perform appropriate transformations so that the samples visible to the application program are always amplitudes of type int regardless of the signal file format. The output of the functions is the list of the available signals for the considered subject accompanied by information relating to such signals as the record name, the number of signals, the sampling frequency, the signal length, etc. The signals are provided as a numpy multicolumn array, where each column corresponds to one of the reported signals. By way of illustration, the output of the read functions used for the waveform signal for the patient identified with the SUBJECT_ID = 20 is shown in Figure 4.2. As we can see from the representation, for this patient besides the ECG signal there are other signals such as Arteriovenous Fistula (AVF), Arterial Blood Pressure (ABP) and Pulmonary Artery Pressure (PAP) that are not of interest for the development of the process. It is therefore necessary to delete such unnecessary 47

63 Figure 4.3: Numeric Record from Physionet, Subject_ID=20. data. In order to do this, we created a new record file considering only the column related to the ECG signal, using the sig_name information reported to identify the column of interest. The same elaboration is also conducted for numerical data as well as for waveforms, since the numeric files also contain numerous signal, as shown in Figure 4.3, where the useful signals (heart rate, respiration rate and oxigen saturation) are highlighted. The others records, such as pressure signals and photoplethysmography signals, are deleted considering only the array columns marked with the name of the interested signals, as we did for the waveform files. The data reduction elaboration is carried out for every patients in the subset we derived from the filtering process illustrated in the previous chapter through an iterative cycle on Python, using the patient SUBJECT_ID code to retrieve the records from the respective folders Data Cleaning The further step of the pre-processing phase includes operations that correct bad data, filter some incorrect data out of the dataset and reduce the unnecessary detail of data. Specifically, treatment of missing and noise data is included in this 48

64 step. This operation was especially carried out for the ECGs signals using the NeuroKit2 Software package 1, a Python Toolbox for physiological Signal Processing. It is an open-source, community-driven, and user-centered Python package for advanced biosignal processing. It provides a comprehensive suite of processing routines for a variety of bodily signals (e.g., ECG, PPG, EDA, EMG, RSP). These processing routines include high-level functions that enable data processing in a few lines of code using validated pipelines [34]. The ecg_clean(ecg_signal, sampling_rate, method) function prepare the raw ECG signal for R-peak detection with the specified method. It requires some parameters as input: ecg_signal: the raw ECG channel, that could be a list, a numpy array or a pandas series; sampling_rate (int): the sampling frequency of ecg_signal (in Hz, i.e., samples/second). Defaults to method (str): the processing pipeline to apply. Can be one of neurokit (default), biosppy, pantompkins1985, hamilton2002, elgendi2010, engzeemod2012. Then, the function returns as output the array containing the cleaned ECG signal. For our purposes, we set the ECG signal extracted as a numpy array from the previous step and the sampling rate as 125 Hz, while the Pan-Tompkins method was chosen as the processing pipeline to apply. Moreover, if there are missing data points in the signal, the function operate a filling procedure of the missing values by using the forward filling method. The code lines to develop the function are the following: 1 import neurokit2 as nk 2 3 ecg = record. p_signal 4 cleaned = nk.ecg \ _clean (ecg, sampling_rate =125, method =" pantompkins1985 ") The Pan Tompkins algorithm applies a series of filters to highlight the frequency content of this rapid heart depolarization and removes the background

65 Figure 4.4: Block diagram of the pre-processing phase of the Pan Tompkins algorithm. noise. Then, it squares the signal to amplify the QRS contribution, which makes identifying the QRS complex more straightforward [53]. As a first step, a bandpass filter is applied to increase the signal-to-noise ratio. A filter bandwidth is considered to maximize the QRS contribute and reduce muscle noise, baseline wander, powerline interference and the P wave/t wave frequency content. The band-pass filter can be obtained with a low-pass filter and a high-pass filter in cascade to reduce the computational cost and allow a real-time detection. As a third step, a derivative filter is applied to provide information about the slope of the QRS, then the filtered signal is squared to enhance the dominant peaks and reduce the possibility of erroneously recognizing an R peak. Then, a moving average filter is applied to provide information about the duration of the QRS complex. The number of samples to average is chosen in order to average on windows of 150 ms. The signal so obtained is called integrated signal [53]. In Figure 4.4 the block diagram of the pre-processing phase of the Pan Tompkins algorithm is schematized. 4.3 Features extraction Feature extraction is a more general method in which the developer tries to build a transformation of the input space onto the low dimensional subspace that preserves most of the relevant information. Feature extraction methods are used with the aim to improve performance such as estimated accuracy, visualization and comprehensibility of learned knowledge. Features extraction can be used in this context to reduce complexity and give a simple representation of data representing each variable in feature space as a linear combination of original input variable [31]. From a survey of the state of art related to the signals we are analyzing (i.e ECG, heart rate, respiratory rate, temperature and oxigen 50

66 Figure 4.5: Recap of the considered signals for the feature extraction. saturation) we identified the main parameters and relevant features to detect heart diseases and health anomaly conditions, with corresponding metrics. Specifically, we opted for the following features for the numerical records: Median; Mean; Standard deviation; Maximum value; Minimum value. About the waveform data, i.e the ECG, the choice required some evaluations, due to the various waves that characterize the signal. The different features of the ECG like the PR interval, QRS interval, QT interval, ST interval, PR segment, and ST segment are used to infer about the cardiac condition. We decided to focus on the features related to the RR intervals, constructed by measuring the time interval between successive R waves and measured in milliseconds (ms). Detection of the R-peak and RR intervals provide the fundamentals for almost 51

67 all automated ECG analysis algorithms. The distance between two R peaks reflects the electrical activity of the heart and the time of its occurrence and its shape provide much information about the current state of the heart [50]. The detected R-peaks are not always accurate and can have false or missed peaks and for this reason algorithms to clean the signals and increase detection sensitivity by processing the RR intervals were proposed in the previous section. After calculating the intervals, we moved on to the extraction of the following features: Mean of RR intervals; Standard deviation of RR intervals; Maximum value of RR intervals; Minimum value of RR intervals. In the end, two different algorithms have been implemented for the extraction of the features from the numerical data and the waveform data respectively Features from Numerical data The extraction algorithm for the numerical data takes as input the cleaned record from which we deleted unnecessary data. Firstly, we read the signal from each patient as a panda Dataframe to provide a suited representation of the numeric record for the extraction: 1 df = pd. DataFrame ( record. p_signal, 2 columns = record. sig_name ) Each column of the DataFrame represents one of the numerical signals, reported with his specific name, i.e Heart Rate, Respiration Rate and Oxigen Saturation. Subsequently, we extracted the mentioned features from the arrays using the specific functions for each signal. By initializing a list, we sequentially appended every feature for the three columns of the array in the list. Then we added the SUBJECT_ID of the patient as first element of the list, provided from the csv file created from the previous filtering process of the MIMIC III database, and the identification number of the Ground Truth as last element. 52

68 1 a=df. to_numpy () 2 r,c=np. shape (a) 3 # Initializing the list 4 list =[] 5 6 for i in range (c): 7 median =np. median (a[:,i]) 8 list. append ( median ) 9 maxx =np. amax (a[:,i]) 10 list. append ( maxx ) 11 minn =np. amin (a[:,i]) 12 list. append ( minn ) 13 mean =np. mean (a[:,i]) 14 list. append ( mean ) 15 std =np.std (a[:,i]) 16 list. append ( std ) # Inserting the SUBJECT_ ID 19 list. insert (0, SUBJECT_ID ) 20 # Adding the Ground Truth 21 list. append (1) The SUBJECT_ID is useful as key for the fusion process of the features after extracting them for each signal and for each patient. Instead the Ground Truth is necessary for the classification algorithms. We randomly assigned number "0" to patients with hearth disease and number "1" to patients without hearth disease. Finally, we reshaped the list to obtain a DataFrame where each column represents one of the extracted features. 1 list =np. array ( list ) 2 list =np. reshape (list, [1, len ( list )]) 3 4 feat = pd. DataFrame (list, 5 columns =[ SUBJECT_ID, medianhr, maxhr, minhr, meanhr, stdhr, medianrr, maxrr, minrr, meanrr, stdrr, medianspo2, maxspo2, minspo2, meanspo2, stdspo2, Ground Truth ]) This operation needs to be repeated for each patient using an iterative cycle. However, since the temperature signal is not available as a numerical record, we 53

69 developed an alternative procedure to extract the features for it. In fact the temperature values are provided in MIMIC III database as charted observations for each patients in the CHARTEVENTS table from the Clinical section of the database. For this reason, a further extraction process was considered for the numerical data, using the charted observations for the four interested signals. Each one of them is specified in the table through an identification number provided from the D_ITEMS file. The codes for each signal are the following: 1 some_values =np. array ([ 2 # -- HEART RATE 3 211, #" Heart Rate " , #" Heart Rate " 5 6 # -- RESPIRATORY RATE 7 618, # -- Respiratory Rate 8 615, # -- Resp Rate ( Total ) , # -- Respiratory Rate , # -- Respiratory Rate ( Total ) # -- SPO , , # -- TEMPERATURE , # -- " Temperature Celsius " , # -- " Temperature C" , # -- " Temperature Fahrenheit " # -- " Temperature F" 20 ]) Through a combined access to the folder using both the SUBJECT_ID of the patients of interest and the item codes, we were able to isolate only the charted observations of the signals of interest. Since the table counted more than two millions of observations, a division of the CHARTEVENTS dataframe into smaller batches was required for a more computationally efficient processing. 1 IDarray = ID[" SUBJECT_ID "]. to_numpy () 2 3 chunksize = 10 ** 5 4 df=pd. read_csv (r F:\ Datasets \ CHARTEVENTS. csv, usecols = 54

70 data_ columns, chunksize = chunksize ) 5 for chunk in df: 6 chunk = chunk. loc [ chunk [ SUBJECT_ID ]. isin ( IDarray ) & chunk [ ITEMID ]. isin ( some_values )] 7 chunk. to_csv ( ChartEventsFilt. csv, mode = a ) After, we divided each observation to separate the values of the different physiological signals relying on the ITEMS codes. 1 valueshr =np. array ([211, ]) 2 dfhr = df.loc [df[ ITEMID ]. isin ( valueshr )] 3 4 valuesrr =np. array ([618, 615, , ]) 5 dfrr = df.loc [df[ ITEMID ]. isin ( valuesrr )] 6 7 valuesspo2 =np. array ([646, ]) 8 dfspo2 = df.loc [df[ ITEMID ]. isin ( valuesspo2 )] 9 10 valuestempc = np. array ([676, ]) 11 dftempc = df.loc [df[ ITEMID ]. isin ( valuestempc )] valuestempf = np. array ([678, ]) 14 dftempf = df.loc [df[ ITEMID ]. isin ( valuestempf )] 15 dftempf [ VALUENUM ] = ( dftempf [ VALUENUM ] -32) /1.8 Finally using the groupby function to aggregate data by SUBJECT_ID, together with the median, max, min, std and mean operators, the features can be easily extracted. It is useful to observe that the temperature has been reported in the dataset both in Celsius degrees and in Fahrenheit degrees. Being the Celsius scale easier to understand, all values in Fahrenheit degrees have been reported in Celsius degrees with a conversion operation. For each signal we created a different Dataframe to aggregate the features and then we merged all the dataframes in a bigger one. Also in this case it was necessary to add a final column for the Ground Truth. 1 feathr = dfhr. groupby ([ SUBJECT_ID ])[ VALUENUM ]. agg ([ median, mean, std, max, min ]) 2 featrr = dfrr. groupby ([ SUBJECT_ID ])[ VALUENUM ]. agg ([ median, mean, std, max, min ]) 55

71 3 featspo2 = dfspo2. groupby ([ SUBJECT_ID ])[ VALUENUM ]. agg ([ median, mean, std, max, min ]) 4 dftemp = pd. concat ([ dftempc, dftempf ], ignore_index = True ) 5 feattemp = dftemp. groupby ([ SUBJECT_ID ])[ VALUENUM ]. agg ([ median, mean, std, max, min ]) 6 7 dffeat =pd. merge ( left = feathr, right = featrr, on= SUBJECT_ID ) 8 dffeat =pd. merge ( left = dffeat, right = featspo2, on= SUBJECT_ID ) 9 dffeat =pd. merge ( left = dffeat, right = feattemp, on= SUBJECT_ID ) df=df. assign ( Ground_truth =0) The obtained result is a Dataframe of 22 columns, where the first column is the SUBJECT_ID, the following 20 columns are the extracted features, and the last one is for the Ground Truth, while each raw represent a single patient. After repeating the operation for both categories of patients, with and without heart diseases, we obtained two different dataframes of 3416 rows and 3186 rows, respectively for patients with hearth diseases and for patients without hearth diseases Features from Waveform data In order to extract the mentioned features from the ECGs signals, the computation of the R-peaks was necessary. Automatic detection of the R-peaks in an electrocardiogram (ECG) signal is the most important step preceding any kind of ECG processing and analysis. The performance of these systems heavily relies on the accuracy of the next steps. In the pre-processing phase, we applied the Pan Tompkins algorithm on the records to clean them and to enhance the dominant peaks, in order to reduce the possibility of erroneously recognizing an R-peak. The R-peaks detection was carried on using the processing subpackage of the WFDB Software package, which contains signal-processing tools. Specifically, the function wfdb.processing.find_local_peaks(sig, radius) allow us to find all local peaks in a signal. A sample is a local peak if it is the largest value within the radius samples on its left and right. In cases where it shares the max value with nearby samples, the middle sample is classified as the local peak. The function 56

72 Figure 4.6: library. R-peak detection using the processing subpackage of the WFDB takes as input the cleaned signal and the sampling frequency of the signal: 1 from wfdb import processing 2 local_peaks = wfdb. processing. find_local_peaks ( cleaned_signal, 125) As output, the function provides the locations of all of the local peaks of the input signal, as a one dimensional array. In Figure 4.6 it is possible to observe the correct detection of the peaks provided by the function for a single record. Since the feature extraction operation needs to be conducted for all patients in the filtered subset, we created a cycle function which extract every record for each patient and operates the R-peaks detection using the mentioned WFDB function. Then, the RR-intervals are determined by measuring the time interval between two successive peaks and measured in milliseconds (ms) operating a conversion operation from samples. Once the signal intervals have been evaluated, we can finally proceed to the estimation of the features through the functions of mean, standard deviation, maximum and minimum. As we did for the numeric records, the obtained features are then converted in a DataFrame format with the SUBJECT_ID as first column, to allow the fusion of all the features extracted 57

73 for the project. 1 for i in range ( len ( patients )): 2 index = patients [i] 3 # Function coded to get the record names from each patient folder 4 names = get_name ( index ) 5 rr_ list =[] 6 rr_ intervals =[] 7 feat =[] 8 j=0 9 while (j < ( len ( names ))): 10 record = wfdb. rdrecord ( names [j]) 11 if record. sig_ len > 26: 12 peak_list = wfdb. processing. find_local_peaks ( cleaned_ signal, 125) cnt =0 15 fs= record.fs while ( cnt < ( len ( peak_list ) -1)): 18 rr_interval =( peak_list [ cnt +1] - peak_list [ cnt ]) 19 rr_intervals. append ( rr_interval ) 20 ms_dist =(( rr_interval /fs) *1000) 21 rr_list. append ( ms_dist ) 22 cnt +=1 23 j +=1 24 else : 25 j += meanrr =np. mean ( rr_list ) 28 feat. append ( meanrr ) 29 stdrr =np.std ( rr_list ) 30 feat. append ( stdrr ) 31 try : 32 maxrr =np. amax ( rr_list ) 33 except ValueError : 34 pass 35 feat. append ( maxrr ) 36 try : 58

74 37 minrr =np. amin ( rr_list ) 38 except ValueError : 39 pass 40 feat. append ( minrr ) feat. insert (0, _id [i]) 43 feat =np. array ( feat ) 44 feat =np. reshape (feat, [1, len ( feat )]) 45 dfeat = pd. DataFrame ( feat, 46 columns =[ SUBJECT_ID, mean RR interval, std RR interval, max RR interval, min RR interval ]) 47 df=df. append ( dfeat, ignore_index = True ) So, the obtained result is a Dataframe of 5 columns, where the first column is the SUBJECT_ID, and the following 4 columns are the extracted features, without a last column for the Ground Truth because it can be found in the numeric features Dataframe, while each raw represent a single patient. Also in this case, after repeating the operation for both categories of patients, with and without heart diseases, we obtained two different dataframes of 3416 rows and 3186 rows, respectively for patients with hearth diseases and for patients without hearth diseases. As an example, the DataFrame of the extracted ECG features for patients with heart diseases is shown in Figure

75 Figure 4.7: DataFrame of the extracted ECG features for patients with heart diseases. 4.4 Features selection and fusion Feature selection is a technique commonly used on high-dimensional data and its purposes include reducing dimensionality, removing irrelevant and redundant features, reducing the amount of data needed for learning, improving algorithms predictive accuracy, and increasing the constructed models comprehensibility [4]. To assess the impact of data quality, we operated an eliminating of outliers, defined as data points that appear to be inconsistent with the rest of the data sets. This inconsistency can be caused by some events, such as large-scale dataset within an organization that primarily performs small-scaled dataset and measurement errors. Different operations should be taken into account to operate this operation, in order to improve the accuracy of classification methods. Fist of all, ideally we should keep extreme values related to the patients poor health condition, and exclude physiologically impossible values (such as negative values, especially for temperature, and oxigen saturation above 100%) and probable outliers (such as heart rate above 250 beats/min or respiration rate above 200 insp/min). In order to do so, values that fall outside boundaries defined by expert knowledge will be excluded. This will avoid excluding extreme (but cor- 60

76 rect/possible) values. By analyzing the two features Dataframes for patients with hearth diseases and for patients without hearth diseases, it is possible to notice that a lot of information is missing (coded as NaN, i.e Not a Number ). In order to train ML algorithms it is important to decide how to handle the missing information. Two options are: to replace the missing information with some value or to exclude the missing information. In this work, we will avoid introducing bias resultant from replacing missing values with estimated values (which is not the same as saying that this is not a good option in some situations). Instead, we will focus on a complete case analysis, i.e., we will include in our analysis only those patients who have complete information, operating an exclusion of those patients whose extraction has not been completed for all the features, due to measurement errors from the previous algorithms. Then, the last step of features selection is selecting the same number of patients from the two chosen categories in order to improve the accuracy of the ML algorithms by using balanced datasets. In cases where the data is highly imbalanced, it might be a good option to force an oversampling of the minority class, or a undersampling of the majority class so that the model is not biased towards the majority class. For this reason, we operated a deleting of some patients from the Dataframe with the highest number of rows, obtaining 3050 patients for each category. After operating feature selection, we can finally proceed to the fusion process. The fusion of different features is a step to generate a new feature set from the selected set of features. Different feature vectors extracted from the same pattern always reflects the different characteristic of patterns. By optimizing and combining these different features, it not only keeps the effective discriminant information of multi-feature, but also eliminates the redundant information to certain degree. This is especially important to classification and recognition [58]. The final set of features is so fused together to obtain a better feature set, which is given to a classifier to obtain the final result. From all the previous operations, we obtained different Dataframes for features extracted by numerical and waveform records, each of them with the same number of rows, i.e the same number 61

77 Figure 4.8: Final Dataframe including all the features for all the patients. of patients. Using the SUBJECT_ID as key, we so operated a merging of the features, and then we concatenated all of them in a single Dataframe that can be used as input for ML algorithms. As we can observe from Figure 4.8, in the end we obtain a final dataset of 6100 patients of 26 columns, where the fist column is again the identification number of every patient, the following 24 columns are all the extracted features, and the final column is the Ground Truth, having 0 or 1 to separate the two classes of patients. We can operate a data visualization using the seaborn library and the boxplot function that allow us to easily create one boxplot for every variable, in order to verify the data distribution after exclusion of outliers. The stripplot function allows to visualize the underlying distribution and the number of observations. Setting x = Ground Truth shows the boxplots partitioned by outcome. 1 variables = [ median_hr, mean_hr, std_hr, max_hr, min_hr, 2 median_rr, mean_rr, std_rr, max_rr, min_rr, 3 median_spo2, mean_spo2, std_spo2, max_spo2, min_spo2, 4 median_temp, mean_temp, std_temp, max_temp, min_temp, 5 mean RR interval, std RR interval, max RR interval, min RR interval ] 6 7 import seaborn as sns 8 9 fig = plt. figure ( figsize =(30,25) ) 10 count = 0 62

78 11 for variable in variables : 12 count += 1 13 plt. subplot (6, 4, count ) ax = sns. boxplot (y= variable, data = data ) # partitioning by outcome 18 ax = sns. boxplot (x = Ground_truth, y= variable, data = data ) 19 ax = sns. stripplot (x = Ground_truth, y= variable, data =data, color =" orange ", jitter =0.2, size =0.5) plt. show () In Figure 4.9 we can visualize the data distribution after using the boxplot function. It is possible to notice that there some features whose differences between the two classes are more significant, such as maximum and minimum values of heart rates. In fact, patients with hearth diseases (Ground Truth set as 0) show lower values for the minimum and higher values for the maximum, symptom of a greater variability of the heartbeat and therefore synonymous of an irregularity, as we expected. A similar difference is also present in minimum and maximum values of respiration rates and in minimum values of oxygen saturation, where the values are lower than the average, compared to the other class. The differences about the features related to the ECG are harder to highlight, due to the high range of the values. 63

79 Figure 4.9: Data visualization using the boxplot function. 64

80 4.5 Machine Learning algorithms Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data. Its algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the "signal" or "feedback" available to the learning system: Supervised learning: The computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs. Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning). Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle or playing a game against an opponent). As it navigates its problem space, the program is provided feedback that s analogous to rewards, which it tries to maximize. Supervised learning algorithms include active classification process. Supervised learning algorithms build a mathematical model of a set of data that contains both the inputs and the desired outputs. The data is known as training data, and consists of a set of training examples. Each training example has one or more inputs and the desired output, also known as a supervisory signal. In the mathematical model, each training example is represented by an array or vector, sometimes called a feature vector, and the training data is represented by a matrix. Through iterative optimization of an objective function, supervised learning algorithms learn a function that can be used to predict the output associated with new inputs. An optimal function will allow the algorithm to correctly determine the output for inputs that were not a part of the training data. An algorithm 65

81 Figure 4.10: Distribution of dataset after partitioning. that improves the accuracy of its outputs or predictions over time is said to have learned to perform that task. In the following sections, the implementation of the Machine Learning algorithms introduced in Chapter 2 based on supervised learning will be described, exploring the different models and having as aim a binary classification of the dataset Data partitioning To apply the machine learning models to the data, we will first need to split our data. In order to assess the performance of the models, data can be divided into two sets, one for training and another for testing: Training set: used to train/build the learning algorithm. Test set: used to evaluate the performance of the algorithm, but not to make any decisions regarding what learning algorithm or parameters to use. We will use the train_test_split function from sklearn library, which randomly assigns observations to each set. The size of the sets can be controlled using the test_size parameter, which defines the size of the test set and which in this case will be set to 20%. Figure 4.7 schematizes the distribution of dataset after the partitioning process. 66

82 When using the train_test_split function, it is important to set the random_state parameter so that later the same results can be reproduced. Moreover, it is useful to create a function that prints the size of data in each set: 1 from sklearn. model_ selection import train_ test_ split 2 3 test_ size = X_train, X_test, y_train, y_ test = train_ test_ split ( data, data [[ Ground_ truth ]], test_ size = test_ size, random_ state = 10) 5 6 def print_size ( y_train, y_test ): 7 print ( str ( len ( y_train [ y_train [ Ground_truth ]==1]) ) + ( + str ( round ( len ( y_train [ y_train [ Ground_truth ]==1]) / len ( y_train ) *100,1) ) + %) + non - cardiac in training set ) 8 print ( str ( len ( y_train [ y_train [ Ground_truth ]==0]) ) + ( + str ( round ( len ( y_train [ y_train [ Ground_truth ]==0]) / len ( y_train ) *100,1) ) + %) + cardiac in training set ) 9 print ( str ( len ( y_test [ y_test [ Ground_truth ]==1]) ) + ( + str ( round ( len ( y_test [ y_test [ Ground_truth ]==1]) / len ( y_test ) *100,1) ) + %) + non - cardiac in test set ) 10 print ( str ( len ( y_test [ y_test [ Ground_truth ]==0]) ) + ( + str ( round ( len ( y_test [ y_test [ Ground_truth ]==0]) / len ( y_test ) *100,1) ) + %) + cardiac in test set ) print_size ( y_train, y_test ) The function provides us a balanced division of the training and test sets: 2426 (49.7%) non-cardiac subjects in training set; 2454 (50.3%) cardiac subjects in training set; 624 (51.1%) non-cardiac subjects in test set; 596 (48.9%) cardiac subjects in test set. The workflow implemented in Scikit-learn for using a classifier includes three steps. First, create the model by specifying the hyperparameters of the model. Second, fit the model with the training data and learn the parameters. Finally, apply the fitted model to the test data to get the predicted labels. There are no general rules for choosing the best set of hyperparameters for a given method with respect to a particular data set. A grid search in the hyperparameter space is 67

83 widely used in machine learning applications. Scikit-learn provides a convenient function to allow an easy grid search of hyperparameters and return the best sets based on the performance. The first algorithm investigated in this work is k-nearest Neighbors. It is known as a "lazy" algorithm, since it does not do anything during the learning phase: the model is essentially the entire training dataset. When a prediction is required for an unseen observation, knn will search through the entire training set for the k most similar observations. The prediction is given by the majority voting of those k nearest neighbors. The similarity measure is dependent on the type of data. For real-valued data, the Euclidean distance can be used. For categorical or binary data, as our situation the Hamming distance is recommended. Moreover each point has a weight proportional to its distance. For example, with inverse distance weighting, each point has a weight equal to the inverse of its distance to the point to be classified. This means that neighboring points have a higher vote than farther points.as we can see from the following code, we set the KNeighborsClassifier function from sklearn with 3 neighbors, the parameter weights set to distance, in order to have weighted votes and the metric set to hamming : 1 from sklearn. neighbors import KNeighborsClassifier 2 3 # instantiate learning model 4 knn = KNeighborsClassifier ( n_ neighbors = 3, weights = distance, metric = hamming ) 5 6 # fitting the model 7 knn. fit ( X_train, y_train. values. ravel ()) 8 9 # predict the response 10 y_pred = knn. predict ( X_test ) The second evaluated ML algorithm is the Support Vector Machine, an efficient machine learning which is able to classify unseen information by deriving selected features and constructing a high dimensional hyperplane to separate the data points into two classes in order to make a decision model. The workflow implemented in Scikit-learn for using this classifier is the same of the previous one. Different Kernel functions can be specified for the decision function. Common 68

84 kernels are provided, but it is also possible to specify custom kernels. We opted for the Radial Basis Function kernel, or RBF kernel, which operate a nonlinear mapping based on the exponential function of the Euclidean (L 2 -norm) distance between two points X 1 and X 2, due to the different nature of the features. 1 from sklearn import svm 2 3 # fit the model 4 clf = svm. SVC ( kernel =" rbf ") 5 clf. fit ( X_train, y_train. values. ravel ()) 6 7 # predict the response 8 y_pred = clf. predict ( X_test ) Given the complexity of the processes underlying signal recording in ICU patients, we tried to improve the prediction obtained by using a non-parametric algorithm such as Decision Tree. Since this type of algorithm does not make strong assumptions about the form of the mapping function, they are good candidates when you have a lot of data and no prior knowledge and heterogeneous complexity. The selection of variables and the specific split is chosen using an algorithm to minimize a cost function. Tree construction ends using a predefined stopping criterion, such as a minimum number of training instances assigned to each leaf node of the tree. For classification, the Gini index (G) function (also known as Gini impurity) is used which provides an indication of how "pure" the leaf nodes are. It gives an idea of how good a split is by how mixed the classes are in the two groups created by the split. The Decision Tree algorithm can be implemented in sklearn using the DecisionTreeClassifier function. Next follows a list of important parameters to have in consideration when training the model: criterion: function to measure the quality of a split. splitter: strategy used to choose the split at each node. Supported strategies are best to choose the best split and random to choose the best random split. max_features: maximum number of features in each tree. 69

85 max_depth: maximum depth of the tree. If None, nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. min_samples_split: minimum number of samples required to split an internal node. min_samples_leaf: minimum number of samples required to be at a leaf node. max_leaf_nodes: grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. random_state: if int, seed used by the random number generator. 1 from sklearn. tree import DecisionTreeClassifier 2 3 # fit the model 4 dt = DecisionTreeClassifier ( criterion = gini, max_depth =6, min_ samples_ leaf = 20, 5 min_ samples_ split = 20, random_ state = 2, splitter = best ) 6 dt.fit ( X_train, y_train ) 7 8 # predict the response 9 y_pred = dt. predict ( X_test ) Finally, the last ML algorithm explored is the Random Forest Classifier as it is a computationally efficient technique that can operate quickly over large datasets, such as the one we are using for our purposes. The Random Forest classifier consists of a combination of tree classifiers, and the predicted class probabilities of an input sample are computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf. The algorithm can be implemented in sklearn using the RandomForestClassifier. Similarly to the decision tree, there are parameters to define, which are the same except for n_estimators that represents the number of trees in the forest. 70

86 1 from sklearn. ensemble import RandomForestClassifier 2 3 # fit the model 4 rf = RandomForestClassifier ( n_estimators = 100, bootstrap = True, criterion = gini, 5 max_ depth = 10, random_ state = 2) 6 rf.fit ( X_train, y_train. values. ravel ()) 7 8 # predict the response 9 y_pred = rf. predict ( X_test ) All the evaluations, the processing, the model fitting and prediction operations were performed on an 8 GB RAM and a Intel(R) Core(TM) i5-5200u CPU 2.20GHz processor HP laptop. 71

87 5 Results and Discussion After developing and testing the ML algorithms, it is time to evaluate their performance. In this chapter, various simulation results are presented and its inference is discussed. 5.1 Comparison of Implemented Algorithms A natural question many data analysis practitioners may ask is which of these methods one should use in practice. This is a simple question, but unfortunately, there is not a simple answer. Numerous extensive empirical studies have been carried out to compare a number of supervised learning methods based on empirical data sets, and their conclusion was that the performance of the methods is essentially dependent on the specific data set Evaluation of a classifier Before starting with the comparison, it is important to define which parameters should be used to evaluate the performance of the classifiers. The evaluation of the goodness of a classifier is done by calculating some coefficients including Accuracy, Sensitivity and Specificity. For the calculation of these measures, the use of the following parameters is required: True positive (TP): a test result that correctly indicates the presence of a condition or characteristic. True negative (TN): a test result that correctly indicates the absence of a condition or characteristic. False positive (FP): a test result which wrongly indicates that a particular condition or attribute is present. 72

88 False negative (FN): a test result which wrongly indicates that a particular condition or attribute is absent. Accuracy is used as a statistical measure of how well a binary classification test correctly identifies or excludes a condition. It is the proportion of correct predictions (both true positives and true negatives) among the total number of cases examined. The formula for quantifying binary accuracy is: Accuracy = T P + T N T P + T N + F P + F N. (5.1) Sensitivity, also called recall or true positive rate (TPR), is a useful measure that indicates the percentage of patients with hearth diseases who are correctly identified as such. It is derived as: Sensitivity = T P T P + F N. (5.2) In the context of our problem, having a high sensitivity is very important, since it tells us the algorithm is able to correctly identify the most critical cases. However, optimizing for sensitivity alone may lead to the presence of many false alarms (false positives). Therefore, it is important to also have in mind specificity, or true negative rate (TNR), which tells us the the percentage of patients without hearth disease who are correctly identified as such. It is derived as: Specif icity = T N T N + F P. (5.3) So, sensitivity and specificity mathematically describe the accuracy of a test which reports the presence or absence of a condition, as it is our situation about classifying the presence or absence of cardiac diseases. One way of combining sensitivity and specificity in one representation is using the receiver operator characteristics (ROC) curve, which is a graphical plot that illustrates the performance of a binary classifier. The ROC curve is created by plotting the true positive rate against the false positive rate (FPR), also known as probability of false alarm and can be calculated as: 1 - specificity. Alternatively, we can also calculate the confusion matrix between the real labels and those predicted by the models. Having 2 classes, present or absent, the matrix confusion will be a 2x2 73

89 matrix. The values on the diagonal indicate the correct situations (True Positive and True Negative), the values outside diagonal indicate the errors between the true and predicted classes (False Positive and False Negative). The sum of the matrix elements is equal to the number of testset elements Results from the implemented methods To show the results from the different implemented algorithms, we will present the values of the parameters described in the previous section for each one of them, accompanied by a graphic representation of the ROC curve and the confusion matrix, in order to provide a clearer visual interpretation of the results. These information are given for the evaluation of performance on the test set for the K-NN and SVM methods, while they are plotted for both training and test sets for Decision Trees and Random Forest algorithms, for better assessing the classifieres ability to generalize, highlighting the best results. They are illustrated in Figures 5.1 to 5.6. In the next section, critical aspects of the study conducted are discussed Discussion Now that the four methods are tested, we are able to analyse the results for each of them. Figure 5.2 shows the results of the K-NN algorithm; as we can see from the representation, the parameters values are: TP: 364, TN: 363, FP: 233, FN: 260 Accuracy: 0.60 Sensitivity: 0.58 Specificity: Unfortunately, this values are not acceptable for a binary classification, being 0.5 the lowest possible value. From this reason, we can deduce that the classifier K-NN is not suitable for the type of data supplied to the system. As well as the previous analyzed classifier, also results from Support Vector Machine are not good at all. In fact, as we can see from Figure 5.3, The obtained val- 74

90 ues for the performance parameters are even worse than the K-NN ones. Specifically, we have: TP: 560, TN: 62, FP: 36, FN: 562 Accuracy: 0.51 Sensitivity: 0.50 Specificity: Anyway, the specificity value is reasonably higher than the sensitivity value, indicating that the model can better identify true negatives than true positives. A high specificity is usually required if the goal of the test is to accurately identify people who do not have the condition, because the number of false positives should be very low. This is especially important when people who are identified as having a condition may be subjected to more testing, expense, anxiety, etc. More reasonable values are instead obtained from testing the Decision Tree classifier. Figure 5.4 and 5.5 show the training set and test set results respectively. In particular we have for the training set: TP: 1771, TN: 1611, FP: 843, FN: 655 Accuracy: 0.69 Sensitivity: 0.73 Specificity: 0.66 while for the test set the performance is slightly lower: TP: 439, TN: 348, FP: 248, FN: 185 Accuracy: 0.65 Sensitivity: 0.70 Specificity: 0.58 We have moderate performance in the test set. Using a nonparametritc algorithm determined better results then using parametric classifiers. The reason can be attributed to the type and the dimensionality of the dataset, characterized by no prior knowledge and heterogeneous complexity. Moreover, in this case we can observe an higher values for the sensitivity parameter. Higher sensitivity is required if the goal of the test is to identify everyone who has a condition and it 75

91 is especially important when the consequence of failing to treat the condition are serious and/or the treatment is very effective and has minimal side effects. Finally, the results of the last analyzed classifying algorithm are presented. The obtained values for Random Forest model are shown in Figure 5.6 and 5.7, which are respectively related to the training set and test set results. If a single decision tree was not properly the way to go in terms of performance, the Random Forest algorithm performs way better in both training and test set, as represented by the ROC curve and the confusion matrix. In fact, the obtained values for the training set are the following: TP: 2175, TN: 2199, FP: 255, FN: 251 Accuracy: 0.90 Sensitivity: 0.90 Specificity: 0.90 On the other hand, the values of performance parameters obtained from the test set are reported below: TP: 507, TN: 441, FP: 155, FN: 117 Accuracy: 0.77 Sensitivity: 0.80 Specificity: 0.78 Even if the the performance in the test set is lower than the training set, we have found its performance to be acceptable. The technique is able to successfully classify the presence or absence of hearth diseases in a large group of patients, reaching almost the 80% of accuracy and reaching a good trade-off between sensitivity and specificity. In the end, we compared the proposed algorithms for the 4 supervised learning methods using a comparative histogram of the results, shown in Figure 5.1. It is seen that Random Forest approach outperformed K-NN and SVM both in accuracy and sensitivity and specificity, while providing better results than the Decision Tree algorithm. A single Decision Tree doesn t allow us to get high values in terms of performance, but if sensitivity is very important, and not so much interpretability, them the it could be considered for an initial evaluation of 76

92 Figure 5.1: Comparative graph of the results. the presence or absence of hearth diseases. Anyway, ideally the evaluation should not be limited to a single random data division, but different data partitions can be considered and cross validation can be used to investigate the variability in performance. Moreover, the complexity of pre-processing operations, feature extraction algorithms and feature fusion process plays a decisive role in the overall quality of performance of the entire system. All the steps through which the project has been developed have contributed to the achievement of the final result of the classification algorithms, influencing their performance significantly. 77

93 Figure 5.2: Results for K-NN. Figure 5.3: Results for SVM. Figure 5.4: Results for Decision Trees (Training set). 78

94 Figure 5.5: Results for Decision Trees (Test set). Figure 5.6: Results for Random Forest (Training set). Figure 5.7: Results for Random Forest (Test set). 79