Skip to Main content Skip to Navigation

Research and development of innovative mathematical algorithms using cluster-based interactions of metagenomic data in biomedicine

Abstract : The developement of new biotechnologies offers a large variety of biological datasets, extending the scope of biomedical research. These include genomic datasets, highly developed in statistical literature and metagenomic datasets, still relatively unknown, which require specific developments due to their special characteristics.The biological explored systems, represented using networks, enable to model the functional relationships between its composing elements and to understand the underlying biological processes. In this context, this thesis provides mathematical studies of clustering algorithms and proper statistical tools to analyze these interactions.The first part of this thesis is dedicated to the development of a graph clustering algorithm, called CORE-clustering, to detect robustly representative variables, centers of specific variable clusters, within a high dimensional complex system. Specifically, we aim at highlighting these densely connected clusters, called CORE-clusters, forming major structures of the graph by only imposing, within each group, the minimal dimension and the minimal level of similarity. We then show through various applications the relevance of the CORE-clusters detected in the specific framework of genetic and road high dimensional networks.The second part of the thesis deals with the development of an extension of the spectral clustering algorithm which addresses the issue of identifying densely connected structures within a noisy graph, characteristic of real biological networks. Using the spectral clustering properties, this new variant, called l1-spectral clustering enables to robustly bring out the natural hidden structure of the graph from the estimation of community indicators by imposing a lasso regularization. From a practical point of view, we show the stability of these estimators through various simulations, comparisons and biomedical applications.The third part of the thesis concerns the use of statistical tools, specifically adapted to the analysis of metagenomic datasets (intestinal microbiota genes). In the context of a clinical study conducted on patients suffering from liver pathologies at an early stage, we propose different strategies to identify the patients’ clinical phenotypic profile and the microbial species involved in the development of the disease. To this end, we present a variety of exploratory, predictive and clustering methods used to identify groups of interacted bacteria and to understand the underlying mechanisms for the clinical trial.This information is the key to discover biomarkers, biological signatures categorizing patients in the disease.This clinical trial dealing with biomedical dataset from two diverse cohorts led us to develop fair learning approaches based on standard dimension reduction techniques to explain the total variabilities in the dataset while limiting the bias effect generated by the population’s diversity, which is explored in the last part of the thesis.
Document type :
Complete list of metadata
Contributor : Abes Star :  Contact
Submitted on : Friday, October 22, 2021 - 12:53:10 PM
Last modification on : Saturday, October 23, 2021 - 4:09:48 AM


Version validated by the jury (STAR)


  • HAL Id : tel-03395321, version 1


Camille Champion. Research and development of innovative mathematical algorithms using cluster-based interactions of metagenomic data in biomedicine. Statistics [math.ST]. INSA de Toulouse, 2021. English. ⟨NNT : 2021ISAT0005⟩. ⟨tel-03395321⟩



Record views


Files downloads