gEM/GANN: a multivariate computational strategy for auto-characterizing relationships between cellular and clinical phenotypes and predicting disease progression time using high-dimensional flow cytometry data

Tong, D.L., Ball, G.R. ORCID: 0000-0001-5828-7129 and Pockley, A.G. ORCID: 0000-0001-9593-6431, 2015. gEM/GANN: a multivariate computational strategy for auto-characterizing relationships between cellular and clinical phenotypes and predicting disease progression time using high-dimensional flow cytometry data. Cytometry Part A, 87 (7), pp. 616-623. ISSN 1552-4922

[img]
Preview
Text
PubSub1657_Ball.pdf - Post-print

Download (597kB) | Preview

Abstract

The dramatic increase in the complexity of flow cytometric datasets requires the development of new computational based approaches that can maximize the amount of information derived and overcome the limitations of traditional gating strategies. Herein, we present a multivariate computational analysis of the HIV-infected flow cytometry datasets that were provided as part of the FlowCAP-IV Challenge using unsupervised and supervised learning techniques. Out of 383 samples (stimulated and unstimulated), 191 samples were used as a training set (34 individuals whose disease did not progress, and 157 individuals whose disease did progress). Using the results from the training set, the participants in the Challenge were then asked to predict the condition and progression time of the remaining individuals (45 ‘non-progressors’ and 147 ‘progressors’). To achieve this, we first scaled down data resolution. We then excluded doublet cells from the analysis using Expectation Maximization approaches. We then standardized all samples into histograms and used Genetic Algorithm-Neural Network to extract feature sets from the datasets, the reliability of which were examined using WEKA-implemented classifiers. The selected feature set resulted in a high sensitivity and specificity for the discrimination of progressors and non-progressors in the training set (average True Positive Rate = 1.00 and average False Positive Rate = 0.033). The capacity of the feature set to predict real-time survival time was better when using data from the ‘unstimulated’ training set (r = 0.825). The p-values and 95% confidence interval logrank ratios between actual and predicted survival time in the test set were 0.682 and 0.9542±0.24 for the unstimulated dataset, and 0.4451 and 0.9173±0.23 for the stimulated dataset. Our analytic strategy has demonstrated a promising capacity to extract useful information from complex flow cytometry datasets, despite a significance imbalance and variation between the training and test sets.

Item Type: Journal article
Description: Special Issue: Computational Analysis of Flow Cytometry Data.
Publication Title: Cytometry Part A
Creators: Tong, D.L., Ball, G.R. and Pockley, A.G.
Publisher: Wiley for the International Society for Advancement of Cytometry
Date: 2015
Volume: 87
Number: 7
ISSN: 1552-4922
Identifiers:
NumberType
10.1002/cyto.a.22622DOI
Divisions: Schools > School of Science and Technology
Depositing User: EPrints Services
Date Added: 09 Oct 2015 09:51
Last Modified: 09 Jun 2017 13:12
URI: http://irep.ntu.ac.uk/id/eprint/3876

Actions (login required)

Edit View Edit View

Views

Views per month over past year

Downloads

Downloads per month over past year