## Surajit Ray |
University Gardens University of Glasgow GLASGOW G12 8QQ Phone: 0141 330 6238 Fax: 0141 330 4814 E-mail: Surajit.Ray at glasgow.ac.uk |

**Research Interests:**My research interests are in the area of model selection, the theory and geometry of mixture models and functional data analysis. I am especially interested in challenges presented by "large magnitude", both in the dimension of data vectors and in the number of vector. Core areas of methodological research include multivariate mixtures, structural equations models, high-dimensional clustering and functional clustering. Key collaborative activities involve projects in immunology, modeling of climate ecosystem dynamics and medical image segmentation.

To appear in the

**Journal of American Statistical Association**(2012)

**PLoS one**(May, 2012)

Journal Page | PDF | BibTex

**Background:**In recent years, intense research efforts have focused on developing methods for automated flow cytometric data analysis. However, while designing such applications, little or no attention has been paid to the human perspective that is absolutely central to the manual gating process of identifying and characterizing cell populations. In particular, the assumption of many common techniques that cell populations could be modeled reliably with pre-specified distributions may not hold true in real-life samples, which can have populations of arbitrary shapes and considerable inter-sample variation.

**Results:**To address this, we developed a new framework flowScape for emulating certain key aspects of the human perspective in analyzing flow data, which we implemented in multiple steps. First, flowScape begins with creating a mathematically rigorous map of the high-dimensional flow data landscape based on dense and sparse regions defined by relative concentrations of events around modes. In the second step, these modal clusters are connected with a global hierarchical structure. This representation allows flowScape to perform ridgeline analysis for both traversing the landscape and isolating cell populations at different levels of resolution. Finally, we extended manual gating with a new capacity for constructing templates that can identify target populations in terms of their relative parameters, as opposed to the more commonly used absolute or physical parameters. This allows flowScape to apply such templates in batch mode for detecting the corresponding populations in a flexible, sample-specific manner. We also demonstrated different applications of our framework to flow data analysis and show its superiority over other analytical methods.

**Conclusions:**The human perspective, built on top of intuition and experience, is a very important component of flow cytometric data analysis. By emulating some of its approaches and extending these with automation and rigor, flowScape provides a flexible and robust framework for computational cytomics.

**Annals of Applied Statistics**2012, Vol 6, No 2

PDF | Supplement

To appear in

**Sociological Methods & Research**(2012)

Preprint

To appear in the

**Journal of Multivariate Analysis**(2012)

Preprint | Journal Page

Preprint | Software available from CRAN

**BMC Bioinformatics**2011, 12:375

Journal Page | pdf |

**Background:**The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.

** Results: ** We developed an approach integrating the k-TSP ranking algorithm (TSP) with
other machine learning methods, allowing combination of the computationally efficient,
multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We
evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known
data structures. As compared with other feature selection methods, such as a univariate
method similar to Fisher’s discriminant criterion (Fisher), or a recursive feature
elimination embedded in SVM (RFE), TSP is increasingly more effective than the other
two methods as the informative genes become progressively more correlated, which is
demonstrated both in terms of the classification performance and the ability to recover
true informative genes. We also applied this hybrid scheme to three cancer prognosis
datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves
either comparable or superior performance to that using SVM alone. In
concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and
RFE in two of the three cancer datasets.

**Conclusions:** The k-TSP ranking algorithm can be used as a computationally efficient,
multivariate filter method for feature selection in machine learning. SVM in combination
with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets
and in some cancer prognosis datasets. Simulation studies suggest that as a feature
selector, it is better tuned to certain data characteristics, i.e. correlations among
informative genes, which is potentially interesting as an alternative feature ranking
method in pathway analysis.

Submitted to

**Sociological Methodology**(2011)

**Sankhya, Series A. (2011)**Vol 72

Journal Page | pdf |

**Methods Mol Biol. (2011)**723:337-47. pmid 21370075

pdf | PubMed | Journal Page |

**BMC Immunology**2008, 9:8

pdf | PubMed | Journal Page |

**Background:**

Tumor-specific antigens and their specific epitopes are formulation targets for patientspecific cancer vaccines. A selection of prediction servers are available for identification of peptides that bind major histocompatibility complex class I (MHC-I) molecules. However, the lack of standardized methodology and large number of human MHC-I molecules, make the selection of appropriate prediction servers difficult. This study reports a comparative evaluation of thirty prediction servers for seven human MHC-I molecules.

**Results**

Of 147 individual predictors 39 have shown excellent, 47 good, 33 marginal, and 28 poor ability to classify binders from non-binders. The classifiers for HLA-A*0201, A*0301, A*1101, B*0702, B*0801, and B*1501 have excellent, and for A*2402 moderate classification accuracy. In addition, 16 prediction servers predict peptide binding affinity to MHC-I molecules with high accuracy; correlation coefficients ranging from r=0.55 (B*0801) to r=0.87 (A*0201).

**Conclusions**

Non-linear predictors outperform matrix-based predictors, and majority of predictors can be improved by non-linear transformations of their raw prediction scores. The best predictors of peptide binding (both classification and binding affinity) show the best performance in prediction of T-cell epitopes. We propose a new standard for prediction of MHC-I binding Ð a common scale for normalization of prediction scores, that is applicable to both experimental and predicted scores.

**Journal of the Royal Statistical Society - Series B:**70(1) Page 95-118, February 2008

pdf | ps | arxiv | Journal Page |

Keywords:Global comparison of models, high dimensional data, model selection, mixture models, quadratic distance, quadratic risk, spectral degrees of freedom.

**Annals of Statistics**2008, Vol. 36, No. 2, page 983--1006

pdf | ps | Journal Page |

**Immunome Research**2007, Oct 29;3(1):9

pdf | PubMed | Journal Page |

**Background:**A key step in the development of an adaptive immune response to pathogens or vaccines is the binding of short peptides to molecules of the Major Histocompatibility Complex (MHC) for presentation to T lymphocytes, which are thereby activated and dierentiate into effector and memory cells. The rational design of vaccines consists in part in the identication of appropriate peptides to effect this process. There are several algorithms currently in use for making such predictions, but these are limited to a small number of MHC molecules and have good but imperfect prediction power.

**Results:**We have undertaken an exploration of the power gained by taking advantage of a natural representation of the amino acids in terms of their biophysical properties. We used several well-known statistical classiers using either a naive encoding of amino acids by name or an encoding by biophysical properties. In all cases, the encoding by biophysical properties leads to substantially lower misclassication error.

**Conclusion**Representation of amino acids using a few important bio-physio-chemical property provide a natural basis for representing peptides and greatly improves peptide-MHC class I binding prediction.

**Journal of Machine Learning Research 8(Aug):1687--1723, 2007**

pdf | Journal Page | Software

**Annals of Statistics**2005, Vol. 33, No. 5, page 2042-2065

pdf | ps | Journal Page |

**Proceedings of International Workshop on Mathematical Foundations of Computational Anatomy**pp. 136-145, 2006

pdf | Poster |

**Proceedings of the SPIE, Vol. 6512, 2007**

pdf | Journal Page |