Research Interests: My research interests are in the area
of model selection, the theory and geometry of mixture
models and functional data analysis. I am especially interested in challenges presented by
"large magnitude", both in the dimension of data vectors and in
the number of vector. Core areas of methodological research
include multivariate mixtures, structural equations models,
high-dimensional clustering and functional clustering.
Key collaborative activities involve projects in immunology,
modeling of climate ecosystem dynamics and medical image segmentation.
Background:
In recent years, intense research efforts have focused on developing methods for automated flow cytometric data analysis. However, while designing such applications, little or no attention has been paid to the human perspective that is absolutely central to the manual gating process of identifying and characterizing cell populations. In particular, the assumption of many common techniques that cell populations could be modeled reliably with pre-specified distributions may not hold true in real-life samples, which can have populations of arbitrary shapes and considerable inter-sample variation.
Results: To address this, we developed a new framework flowScape for emulating certain key aspects of the human perspective in analyzing flow data, which we implemented in multiple steps. First, flowScape begins with creating a mathematically rigorous map of the high-dimensional flow data landscape based on dense and sparse regions defined by relative concentrations of events around modes. In the second step, these modal clusters are connected with a global hierarchical structure. This representation allows flowScape to perform ridgeline analysis for both traversing the landscape and isolating cell populations at different levels of resolution. Finally, we extended manual gating with a new capacity for constructing templates that can identify target populations in terms of their relative parameters, as opposed to the more commonly used absolute or physical parameters. This allows flowScape to apply such templates in batch mode for detecting the corresponding populations in a flexible, sample-specific manner. We also demonstrated different applications of our framework to flow data analysis and show its superiority over other analytical methods.
Conclusions: The human perspective, built on top of intuition and experience, is a very important component of flow cytometric data analysis. By emulating some of its approaches and extending these with automation and rigor, flowScape provides a flexible and robust framework for computational cytomics.
Bayes Factors play an important role in comparing the fit of models ranging from multiple regression
to mixture models. Full Bayesian analysis calculates a Bayes Factor from an explicit
prior distribution. However, computational limitations or lack of an appropriate prior sometimes
prevent researchers from using an exact Bayes Factor. Instead, it is approximated, often
using Schwarz’s (1978) Bayesian Information Criterion (BIC), or a variant of the BIC. In this
paper we provide a comparison of several Bayes Factor approximations, including two new
approximations, the SPBIC and IBIC. The SPBIC is justified by using a scaled unit information
prior distribution that is more general than the BIC’s unit information prior, and the IBIC
approximation utilizes more terms of approximation than in the BIC. In a simulation study we
show that several measures perform well in large samples, that performance declines in smaller
samples, and that SPBIC and IBIC can provide improvement to existing measures under some
conditions, including small sample sizes. We then illustrate the use of the fit measures in an
empirical example from the crime data of Ehrlich (1973). We conclude with recommendations
for researchers.
Often researchers must choose among two or more structural equation models for a given set
of data. Typically, selection is based on having the highest chi-square p-value or the highest fit
index such as the CFI or RMSEA. Though a common situation, there is little evidence on the
performance of these fit indices in choosing between models. In other statistical applications,
Bayes Factor approximations such as the BIC are commonly used to select between models,
but these are rarely used in SEMs. This paper examines several new and old Bayes Factor
approximations along with some commonly used fit indices to assess their accuracy in choosing
the true model among a broad set of false models. The results show that the Bayes Factor
Approximations outperform the other fit indices. Among these approximations one of the new
ones, SPBIC, is particularly promising. The commonly used chi-square p-value and the CFI,
IFI, and RMSEA do much worse.
Protein microarrays are a high-throughput technology capable of generating large quantities of proteomics data. They can be used for general research or for clinical diagnostics. Bioinformatics and statistical analysis techniques are required for interpretation and reaching biologically relevant conclusions from raw data. We describe essential algorithms for processing protein microarray data, including spot-finding on slide images, Z score, and significance analysis of microarrays (SAM) calculations, as well as the concentration dependent analysis (CDA). We also describe available tools for protein microarray analysis, and provide a template for a step-by-step approach to performing an analysis centered on the CDA method. We conclude with a discussion of fundamental and practical issues and considerations.
Background:
Tumor-specific antigens and their specific epitopes are formulation targets
for patientspecific
cancer vaccines. A selection of prediction servers are available for
identification of peptides that bind major histocompatibility complex class I
(MHC-I)
molecules. However, the lack of standardized methodology and large number of
human MHC-I molecules, make the selection of appropriate prediction servers
difficult. This study reports a comparative evaluation of thirty prediction
servers for
seven human MHC-I molecules.
Results
Of 147 individual predictors 39 have shown excellent, 47 good, 33 marginal,
and 28
poor ability to classify binders from non-binders. The classifiers for
HLA-A*0201,
A*0301, A*1101, B*0702, B*0801, and B*1501 have excellent, and for A*2402
moderate classification accuracy. In addition, 16 prediction servers predict
peptide
binding affinity to MHC-I molecules with high accuracy; correlation
coefficients
ranging from r=0.55 (B*0801) to r=0.87 (A*0201).
Conclusions
Non-linear predictors outperform matrix-based predictors, and majority of
predictors
can be improved by non-linear transformations of their raw prediction
scores. The
best predictors of peptide binding (both classification and binding affinity)
show the
best performance in prediction of T-cell epitopes. We propose a new standard
for
prediction of MHC-I binding Ð a common scale for normalization of prediction
scores, that is applicable to both experimental and predicted scores.
This work builds
a unified framework for the study of quadratic form distance measures as
they are used in assessing the goodness of fit of models. Many important
procedures have this structure, but the theory for these methods is
dispersed and incomplete. Central to the statistical analysis of these
distances is the spectral decomposition of the kernel that generates the
distance. We show how this determines the limiting distribution of natural
goodness of fit tests. Additionally, we develop a new notion, the spectral
degrees of freedom of the test, based on this decomposition. The degrees of
freedom are easy to compute and estimate, and can be used as a guide in the
construction of useful procedures in this class.
In this paper, we develop a mode-based clustering approach applying new
optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This clustering method shares the major advantages of mixture model based clustering. Moreover, it requires no model fitting and ensures that every cluster corresponds to a bump of the density. A hierarchical clustering algorithm is also developed by applying MEM recursively to kernel density estimators with increasing bandwidths. The issue of diagnosing clustering results is investigated. Specifically, a pairwise cluster separability measure is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is developed to guarantee strong separation between clusters. Experiments demonstrate that our clustering approach tends to combine the strengths of mixture-model-based and linkage-based clustering. Tests on both simulated and real data show that the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling.
The advancing technology for automatic segmentation of medical images should
be accompanied by techniques to inform the user of the credibility of results.
To the extent that this technology produces clinically acceptable segmentations
for a significant fraction of cases there is a risk that the clinician will assume
every result is acceptable. In the less frequent case where segmentation fails
we are concerned that unless the user is alerted by the computer, she would
5A5Astill put the result to clinical use. We propose an automated method to signal
suspected noncredible regions of the segmentation, triggered by outlier values
of the local image match function. The user can focus her validation resources
on the noncredible regions.
When the local image match function is computed via a Mahalanobis dis-
tance, as is the case for PCA-based matches, its value follows the chi-squared
distribution. Our method signals a noncredible region wherever the probability
of a chi-squared random variable being greater than the match observed is above
a threshold level.
ROC analysis validates our noncredibility test on m-rep segmentations of
the bladder in CT images, using an image match computed by PCA on regional
intensity quantile functions. We approximate ground truth as truly noncredible
regions have surface distance > 5mm to a reference segmentation. We swept
out ROC curves by varying the threshold level. The area under the ROC curve
was 0.91. Based on this preliminary result, our method shows potential for
validation in an automatic segmentation pipeline.
This research was initiated by the analysis of NCI60 cancer dataset . The dataset contains gene expression values (from cDNA arrays) corresponding to 3509 genes collected from 60 different patients diagnosed with 8 different cancer types (assumed unknown in the following discussion). The goal is to provide a model based approach for simultaneously clustering cancer types (columns) and the genes (rows) involved in differentiating these cancer types. We formulate a novel two-way mixture framework and adapt our distance-based model selection tool to determine the unknown number of row and column clusters. This methodology avoids two major pitfalls of using model-based clustering in high-dimensions. First, the two-way mixture has a considerably smaller parameter set, compared to the full multivariate analysis, making all parameters estimable. Second, unlike the complex distribution of likelihood-ratio-based tools under the composite null hypothesis of fixed row and column clusters, the distribution of our distance-based model selection tool is well defined, even for composite hypotheses. Finally, based on the geometry of pure Gaussian HDLSS data, we provide an effective visual diagnostic tool to uncover any remaining structure in the data. Through our analysis, we uncovered some interesting sets of gene clusters. But some of our cancer-type clusters did not match the initial cancer labels. On later verification we found that such discordance was due to the close similarity in symptoms and pathological test results of the two types of cancer in question.
Multivariate mixtures provide flexible methods for both fitting and
partitioning high-dimensional data. Ray and Lindsay(2005) show that the
topography of multivariate mixtures, in the sense of their key features as a
density, can be analyzed rigorously in lower dimensions by use of a ridgeline
manifold that contains all critical points as well as the ridges of the
density. In addition, we have developed a new computing procedure similar to
the EM algorithm that can quickly find the modes of a mixture density.
This tool can be extended to examine the degree of separation
between the modes based on the ridgeline separating them.
These tools can be used in various ways. For one, we can take a conventional
mixture analysis and cluster together those components
whose contribution is actually unimodal. This cluster could then
represent a single true component with a more complex distribution.
We can also turn kernel density estimation into a hierarchical clustering tool
in which the data points become identified with each other by their association
with a common mode of the density estimator. Separate clusters must then
correspond to gaps in the estimated density. The analysis
is multi-scale, as different levels of smoothing provide different
aggregations.