Surajit Ray |
University Gardens University of Glasgow GLASGOW G12 8QQ Phone: 0141 330 6238 Fax: 0141 330 4814 E-mail: Surajit.Ray at glasgow.ac.uk |
Postgraduate Opportunities
Recently Published and Working Papers
2018
Spatial models with covariates improve estimates of peat depth in blanket peatlands
with Young, D. M., Parry, L. E. , Lee, D.
PLoS ONE, (2018)
Journal Page
|
PDF
|
BibTex
PLoS ONE, (2018)
Peatlands are spatially heterogeneous ecosystems that develop due to a complex set of autogenic physical and biogeochemical processes and allogenic factors such as the climate and topography. They are significant stocks of global soil carbon, and therefore predicting the depth of peatlands is an important part of establishing an accurate assessment of their magnitude. Yet there have been few attempts to account for both internal and external processes when predicting the depth of peatlands. Using blanket peatlands in Great Britain as a case study, we compare a linear and geostatistical (spatial) model and several sets of covariates applicable for peatlands around the world that have developed over hilly or undulating terrain. We hypothesized that the spatial model would act as a proxy for the autogenic processes in peatlands that can mediate the accumulation of peat on plateaus or shallow slopes. Our findings show that the spatial model performs better than the linear model in all cases—root mean square errors (RMSE) are lower, and 95% prediction intervals are narrower. In support of our hypothesis, the spatial model also better predicts the deeper areas of peat, and we show that its predictive performance in areas of deep peat is dependent on depth observations being spatially autocorrelated. Where they are not, the spatial model performs only slightly better than the linear model. As a result, we recommend that practitioners carrying out depth surveys fully account for the variation of topographic features in prediction locations, and that sampling approach adopted enables observations to be spatially autocorrelated.
2017
Functional principal component analysis of spatially correlated data
with Liu, C., and Hooker, G.
Statistics and Computing, (2017)
Journal Page
| PDF
| BibTex
Statistics and Computing, (2017)
This paper focuses on the analysis of spatially correlated functional data. We propose a parametric model for spatial correlation and the between-curve correlation is modeled by correlating functional principal component scores of the functional data. Additionally, in the sparse observation framework, we propose a novel approach of spatial principal analysis by conditional expectation to explicitly estimate spatial correlations and reconstruct individual curves. Assuming spatial stationarity, empirical spatial correlations are calculated as the ratio of eigenvalues of the smoothed covariance surface Cov (Xi(s),Xi(t))(Xi(s),Xi(t)) and cross-covariance surface Cov (Xi(s),Xj(t))(Xi(s),Xj(t)) at locations indexed by i and j. Then a anisotropy Matérn spatial correlation model is fitted to empirical correlations. Finally, principal component scores are estimated to reconstruct the sparsely observed curves. This framework can naturally accommodate arbitrary covariance structures, but there is an enormous reduction in computation if one can assume the separability of temporal and spatial components. We demonstrate the consistency of our estimates and propose hypothesis tests to examine the separability as well as the isotropy effect of spatial correlation. Using simulation studies, we show that these methods have some clear advantages over existing methods of curve reconstruction and estimation of model parameters.
2014
Parallel and hierarchical mode association clustering with an R package Modalclust
with Y.Cheng.
Vol 4(10), pp. 826-836. Open Journal of Statistics (2014)
Journal Page
| PDF
| BibTex
Vol 4(10), pp. 826-836. Open Journal of Statistics (2014)
Modalclust is an R package which performs Hierarchical Mode Association
Clustering (HMAC) along with its parallel implementation
over several processors. Modal clustering techniques are
especially designed to efficiently extract clusters in high
dimensions with arbitrary density shapes. Further,
clustering is performed over several resolutions and the
results are summarized as a hierarchical tree, thus
providing a model based multi resolution cluster
analysis. Finally we implement a novel parallel
implementation of HMAC which performs the clustering job
over several processors thereby dramatically increasing the
speed of clustering procedure especially for large data
sets. This package also provides a number of functions for
visualizing clusters in high dimensions, which can also be
used with other clustering softwares.
Multivariate modality inference using Gaussian kernel.
with Y.Cheng.
Vol 4(5), pp. 419-434. Open Journal of Statistics (2014)
Journal Page
| PDF
| BibTex
Vol 4(5), pp. 419-434. Open Journal of Statistics (2014)
The number of modes (also known as modality) of a kernel density estimator (KDE) draws lots of interests and is important in practice. In this paper, we develop an inference framework on the modality of a KDE under multivariate setting using Gaussian kernel. We applied the modal clustering method proposed by [1] for mode hunting. A test statistic and its asymptotic distribution are derived to assess the significance of each mode. The inference procedure is applied on both simulated and real data sets.
Kernels, degrees of freedom, and power properties of quadratic goodness of fit tests.
with B.G. Lindsay and M. Markatou.
Vol 109, No 505, Journal of American Statistical Association (2014)
Journal Page
| PDF
| BibTex
Vol 109, No 505, Journal of American Statistical Association (2014)
In this paper we study the power properties of quadratic distance based
goodness of fit tests. First, we introduce the concept of a
root kerneland discuss the considerations that enter the
selection of this kernel. We derive an easy to use normal
approximation to the power of quadratic distance goodness
of fit tests and base the construction of a noncentrality
index,an analogue of the traditional noncentrality
parameter, on it. This leads to a method akin to the
Neyman-Pearson lemma for constructing optimal kernels for
specific alternatives. We then introduce a midpower
analysis as a device for choosing optimal degrees of
freedom for a family of alternatives of interest. Finally,
we introduce a new diffusion kernel, called the
Pearson-normal kernel and study the extend to which the
normal approximation to the power of tests based on this
kernel is valid.
BIC and Alternative Bayesian Information Criteria in the Selection of Structural Equation Models
with K. Bollen, J. Zavisca and Jeffrey J. Harden.
31 Jan 2014, Structural Equation Modeling: A Multidisciplinary Journal (2014)
Journal Page
| PDF
31 Jan 2014, Structural Equation Modeling: A Multidisciplinary Journal (2014)
Selecting between competing structural equation models is a common problem. Often selection is based on the chi-square test statistic or other fit indices. In other areas of statistical research Bayesian information criteria are commonly used, but they are less frequently used with structural equation models compared to other fit indices. This article examines several new and old information criteria (IC) that approximate Bayes factors. We compare these IC measures to common fit indices in a simulation that includes the true and false models. In moderate to large samples, the IC measures outperform the fit indices. In a second simulation we only consider the IC measures and do not include the true model. In moderate to large samples the IC measures favor approximate models that only differ from the true model by having extra parameters. Overall, SPBIC, a new IC measure, performs well relative to the other IC measures.
Statistical monitoring of clinical trials with multiple co-primary endpoints using multivariate B-value
with Yansong Cheng, Mark Chang and Sandeep Menon
To appear in Statistics in Biopharmaceutical Research (2014)
Journal Page
| PDF
To appear in Statistics in Biopharmaceutical Research (2014)
This paper develops methods of statistical monitoring of clinical trials with multiple co-primary endpoints, where success is defined as meeting both endpoints simultaneously. In practice, a group sequential design method is used to stop trials early for promising efficacy, and conditional power is used for futility stopping rules. In this paper, we show that stopping boundaries for the group sequential design with multiple co-primary endpoints should be the same as those for studies with single endpoints. Lan and Wittes (1988) proposed the B-value tool to calculate the conditional power of single endpoint trials and we extend this tool to calculate the conditional power for studies with multiple co-primary endpoints. We consider the cases of two-arm studies with co-primary normal and provide an example of implementation with simulated trial. A fixed-weight sample size re-estimation approach based on conditional power is introduced.
2013
On the Number of Modes of Finite Mixtures of Elliptical Distributions
with Grigory Alexandrovich and Hajo Holzmann
49-57 Conference Proceedings: Algorithms from and for Nature and Life 2013 (2013)
Journal Page
|
PDF |
BibTex
49-57 Conference Proceedings: Algorithms from and for Nature and Life 2013 (2013)
We extend the concept of the ridgeline from Ray and Lindsay (Ann Stat 33:2042–2065, 2005) to finite mixtures of general elliptical densities with possibly distinct density generators in each component. This can be used to obtain bounds for the number of modes of two-component mixtures of t distributions in any dimension. In case of proportional dispersion matrices, these have at most three modes, while for equal degrees of freedom and equal dispersion matrices, the number of modes is at most two. We also give numerical illustrations and indicate applications to clustering and hypothesis testing.
2012
A Computational Framework to Emulate the Human Perspective in Flow Cytometric Data Analysis
with Saumyadipta Pyne
PLoS one (May, 2012)
Journal Page | PDF | BibTex
PLoS one (May, 2012)
Journal Page | PDF | BibTex
Background:
In recent years, intense research efforts have focused on developing methods for automated flow cytometric data analysis. However, while designing such applications, little or no attention has been paid to the human perspective that is absolutely central to the manual gating process of identifying and characterizing cell populations. In particular, the assumption of many common techniques that cell populations could be modeled reliably with pre-specified distributions may not hold true in real-life samples, which can have populations of arbitrary shapes and considerable inter-sample variation.
Results: To address this, we developed a new framework flowScape for emulating certain key aspects of the human perspective in analyzing flow data, which we implemented in multiple steps. First, flowScape begins with creating a mathematically rigorous map of the high-dimensional flow data landscape based on dense and sparse regions defined by relative concentrations of events around modes. In the second step, these modal clusters are connected with a global hierarchical structure. This representation allows flowScape to perform ridgeline analysis for both traversing the landscape and isolating cell populations at different levels of resolution. Finally, we extended manual gating with a new capacity for constructing templates that can identify target populations in terms of their relative parameters, as opposed to the more commonly used absolute or physical parameters. This allows flowScape to apply such templates in batch mode for detecting the corresponding populations in a flexible, sample-specific manner. We also demonstrated different applications of our framework to flow data analysis and show its superiority over other analytical methods.
Conclusions: The human perspective, built on top of intuition and experience, is a very important component of flow cytometric data analysis. By emulating some of its approaches and extending these with automation and rigor, flowScape provides a flexible and robust framework for computational cytomics.
Results: To address this, we developed a new framework flowScape for emulating certain key aspects of the human perspective in analyzing flow data, which we implemented in multiple steps. First, flowScape begins with creating a mathematically rigorous map of the high-dimensional flow data landscape based on dense and sparse regions defined by relative concentrations of events around modes. In the second step, these modal clusters are connected with a global hierarchical structure. This representation allows flowScape to perform ridgeline analysis for both traversing the landscape and isolating cell populations at different levels of resolution. Finally, we extended manual gating with a new capacity for constructing templates that can identify target populations in terms of their relative parameters, as opposed to the more commonly used absolute or physical parameters. This allows flowScape to apply such templates in batch mode for detecting the corresponding populations in a flexible, sample-specific manner. We also demonstrated different applications of our framework to flow data analysis and show its superiority over other analytical methods.
Conclusions: The human perspective, built on top of intuition and experience, is a very important component of flow cytometric data analysis. By emulating some of its approaches and extending these with automation and rigor, flowScape provides a flexible and robust framework for computational cytomics.
Functional Factor Analysis For Periodic Remote Sensing Data
with Chong Liu, Giles Hooker, and Mark Friedl.
Annals of Applied Statistics 2012, Vol 6, No 2
PDF | Supplement
Annals of Applied Statistics 2012, Vol 6, No 2
PDF | Supplement
We present a new approach to factor rotation for functional data.
This rotation is achieved by rotating the functional principal components
towards a pre-defined space of periodic functions designed
to decompose the total variation into components that are nearlyperiodic
and nearly-aperiodic with a pre-defined period. We show
that the factor rotation can be obtained by calculation of canonical
correlations between appropriate spaces which makes the methodology
computationally efficient. Moreover we demonstrate that our
proposed rotations provide stable and interpretable results in the
presence of highly complex covariance. This work is motivated by
the goal of finding interpretable sources of variability in vegetation
index obtained from remote sensing instruments and we demonstrate
our methodology through an application of factor rotation of this
data.
A Comparison of Bayes Factor Approximation Methods
Including Two New Methods
with K. Bollen, J. Zavisca and Jeffrey J. Harden.
vol. 41 no. 2 294-324, Sociological Methods & Research (2012)
Preprint | Journal Page
vol. 41 no. 2 294-324, Sociological Methods & Research (2012)
Preprint | Journal Page
Bayes Factors play an important role in comparing the fit of models ranging from multiple regression
to mixture models. Full Bayesian analysis calculates a Bayes Factor from an explicit
prior distribution. However, computational limitations or lack of an appropriate prior sometimes
prevent researchers from using an exact Bayes Factor. Instead, it is approximated, often
using Schwarz’s (1978) Bayesian Information Criterion (BIC), or a variant of the BIC. In this
paper we provide a comparison of several Bayes Factor approximations, including two new
approximations, the SPBIC and IBIC. The SPBIC is justified by using a scaled unit information
prior distribution that is more general than the BIC’s unit information prior, and the IBIC
approximation utilizes more terms of approximation than in the BIC. In a simulation study we
show that several measures perform well in large samples, that performance declines in smaller
samples, and that SPBIC and IBIC can provide improvement to existing measures under some
conditions, including small sample sizes. We then illustrate the use of the fit measures in an
empirical example from the crime data of Ehrlich (1973). We conclude with recommendations
for researchers.
On the upper bound of the number of modes of a multivariate
normal mixture
with Dan Ren
Volume 108, Pages 41–52, Journal of Multivariate Analysis (2012)
Preprint | Journal Page
Volume 108, Pages 41–52, Journal of Multivariate Analysis (2012)
Preprint | Journal Page
The main result of this article states that one can get as many as D + 1 modes from a
two component normal mixture in D dimensions. Multivariate mixture models are widely used
for modeling homogeneous populations and for cluster analysis. Either the components directly
or modes arsing from these components are often used to extract individual clusters. Though in
lower dimensions these strategies work well, our results show that high dimensional mixtures are
often very complex and researchers should take extra precaution while using mixtures for cluster
analysis. Even in the simplest case of mixing only two normal components in D dimensions, we
can show that it can have a maximum of D + 1 modes. When we mix more components or if
the components are non-normal the number of modes might be even higher, which might lead us
to wrong inference on the number of clusters. Further analyses show that the number of modes
depend on the component means and eigenvalues of the ratio of the two component covariance
matrices, which in turn provides a clear guideline as to when one can use mixture analysis for
clustering high dimensional data.
Parallel and Hierarchical Mode Association Clustering with an R Package
Modalclust with Yansong Cheng
Preprint | Software available from CRAN
Preprint | Software available from CRAN
Modalclust is an R package which performs Hierarchical Mode Association Clustering (HMAC)
along with its parallel implementation (PHMAC) over several processors. Modal clustering tech-
niques are especially designed to efficiently extract clusters in high dimensions with arbitrary density
shapes. Further, clustering is performed over several resolutions and the results are summarized
as a hierarchical tree, thus providing a model based multi resolution cluster analysis. Finally
we implement a novel parallel implementation of HMAC which performs the clustering job over
several processors thereby dramatically increasing the speed of clustering procedure especially for
large data sets. This package also provides a number of functions for visualizing clusters in high
dimensions, which can also be used with other clustering softwares.