Simple Introduction to Statistical Aspects of Microarrays.(For those who want to get an idea of what the field is about....)
The Microarray Itself
Microarrays are glass slides on which many thousands genes (short pieces of DNA) are preprinted. In a laboratory, one receives these preprinted glass slide and one prepares two samples of mRNA: one sample is the control and the other the sample of interest (say cancer vs. healthy). One makes one sample green (Cy3) and the other sample red (Cy5) and mixes the two samples. After that one applies the mixture to each of the many thousands of genes on the glass slide. After a while they wash the slide and scans the slide with a scanner to find out how much green and how much red sticks to the slide for each gene. So for each gene one observes two intensities: the two intensities indicate how much the gene was "expressed" in each of the two sample, healthy vs. cancerous. The general idea is if the red and green intensity are approximately the same for a particular gene, then probably this gene is not involved in the disease, whereas if the two expressions are wildly different, then this could be an indication that the gene may be involved.Statistical Issues
- A. Scanning is imprecise, nor is it really clear what the "intensity" is that we are looking for. Generally, nowadays, one measures the median green or red intensity for each of the pixels and subtract from that the background intensity. Is this sensible? Are there better ways?
- B. For statistical purposes one frequently transforms the two intensities immediately to a ratio (or rather a log ratio) of the two intensities and uses only this number in the analysis for each gene. This is to cancel out some of the spot specific scanning problems (if a spec of dust obscures the intensity of green, it will also obscure the intensity of red at that place. However, don't you lose information in this way? Are there better ways to correct for scan and other spot specific problems? One problem is that the ratio becomes very instable when the quantities become close to zero.
- C. After the log transform frequently other kinds of normalization take place. Which normalizations are reasonable?
- D. The issue of the reference or control sample is understudied. Frequently, statisticians as well as biologists pretend that the log ratio data are the absolute expression levels. It is always important to remember that the expressions are relative expression. This is a typical statistical experimental design issue, which can be very productively brought to bear on the design of microarray experiments.
- E. Short article about problems of statistical microarray normalisation
Dynamic modelling of expression data
Currently a lot of microarray experiments that are carried out are repeated experiments in which the mRNA experimental samples that are put on a slide are pieces of mRNA of bacterial organisms 5, 10, 15, 30 etc minutes after they are given some kind of treatment (e.g. a heatshock to the organism "yeast"). The reference/control sample is generally for each glass slide the mRNA of the same organism just before the treatment. In this way, a time series of differential expressions is obtained for each gene at 5, 10, 15 etc. minutes after the treatment. It is hoped that these data can tell us something about how the genes are interacting with each other. A problem is generally that at the most only 15-20 time points are observed and frequently that number is less than 10.Statistical Issues
- A. A lot of clustering techniques are on the market these days that simply ignore the time structure and treat the problem as an 6000 times 10 matrix in which they want to cluster the genes in separate groups. However, for clustering the time structure should be taken into account. Moreover, what is the biological use of finding clusters of genes that express themselves in the same way? These clusters of genes are called "co-regulated" genes and biologists suspect that coregulation is an essential part of the redundancy built into the genetic network. However, clustering tells us nothing about real interactions or causal relationships between genes.
- B. The short time series has the intrinsic problem that standard time series techniques are not going to work. Therefore, one has to come up with new methods to describe the genetic frame work. Notice that even if one makes a simplistic linear model where the past time instance determines the next time instance completely (except the random noise) and allows for interaction between genen, one has a 6000 * 6000 dimensional parameter space, where as there are only 10 * 6000 observations.
- C. What is the biological structure underlying gene interactions, e.g., on what scale can we expect genes to interact with each other. Also, it is important to use the information that the genetic network is only loosely connected. This means that we always have that a lot of the 6000 * 6000 parameters are zero.
- D. Remember that the issue of the reference/control sample is important here as well. If one choose a different reference, one can get a quite different expression. Again it is good to keep in mind that we are dealing with differential expressions.
We have tried to gather all information ourselves to secure accuracy. However, if you notice inaccuracies or want to add something, please email us.