DNA Microarray Gene Expression Analysis
Date added: 19-02-06
Each of the experiments mentioned were based around the abundance of gene expression found in a given sample. Gene expression can be thought of as gene activation, e.g. the higher the expression value, the more often that particular gene is doing work. Quantification of these abundances was made possible by the advent of microarray technologies. A full discussion of microarray technology is beyond the scope of this paper, but a basic explanation is as follows. A simple microarray is comprised of a solid surface, such as glass, covered in thousands of microscopic divots. Each divot contains many synthetically produced, single-stranded DNA, which represent a particular gene. Two groups of sample DNA are prepared; a control group to provide a standard of comparison and the experimental group. Both groups are colored with a fluorescent dye, typically green for the control group and red for the experimental group, and then introduced to the microarray platform. Upon contact with the platform, the sample strands of DNA will bind with the synthetic DNA found in the divots, in a process called hybridization. The divots will now emit levels of fluorescence dependent upon the abundance of each group (control, experimental, both, or neither) that adhered to a particular divot. By measuring the wavelengths of the fluorescence, it is possible to measure the abundance of experimental vs control that bound to a particular gene’s site.
As mentioned in the introduction, one of the major hurdles impeding the analysis of gene expression data is the lack of samples. Thankfully, recent years have seen an increase in the potential and motivation for researchers to store their data sets on publicly accessible repositories. For our experimentation, we will be accessing data from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO), [all our datasets with accession numbers]. All accession numbers provided in the references section. These files can be accessed directly in several formats, as well as via the GEOquery package (a component of Bioconductor) found in R [geo, bioconductor].
Finding a method to represent data collected by different research groups for various reason and collected on ever changing technology is by no means trivial. One of the few constants among the data sets is the expression levels, but that is only one component of the data set. The designers of GEO have incorporated several methods for encapsulating all of the data relevant to each study. We chose to use the Simple Omnibus Format in Text (SOFT) format for our work. Depending upon the level of granularity sought, these files can contain up to three main sections of data: Phenotype, feature, and expression. Phenotype data contains the information about each gene found on the microarray platform. This information may vary across research group, but typically contains the gene name, the column and row it was found on the platform, the amino acid sequence of the gene, and a unique gene identifier. Feature data is describes the microarray platform used to measure the gene expression. It contains the raw fluorescence values used to generate the relative abundance of each gene. It may also contain some metadata about the experiment such as who produced it, researcher location, submission dates, etc.
Phenotype data contains information about the origin of the sample DNA being analyzed. For our studies, this included age, survival rate, tumor size, and several other pieces of clinical information useful in classifying the tumor. Typically, this data also includes details on the research group, how the sample was collected and any protocols it was subjected to. Feature data describes the microarray platform used in the experiment and the subsequent data gathered. This includes the raw fluorescence values which are used to calculate the relative abundance of each gene, along with identifiers such as gene name and gene symbol. It may also contain other platform related info, such as the actual gene sequence, row and column on the plate, etc. Expression data is the abundance value assigned to that particular sample by the microarray analysis. It is a matrix of real number.