Butte Lab at Stanford Medical Informatics




GENOTEXT (GENOmics conTEXT) is an automated sytem that determines the experimental context of samples and data sets in the NCBI Gene Expression Omnibus. It extracts the contextual identifiers and annotations of samples and models these annotations using the largest available compendium of biomedical vocabularies, the Unified Medical Language System (UMLS).

The program and its results are pending publication.

The following software is available from this project. These programs are not meant to be downloaded and run; each will need modifications for ones own computer system and database structure.

  • A program in Perl that iterates through GSE .soft files in a directory, extracts seven free-text annotations of each sample and series, and stores these in a MySQL database.
  • A program in Java that iterates through the annotations in the MySQL database and uses MetaMap to map the text into UMLS concepts. This program requires the MetaMap Transfer libraries.
  • A program in Java that iterates through the mappings and removes many (but not all!) that were manually determined to be incorrect.
  • A program in R and its data file that together can create a dendrogram figure, clustering GEO data sets by the concepts mapped from their annotations. See the gallery for an example of this dendrogram.

Funding for this work was provided in part by:


Updated January 10, 2006