Butte Lab at Stanford Medical Informatics
     
 

 

 

Downloading GENOTEXT programs

The following software is available from this project. These programs are not meant to be downloaded and run; each will need modifications for ones own computer system and database structure.

  • A program in Perl that iterates through GSE .soft files in a directory, extracts seven free-text annotations of each sample and series, and stores these in a MySQL database.
  • A program in Java that iterates through the annotations in the MySQL database and uses MetaMap to map the text into UMLS concepts. This program requires the MetaMap Transfer libraries.
  • A program in Java that iterates through the mappings and removes many (but not all!) that were manually determined to be incorrect.
  • A program in R and its data file that together can create a dendrogram figure, clustering GEO data sets by the concepts mapped from their annotations. See the gallery for an example of this dendrogram.
  • A data file containing the mappings between GEO annotations and UMLS String Unique Identifiers (SUI). To make effective use of this file, you will need the UMLS MRCON or MRCONSO table to convert SUI to Concept Unique Identifiers (CUI) or readable strings. This file is tab-delimited and is approximately 16 MB in size.
    • Column 1 contains the GEO object (GDS = GEO Data Set, GSE = GEO Series, GSM = GEO Sample)
    • Column 2 contains the annotation (title, description, source, keyword)
    • Column 3 contains a phrase of the annotation
    • Column 4 contains the score of mapping (from the MetaMap programming libraries)
    • Column 5 contains the UMLS String Unique Identifier (SUI)
    • Column 6 indicates whether the mapping was considered an erroneous mapping and removed by the Java program (above); 1 indicates removed, 0 indicates not removed
  • A data file containing relations between genes (indicated by NCBI Gene ID, formally LocusLink) and UMLS concepts. A relation between a gene and concept indicates that a statistically significant difference in expression level is seen for that gene in GEO data sets annotated with that concept, compared with those data sets not annotated with the concept. To make effective use of this file, you will need the UMLS MRCON or MRCONSO table to convert CUI to readable strings, as well as access to the NCBI Gene site to translate IDs into gene name and symbols. To reproduce the analysis in the manuscript, you will also require the Homologene table. This file is tab-delimited and is approximately 22 MB in size.
    • Column 1 contains the LocusLink (now NCBI Gene) identifier for this relation
    • Column 2 contains the UMLS Concept Unique Identifier (CUI) for this relation, obtained from the String Unique Identifier
    • Column 3 indicates whether this gene's expression was statistically significantly higher (1) or lower (0) in those GEO data sets annotated with the concept, compared to those GEO data sets measuring this gene but not annotated with the concept
    • Column 4 contains the p-value from the t-test performed for this gene and concept across the GEO data sets with and without the annotative concept
    • Column 5 contains the q-value from the t-test performed, computed using 100 permutations
    • Column 6 contains the mean rank-normalized expression level for this gene in the GEO data sets measuring this gene and annotated with the concept
    • Column 7 contains the mean rank-normalized expression level for this gene in the GEO data sets measuring this gene and not annotated with the concept
    • Column 8 contains number of GEO data sets measuring this gene and annotated with the concept
    • Column 9 contains number of GEO data sets measuring this gene and not annotated with the concept

 

 

Updated October 1, 2004