Tissue samples and the SAGE method. RNA for normal tissues was obtained from the following sources: colon epithelial cells isolated from sections of normal colon mucosa from two patients6; HaCaT keratinocyte cells; normal mammary epithelial cells from two individuals (Clonetics); normal bronchial epithelial cell from two individuals10; normal melanocytes from two individuals (Cascade Biologics); normal cultured monocytes, dendritic cells and TNF activated dendritic cells; two normal kidney epithelial cell lines; cultured chondrocyte cells from two normal individuals and one patient with osteoarthritic disease; normal fetal cardiomyocytes in normoxic and hypoxic conditions; and normal brain white matter from two patients and normal cultured astrocyte cells. RNA for diseased tissues was obtained from the following sources: primary colon adenocarcinomas from two patients, HCT116, DLD1, HT29, Caco2, SW837, SW480, and RKO colon cancer cell lines cultured in vitro in a variety of different cellular conditions including log phase growth, G1/G2 phase growth arrest, and apoptosis5,6,7,8 primary pancreatic adenocarcinomas from two patients and ASPC-1 and PL-45 pancreatic cancer cell lines6; breast cancer cell lines 21-PT, 21-MT, MDA-468, SK-BR3, and BT-474; primary lung squamous cell cancers from two patients10, primary lung adenocarcinoma from one patient, and the A549 lung cancer cell line10; primary melanomas from 3 patients; kidney epithelial cells lines from two patients with polycystic kidney disease; hemangiopericytomas from 5 patients; primary glioblastoma tumors from two patients and the H392 glioblastoma cell line. Isolation of polyadenylate RNA and the SAGE method for all tissues was performed as previously described1,2. Detailed protocols for the SAGE method are available from the authors upon request.

Data analysis. The SAGE software1 was used to analyze raw sequence data and to identify a total of 3,668,175 SAGE tags. Of these 171,346 tags (4.7%) corresponded to linker sequences and were removed from further analysis. The remaining 3,496,829 tags were derived from transcript sequences, but a small fraction of these contained sequencing errors. SAGE analysis of yeast2, for which the entire genome sequence is known, demonstrated a sequencing error rate of ~0.7% per bp, translating to a tag error rate of 6.8% (1-0.99310), in accord with sequence errors measured in the current data set. Therefore to provide as accurate an estimate of unique genes as possible, we accounted for sequencing errors in two ways. First, we only considered tags that occurred twice in the data set. Although this requirement might have removed legitimate transcript tags expressed at very low levels (less than approximately 0.2 copies per cell, or 2 copies in 3,496,829 transcript tags), it eliminated the majority of sequencing errors (172,276 tags). Second, because of the size of the data set utilized, it was possible that the same sequencing error in a given tag may be observed multiple times. To account for these, tags with expression levels high enough to give multiple redundant errors were analyzed for single base substitutions, insertions and deletions. If the observed expression level of a tag did not exceed its expected incidence due to redundant errors by a factor of five, it was assumed to be the result of a repeated sequencing error. This identified and removed an additional 27,051 unique tags (156,174 total tags), a number very similar to estimates of multiple sequencing errors obtained by Monte Carlo simulations. In total, these corrections amount to a sequencing error rate of approximately 9.4%, suggesting that our analyses more than fully accounted for sequencing errors and the remaining 134,135 unique transcript tags represented a conservative accounting of legitimate transcripts.

Transcript tags were matched to known genes and ESTs by use of tables containing matching 10bp transcript sequences, UniGene clusters, GenBank accession numbers and functional descriptions downloaded from the SAGEmap web site ( on Feb 23, 1999 (UniGene build 70,, and the Microsoft Access software. As UniGene clusters numbers may change over time, the most recent tag to cluster mapping can be obtained for each transcript tag individually at A total of 37,534 distinct transcripts from the UniGene database contained polyadenylation signals or polyadenylated tails and matched the collection of SAGE transcript tags; these corresponded to 23,534 unique UniGene clusters. This indicated that the each unique gene would therefore be represented by 1.6 SAGE tags on average. Therefore, we accounted for this factor in our estimates of unique genes. For example, we estimate that the 134,135 transcripts observed in our complete data set corresponds to 84,103 unique genes.

Transcript abundance per cell was determined simply by dividing the observed number of tags for a given transcript by the total number of transcripts obtained. An estimate of about 300,000 transcripts per cell was used to convert the abundances to copies per cell. For tissue specific transcripts, only transcript tags expressed at nominally 10 transcript copies per cell were considered in order to normalize for tissues with fewer total tags analyzed.


Copyright © 2003 Sagenet. All Rights Reserved.
Site design Academic Web Pages