| back
Methods
Tissue samples
and the SAGE method. RNA for normal tissues was obtained
from the following sources: colon epithelial cells isolated
from sections of normal colon mucosa from two patients6;
HaCaT keratinocyte cells; normal mammary epithelial cells
from two individuals (Clonetics); normal bronchial epithelial
cell from two individuals10;
normal melanocytes from two individuals (Cascade Biologics);
normal cultured monocytes, dendritic cells and TNF activated
dendritic cells; two normal kidney epithelial cell lines;
cultured chondrocyte cells from two normal individuals and
one patient with osteoarthritic disease; normal fetal cardiomyocytes
in normoxic and hypoxic conditions; and normal brain white
matter from two patients and normal cultured astrocyte cells.
RNA for diseased tissues was obtained from the following sources:
primary colon adenocarcinomas from two patients, HCT116, DLD1,
HT29, Caco2, SW837, SW480, and RKO colon cancer cell lines
cultured in vitro in a variety of different cellular conditions
including log phase growth, G1/G2 phase growth arrest, and
apoptosis5,6,7,8 primary
pancreatic adenocarcinomas from two patients and ASPC-1 and
PL-45 pancreatic cancer cell lines6;
breast cancer cell lines 21-PT, 21-MT, MDA-468, SK-BR3, and
BT-474; primary lung squamous cell cancers from two patients10,
primary lung adenocarcinoma from one patient, and the A549
lung cancer cell line10;
primary melanomas from 3 patients; kidney epithelial cells
lines from two patients with polycystic kidney disease; hemangiopericytomas
from 5 patients; primary glioblastoma tumors from two patients
and the H392 glioblastoma cell line. Isolation of polyadenylate
RNA and the SAGE method for all tissues was performed as previously
described1,2. Detailed protocols
for the SAGE method are available from the authors upon request.
Data analysis. The SAGE software1
was used to analyze raw sequence data and to identify a total
of 3,668,175 SAGE tags. Of these 171,346 tags (4.7%) corresponded
to linker sequences and were removed from further analysis.
The remaining 3,496,829 tags were derived from transcript
sequences, but a small fraction of these contained sequencing
errors. SAGE analysis of yeast2, for which the
entire genome sequence is known, demonstrated a sequencing
error rate of ~0.7% per bp, translating to a tag error rate
of 6.8% (1-0.99310), in accord with sequence errors measured
in the current data set. Therefore to provide as accurate
an estimate of unique genes as possible, we accounted for
sequencing errors in two ways. First, we only considered tags
that occurred twice in the data set. Although this requirement
might have removed legitimate transcript tags expressed at
very low levels (less than approximately 0.2 copies per cell,
or 2 copies in 3,496,829 transcript tags), it eliminated the
majority of sequencing errors (172,276 tags). Second, because
of the size of the data set utilized, it was possible that
the same sequencing error in a given tag may be observed multiple
times. To account for these, tags with expression levels high
enough to give multiple redundant errors were analyzed for
single base substitutions, insertions and deletions. If the
observed expression level of a tag did not exceed its expected
incidence due to redundant errors by a factor of five, it
was assumed to be the result of a repeated sequencing error.
This identified and removed an additional 27,051 unique tags
(156,174 total tags), a number very similar to estimates of
multiple sequencing errors obtained by Monte Carlo simulations.
In total, these corrections amount to a sequencing error rate
of approximately 9.4%, suggesting that our analyses more than
fully accounted for sequencing errors and the remaining 134,135
unique transcript tags represented a conservative accounting
of legitimate transcripts.
Transcript tags were matched to known genes
and ESTs by use of tables containing matching 10bp transcript
sequences, UniGene clusters, GenBank accession numbers and
functional descriptions downloaded from the SAGEmap web site
(http://www.ncbi.nlm.nih.gov/SAGE)9
on Feb 23, 1999 (UniGene build 70, http://www.ncbi.nlm.nih.gov/UniGene),
and the Microsoft Access software. As UniGene clusters numbers
may change over time, the most recent tag to cluster mapping
can be obtained for each transcript tag individually at http://www.ncbi.nlm.nih.gov/SAGE/SAGEtag.cgi.
A total of 37,534 distinct transcripts from the UniGene database
contained polyadenylation signals or polyadenylated tails
and matched the collection of SAGE transcript tags; these
corresponded to 23,534 unique UniGene clusters. This indicated
that the each unique gene would therefore be represented by
1.6 SAGE tags on average. Therefore, we accounted for this
factor in our estimates of unique genes. For example, we estimate
that the 134,135 transcripts observed in our complete data
set corresponds to 84,103 unique genes.
Transcript abundance per cell was determined
simply by dividing the observed number of tags for a given
transcript by the total number of transcripts obtained. An
estimate of about 300,000 transcripts per cell was used to
convert the abundances to copies per cell. For tissue specific
transcripts, only transcript tags expressed at nominally ³
10 transcript copies per cell were considered in order to
normalize for tissues with fewer total tags analyzed.
top |