|
|
|
|
Vol. 11, Issue 1, 12-27, January 2001 Biased Distribution of Inverted and Direct Alus in the Human Genome: Implications for Insertion, Exclusion, and Genome Stability1 Laboratory of Structural Biology, 2 Laboratory of Molecular Genetics, National Institute for Environmental Health Sciences, NIH, Research Triangle Park, North Carolina 27709, USA; 3 Genetic Information Research Institute, Sunnyvale, California 94089, USA
Alu sequences, the most abundant class of large dispersed DNA repeats in human chromosomes, contribute to human genome dynamics. Recently we reported that long inverted repeats, including human Alus, can be strong initiators of genetic change in yeast. We proposed that the potential for interactions between adjacent, closely related Alus would influence their stability and this would be reflected in their distribution. We have undertaken an extensive computational analysis of all Alus (the database is at http://dir.niehs.nih.gov/ALU) to better understand their distribution and circumstances under which Alu sequences might affect genome stability. Alus separated by <650 bp were categorized according to orientation, length of regions sharing high sequence identity, distance between highly identical regions, and extent of sequence identity. Nearly 50% of all Alu pairs have long alignable regions (>275 bp), corresponding to nearly full-length Alus, regardless of orientation. There are dramatic differences in the distributions and character of Alu pairs with closely spaced, nearly identical regions. For Alu pairs that are directly repetitive, ~30% have highly identical regions separated by <20 bp, but only when the alignments correspond to near full-size or half-size Alus. The opposite is found for the distribution of inverted repeats: Alu pairs with aligned regions separated by <20 bp are rare. Furthermore, closely spaced direct and inverted Alus differ in their truncation patterns, suggesting differences in the mechanisms of insertion. At larger distances, the direct and inverted Alu pairs have similar distributions. We propose that sequence identity, orientation, and distance are important factors determining insertion of adjacent Alus, the frequency and spectrum of Alu-associated changes in the genome, and the contribution of Alu pairs to genome instability. Based on results in model systems and the present analysis, closely spaced inverted Alu pairs with long regions of alignment are likely at-risk motifs (ARMs) for genome instability.
The genomes of many complex organisms contain short, interspersed,
intermediate repetitive elements that are nonviral,
nonautonomous transposons. The Alu sequence
elements, which are derived from 7sRNAs, are the most numerous in
primates (for review, see Novick et al. 1996 The dissemination of diverged Alu repeats over the last 65 million years have contributed to the structure, function, evolution, and diversity of the human genome. Alus have been regarded as "junk" DNA because of their high frequency and inert nature.
Retroposed Alu insertions, however, can coevolve in the
context of their target DNA and hence take on diverse functions
(Szmulewicz et al. 1998 Alus also may have negative consequences and impact on human
health. Besides their ability to retropose to inappropriate regions or
to facilitate unequal homologous recombination events (Deininger and
Batzer 1999 Based on their abundance and an average of approximately 85% identity
(Shen et al. 1991 What determines whether Alu elements will be benign or pose a
threat to human health? Aside from their appearance within important regions of critical genes, we have pursued features of Alu
elements and their arrangement in the human genome that would identify potential destabilizing effects as well as suggest their mode(s) of
integration and/or stability. One approach is to examine the pairwise
distribution of Alus with the idea that adjacent Alus might interact with higher frequency than Alus that are
farther apart. A priori, the appearance and characteristics of
an Alu could be independent of other Alus or,
alternatively, might be influenced by adjacent Alus. For
example, a nonrandom, high frequency of closely spaced Alus
might indicate preferential insertion. Therefore, a study of
Alu pairs may reveal mechanisms of insertion and/or subsequent
preferred changes. Findings from previous computational analysis of
Alu distributions suggest a strong bias towards pairs in which
the Alus are in a direct orientation and closely spaced (Jurka
1995 Driven in part by questions of homologous interactions, we have
developed approaches to analyzing Alu pairs based on regions of alignments and degree of sequence identity. Observations with yeast
and mouse cells suggest that long closely related inverted repeats are
unstable and can cause deletions in eukaryotes (Gordenin et al. 1993 Recently, we demonstrated with a yeast-based model system that inverted
Alu pairs are hotspots for recombination, even if diverged
(Lobachev et al. 2000
Approaches to Analyzing Alu Pair Distribution Our analysis of Alu pairs was guided by observations in
model systems that investigated stability of large repeats and their potential for interaction. In particular, results with human
Alus and other long repeats in yeast (Nag and Kurst 1997 First, we identified Alu sequences by comparing them to a
well-defined consensus sequence (see Methods). Alu pairs were
identified in which the separation was <650 bp and the orientations,
direct and inverted, were identified (see Methods). Briefly,
Alus were assigned an orientation D or C according to their
direction in the human genome sequence database: direct
with the poly(A) tail at the 3' end or complementary
with the complement of the tail at the 5' end. There are four
possible orientations for the Alu pairs corresponding two
combinations of direct repeats (DD and CC) and two combinations of
inverted repeats (CD with the double-strand AT-rich tails pointed
outward and DC with AT tails pointed inward). In addition to
orientation, the pairs were analyzed according to degree of homology
(defined by the identity score; see Methods) and length of identical
regions across the aligned Alu region ("a" in Fig.
1). Both the distance between Alu
sequences (c) and more importantly the distance between aligned regions
in Alu pairs were determined (see Fig. 1 and Methods). Based
on the results of Lobachev et al. (2000)
The information about Alu distributions is available at the Web site, and the extensive Alu pair tables (CC.html, CD.html, DC.html, DD.html) are interfaced to our database. The analysis utilized human sequence information available up to September 9, 1999. (Our system provides for the efficient incorporation of new data as they are added to the human genome database.) Included in the Web site is a description of the associated chromosome region or gene to aid in identifying regions that may be "at-risk" for possible Alu-associated changes (an example is presented in Fig. 2).
Preference in the Alignment Length Distribution of Paired Alus The adjacent Alus first were classified according to relative orientation. Two-thirds of the Alu pairs (46,087 / 70,324) had both Alus pointing in the same direction (Fig. 3). As expected, the direct repeats were equally distributed between the DD and CC orientations (23,318 and 22,769, respectively).
The pairs of Alus then were analyzed according to their alignment lengths, which would correspond to the length of adjacent Alus over which interactions might be possible. These distributions were expected to reflect rules about insertion and stability of Alus. Pairs of Alus were divided into five categories according to the alignment length `a' (in bp): >276, 201-275, 126-200, 51-125, and <50. As shown in Figure 3, nearly 75% of the Alu pairs contained aligned regions corresponding to nearly full-length (>276 bp) or half-length Alus regardless of orientation. Among all the Alu pairs, 41% of direct and 50% of inverted
Alu pairs had alignment lengths >276 bps. The second
most-frequent category of Alu pairs, containing 24% (17,007)
of the total, was that for which the alignment lengths were
approximately one half the length of an Alu consensus sequence
(126-200 bp). This category may reflect the dimeric nature of the
ancestral full-length Alu that is 282 bps long (excluding the
poly(A) tails). The upstream half of an Alu sequence, which is
derived from the free left Alu monomer (FLAM for review,
see Jurka 1995 Bias and Exclusion of Adjacent Alus in Relation to Separation of Repeats Previously, it was shown that distance was a major component in the
distribution of Alu pairs, where the distance between Alus was simply identified as non-Alu sequence
between the pair of Alus (Jurka 1995 As shown in Figure 4A, nearly 30% of all
the direct Alu repeat pairs are separated by <20 bp,
similar to the previous report where much less of the genome had been
analyzed (Jurka 1995
Analysis of Alu pair distribution on the basis of distance between aligned regions (b + c), rather than distance between Alu sequences revealed a much greater difference in frequencies between direct and inverted Alu repeats. Unlike for direct repeats, only a relatively small number of all the inverted Alu repeats were identified that had aligned regions separated by distances <20 bp (Fig. 4B). This may suggest that inverted Alu repeats are excluded during insertion and/or they are an unstable configuration. The total number of direct repeats separated by <20 bp was ~12- and 25-fold, more than the DC or CD classes of inverted repeats, respectively. For separations >20 bp, the total number of inverted and direct repeats was nearly constant for different size classes and within a factor of two of each other, suggesting more random insertion. It is interesting that while the most frequent group of inverted Alus are those separated by <20 bp (Fig. 4A), the actual distance between aligned regions often is 20 to 40 bp (compare Fig. 4A to 4B), suggesting that one member of the Alu pair is truncated (discussed below). Given the importance of alignment in Alu pair distribution, the Alu pairs were analyzed according to the length of aligned region and either distance between Alu sequences (Fig. 5A-E) or distance between aligned regions (Fig. 6A-E). As shown in Figure 5A-E, large differences in the relative number of direct versus inverted repeats were found when the size of aligned regions was considered. For Alu pairs with approximately full-length alignments (>275 bp, the upper limit was arbitrarily chosen to be 500 bp, although few Alus exceeded 300 bp) and half-length alignments (125-200 bp), there is a vast excess of direct versus inverted Alu pairs for short spacer distances (<20 bp). Surprisingly, there are many fewer direct repeats in the 200-275 bp and the <125 bp alignment categories. Possibly this pattern is a reflection of the dimeric nature of Alus.
Regardless of the size of the aligned regions, there is clearly a
reduction in inverted Alu pairs with short spacer distances (<20 bp) between the aligned regions (Fig. 6A-E), although the exclusion seemed somewhat less for the DC category of the approximately full-length Alus. This is consistent with our previous
results, where we found that for full-length Alus, there was a
strong bias against inverted repeats that are closely-spaced (Lobachev
et al. 2000 Sequence Identity between Paired Alu Elements An analysis of Alu pair distribution based on identity between the Alus may reflect a role for homologous interactions in their appearance and stability. Most Alu pairs share between 65% and 85% sequence identity for both direct (Fig. 7A) and inverted (Fig. 7B) repeats. There were no apparent differences between distributions for the CC vs DD orientations or the CD vs DC inverted repeat orientations (data not shown). The distributions were similar for alignment lengths up to 275 bp. The exception was for Alu pairs that had a short alignment length (<50 bp) where identities greater than 90% range were observed frequently. For the other categories (>50 bp), there were few Alu pairs that shared >90% identity. The degree of identity distribution for the approximately full-length direct and inverted Alu pairs were somewhat narrower and shifted towards greater identity, possibly suggesting interactions or conservation of some features of the Alus. It will be interesting to determine if there are regions within the full-length Alus that are more conserved.
Because the distributions of direct and inverted full-length
Alu pairs differed dramatically for short separation distances between aligned regions (Fig. 6A), we examined further their
distribution in order to evaluate the relationship between separation
of aligned regions and level of homology. Presented in Table
1 are the frequency distributions for
Alu pairs with long alignment regions (>275 bp) classified
by levels of identity and separation distance between the aligned
regions. Ninety percent of these Alu pairs exhibit 70%-90%
sequence identity, with the number of pairs in the 70%-80% identity
group about 1.7-fold greater than in the 80%-90% group, regardless
of orientation within the pair (see Table 1). The aligned regions in
one third of the direct Alu pairs are separated by 40 bp or
less (Fig. 6 and Table 1).
While closely spaced, inverted Alu pairs generally are less frequent than direct pairs, our analysis revealed features in the distribution of full-length inverted Alu pairs that correlate with the degree of identity and orientation (i.e., tails-out or tails-in). As shown in Table 1, the proportion of inverted Alu pairs separated by (<20 bp with 80%-90% identity was considerably lower than for direct repeats, regardless of orientation of the inverted Alus: ~250- and ~20-fold lower for the Alu pairs with tails external (CD) and tails internal (DC), respectively. Unlike the direct Alu pairs, the frequency appeared to increase somewhat with decreasing degree of homology. (Although the total number of events is small, these results are consistent with the observation of only one pair of inverted Alus in the 90%-100% identity category up to 40 bp separation as compared to 14 among 34 direct Alu pairs.) The frequencies of CD and DC pairs were comparable at longer separation distances (41 to >80 bp) and were independent of homology; the frequencies also were more comparable to those of direct repeats. There was a difference between the frequencies of CD and DC repeats among the Alus separated by 21-40 bp, with the CD pairs being about twice the frequency of the DC pairs and exhibiting frequencies in the range of the direct repeats. As presented in the Discussion, the rarity of inverted Alu pairs that have closely spaced, highly related regions (e.g., <20 bp), regardless of heads-in or heads-out orientation, is likely because of a destabilizing effect resulting from an interaction between the homologous regions. Truncations of Large Alus The similar frequencies for inverted pairs (especially CD) and
direct repeat pairs at distances >20 bp may indicate that targeting mechanisms exist for inverted repeats (or at least for the CD category)
as well as direct repeats (Jurka 1997
For the heads-out repeats, approximately one half (44%-55%) of the Alu pairs have a truncated Alu, regardless of the distance between the Alus. This contrasts with the heads-in inverted repeats and the direct repeat categories, which also were markedly different from each other. For the direct repeats, 45% of the pairs have a truncation when the Alus are closely spaced (<20 bp). However, the truncations are not evenly distributed: there is a strong bias (10:1) for truncations of the 5' ends of Alus that are internal to the closely spaced pairs. At greater distances, the frequency of truncations increases and the bias disappears, with the ratio of internal truncations becoming somewhat less than external. For the heads-in category (CD), there is a strong bias towards equal alignment lengths (no truncation) when the spacer is short (20-60 bp). At longer distances, the frequency of truncations is comparable to that for the direct and the heads-out inverted repeat categories. Among the few heads-in Alu repeats that are found at very short distances (<20 bp), they are nearly equally distributed between truncated and equal sizes. We suggest that these differences in truncation patterns reflect differences in mechanisms of insertion and/or stability (see Discussion). Age of Alus in Pairs Differs with Distance Recently retroposed Alu sequences usually are not fixed in
the population (Batzer et al. 1996
General Approach to Investigating Alu Distribution Understanding the organization of Alus in the human genome is expected to shed light on Alu integration, Alu changes, and the potential for Alus to affect genome stability. Our approach incorporated newly developed computational tools along with previously developed programs to analyze Alu pairs in terms of the potential for homologous interactions. The pairwise approach to analyzing Alus was motivated in part
by observations from several model systems. Pairs of large inverted DNA
repeats can be unstable, lead to deletions, and stimulate recombination
between DNAs surrounding the inverted repeats in yeast (Gordenin et al.
1993 Our study focused on four attributes of Alu pairs that might
be important in the integration and stability of Alu sequence pairs in genomes of humans: (1) Orientation of each member of the pair,
(2) size of the inverted repeat, (3) distance between the aligned
regions of the pair, and (4) sequence identity. In addition, we
examined the age of Alus. The present study considerably extends our recent report demonstrating a reduction in closely spaced
inverted Alus as compared to direct Alu repeats
(Lobachev et al. 2000 Direct Alu Repeats and Preferences It is clear that orientation and distance between aligned regions are important factors in Alu distribution. For long separations (i.e., >80 bp) between aligned regions, there do not appear to be any preferences or exclusions of direct or inverted Alu pairs. In this study, we confirm that there is a vast excess of closely spaced
Alus in the direct as compared to the inverted orientation (Jurka 1995 It should be noted that complementary AATTTT-like signals within preexisting Alus would determine integration of the incoming Alus in the opposite orientation. However, in directly oriented Alus, the complementary AATTTT-like target signals are about 10 times less frequent than the TTAAAA-like signals (Jurka, unpubl.). Therefore, closely spaced Alu integrations in direct orientation (i.e., CC or DD) would be expected to be around 10 times more frequent than inverted sequences (CD or DC). (Furthermore, the AATTTT-like signals are scattered randomly in Alus so that no systematic pattern of CD or DC pairs would be expected.) Regardless of targeting preferences, this would not explain the rarity of closely spaced inverted Alus (<20 bp) as compared to Alus that are more distant. The present study differs from previous approaches in that we have classified Alu pairs according to the distance between aligned regions rather than just the distance between Alu sequences. Because the actual distance between aligned regions (b + c) is greater than or equal to the distance between Alu sequences (c), it appears that many of the closely spaced pairs may contain truncated Alus. Our extensive analysis of approximately full-length Alus (Table 2) has demonstrated that most of the truncations are in the internal head of the direct repeats. It is interesting that the strong preference for closely spaced aligned regions of Alus only applied to the largest categories of direct repeat pairs, >275 bp and 125-200 bp. Possibly this is a result of an integration preference for full-length and half-length Alus (i.e., one member of the dimer within an Alu). Both the direct and the inverted Alu pairs that have
full-length alignments tend to be more closely related than when the alignments are shorter. Among the reasons are that some regions diverge
less readily than others, the full-length Alus have long polyA
tails that would contribute to overall sequence identity, and possibly
there are greater opportunities for homologous interactions. Also, long
Alus tend to be younger and less diverged (Arcot et al. 1995 Inverted Alu Pairs and Exclusion We found that, unlike for direct repeats, inverted Alu pairs with closely spaced (<20 bp) aligned regions were uncommon regardless of size of alignment or orientation (Fig. 4B) and especially rare among the Alu pairs with nearly full-length alignments. As the distance between aligned regions increased beyond 20 bp, the frequency of direct and inverted Alu pairs became more uniform, suggesting random integration. Interestingly, for Alus whose alignment regions are separated by <20 bp nucleotides, the CD pairs (tails out) are even more excluded than the DC pairs. Possibly, this is a result of more heterogeneity of tails versus the unique sequence of Alu heads. The exclusion of closely spaced inverted repeats is consistent with the
observation that inverted repeats at close distances are unstable in
yeast and, for the case of Alus, the instability is highly
dependent on distance (Lobachev et al. 1998 Thus, we conclude that sequence identity and distance are important factors that contribute to the distribution of Alu pairs and the potential for Alu pairs to cause genome stability. These observations with inverted as well as direct repeats may be useful in understanding the frequent clustering of Alus. Truncations of Closely Spaced Alu Pairs and Mechanisms of Integration and/or Instability We found that for heads-out repeats (DC) and distantly spaced heads-in (CD) and direct repeats, about one half of the pairs had a truncation in one of the Alus. Departures from this frequency may indicate factors that affect Alu insertion and/or stability of Alu pairs. For closely spaced direct Alus, the internal Alu head frequently is truncated relative to the external head. The results with the inverted Alus clearly are different. The closely spaced (20-60 bp) heads-in inverted Alu pairs have a strong bias towards Alus with no 5' truncations, while there is no such bias for the heads-out category. (The higher frequency of truncated pairs among the very few with aligned regions that are <20 bp apart may simply reflect the instability of inverted Alus in close proximity.) The differences between the two categories of inverted repeats and the markedly different observations with the direct repeats have interesting implications both for the origin of Alu repeats and their potential instability. Models based on a simple retroposition at the site of integration do not account for the various truncation patterns we observe: Fifty-percent truncation of one of the Alus in distantly separated pairs of Alus, preferred truncation of the internal head of direct repeats, and no truncation of heads-in inverted repeats. As discussed above, there is a strong preference for insertion of an Alu in a direct orientation immediately next to the tail of an existing Alu. The present results suggest that along with targeting, there is an associated removal of some of the 5' end of the incoming Alu and as targeting next to an Alu become less likely (i.e., for more distantly spaced repeats), the truncations are less frequent. The reasons for the lack of truncation in the heads of the closely
spaced heads-in Alus but not heads-out inverted Alu
pairs are not clear. However, they may indicate a directional
complementary pairing mechanism prior to integration that starts from
the apex of the inverted pair (i.e., the closest sequences in an
Alu pair). This would result in a preference for full-length
pairing at the apex that could be detected in the heads-in category but
not the tails-in category (variability in size of poly AT tails would preclude such analysis of the tails-in pairs). Opportunities for complementary interactions might arise if Alu RNA had been
reverse transcribed to cDNA prior to integration. Although there are
few direct examples, genetic evidence from yeast are consistent with cDNAs being an intermediate in recombination (Derr and Strathern 1993 Another possible explanation has to do with the transcription by RNA polymerase III. In the DC class of inverted repeats, 5' to 3' transcription from both ends converges toward the center of the repeat. The transcription diverges outward from the center in the CD class. DC repeats therefore can be opened by transcription for recombination because transcription initiating in D's promoter could proceed through the center and into the internal middle run of Ts between the FLAM and the FRAM in the second Alu in the pair. This would be precluded if the C in the DC pair presents a "poly T" tail so that transcription is halted before the second Alu creating the possibility of ATA triplexes from the interaction between single stranded transcribed poly A and the AT duplex in the DC pair.
This study was motivated in part by the association of numerous
human disorders with Alu-mediated sequence rearrangements. However, given the high frequency of Alu sequences in the
human genome, the number of Alu-associated diseases would
appear low. The majority of the Alu sequences are stable,
having remained in their present position in the human genome for
millions of years and may pose relatively little threat to human
health. However, we reasoned that a study of the characteristics of
Alu pairs might reveal situations that potentially are
unstable. These could include closely spaced, inverted Alu
pairs of pairs that might be unstable in some backgrounds [as found
for yeast (Lobachev et al. 2000 Because we found that both distance and sequence identity were
important factors in defining the distribution of Alu pair repeats and that they influenced the stability of inverted repeats in
model systems, these factors may be useful in predicting regions or
genes in the human genome that may be at-risk for instability as a
result of inverted Alu pairs, based on information available from yeast and the similarity between systems in yeast and humans that
deal with genetic stability (Resnick and and Cox 2000 Loss of heterozygosity (LOH) of several genes on the list have been
linked to various genetic illnesses. These include TNX, encoding tenascin-x (associated with an undesignated Ehlers-Danlos syndrome type, clinically similar to type II, but with distinct ultrastructural characteristics) (Burch et al. 1997 A deletion of the LIMK1 gene coding for the
neuregulin-interacting serine, threonine, tyrosine kinase, is
associated with the less-severe phenotype of Williams-Beuren
syndrome (Robinson et al. 1996 It is interesting that
We also have identified many genetically uncharacterized regions that contain Alu pairs with closely spaced, aligned regions (data not shown but available in web site database). Even if these regions are not associated with genes, they might be at-risk if the Alu pairs could initiate genomic changes such as LOH, chromosome loss [as found for yeast (Lobachev, unpubl.)] and translocations. They may correspond to highly polymorphic sites between individuals. It would be interesting to follow the stability of the various closely spaced inverted Alu pairs under different growth and exposure conditions. Based on results with yeast, mutants defective in DNA metabolism or expressing altered DNA metabolic proteins also may reveal situations under which closely spaced, inverted Alus would be unstable.
Computational Approaches and Alu Identification The computational methods used in this study are listed in Table
5. They are available at
the website http://dir.niehs.nih.gov/ALU/methods.html, where they
are described at length. Computations were performed on either a SUN
Sparc station running SunOS 5.5 (SUN Microsystems) or a Silicon
Graphics O2 workstation running IRIX 6.3.
Sequences that were related to a consensus Alu sequence were
identified. A map file of all human Alu sequences as of
September 1999 was developed by comparing human genomic sequences in
the GenBank database (release112.0, National Center for Biotechnology Information, National Library of Medicine, National Institutes of
Health, Bethesda, Maryland) with a well-defined Alu consensus sequence (Jurka 1993 The annotated sequence files were extracted from the GenBank database to create a GenBank sublibrary of Alu sequences. This sublibrary was needed to generate alignments between the Alu pairs. Categorization of Pairs According to Relative Orientation of Alus The revised map file was used as input to derive a list of loci and
their corresponding coordinates for each pair of adjacent Alu
sequence. The program PFOLLOWS3 (Klonowski and Jurka 1997 Four simple programs, written in Perl, were run by an executable script
to extract the coordinates for each of the four types of pairs (CC, CD,
DC, DD) from the PFOLLOWS3 output file. The four subfiles
were reformatted to place the pair coordinates side-by-side in list
format for subsequent reference and analysis. These coordinate list
files were then used to extract the sequences from the GenBank-derived
sublibrary by the program VEXT (Klonowski and Jurka 1997 Extraction and Alignment of Paired Alu Sequence Fragments Sequences from the sublibrary were extracted for both Alu
elements in each pair. For the inverted repeats (the CD and DC
Alu pairs), the reverse complementary sequence of the second
Alu in each pair was generated for alignment. The Alu
pairs that were direct repeats (CC and DD) were aligned with each
other; for the inverted repeats, the first Alu in each of the
pairs was aligned with the complement of the second Alu
element. These pairwise alignments yielded various alignment
characteristics [such as mismatches and matches (Waterman 1984 Retrieval of Coordinates Once the sequences were aligned, the actual aligned coordinate
numbers (the alignment program renumbered the coordinates to start with
one) were regenerated using the short program PRENUM02 (Klonowski 1998 Alignment Length and Percent Identity within Pairs of Alus The data were reformatted and files containing chromosome location,
Alu sequence, and pairwise alignment characteristics were merged (the loci names were matched to establish correct merging of
files). This enabled the alignment length "a", percentage identity, and other parameters to be determined for pairs of Alus. An
example of the parameters associated with each Alu pair in the
website is provided in Figure 2. The alignment length "a" ("align
len." in Fig. 2) is obtained as follows: if the aligned sequence of Alu1 (a1) > the aligned sequence of Alu2 (a2),
then a = a1, otherwise a = a2. (The actual length
a1 = ("Alu1 aligned finish") The "spacer len(gth) (c) " = (Alu2 fragment start) We next recovered the sequence description headings and the original unaligned sequence fragment coordinates from the sequence files that were used in generating the data. Once the files containing the coordinates for first and second Alus in a pair and the description were generated, they were pasted side-by-side with the "grepped" alignment statistics. A Perl script called Bins 3 was used to subdivide the summary output from the alignment into different groupings dependent on the percentage identity between the two Alus in the pairs. In the process, the Bins program verified that the loci names in the pasted coordinate data matched the alignment data. The Bins program also recalculated the "c" size by subtracting the end of the first original Alu fragment coordinate from the end of the second original Alu fragment coordinate as an internal check. Another Perl program, Bins, was written to subdivide the files according to the range in which the "a" length fell to group the data based on length. The data also were grouped according to the percentage identity between the two Alu sequences in a pair. Finally, two more Perl programs were used to subdivide the data according to distances between the Alus (Bins C) and distances between aligned regions (Bins 2). The data were summarized in four hypertext mark-up language (HTML) tables (cc.html, dd.html, etc.) These tables include the raw data supporting the present results. Links from each of the cells in the tables contain all Alu sequence information available as of September 1999. Each cell has a hypertext link to the characteristics describing each pair. An example of the linked information is provided in Figure 2. Determination of Internal Truncations It was necessary to use four different algorithms, one for each of
the four possible orientations to determine the length of internal
orientations. In the case of CC, p is equal to the fragment
finish coordinate minus the alignment finish coordinate of the first
Alu in the pair, while q equals the fragment finish coordinate minus the alignment finish coordinate of the second Alu in the pair. If q For DD, the algorithm is as follows: p is equal to the aligned
start coordinate minus the fragment start coordinate of the first
Alu in the pair, while q equals the aligned start
coordinate minus the fragment start coordinate of the second
Alu in the pair. If q Inverted repeats were more complicated. For CD, where both heads are
internal, p is equal to the fragment finish coordinate minus
the aligned finish coordinate of the first Alu in the pair, while q equals the aligned start coordinate minus the fragment start coordinate of the second Alu in the pair. If
q Finally, for DC, where both heads are external, p is equal to
the aligned start coordinate minus the fragment start coordinate of the
first Alu in the pair, while q equals the fragment
finish coordinate minus the alignment finish coordinate of the second Alu in the pair. If q
We are grateful to Paul Klonowski for writing PRENUM02 and for automating the tabulation of the data in HTML. We greatly appreciate the comments of Rob Slebbos and Jim Mason on the manuscript. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
4 Present address: Duke Center for Human Genetics, Duke University Medical Center, Box 3445, Durham, NC 27710, USA.
5 Corresponding author.
E-MAIL resnick{at}niehs.nih.gov; FAX (919) 541-7593.
Article and publication are at www.genome.org/cgi/doi/10.1101/gr.158801.
|