SGS terms

Материал из Zbio

Перейти к: навигация, поиск


[править] Fragment sequencing

Also "shotgun sequencing". DNA is broken up randomly into numerous small overlapping segments. Adaptors are ligated to both ends. Fragments are pre-amplified, clonally amplified and sequenced from one end. Obtained reads are aligned to the reference genome or used for de-novo assembly.

[править] Randomness of fragments

To obtain "random sequences" it is necessary to have "random fragments", "equal efficiency of adaptor ligation", "equal pre-amplification" and "equal clonal amplification"

  • current digestion procedures: hydrodynamic and ultrasonic digestion are generally considered as random. DNase digestion, 2-bp recognition restriction enzymes — non-random with preference for some regions;
  • if starting material is a collection of short fragments (like a PCR products), than "end parts" will be overepresented;
  • adaptor ligation definetely has some sequence preference, especially for ligation of A-tailed fragments;
  • both preamplification and clonal amplification are sensitive to GC-content and to secondary structure.

[править] Length of library fragments

  • both bridge-amplification (Illumina) and beads-ePCR (SOLiD) do not amplify too long fagments. Limits are: ~700bp for bridge-amplification and ~300bp for beads-ePCR;
  • DNA fragments should not be shorter, than read length: ~35bp both Illumina and SOLiD;
  • shearing of DNA sample into shorter fragments increases the complexity of library. Let's suppose, that two DNA samples were digested to mean sizes ~500bp and ~50bp and libraries were prepared without any losses. Both libraries are suitable for Illumina sequencing. But complexity of the first library is 10x loower than complexity of the second.
  • hydrodinamic digestion is not efficient for ds DNA <1kbp;
  • ultrasonic digestion is not efficient for ds DNA <300bp;
  • ultrasonic digestion produce smaller fragments if compare with hydrodinamic (Hydroshear, nebulizer);

[править] Sequence analysis

  • it is impossible align sequence unambiguously if it is repeated in the reference genome several times, so any repeats longer than read length are out of analysis for fragment libraries;
  • to reconstruct structural variations (insertions, deletions, inversions, duplications) it is necessary to recognize sequence on both borders of variation. 35bp read is too small for such task: it is difficult to recogmize and unambiguously align two fragments within it;
  • de-novo assembly results in ~500bp contigs;

[править] Mate-Paired sequencing

Also "pairwise end", "paired end", "double-barrel shotgun" sequencing. Normally, fragment length should be within some interval (for example 1.5±0.1kb).

Mate-Paired sequencing helps to solve two tasks:

  • mapping of repetitive sequences. Let's suppose, that for some particular MP-read one of the end-sequence is unique, and other may be mapped in the number of positions within the genome. Taken alone repetitive sequence can't be unambiguously mapped to the genome. Taken as a part of MP-read it will be mapped unambiguously if only one repeat is located within fragment lenght interval from the unique sequence. Similar algorithm used for mapping of two repetitive sequences: defined fragment lenght significantly restrict possible map positions;
  • studying of structural variants. Inversion may be recognized as disturbance of orientation of end-sequences. Insertion/deletion — as significant deviation of mapped length of MP-read from the mean value. Translocation — as location of end-reads in unrelated positions in the genome;

Fragment length variation (FLV) should be as small as possible, because:

  • the length of region for location of repetitive sequence is equivalent to the fragment length variation (FLV). The smaller FLV, the more accurate positioning of the repetitive sequence;
  • it is possible to recognize insertion/deletion only if length of rearrangement is larger, then the FLV. The shorter the fragment length variation the more In/Del's would be detected;
  • but fragment length variation can't be zero, because: limited resolution of gel electrophoresis, DNA fragments with different sequence have slightly different mobility; the shorter FLV, the less DNA will be used for library preparation, the lover will be complexity of the library.

Different sequencing projects may have different optimal MP-fragment length.

MP-fragment length
  • less initial material for the library of same complexity;
  • less fragment length variation (FLV);
  • less fragments should be sequenced to characterize a whole genome (virtual redundancy is higher);

[править] Library complexity

Complexity of the library is a number of independent DNA molecules in it. In both Illumina and SOLiD protocols "preamplified" libraries are used for preparation of flowcells. As a result it is possible, that the same fragment will be sequenced several times.

Ideally, complexity should be significantly more, than the number of sequenced reads:

  • it is unpractical to sequence too deep a low complexity libraries (ChIP);
  • high complexity library should be prepared for high-coverage sequencing project;

It is possible to estimate complexity after preamplification.

Let's suppose, that K cycles of prePCR results in m[µg] of DNA with mean size L[kb].

* starting amount of DNA was: m0 = m / 2K
* number of independent molecules: 
    N0 = m0 / Mw * NAvogadro = 
    m [µg] / {2K × 2 × 330[g/mol] × 1000 × L[kb]} × 6x1023[mol-1] ≈ 
    {m[µg] × 2-K × 1012}/ L[kb] 

[править] Read length (RL)

Number of sequenced nucleotides. For both SOLiD and Illumina read length is the same for all clones.

In both systems RL may be selected by user. In most of the cases it is better to have RL larger than 30bp, because this range is a border, where most of the "good" sequences map uniquely to the genome. Further increase of the RL practically does not change throughput, slightly decrease a price per nucleotide, increase an error rate, a bit simplify analysis of repeats and structure variants.

Different sequencing projects may have different optimal RL. 50bp is a good selection for sequencing now.

Long read length
  • for analysis of structure variations;
  • for analysis of repeats;
  • sequencing time per nucleotide increases with length;
  • sequencing quality significantly decreases with length;
  • increase of RL may result in lower number of "readable" clones;
  • for such applications as ChIP-seq or expression proofiling longer reads does not provide additional information;

[править] RL / price

Sequencing price is a sum of prices for:

  • flowcell preparation,
  • sequencing reagents,
  • mashine amortization.

Increase of read length does not change the first component, but proportionally increase second and third.

[править] RL / throughput

Time of the run a sum of times for:

  • run installation,
  • sequencing.

The first component does not depend on RL. Sequencing time per nucleotide increases with length (because longer time is required for catching of low fluorescent signals).

[править] Redundancy and coverage

Coverage is the percentage of the genome covered by reads.

Redundancy (sometimes erroneously referred to as coverage) is the number of reads representing a given nucleotide in the reconstructed sequence. Mean redundancy can be calculated from the length of the genome (G), coverage (C), the number of reads (N), and the read length (L) as:

Mean redundancy = N × L[bp] × C[%] / {G[bp] × 100%}

For example, sequencing of genome with 3x109bp give 5x108 of 35b reads. 
70% of these reads were align to 90% of genome. 

In this case:
 * length of the genome G = 3x109[bp]
   * coverage C = 90%
   * number of reads N = 0.7 * 5x108
   * read length L = 35[bp];
   * mean redundancy 0.7 * 5x108 × 35[bp] × 90% / {3x109[bp] × 100%} ≈ 3.7

Both terms (coverage & redundancy) may be applied to the whole genome or any fragment of it.

Redundancy is not uniform along the genome because of combinatorial and systematic reasons. Uniformity of redundancy is highly desirable, because it could help in analysis of structure variations.

[править] SNP's

Single Nucleotide Polymorphism (SNP) represents a DNA sequence variant of a single base pair, with the minor allele occurring in more than 1% of a given population. SNPs having a minor allele frequency ≥20% are called "common SNPs". Frequently, the term "SNP" is used in a looser sense for short allelic variants — substitutions or small insertions-deletions (indels) without any assumptions about minimum allele frequencies for the polymorphisms. For example, NCBI dbSNP database ( uses the SNP term regardless to allelic frequencies.

[править] Structural variations

Insertion / deletion / inversion / duplication / translocation

Источник — «»

Личные инструменты

Инструменты  ·  ·  реклама

 ·  Викимарт - все интернет-магазины в одном месте  ·  Доска объявлений  · 
--- сервер арендован в компании Hetzner Online, Германия ---
--- администрирование сервера: Intervipnet --- - методы, информация и программы для молекулярных биологов     Rambler's Top100 Rambler