A BioCrunch Series Talk:

"Identification of Sources of Error Affecting Base Calling in Next Generation Illumina/Solexa Sequencing"

Dr. René Boekhorst, Imrana Sabir, Sandeep Brar and Sylvia Beka
(School of Computer Science, University of Hertfordshire, UK)

joint work with Dr. Irina Abnizova
(Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK)

21 April 2010 (Wednesday)
Lecture Theatre E351
Hatfield, College Lane Campus
3 - 4 pm

Everyone is Welcome to Attend
Refreshments will be available


The Genome Analyzer (Illumina/Solexa) is a pioneering high throughput sequencing platform that is able to produce millions of short (up to about 100 bases) "reads" of sequenced DNA fragments. In Illumina/Solexa sequencing single stranded DNA fragments are attached in "lanes" (which are subdivided in "tiles") on glass plates termed flow cells. The fragments are amplified to clusters containing about 1000 clones. The clusters are "sequenced-by-synthesis" by attaching fluorescently labelled nucleotides position by position to the complementary base in the template DNA strands in a series of chemistry steps or cycles (each cycle corresponds to a position in the read DNA fragment). Following laser excitation, the fluorescence of the clusters is captured in approximately 100 images ("tiles")) per lane at each cycle. This is done four times, using different wavelengths for each of the four nucleotides. Ideally, at each cycle the clusters display a single fluorescence signal of maximal intensity, thus leading to the unambiguous identification of a nucleotide (a "base call") at the corresponding position. However, this is not the case and the accuracy of base-calling is distorted by "signal noise" of the images, artefacts in the chemistry and the spatial locations of the clusters on the flow cell.

To assess the importance of these sources of error we carried out three quality confirmation investigations (data were from the genome of the phage FX174, obtained at the Wellcome Trust Sanger Institute and sequenced by Illumina's Genome Analyzer GA2, release 1.4, run 3259). For the first study (I. Sabir) a program was written to read Illumina data files, preprocess the data and compute a four way ANOVA with replication to test for differences in purity between tiles within lanes and between lanes while accounting also for the type of nucleotide and cycle number. A second ANOVA (S. Brar) was applied to unravel the effects of the neighbouring bases on the value of purity of the middle nucleotide of a trimer, given its identity and cycle number. The analysis was performed on triads of three subsequent base-calls, randomly sampled from a large number of reads.

A check on the validity of the method is to verify whether or not a called base is found back after aligning the sequenced fragment to a reference genome. The method is reliable if purity “predicts” the proportion of correctly back-aligned nucleotides well. In that case the procedure of back-alignment can be side-stepped, as a table can be constructed in which error-rates can be looked-up from purity values alone. However, there are other well-defined diversity measures beside purity that may give a finer distinction between signals and (therefore) correlate better with the proportion of "true" base calls. The topic of the third investigation (S. Beka) was therefore to find out which of a series of indices (Purity, chastity, Shannon entropy, and indices by Simpson, Hill and Margalef) showed the strongest logistic regression with the probability of correct base calls.

Keywords: Next generation sequencing, Statistics, error sources, Illumina/Solexa, base calling, purity.

Hertfordshire Computer Science Research Colloquium