Software for Analysis of “Next Generation” Sequence data

Condensation Assembly Tool

  • Identifies identical anchor sequences
  • Statistically polishes short reads
  • Increases reading length and accuracy 

Critical to most applications as well as assembly, The Condensation Assembly Tool is used to statistically polish and lengthen the short sequence reads into fragment sizes that are more manageable. The short reads such as those from the Illumina Genome Analyzer System are often not unique within the genome being analyzed. By clustering similar reads containing a unique anchor sequence, data of adequate coverage is condensed and the short reads are lengthened. The unique anchor sequence, or index, can be a 12 base fragment that is found in several of the reads. All reads containing this exact sequence are clustered together. Often, many of the reads within a cluster contain homologous nucleotides both upstream and downstream of the index sequence. The cluster of reads can be sorted by these flanking shoulder regions into groups of similarity. The consensus of these groups is much larger in length, and often these 50 to 65 base pair fragments are unique within the genome, with exceptions such as homopolymeric regions, repeats and duplications. 

Condensation assembly tool clustered similar reads containing the same anchor sequence of CTGGGGTTACAG. The right shoulder of 8 nucleotides is divided into two groups differing in sequences of GTGTGAGC and GTGCCTGC. A consensus sequence is generated for each group, almost doubling the read lengths.

de novo Assembly

  • Automatically forms anchor sequences
  • Forms contigs from 500bp to 3Kb
  • Assembles Illumina, SOLiD data
  • Completely automated, no script writing necessary
  • Provides critical review documentation on assembly results 

de novo sequence assembly with the short reads from the genome analyzers presents many challenges. With many of the current techniques, it is difficult to assemble the short reads into a large contig of 1 to 5 kb. These techniques often create many false alignments due to two major issues; short reads with high base calling errors and ambiguity within the genome. The short reads with SNPs and Indels are often discarded, which is problematic in the determination of copy number variations in applications such as chromatin immunoprecipitation (ChIP), gene expression and transcriptome studies. NextGENe sequence assembler was developed to solve the current problems. The software is able to assemble the short reads into contigs of 0.5 kb to 5 kb, where contigs end with repeat sequences. It uniquely aligns these contigs to a reference genome. The short reads used in the assembly of a contig are recorded to show the copy number and Indel positions. NextGENe is capable of detecting Indels of 1-30 bps. 

de novo Assembly Methodology

NextGENe statistically polishes high coverage (20-100x) datasets to remove random sequencing errors and roughly double the read lengths with the use of the Condensation Assembly Tool (Patent Pending). Repeating the Condensation removes systematic errors and further lengthens the sequence reads. The polished and elongated reads can then be assembled into large contigs while removing redundant reads. The first step is utilizing the Condensation Assembly Tool to generate the first assembly. All of the reads with the same anchor sequence of 12 bps are collected into a cluster. The two shoulder sequences of 10 bps are used to sort the short reads into multiple groups. The consensus sequence in each group is obtained from the short reads. The ending bases are ignored from the consensus when the base has covered only one sequence read or inconsistency between multiple reads. The 5’ sequence has higher weight than that of 3’ end because of quality. With 50x coverage, confidence of the condensed sequence is about 99.8%. Then all of the possible anchor sequences with 16.7 million possibilities are calculated. 

Condensation Assembly Tool elongated the 35 bp reads to approximately 60 bp while removing many of the random errors produced by the instrument. 

Click here to download de novo assembly application note (PDF file) 

SNP/INDEL Detection

SNP’s and Micro lndels, up to 30 bp, can be detected in targeted sequencing data from both longer sequence reads and short reads from the Solexa sequencing technology. Use of the Condensation Tool elongates short reads increasing their uniqueness probability, while polishing the data to remove chemistry and instrumental errors. 

  • SNP Detection
  • Indel Detection, (up to 30bp demonstrated)
  • Low False Positive Rate
  • Low False Negative Rate
  • Biologist friendly reporting
  • Export results to data base or LIMS system 

 

In the region of aligned sequence reads, mutation calls are highlighted in blue. An insertion was found before position 898773 and a substitution is present at position 898796. The Whole Genome Pane is located at the top of the display – coverage is indicated by gray lines, red tick marks indicate the breakpoints between the transcripts within reference, blue tick marks identify the location of SNPs. 

 

The Mutation Output shows a listing of all mutation calls. On the left is a graphical representation of the selected and adjacent positions. The top chart shows the reference nucleotide and expected percentage, the middle chart shows the percentage of coverage for all nucleotides at each position, and the bottom chart shows the gain/loss of each allele.

Click here to download SNP/INDEL Application Note (PDF file) 

Transcriptome/ChIPSeq Analysis

  • Reduced errors and better matching due to use of polished data.
  • Condensation of sample reads
  • Use any mRNA sequence database as reference
  • Provides accurate copy number even in presence of several variations
  • Expression ratio of multiple alleles differing by SNPs/Indels  

Analyzing an organism’s transcriptome with the Next Generation Sequencing technology presents several challenges, including a high level of sequence variation to the reference genome due to SNPs/Indels, a single analysis often including multiple transcripts for each gene and high variability in expression rates. Short reads (25 to 35 bases) are not always unique, causing ambiguities between the various isoforms. In addition, high expression of some genes can mask genes of low expression levels. 

By using the Condensation Tool, short reads are statistically polished and nearly doubled in length, allowing for noise and error to more reliably be filtered out. When using the Alignment Tool, the highly expressed sequences are matched to the reference. The low level reads, often mistaken for sequencing errors, are rescanned and matched to the reference allowing for more accurate detection of genes expressed at lower rates. 

The results of the analysis can be saved as a reference file, allowing for direct comparison to the results from another analysis. This is a useful feature for comparison studies such as Chromatin Immunoprecipitation (ChIPSeq). 

High frequency variations between the transcriptome and the sample reads, automatically highlighted in blue, can easily be aligned to the reference. Reports are available for viewing, editing and exporting this information. 

Click here to download Transcriptome/ChIPSeq Analysis Application Note (PDF file) 

Digital Gene Expression Studies

  • Expression Report (Gene count, ambiguities)
  • New Genes are listed separately
  • Coverage Plot
  • Search Tool
  • Display Biological Information for each tag 

Gene expression studies are often currently analyzed using the technologies of microarray and DNA sequencing such as Serial Analysis of Gene Expression, or SAGE. In the microarray experiment, cDNA probes are hybridized to the sequence targets of the gene of interest on the microarray, where many probes of interests are located in different spots. The cDNA is labeled with a chromophore, and fluorescence intensity is proportional to the cDNA concentration of the probes. SAGE technology measures the counts of the sequence tags relative to the genes of interest. The SAGE tags are produced from the restriction enzymes cut to the cDNA with the poly-A end bounding to the biotin-labeled dT primer. The portion bound to the solid surface will be kept. The NlaIII restriction enzyme of SAGE targeting CATG, in addition to the techniques such as MicroSAGE, LongSAGE, RL-SAGE, SuperSAGE and more offer powerful solutions to read the absolute expression number by counting the tags. 

The next generation DNA sequence technologies generate millions to hundreds of millions of the short sequence reads. Illumina® Genome Analyzer utilizing the Solexa sequencing technology uses PCR on a surface and the Applied Biosystem SOLiD™ System uses emulsion PCR and sequencing by ligation. Both of these systems can produce the short reads ideal for analyzing gene expression.  

NextGENe software package takes full advantage of the short sequencing reads and has tools for analyzing the SAGE tags. SAGE Libraries are available that contain lists of sequence tags associated with particular genes. NextGENe can load these libraries as a reference and align the sequence reads to the appropriate sequence tags. The alignment to the tag library is only performed in the forward orientation of the sequences, no reverse complementation is implemented. Digital gene expression reports are created to show the sequence of each tag, the coverage, gene names, and the location in the genome. New gene tags that are not in the library are also reported. 

The Sequence Alignment Tool has a Whole Genome View at the top of the screen, which shows each sequence of the library. Mousing over the library activates a yellow box containing the biological information for the tag that is currently at the cursor. The bottom of the screen contains all reads as they have been aligned to the library. 

 

NextGENe produces a chart with the sequence tag number on the x-axis and coverage of each tag on the y-axis. Most tags are expressed less than 500 times, but several genes show very high expression levels. Positions on this chart after 23K are new genes that have been added to the reference file because the sequence was found many times. Several of these new sequences were found in the project with expression levels above 4000. 

Click here to download Digital Gene Expression Application Note (PDF file)

Request Trial Data Analysis

Request a 30 day software trial

Trademarks are property of their respective owners.


 

Request Trial Data Analysis

Request a 30 day software trial

Copyright © 2008
SoftGenetics, LLC
State College, PA 16803
Phone: 814-237-9340
Fax: 814-237-9343
info@softgenetics.com