Help
What is Contigs and Scaffolds in Genome Assemblies
Overview
Next-generation sequencing (NGS) technology has transformed genomics research over the past decade, enabling the sequencing of the entire genome of virtually any organism on Earth. To date, most sequencing projects have utilized short-read technology, and assembling the large number of reads generated by NGS sequencing platforms into complete genomes remains a challenging endeavor. In large part, because the length of repetitive sequences is usually greater than the length of the reads, most of the assembled sequences are just draft genomes usually consisting of hundreds or even thousands of composed contigs (contiguous sequences). Long-read sequencing technologies, such as PacBio and Nanopore sequencing, allow users to generate read lengths that span most of the repetitive sequences, which can be used to close gaps in fragment assemblies. Several algorithms have been developed to utilize long-read data for genome assembly.
The availability of complete genomes is actually important for downstream sequence analysis and interpretation in many biological applications. In these methods, computers are used to assemble small fragments into larger fragments, which are then assembled into larger contigs. The contigs are then assembled into scaffolds and finally into chromosomes. Thus, a contig is a continuous sequence of nucleotides, while a scaffold is part of a genome consisting of contigs. Both the contig and the scaffold are reconstructed genomic sequences.
Overview of methods for long-range scaffolding.Overview of methods for long-range scaffolding. (Tseng et al., 2015)
What Are Contigs?
Contigs are derived from the term "contiguous" and represent continuous stretches of DNA sequences. These sequences consist of only four nucleotide bases: adenine (A), cytosine (C), guanine (G), and thymine (T), with no intervening gaps. contigs are part of a scaffold. Contigs are linked together when the scaffold is created. It requires additional information about the relative positions and orientations of the contigs in the genome. Gaps separate the contigs in the scaffold.
The creation of contigs is a complex process that involves recognizing overlapping DNA fragments and aligning them to produce longer contiguous sequences. Advances in sequencing technology, particularly PacBio's HiFi sequencing, have amplified the potential for contigs assembly. For example, HiFi sequencing has the ability to generate contiguous genome assemblies with contigs spanning millions of base pairs, encompassing entire genes, and even distinguishing chromosomes in polyploid organisms.
What Are Scaffolds?
While contigs provide continuity, scaffolding introduces a higher level of genome structure by linking contigs together. This linkage utilizes additional data about the relative position and orientation of contigs within the genome.
In scaffolds, contigs are scattered with gaps. These gaps are usually represented by a series of "N" letters, representing missing genomic information. Although scaffolds help contextualize fragmented genome assemblies derived from short-read sequencing, they have inherent limitations. For example, the gaps may be consistent with or inaccurately represent key genomic regions, leading to misinterpretation of spatial relationships or even underestimation of missing genetic data.
Similarities Between Contigs and Scaffolds
Essentially, both contigs and scaffolds are reconstructed genomic sequences important in the genome assembly process.
Genomic composition: Both contigs and scaffolds contain nucleotide base sequences, making them important components of genome assembly.
Structural role: Both act as intermediate steps to bridge the gap between the raw sequencing data and the fully sequenced genome. While contigs represent the original gapless assembly, scaffolds provide a more comprehensive structure that integrates contigs and resolves their relative orientations.
Differences Between Contigs and Scaffolds
Understanding the subtle differences between contigs and scaffolds is crucial for any genomic researcher.
Gaps exist: The most significant difference is in their structure. While contigs are continuous sequences without gaps, scaffolds contain contigs scattered with gaps, representing unknown or unresolved genomic regions.
Assembly complexity: Contigs assembly focuses on local overlap and sequence comparison, and is a fundamental step in genome sequencing. On the other hand, scaffold construction requires more complex information about the orientation and localization of contigs.
Genome representation: Contigs have continuity and can capture specific genomic regions in their entirety. Scaffolds, while providing a broader view, may have gaps that align with key genomic regions and may miss critical genetic information.
Reference
Rice, Edward S., and Richard E. Green. "New approaches for genome assembly and scaffolding." Annual review of animal biosciences. 7 (2019): 17-40.