(ABSTRACT) Accurate detection of somatic variants in cancer genomes remains significantly more challenging than germline variant detection, with typical error rates an order of magnitude higher. Multiple factors contribute to this disparity, including tumor heterogeneity, aneuploidy, widespread structural variation, and cross-sample contamination. However, additional key factors impeding progress include insufficient benchmark data for training and testing methods, limited adoption of long-read sequencing technologies, and reliance on linear reference genomes that introduce reference bias. We propose to address these challenges through three complementary aims. First, we will expand our existing Cancer Standards Long-read Evaluation (CASTLE) collection to twelve tumor-normal cell line pairs, sequencing each with multiple technologies including Illumina, Oxford Nanopore, and PacBio HiFi. We will generate complete telomere-to-telomere germline genome assemblies for each line and create comprehensive benchmark variant sets validated across technologies. All data will be openly released without access restrictions. Second, we will create new versions of our DeepSomatic variant caller that incorporate pangenome information by: (1) using pangenome-based read mapping to reduce reference bias, (2) incorporating complete haplotype information from the Human Pangenome Reference Consortium into variant inference, and (3) utilizing personalized pangenome references imputed from sequencing data. Third, we will extend our Severus structural variant caller to work with both complete germline assemblies and pangenome references, exploring multiple approaches including direct mapping to diploid assemblies, mapping to merged diploid pangenome graphs, and using personalized pangenome references with imputed haplotypes. The successful completion of these aims will provide essential benchmark data enabling further method development, improved methods for detecting both small variants and structural variants in cancer genomes, and standardized variant call sets for major cancer genomics projects. Our team brings together leading expertise in pangenomics, machine learning, and cancer genomics, positioning us to successfully execute this ambitious program. Project Number: 1U01CA309342-01 | Fiscal Year: 2026 | NIH Institute/Center: National Cancer Institute (NCI) | Principal Investigator: Benedict Paten | Institution: UNIVERSITY OF CALIFORNIA SANTA CRUZ, SANTA CRUZ, CA | Award Amount: $593,867 | Activity Code: U01 | Study Section: Special Emphasis Panel[ZRG1 MGG-W (50)] View on NIH RePORTER: https://reporter.nih.gov/project-details/11294518

Pangenome-Aware Methods for Accurate Somatic Variant Discovery in Cancer Genomics

Description

Interested in this grant?

Grant Details

View the application link

Get personalized grant matches