Bioinformatics

Overview

The Bioinformatics Shared Resource supports the bioinformatics research needs of DCI members, including their needs for complex genomic and imaging data management, data integration, computing, statistical analysis, and machine learning. The support provided by the Bioinformatics Shared Resource is critical to DCI as it enables its members to analyze and interpret results from experimental and study data to their full potential with rigor and efficiency.

The group supports every facet of analysis of high-dimensional genomic and imaging data, from the design stage to pre-processing, to high-level association analyses with complex phenotypes, to annotation of results. The faculty and staff of the Bioinformatics Shared Resource not only answer DCI members’ data analysis needs but also contribute to writing abstracts and manuscripts as well as grant and contract applications.

Mission

The mission of the Bioinformatics Shared Resource is to provide research support for DCI members that: enhances DCI research rigor and reproducibility and increases collaborations across and among DCI programs, other Duke programs, and external investigators. This is accomplished by adhering to principles such as sound data provenance and statistical inference, literate programming, and reproducible analysis. In addition, whenever existing quantitative methods or computational tools fail to meet the researcher’s needs, the faculty and staff of the Bioinformatics Shared Resource leverage their deep and broad theoretical background and computing expertise to develop methodology or tools customized to solve the problem at hand.

This shared resource provides support to DCI programs and individual laboratories, coordinates institutional efforts in bioinformatics, helps drive the development of biotechnology and pharmaceutical sectors within Duke, and creates synergy between scientific and clinical groups. The importance of the Bioinformatics Shared Resource within DCI and at Duke has amplified with the explosion of genomic, proteomic, metabolomic, and other high-throughput data types; these data carry vast potential utility for breakthroughs in clinical and translational research, especially when combined with clinical, imaging, and other scientific data. Investigators spanning scientific disciplines that use high-dimensional bioinformatics data (e.g., genomics, metabolomics, proteomics) can leverage our expertise to increase the quality and efficiency of complex, integrative, collaborative cancer research. Given that the bioinformatics and biostatistics needs of DCI members often intersect, this shared resource is tightly integrated with the Biostatistics Shared Resource and serves as a liaison between DCI members and other DCI cores, notably the Integrated Cancer Genomics (ICG), the Functional Genomics (FG), and the BioRepository and Precision Pathology Center (BRPC) shared resources.

Services

The Bioinformatics Shared Resource serves as a centralized resource for expertise in applied and theoretical cancer bioinformatics, supporting DCI members across the continuum of research, and throughout all stages of an investigation.

Early-stage studies

Grant writing support, including rigorous genomic study designs (e.g., power and sample size calculations using simulation techniques)
Optimal selection of computational and data storage resources at Duke
Data query and analysis of public research data (e.g., from TCGA, dbGaP)
Design of primers for Sanger sequencing validation of breakpoints and fusion transcripts
Support for validation and meta-analyses

Pre-processing, analysis, and annotation of high-throughput sequencing and other assays, including:

DNA-Seq -- Germline, tumor, and cell-free assays based on candidate markers, whole-exome, or whole-genome sequencing
RNA-Seq -- Bulk and single-cell
ChIP-Seq
ATAC-Seq -- Bulk and single-cell
T and B Cell Receptor (TCR/BCR) sequencing-- Bulk and single cell
CITE-Seq
Metagenome -- shotgun and 16S bacterial sequencing
NanoString GeoMx Spatial Transcriptomics and Proteomics Assays
10x Visium and Xenium Spatial Gene Expression
CRISPR screens and single-guide RNA detection
Flow cytometry
mRNA and genotyping arrays

Identification of:

Methylation
Alternative splicing
Copy number variation
Neopeptide prediction
Feature and variant annotation (e.g., VEP, ANNOVAR)
Associations of genetic and genomic variation with clinical outcome
Novel translocation breakpoints from DNA-Seq data
Novel gene fusion transcripts from RNA-Seq data

Analyses, including integrative analyses

Support for data programming including merging across heterogeneous data sources
Statistical genetics (e.g., candidate SNP, genome-wide association studies, analysis of rare variants, local and global ancestry inference, admixture mapping, haplotype regression)
Development of theoretical and applied methods for rigorous and efficient analysis of complex genomic data
Development of novel statistical methodologies for emerging sequencing technologies or the integration of multiple data types
Application of artificial intelligence (AI) and large language model (LLM) methodology

Late-stage studies

Manuscript writing and review, with emphasis on methodology and results reporting
Reproducible preprocessing and analyses, including manuscript-quality figures, using sharable code repositories and software containers
Deposition of genomic data into online research databases (e.g., GEO, SRA, or dbGaP)
Follow-up analyses for revisions and reviewer responses

Support for:

Data transfer (e.g., Globus) both within Duke’s IT infrastructure and among external collaborators
Data archiving, including optimal selection of data storage resources at Duke
Web interface and database programming assistance
User training in bioinformatics software and hardware, facilitating the use of computing resources offered by Duke University and available through commercial vendors (e.g., cloud computing)

The shared resource continues to refine existing and develop new workflows for analysis such as:

CRISPR targeted library screens, featuring our bcSeq R package
Single-cell and spatial transcriptomic sequencing
Detection of enrichment or depletion of sgRNAs
Microbiome data, including measures of diversity and dominance of selected microbes
Estimation of global and local genetic ancestry
Haplotype regression
Imaging technologies (e.g., 10x Visium, CODEX, and MIBI)

The shared resource pipelines utilize container technologies, allowing them to be run on local servers, university compute clusters, or cloud services providers (e.g., Amazon Web Services, Microsoft Azure, or Google Cloud).

High-Performance Computing and Storage

Compute
The Bioinformatics Shared Resource leverages both local and cloud computing environments to meet the needs of large-scale omics research.

We have exclusive access to two local compute servers:

40 cores (80 threads) Intel Xeon E5-2698V4 server with 1TB of RAM and 146TB RAID 10 storage array
64-core Opteron 6386 SE server with 512GB of RAM and 44TB RAID 10 storage array

We regularly utilize Duke’s cluster computing resources:

The Duke Compute Cluster (DCC) consists of over 45,000 vCPU-cores and 980 GPUs, with underlying hardware from Cisco Systems UCS blades in Cisco chassis. GPU-accelerated computers are Silicon Mechanics with a range of Nvidia GPUs, including high-end “computational" and “graphics." Interconnects are 10 Gbps or 40 Gbps. General partitions are on Isilon, 40Gbps or 10Gbps network-attached storage arrays. The cluster provides 1TB of group storage and 25GB for each personal home directory. The cluster also provides 450TB of scratch storage and archival storage (rates set annually). This system may not be used for storage or analysis of sensitive data. See dcc.duke.edu for additional information.

The SoM Research Computing Center (SRCC) on-premises HPC cluster consists of 27 CPU Compute Nodes, each with 128 threads and 1TB RAM, one GPU Compute Node with 64 threads, 1TB RAM, and 4 x NVidia L40S. The SRCC uses a tiered pricing cost model, allowing for predictable budgeting, with the largest plan including up to 930,000 hours/year of compute time and up to 70 TB of storage. The cluster is managed by the SchedMD Slurm Workload Manager, and features a 200GB/s Infiniband Switch for networking. See this page for additional details about the cluster and pricing.

Storage
To meet the high-capacity storage demands of high-throughput sequencing data, the Bioinformatics Shared Resource integrates its workflows with and promotes the adoption of DHTS-supported cloud storage, including Amazon Web Services (AWS) Simple Storage Service (S3). AWS S3 offers secure data storage that automatically expands to match our needs and the needs of our collaborators, and can be accessed directly from all active compute environments, AWS S3 utilizes intelligent storage tiering, allowing it to also serve as a study data archive.

Software
The Bioinformatics Shared Resource adheres to the principles of sound data provenance, literate programming, and reproducible analysis. To this end, we utilize the open-source software model to the fullest extent possible. The resource uses GNU/Linux as the operating system for its servers and individual workstations.

Both our local servers and the university clusters support the use of the Globus data transfer software. Globus allows the automated, encrypted transfer of large-scale datasets both within Duke computing environments and with external collaborators.

The R Statistical environment, along with the Python and C/C++ programming languages, constitute the main programming toolkit. R extension packages from the Comprehensive R Archive Network (CRAN) and Bioconductor project are actively maintained on our servers. The resource also maintains several other R extension packages developed by its staff and faculty members. Duke University site-license agreements provide access to commercial software packages, including SAS, Matlab, Maple and Mathematica. Linux ports for these software products are available and currently installed.

For the production of reproducible reports, the shared resource uses the knitr, RMarkdown, and Jupyter notebook systems. Analysis code and software pipelines are maintained under strict source code management using Duke’s internal GitLab repository, managed by OIT. The GitLab infrastructure includes functionality for automated container building, which further supports the development of the shared resource’s preprocessing and analysis pipelines. The associated repositories can then be made public to accompany the manuscript publication.

We are actively working to translate our suite of pipelines to the Nextflow DSL, utilizing Apptainer (Singularity) software containers. The container software environment allows for the standardization and portability of workflows while still allowing them to be optimized for the available computational resources, as we work to both improve workflow efficiency and reduce computing costs. This transition will ensure total portability of our pipelines across computing environments. Utilization of these tools, in addition to the deposition of source data into public repositories (e.g., GEO, SRA or dbGaP), ensures end-to-end reproducibility of all statistical analyses conducted by our shared resource.

Accessing the Bioinformatics Shared Resource

The services and resources provided by the Bioinformatics Shared Resource are available to all Duke faculty who have been designated as DCI members for their cancer-focused research. The shared resource is, at present, exclusively focused on the research needs of DCI members (other bioinformatics resources at Duke are available for non-DCI members). Requests for resource services can be sent to dcibioinformatics@duke.edu.

Upon receipt of the initial request, Dr. Owzar schedules an in-person, phone, or teleconference meeting with the requesting DCI research team and appropriate shared resource personnel. Presently, the initial response time is one business day. Dr. Owzar is responsible for prioritizing the resources, staff, and hardware.

Services and Fees

The Bioinformatics Shared Resource is not a fee-for-service data analysis core. Instead, the shared resource seeks to establish externally funded scientific collaborations with DCI members. The Bioinformatics Shared Resource is able to provide grant writing support, including rigorous genomic study designs (e.g., power and sample size calculations using simulation techniques) and budget estimations (including computing, data storage, and bioinformatic staff and faculty effort) in pursuit of these collaborations.

The first step in establishing a new collaboration with the Bioinformatics Shared Resource is for the principal investigator to request a set of initial meetings with the shared resource leadership and staff to discuss the scientific objectives and scope of the proposed project. These meetings provide the opportunity for the shared resource faculty and staff to learn the background and objectives of the project, and for the investigators to learn more about the resources and expertise of the shared resource. These meetings will also help with determining if resources or expertise from other DCI shared resources, for example the DCI Biostatistics Shared Resource, need to be included. There are no charges for these initial discussions. The Bioinformatics Shared Resource faculty are also highly knowledgeable in providing advice on sequencing technologies, statistical methodology, and computing tools and environments. This support is generally provided at no charge to DCI members.

There are limited faculty, staff, and computing resources available for conducting initial analyses during the pre-award stage (e.g., to develop preliminary data for a grant application). If these initial resources are deemed to be insufficient considering the scope of the requisite analyses, the Bioinformatics Shared Resource will assist in developing a budget to conduct the requisite analyses.

The costs of provisioning and using cluster resources and cloud storage from the Duke Health Technology Solutions (DHTS) are charged directly to DCI members.