High-Performance Computing and Storage
Compute
The Bioinformatics Shared Resource leverages both local and cloud computing environments to meet the needs of large-scale omics research.
We have exclusive access to three local compute servers:
- 40 cores (80 threads) Intel Xeon E5-2698V4 server with 1TB of RAM and 146TB RAID 10 storage array
- 64-core Opteron 6386 SE server with 512GB of RAM and 44TB RAID 10 storage array
- 48-core Opteron 6180 SE server with 256GB of RAM and 34TB RAID 10 storage array
We regularly utilize Duke’s cluster computing resources:
The Duke Compute Cluster (DCC) consists of over 30,000 vCPU-cores and 730 GPUs, with underlying hardware from Cisco Systems UCS blades in Cisco chassis. GPU-accelerated computers are Silicon Mechanics with a range of Nvidia GPUs, including high-end “computational” GPUs (V100, P100) and “graphics” GPUs (TitanXP, RTX2080TI). Interconnects are 10 Gbps. General partitions are on Isilon, 40Gbps or 10Gbps network-attached storage arrays. The cluster provides 1TB of group storage and 10GB for each personal home directory. The cluster also provides 450TB of scratch storage and archival storage at the cost or 0.08/GB/year. This system may not be used for storage or analysis of sensitive data. See https://dcc.duke.edu/ for additional information.
The HARDAC cluster consists of 1512 physical CPU cores and 15TB of RAM distributed over 60 computer nodes. For computing with high-volume genomics data, HARDAC is equipped with high-performance network interconnects (Infiniband) and an attached high-performance parallel file system, providing roughly 1.2 petabytes of mass storage. All nodes are interconnected with 56Gbps FDR InfiniBand, and the data transfer node of the cluster is linked to the Duke Health Technology Services (DHTS) network through pair-bonded 10GB Ethernet switches. The attached mass storage runs IBM’s General Parallel File System (GPFS), which is managed through two redundant GPFS NSD server nodes and designed to sustain ~5GB per second average input/output read rate. See https://genome.duke.edu/cores-and-services/computational-solutions/compute-environments-genomics for additional details.
Finally, the DHTS Azure School of Medicine HPC (DASH) cloud-based cluster can scale up to 13 nodes, including up to 10 “Execute” partition nodes with 32 vCPUs, 256 GB of RAM, 1200 GB of attached SSD temporary storage, and 16,000 Mbps network bandwidth, and up to three “highmem” partition nodes with 96 vCPUs, 672 GB of RAM, 3600 GB of attached SSD temporary storage, and 35,000 Mbps network bandwidth. The available scratch space includes a 2 TiB Lustre Marketplace Filesystem with backend to Azure Blob container storage, and up to 5 PB of storage. See https://wiki.duke.edu/display/DAH/DHTS+Azure+HPC+Home for additional details.
Storage
To meet the high-capacity storage demands of high-throughput sequencing data, the Bioinformatics Shared Resource integrates its workflows with and promotes the adoption of DHTS-supported cloud storage, including Azure storage containers and Amazon Web Services (AWS) Simple Storage Service (S3). Both services offer secure data storage that automatically expands to match our needs and the needs of our collaborators, both can be accessed directly from all active compute environments, and both utilize intelligent storage tiering, allowing them to also serve as study data archives.
Software The Bioinformatics Shared Resource adheres to the principles of sound data provenance, literate programming, and reproducible analysis. To this end, we utilize the open-source software model to the fullest extent possible. The resource uses GNU/Linux as the operating system for its servers and individual workstations.
The R Statistical environment, along with the Python and C/C++ programming languages, constitute the main programming toolkit. R extension packages from the Comprehensive R Archive Network (CRAN) and Bioconductor project are actively maintained on our servers. The resource also maintains several other R extension packages developed by its faculty members. Duke University site-license agreements provide access to commercial software packages, including SAS, Matlab, Maple and Mathematica. Linux ports for these software products are available and currently installed.
For the production of reproducible reports, the shared resource uses the knitr, RMarkdown, and Jupyter notebook systems. Analysis code and software pipelines are maintained under strict source code management using Duke’s internal GitLab repository, managed by OIT. The GitLab infrastructure includes functionality for automated container building, which further supports the development of the shared resource’s preprocessing and analysis pipelines. The associated repositories can then be made public to accompany the manuscript publication.
We are actively working to translate our suite of pipelines to the Nextflow DSL, utilizing Singularity software containers. The container software environment allows for the standardization and portability of workflows while still allowing them to be optimized for the available computational resources, as we work to both improve workflow efficiency and reduce computing costs. This transition will ensure total portability of our pipelines across computing environments. Utilization of these tools, in addition to the deposition of source data into public repositories (e.g., GEO, SRA or dbGaP), ensures end-to-end reproducibility of all of our statistical analyses.