Kouros Owzar, PhD
Director of DCI Bioinformatics
Duke Box 2721
2424 Erwin Road, Suite 1102
11074 Hock Plaza
Durham, NC 27705
The Bioinformatics Shared Resource, a core function of Duke Cancer Institute (DCI) supports the bioinformatics research needs of DCI investigators, including their needs for complex genomic data management, data integration, computing and statistical analysis. Its mission is to provide a high-quality, service-oriented, coordinated and cost efficient bioinformatics infrastructure for DCI researchers, one which increases collaborations across DCI programs, and among the DCI, other Duke programs and external investigators. This mission is accomplished within the framework of adherence to sound data provenance and statistical principles, literate programming, and reproducible analysis.
The Bioinformatics Shared Resource covers every facet of analysis of high-dimensional genomic data starting from the design stage to pre-processing (background, normalization and summarization of RNA microarrays; genotype and copy number calling from GWAS platforms; and alignment, normalization (RNA-seq) and SNV calling (DNA-seq) of next-generation sequencing [NGS] platforms), high-level association analyses with complex phenotypes, and genomic annotation of results. This core provides resources to DCI programs and individual laboratories, coordinates institutional efforts in bioinformatics, helps drive the development of the biotechnology and pharmaceutical sectors within Duke, and creates synergy between scientific and clinical groups. The importance of the Bioinformatics Shared Resource within DCI and at Duke has amplified with the explosion of genomic, proteomic, metabolomic, and other high-throughput data types; these data carry vast potential utility for clinical and translational research, especially when combined with clinical, imaging, and other scientific data.
Investigators spanning scientific disciplines that use high-dimensional bioinformatics data (e.g., genomics, metabolomics, proteomics) can leverage our expertise to increase the quality and efficiency of complex, integrative, collaborative cancer research.
Services include consultation and bioinformatics programming to assist with study design and analysis, high-performance computing (HPC) leveraging CPUs and GPUs, data storage, and a strong commitment to training and education of clinical, translational, and basic science investigators. Our overall goal is to apply expertise in bioinformatic technologies, statistics and information systems to the creation of systems for conducting reproducible research of high-dimensional data types. These systems will allow for all raw data, analytical processes, and results to be stored and made publicly available under common standards such that they can be independently verified. Services include state-of-the-art hardware and software to support a full range of research involving "-omics" with a particular emphasis on open development and open source solutions.
Integration with other Duke Resources. Under the leadership of Dr. Owzar, the DCI Bioinformatics Shared Resource formally and actively collaborates with the Department of Biostatistics and Bioinformatics in the School of Medicine, the Duke Translational Medicine Institute (DTMI), the Duke Office of Information Technology (OIT), and the Duke Office of Clinical Research (DOCR) to provide expertise and resources specific to data storage, management and analysis of high-dimensional data types in an efficient manner. Dr. Owzar's overarching goal for these collaborations is to ensure that the DCI Bioinformatics Shared Resource takes full advantage of resources within Duke and avoids duplication of effort and expenditures as it meets the needs of DCI researchers.
The DCI Bioinformatics Shared Resource provides data management and analysis support for traditional microarray platforms (mRNA microarrays and genome-wide DNA arrays) and for next generation high throughput assays (DNA-seq and RNA-seq). It also provides support for candidate biomarker studies and cell-based assays including flow cytometry data. The Resource covers every facet of analysis and management of project data from the design stage, through pre-processing and downstream analyses through annotation.
Software. The Bioinformatics Shared Resource adheres to an open-source software model to the fullest extent. To this end, the resource uses GNU/Linux as the operating system for its servers and several of its individual workstations. The R Statistical environment, along with the Python and C/C++ programming languages, constitute the main programming toolkit. R extension packages from the Comprehensive R Archive Network (CRAN) and Bioconductor project are actively maintained on its servers. The resource also maintains a number of other R extension packages developed by its faculty members along with developmental packages from RForge. Commercial software packages, available through Duke University site-license agreements, include SAS, Matlab, Maple and Mathematica. Linux ports for these software products are available and currently installed. For the production of reproducible reports, the shared resource uses the Sweave, knitr, Python sphinx and IPython notebook systems.
The Bioinformatics Shared Resource has begun use of the Mercurial source code management (SCM) software for its projects. Staff members work on local repositories and push their changes to a common server.
Assays and Platforms Supported by the Resource. A representative listing of cellular assays and platforms that are supported by the Bioinformatics Shared Resource is provided here.
Hardware. The computing hardware infrastructure of the Bioinformatics Shared Resource consists of dedicated hardware owned and managed by the resource and is further extended by hardware resources maintained by Duke OIT and by the DTMI. Additionally, the Bioinformatics Shared Resource takes advantage of commercial cloud computing resources including Amazon Web Services (AWS).
The personal and server computing resources owned by the Bioinformatics Shared Resource are managed by the DCI Information Systems (DCI IS) Shared Resource. The servers are housed on the 7th floor of Hock Plaza in a secure, temperature and humidity controlled computer room with FM200 fire suppression, UPS and emergency generator power protection. DCI IS provides both protected and DMZ network connections.
The HPC clusters managed by the Bioinformatics Shared Resource operate within Duke Medicine's protected networks and security, access, and authorization measures have been taken that allow for the analysis of protected health information (PHI). It should be pointed out that despite this level of protection, whenever possible the phenotypic data will be anonymized or de-identified. Duke computing resources that are not authorized for storing PHI are exclusively used for simulation studies.
Data Storage Options. A crucial and expensive aspect of genomic analysis is access to storage. We provide access to various data storage options to meet project requirements related to storage speed and space. The Bioinformatics Shared Resource offers storage to DCI investigators on its local storage servers for small to moderately sized projects. For large projects, the Bioinformatics Shared Resource uses storage solutions managed by Duke University and Amazon Web Services.
Duke University offers a wide variety of storage options that can be chosen based on the specific storage needs. These include the following three options:
Currently, the annual price for leasing 1TB of space from these Duke owned storage resources is $1600, $500 and $300, respectively. For long-term storage of large data files that are not sensitive and do not need to be accessed frequently, the Amazon Glacier system from Amazon Web Services is used. Currently the monthly charge for storage is $0.01 per GB. For 1TB of space this amounts to an annual charge of $120. The Bioinformatics Shared Resource will facilitate transfer DCI investigators to this resource when feasible.
The services and resources provided by the Bioinformatics Shared Resource are available to all Duke faculty who have been designated as DCI members; for the shared resource is, at present, exclusively focused on the research needs of DCI members (other bioinformatics resources at Duke are available for non-DCI members). Currently, all requests for resource services are communicated directly to Dr. Owzar by phone or email. Dr. Owzar evaluates each request and delegates tasks to an appropriate staff or faculty member. Upon receipt of the initial request, Dr. Owzar schedules an in-person or phone meeting with the requesting DCI research team and appropriate shared resource personnel. Presently, the initial response time is less than one business day. Dr. Owzar is responsible for prioritizing the resources, staff and hardware.
Currently, DCI members are not charged for Bioinformatics Shared Resource services in order to lay the foundation for long-term scientific collaborations between DCI investigators and shared resource personnel. These long-term collaborations are expected to lead to grant and federal and industry contract applications in which shared resource staff and faculty are included as co-investigators. Nor does the Bioinformatics Shared Resource charge DCI members for using its in-house computational hardware, including CPU cycles, GPUs, and local storage; these resources are available to DCI investigators on a 24-7-365 basis through secure access mechanisms (VPN and ssh). As described under Equipment, the resource also heavily leverages other computing resources, including storage, available at Duke and through Amazon Web Services. Any cost incurred for use of the latter resources is passed to the investigator; the Bioinformatics Shared Resource staff assists and trains at no cost DCI members interested in those resources.