(A downloadable PDF is here: Research_Computing_Resources_and_Services_for_Researchers_20170307)
UNC-Chapel Hill recognizes that computational resources are an important part of research endeavors, and that research varies with respect to its data and processing demands, and also with respect to the need to compute, modify theories/codes, re-compute, etc.. UNC-Chapel Hill is also committed to providing a base computational resource both to help build research programs, to extend the value of extramural contracts/grants/awards, and to help sustain programs. The university acknowledges, too, that some projects may take weeks to realize, some may take decades to realize. Research problems are not one-sized; therefore, computational demands are not one-sized.
SINGLE PARAGRAPH SUMMARY (IN THE BOX)
The Research Computing division of UNC-Chapel Hill provides expert scientific and information technology consultants and cyberinfrastructure to the scholarly and research community of the university. The consultation staff include eight scientists and scholars who have experience across a wide range of disciplinary communities from the physical sciences to the life sciences, from the computational sciences to clinical research, from social/behavioral sciences to the humanities. Cyberinfrastructure includes two large computational clusters. One cluster is designed specifically for high-performance computing needs with more than 8000 cores, parallel scratch filesystem, and low-latency interconnect fabric. The second cluster is designed specifically for high-throughput and data-intensive processing needs: it contains more than 3000 cores, five 3-TB memory nodes, 102,400 CUDA cores, and “Big Data” workloads, with 2-PB of high performance storage. For permanent storage, Research Computing offers 4-PB cluster mounted via NFS and 6-PB of active archive. For smaller scale needs, Research Computing provides a self-service private cloud for virtual scientific workstations and a secure enclave for computing on sensitive and regulated data (there are also secure file transfer solutions). Cyberinfrastructure administration (i.e., eight systems administrators) and consultation is available at no cost to researchers. With respect to cyberinfrastructure, Research Computing provides an institutional allocation for each element, and incremental charges for resources above that allocation. The division’s aim is to ensure that research efforts have a stable, consistent, available, and expert, resource for all phases of the research lifecycle.
I. Consultation and Engagement
Research Computing includes an “Engagement Team” of experienced scientists who are also adept with various computational, information-processing, and data management techniques.
The Engagement Team is loosely organized by disciplinary families:
- Physical, Information, Mathematical, Computer Science (3 FTE)
- Life and Environmental Science (2 FTE)
- Health Outcomes and Clinical Research (1 FTE)
- Economics, Social and Behavioral Science, Business (1 FTE)
- Humanities (1 FTE)
If a project does not fit one of the above families easily, we assign an engagement member as appropriate. Engagement team members perform three general functions: (i) user/group onboarding, (ii) disciplinary/project outreach, (iii) advanced consultations. The Engagement Team also conducts select short course training.
Contributions by engagement team members range from co-investigation and article co-authorship to assisting lab teams with job submission scripts, to collaborating on scientific workshops.
II. Institutional Research CyberInfrastructure
A. “Cluster-scale” Computation and Information-Processing
High Performance Computation
Killdevil is a 772 node (9152 core) Dell Linux cluster with QDR Infiniband interconnect and a minimum of 4-GB memory per core, and two 32-core hosts with one terabyte of memory each to accommodate codes that require extremely large amounts of RAM. Killdevil also includes 64 NVidia Tesla GPUs (M2070). A 125-TB Lustre parallel filesystem is presented to Killdevil over Infiniband. Killdevil uses the IBM LSF batch scheduling system. A high performance NFS scratch filesystem of 225-TB is presented to Killdevil over Ethernet. Also, a permanent 4-PB high performance scale-out NFS storage cluster on Dell/EMC Isilon X-series was recently installed in 2016 as a lifecycle replacement of a prior system; Killdevil nodes may access to this space by request if required.
Kure is a 220 node (1760 core) HP Linux cluster, with QDR Infiniband interconnect and at least 6GB of memory per core. Kure uses the IBM LSF batch scheduling system. A high performance NFS scratch filesystem of 225-TB is presented to Kure over Ethernet. Kure has access to the 4-PB scale-out NFS service from the Dell/EMC X-series storage cluster via high-bandwidth network paths
Prior to July 1, 2017, Research Computing will implement a new cluster explicitly designed for MPI and/or OpenMP+MPI hybrid (or relevantly similar) workloads typical of disciplines and programs that have significant calculation and/or simulation workloads. Whether in the initial implementation or as subsequent additions, the new cluster is to have high-end GPU and Xeon Phi compute capability as well.
Research groups, programmes, investigators, and users in general, whose typical workloads are MPI and/or OpenMP+MPI hybrid (or relevantly similar) workloads will be provided access to and resource allocations on the new cluster, Dogwood, post-implementation.
High-throughput, data-intensive, regulated-data, and big-data computation
Longleaf is a new cluster explicitly designed to address the computational, data-intensive, memory-intensive, and big data needs of researchers and research programmes that require scalable information-processing capabilities that are not of the MPI and/or OpenMP+MPI hybrid variety. Longleaf includes 117 “General-Purpose” nodes (24-cores each; 256-GB RAM; 2x10Gbps NIC) and 24 “Big-Data” nodes (12-cores each; 256-GB RAM; 2x10Gbps; 2x40Gbps), 5 large memory nodes (3-TB RAM each), 5 “GPU” nodes each with GeForce GTX1080 cards (102,400 CUDA cores in total), zero-hop connections to a high-performance and high-throughput parallel filesystem (GPFS; a.k.a., “IBM SpectrumScale”) and storage subsystem—with 14-controllers, over 225-TB of high-performance SSD disk storage, and approximately 2-PB of high-performance SAS disk. The nodes include local SSD disks for a GPFS Local Read-Only Cache (“LRoC”) that optimizes the most frequent metadata data/file requests to the node itself, thus eliminating traversals of the network fabric and disk subsystem. Both General-Purpose and Big-Data nodes have 68-GigaBytes/second of memory bandwidth. General-Purpose nodes have 10.67GB of memory per core and 53.34-Megabytes/second of network bandwidth per core. Big-Data nodes have 21.34GB of memory per core and 213.34-Megabytes/second of network bandwidth per core. Longleaf uses the SLURM resource management and batch scheduling system. Longleaf’s total conventional compute core count is 6,496 cores (note: this count reflects that hyperthreading enabled).
Also, a permanent 4PB high performance scale-out NFS storage cluster on Dell/EMC Isilon X-series was recently installed in 2016 as a lifecycle replacement of a prior system; this storage is presented to all Longleaf nodes.
Research groups, programmes, investigators, and users in general, whose typical workloads are best satisfied by Longleaf are provided access to and resource allocations there.
Per Research Computing’s cluster lifecycle strategy, Killdevil and Kure will be retired or repurposed once all researchers, research programs, etc., have been provided appropriate access to and allocations on Longleaf and/or the new cluster system. These resources, and the lifecycle of them, are institutionally supported/provided.
B. Permanent storage systems and data management
For comparatively large capacity permanent storage, Research Computing presents a 4PB high performance scale-out NFS storage cluster on Dell/EMC Isilon X-series (recently installed in 2016 as a lifecycle replacement of a prior system). Researchers whose research requires it may receive a 5-TB institutional allocation upon request. On a project-by-project basis, researchers may request additional storage space (usually not to exceed 25-TBs of added space) for the duration of a time-delimited project (usually not to exceed 3-years), pending available capacity. For space in excess of 100-TB, Research Computing passes on the cost of the incremental infrastructure required for a term of 4-years (at present, the cost for 100-TB is approximately $90,000).
Network Attached Storage (NAS)
Researchers have access to Netapp filer storage providing predominantly NFS (and also CIFS for specific use cases). High-performance storage to is delivered via SATA disks; extreme-performance storage is delivered via SAS disks. All storage is configured with large controller caches and redundant hardware components to protect against single points of failure. This storage space is “snapshotted” in order to support file recovery in the event of accidental deletions. Faculty receive an institutional allocation of 10-GB per person; additional storage is available at incremental cost.
For active archive, Research Computing offers Quantum StorNext active archive with 600TB disk cache, and in excess of 4PB tape storage. Data protected against media failure via two copies, and encrypted on tape. Faculty receive an institutional allocation of 2TB per person; laboratories and project teams receive an institutional allocation of 10-TB per person. Additional capacity is available at incremental cost.
To facilitate the deposition of files/data from external organizations into UNC-Chapel Hill, Research Computing offers a secure file-transfer-protocol service that allows files/data to be uploaded but prohibits downloading. This file transfer service meets additional IT-Security requirements for sensitive data.
Research Computing supports Globus (http://www.globus.org) for secure data/file transfer amongst participating institutions.
Research Computing offers schemas on managed Oracle databases sufficient for many small to medium sized research projects. These included patching, general database administration, and transparent database/datafile encryption.
MySQL and PostgreSQL are available within contexts where there is an ongoing engagement project, and it fits within available resources and projects. These are on a case-by-case basis.
C. Secure Research Workspace
Redesigned and re-architected in 2013 by Research Computing, the Secure Research Workspace (SRW) contains computational and storage resources specifically designed for management and interaction with high-risk data. The SRW is used for storage and access to Electronic Health Records (EHR) and other highly sensitive or regulated data; it includes technical and administrative controls that satisfy applicable institutional policies. SRW is specifically designed to be an enclave that minimizes the risk of storing and computing on regulated or sensitive data.
Technically, the SRW is an advanced implementation of a Virtual Desktop Infrastructure (VDI) system based on VMWare Horizon View, Cisco Unified Computing System, Netapp Clustered Data ONTAP comprised of standard disk and flash arrays, with network segmentation and protection guaranteed by design, by adaptive Palo Alto enterprise firewalls, and enterprise TippingPoint Intrusion Prevention System appliances. Access controls and permissions are managed via centrally administered systems and technologies appropriate to ensure security practices and procedures are correctly and consistently applied.
ITS-Research Computing consults with the investigator or research group to arrive at a reasonable initial configuration suitable for their respective project(s).
The default software installed is:
• Adobe Reader
• ArcGIS Workflow Manager
• ERD Concepts 6
• Google Chrome
• Internet Explorer
• Java Runtime
• Java Development Kit
• Microsoft Accessories Bundle
• Microsoft Sharepoint Workspace
• Microsoft Silverlight
|• Microsoft-SQL Server 2008
• Oracle Client
• Stata 13
In addition, Data Leakage Prevention software is available for install on systems that enable data ingress and egress but require detailed access and transfer logging, or that require additional server-level controls. Two-step (or “two-factor”) authentication is also available as required or requested.
D. Virtual Computing Lab
Research Computing provides a self-service private cloud virtualization service called “Virtual Computing Lab” (VCL) to UNC-Chapel Hill researchers at http://vcl.unc.edu. Originally developed by NC State University in collaboration with IBM, VCL (see http://vcl.apache.org) provides researchers with anytime, anywhere access to custom application environments created specifically for their use.
With only a web-browser, users can make a reservation for an application, either in advance or immediately, and the VCL will provision that application on a centrally maintained server, and provide the user with remote access to that server.
VCL provides users remote access to hardware and software that they would otherwise have to install themselves on their own systems, or visit a computer lab to use. It also reduces the burden on computer labs to maintain large numbers of applications on individual lab computers, where in many cases it’s difficult for some applications to coexist on the same machine. In the VCL, operating system images with the desired applications and custom configurations are stored in an image library, and deployed to a server on-demand when a user requests it.
E. Select Commercial Scientific Software
Research Computing licenses commercial software to support the research community at UNC-Chapel Hill. Notable software includes:
- Biovia (DiscoveryStudio, MaterialsStudio); formerly “Accelrys”
- Cambridge Crystallographic
- Globus Connect
- Harris Geospatial Solutions (ENVI+IDL); formerly “Excelis”
- Intel Compilers
- KEGG Database
- nQuery (Statistical Solutions)
- Portland Group (Fortran/C/C++)
- RogueWave (TotalView and IMSL)
- Scientific Computing Modeling (ADF and BAND Modeling Suite)
- StataCorp (Stata/SE)
- Certara (SYBYL)
- Wolfram (Mathematica)
The above list is not exhaustive.
Research Computing offers short courses during the Summer, Fall and Spring terms. Courses are:
- Linux: Intermediate
- Linux: Introduction
- Matlab: Intermediate
- Matlab: Introduction
- Python for Scientific Computing
- Python Workshop
- Scientific Computing: Gaussian and GaussView
- Scientific Computing: Introduction to Computational Chemistry
- Shell Scripting
- TarHeel Linux
- Using Research Computing Clusters
- Web Scraping
IV. Costs to researchers
Research Computing’s suite of services and support has broadened significantly since the most recent charging structure was established (in 2013). The cost structure is due for review and refactoring.
The fact of the matter is that we are actively re-orienting our cyberinfrastructure to the needs of the research programs at UNC-Chapel Hill. Coarse and piecemeal approaches like “core hours” charges and/or “storage” charges assume a mostly homogeneous workload: the resource demands of different disciplines are sufficiently varied that these approaches unintentionally disadvantage domains of inquiry. Worse still, they make it so we are least likely to observe the capacity demands that mixed workloads and emerging workloads present; thus, it reduces our ability to respond to the needs of the research community. Additional kinds of cyberinfrastructure exhibit similar complexity.
Given the newness of our approach, and the fact that we are re-orienting overtly and intentionally to respond to a broader suite of demands from a broader array of research pursuits, we do not yet know which dimension (or dimensions) of resource will drive cost. Nor do we know precisely how many materially different “workload profiles” we will observe. In short, we need to see what happens, measure what happens, and perform some analysis, in order to know enough to frame a costing structure (or: to know enough to justify it).
Our approach is to apply technical resource limits that will (i) help us to measure the relevant dimensions of resource demands, (ii) facilitate incremental adjustment as our observations of the actual job streams suggest, and of course (iii) protect against over-consuming users or runaway tasks/workload/resource-use.
With this in view, we encourage investigators to consult with us on project proposals so we can bring to bear the full suite of Research Computing capabilities, services, and cyberinfrastructure initiatives/projects.