High-level Infrastructure Architecture

 

The Ethernet network is used for cluster management:

1. Console access (iDRAC/BMC)

2. Compute nodes provisioning using xCAT

3. Grid internal communications (scheduling, remote login, naming services, etc.)

Infiniband

InfiniBand is used for storage access and MPI traffic.

Access from HaifaU LAN

All grid components, including compute nodes, are accessible from University of Haifa LAN

Storage

Lustre is a high-performance parallel distributed file system designed for use in HPC and supercomputing environments. It offers scalable and high-speed storage solutions, enabling efficient data access and storage for large-scale computational workloads. 

Servers:

Lustre Storage Servers - Redundant servers support storage fail-over, while metadata and data are stored on separate servers, allowing each file system to be optimized for different workloads. Lustre can deliver fast IO to applications across high-speed network fabrics, such as Ethernet, InfiniBand (IB), Omni-Path (OPA), and others.

Storage server - The storage system of the cluster is an HPE Cray E1000 ClusterStor server. Note that this storage is meant for ongoing analyses. This is not an archive system. This is a distributed file system that allows high performance - fast reading and writing of many files, both large and small. It is composed of many disks, but functions as one storage volume, with a total of 919 TB. This total volume is made up from a hybrid set of disks, including both HDD and SSD, which ensures high performance for different usecases.

Please note that your files on the Hive2 storage system are NOT BACKED UP by default. It is your responsibility to backup your vital files and irreplaceable data. While the E1000 server is a highly resilient solution, any system has risk of failure and loss of data. Therefore, please make sure to backup important files.

Backup server - The Hive2 backup system can automatically back up files, but only for users who buy space on the backup server. The backup system uses rsync replication of user's data. Rsnapshot is used to create daily snapshots, keeping up to 14 snapshots.

 

Name Quantity Model CPU's  RAM Notes
Compute(Old bee's) 38 HP XL170r 
24 128GB bee033-071
Compute(bee's) 73 HP ProLiant XL220n Gen10 Plus 
64 250GB bee073-145
Fat node(Old queen's) 1 HP DL560
56 760GB queen02-03
Fat node(queen's) 2 HP DL560-G10
80 1.5TB queen4 (1.5 TB) queen5 (360GB)
GPU(vespa's)          

 

Operating Systems

For xCAT we are using RHEL8.5 for xCAT supported OS reasons. Compute nodes and xCAT managed hosts we are using RHEL9.1. Operating system can be upgraded easily as needed using xCAT.