DOGWOOD PARTITIONS AND USER LIMITS

Since these limits are continuously being evaluated and revised, our goal here will be two-fold. One, to give a snapshot of the limits at this point in time, and two, to provide the commands to query what the current limits are on the system.

  • To see all the partitions, the maximum run time limit and the nodes available in each partition. sinfo -s
  • To see the quality of service (QOS) associated with the partition as well as the default time limit use and if the partition is exclusive user scontrol show partition
  • Then to see the minimum job size allowed as well as the maximum number of resources (cores) that a user can use at any one time use sacctmgr show qos

Table of limits:

Partition Name Nodes in Partition Min Job Size Max Job Size Total Cores per User per Partition Default Time Maximum Run Time Max Jobs per user (Pend+Run)
528_queue 87 45 tasks 528 tasks 2112 1 hour 3 days 30
2112_queue 96 529 tasks 2112 tasks 2112 1 hour 2 days 20
by_request_queue 183 2113 tasks none none 1 hour 2 days
debug_queue 183 none 88 tasks 352 1 hour 4 hours 6
cleanup_queue 50 none 40 tasks 220 1 hour 1 day
skylake 50 2 nodes 640 tasks 640 1 hour 7 days 40
knl 20 none none none 1 hour 2 days

Rationale for each partition:

528_queue: Run jobs that span two or more nodes, up to 12 nodes. The nodes in this partition all have 44 cores/node, thus this partition runs jobs between 45 and 528 processes. The user is always allocated the entire node meaning they have access to all the cores, memory, memory bandwidth, and interconnect bandwidth on that node.

2112_queue: Run large jobs that span twelve or more nodes, up to 48 nodes. The nodes in this partition all have 44 cores/node, thus this partition runs jobs between 529 and 2112 processes. Nodes are assigned to this partition to increase the probability that the job runs within the same rack (racks hold 24 nodes = 1056 cores). The switching topology is non-blocking within the rack. The user is always allocated the entire node meaning they have access to all the cores, memory, memory bandwidth, and interconnect bandwidth on that node.

by_request_queue: Special queue to be used at the discretion of Research Computing for particularly large runs, say for large way scaling runs, large capability computing, or “heroic” type computations deemed important to the mission of the university.

debug_queue: Used to debug applications, it is expected to run smaller jobs and have a fast turnaround.

cleanup_queue: Used for pre or post-processing of data associated with larger simulations.

skylake: It should be noted that the nodes in this partition have different hardware than those in the other partitions. These are nodes with the Intel “skylake” micro-architecture and have 40 cores per node. Jobs running here should span nodes (i.e. 2 node minimum size) and the user can request up to 16 nodes (640 cores) and for up to seven days of runtime. The user is always allocated the entire node meaning they have access to all the cores, memory, memory bandwidth, and interconnect bandwidth on that node.

knl: These nodes are all Intel Phi processors, code name Knight’s Landing (KNL). This is a small partition for development purposes for user’s interested in doing code modernization and/or in developing highly threaded code.

ms: This is for handling requests to move data to/from mass storage (see ~/ms). Since the mass storage file system is not mounted to the compute nodes, if you want your job workflow to move data to or from mass storage, that command should be submitted to the ms partition and it will run on the login nodes.

Additional resources:

 

Last Update 4/19/2024 2:37:35 PM