Rendimiento y monitorización RED ESPAÑOLA DE SUPERCOMPUTACIÓN -Operations Department -Barcelona Supercomputing Center Foreword All Information contained in this document refers to BSC´s & RES´s internal proceedings/scripts/developments. This information is confidential and should not be published nor distributed. 2 Index ● ● ● ● Introduction RES node architecture RES node policies Monitorización 3 Introduction ● Resource Manager ● Handles any allocatable resource (check, start application, stop application, ...) ● Scheduler ● Decides which job to run at every moment in base of priorities and policies defined ● IBM´s LoadLeveler was our de-facto (Resource Manager + Scheduler solution) ● Since June 2007 MareNostrum production tools are: ● Slurm as Resource Manager (OpenSource) ● Moab as Scheduler (from ClusterResources) 4 Index ● ● ● ● Introduction RES Node Architecture RES Node Policies Monitorización 5 RES Node Architecture SYSTEM ARCHITECTURE Head node Cluster Management Users` job control commands Blade Centers Servers Login nodes GPFS 6 RES Node Architecture COMPONENTS DEPLOYED Servers Blade Centers Head node Cluster Management Moab SlurmCtld User’s job control commands slurmd slurmd slurmd slurmd slurmd slurmd slurmd slurmd Login nodes slurmd slurmd slurmd GPFS 7 Index ● ● ● ● Introduction RES Node Architecture RES Node Policies Monitorización 8 RES Node Policies INTRODUCTION ● MareNostrum´s CPU time is divided and prioritized ensuring access for: ● Access Committee assigned projects (70%) ● Site own projects (20%) ● Other (10%) ● Scheduling policies should guarantee this consumption at the end of every period and year 9 RES Node Policies ACCESS COMMITTEE ● For every project, Scientific Committee provides: ● # Number of hours –in thousands● Class of hours: ● A - maximum priority, should be executed before the rest ● B - if there are no A jobs, or filling the gaps ● To accomplish this BSC: ● Defines internal ‘Class C’ ● for those users that wasted all their A and/or B time ● only run if there are no suitable A or B jobs on queue ● Establishes manual Priority Management Rules: ● «One ‘A+B’ project that wastes A, is moved to B» ● «One only ‘A’ or ‘B’ project that wastes all its time, is moved to C» 10 RES Node Policies JOB PRIORITY MODEL ● To evaluate priority weights from components: CREDENTIAL + FAIR-SHARING + SERVICE 11 RES Node Policies CREDENTIALS - JOB PRIORITY MODEL ● To evaluate priority weights from components: CREDENTIAL + FAIR-SHARING + SERVICE CREDWEIGHT 1 QOSWEIGHT 1000 GROUPWEIGHT 10 USERWEIGHT 1 This sets priority depending on the: * Group * User * Quality of Service 12 RES Node Policies FAIR-SHARE - JOB PRIORITY MODEL ● To evaluate priority weights from components: CREDENTIAL + FAIR-SHARING + SERVICE FSWEIGHT 100 FSUSERWEIGHT 1 FSGROUPWEIGHT 10 FSINTERVAL 07:00:00:00 FSDEPTH 16 FSDECAY 0.95 FSPOLICY DEDICATEDPES FSTREEISPROPORTIONAL TRUE 13 RES Node Policies FAIR-SHARE TREE - COMMITTEE BRANCH Root 70 20 projects 1000 class_a 10 other bsc 100 2 class_c class_b Initial Group Share == # thousand hours from Access Committee 14 RES Node Policies SERVICE - JOB PRIORITY MODEL ● To evaluate priority weights from components: CREDENTIAL + FAIR-SHARING + SERVICE SERVICEWEIGHT 1 QUEUETIMEWEIGHT 100 This sets priority depending on the time the job has spent in the queue 15 Index ● ● ● ● Introduction RES Node Architecture RES Node Policies Monitorización 16 Necesidades básicas - Monitorización ● Monitorización de sistema ● Diagnósticos (detección de anomalías) ● Monitorización de aplicaciones ● Estado de las ejecuciones (rendimiento) ● Contabilidad ● Fuentes ● Software específico (Ganglia) ● Sistema de colas ● Software propio ● Frecuencia ● Elevada, pero sin excesos ● Minimización de interferencias con la ejecución ● Inicio y final de las ejecuciones Centro Nacional de Supercomputación 17 Herramientas – Monitorización de sistema ● Ganglia ● Monitorización de sistema ● Carga cpu ● Uso de memoria/swap ● Uso de red ●… ● Posibilidad de envío de información adicional ● Desde scripts ● Componentes ● Gmond – daemon local ● Gmetad – recolector remoto ● Interfaz web Centro Nacional de Supercomputación 18 Herramientas – Monitorización de sistema ● Ganglia ● Puntos fuertes ● Daemon local ligero ● Fácilmente modificable (open source) ● Puntos débiles ● Broadcast de información ● Recolector no fácilmente escalable ● Modificaciones BSC-CNS ● Modificación Gmond (métricas adicionales) ● Generación automàtica de configuración ● Limitación de broadcast a blade center ● Desarrollo de un recolector escalable ● Desarrollo de herramientas de consulta Centro Nacional de Supercomputación 19 Herramientas – entorno de ejecución ● Desarrollos en el BSC-CNS ● Prólogo ● Verificación del estado del nodo ● Drivers, red, sistemas de ficheros, hardware, … ● Cancelación automática del trabajo en caso de fallo ● Extracción del nodo del sistema de colas en caso de fallo ● Propagación de información al script inicial del usuario a través de variables de entorno ● Nodo master, lista de nodos ● Generación de información de contabilidad ● Epílogo ● Localización y eliminación de procesos de usuario ● Verificación del estado del nodo y reconfiguración en caso necesario Centro Nacional de Supercomputación 20 Thank you ! www.bsc.es 21