Distributed Operating Systems Process Scheduling Process Management 1. Concepts and Taxonomies: Jobs and parallel/distributed systems 2. Static scheduling: Scheduling dependent tasks Scheduling parallel tasks Scheduling tasks from multiple jobs 3. Dynamic scheduling: Load balancing Process migration Data migration Connection balancing Víctor Robles Francisco Rosales Fernando Pérez José María Peña Starting Scenario: Concepts • Jobs: Set of tasks (processes or threads) that require (resources x time) – Resource: Data, devices, CPU or other required (finite) elements to carry out a job. – Time: Period when resources are assigned (shared or dedicated) to a certain job. – Task relationship: Tasks should be performed in order keeping the restrictions based on needed input or resources. • Scheduling: Assigning jobs and their tasks to computational resources (specially CPU). Scheduling could be monitored, reviewed and changed along time. Distributed Operating Systems 3 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Starting Scenario Required Resources Nodes (Processors) Tasks Jobs GOAL To assign users’ jobs to the nodes, with the objective to improve sequential performance. Distributed Operating Systems 4 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Scheduling Characteristics • Shared memory systems – Any of the processor can access the resources used by a task: • Memory space • Internal OS resources (files, connections, etc.) – Automatic load sharing/balancing: • Free processors execute any process (task) in ready status – Improvements derived from efficient process managements: • Better resource usage and performance • Parallel application uses available processors • Distributed systems – Tasks are assigned to a processor for their whole running time – The resources used by a task can only be accessed from the local processors. – Load balancing requires process migration Distributed Operating Systems 5 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Starting Scenario: Jobs What do we execute? Jobs are divided into tasks: • Independent tasks – Independent processes – They could belong to different users • Coopering tasks – – – – They interact somehow Belonging to the same application There can be some dependencies Or they can require parallel execution Distributed Operating Systems 6 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Coopering Tasks Task dependencies • Model based on a direct acyclic graph (DAG). Tasks Data transferred • Example: Workflow Distributed Operating Systems 7 Parallel execution • It requires a number of parallel executing tasks at the same moment: – Synchronous or asynchronous interactions. – Based on a connection topology. – Either master/slave or fully distributed model. – Particular communication ratios and message exchanges. • Example: MPI code Víctor Robles Francisco Rosales Fernando Pérez José María Peña Starting Scenarios: Objectives What kind of “better performance” are we expecting? System taxonomy. • High-availability systems – HAS: High Availability Systems – Service should be always working – Fault tolerance • High-performance systems – HPC: High Performance Computing – Reaching a higher computational power – Execution one heavy job in less time. • High-throughput systems – HTS: High Throughput Systems – Number of executed jobs should be maximized – Optimizing the resource usage or the number of clients (could represent a different objective). Distributed Operating Systems 8 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Scheduling • Scheduling deals with the distribution of tasks on a distributed computing platform: – Attending to resource requirements – Attending to task inter-dependencies • Final performance depends on: – Concurrency: Maximum number of processors running in parallel. – Parallelism degree: The smallest degree in which a parallel job can be divided into tasks. – Communication costs: Could be different between processors in the same node than different nodes. – Shared resources: Common resources (like memory) shared among all the tasks running in the same node. Víctor Robles Francisco Rosales Fernando Pérez José María Peña Scheduling • Processor usage: – Exclusive: One task to one processor. – Shared: If the tasks performs few I/O phases performance is limited. It is not the usual strategy. • Task scheduling can be planned as: – Static Scheduling: The system decides previously where and when tasks will be executed. These decisions are taken before the actual execution of any of the tasks. – Dynamic scheduling: When tasks are already assigned, and depending on the behavior of the systems, initial decisions are reviewed and modified. Some of the tasks of the job could be already execution.. Distributed Operating Systems 10 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Distributed Operating Systems Process Scheduling Static Scheduling Víctor Robles Francisco Rosales Fernando Pérez José María Peña Static Scheduling • Generally, it is performed before allowing the job to enter in the system. • The scheduler (sometimes called resource manager) selects a job from a waiting queue (depending on the policy) if there are enough resources, otherwise it waits. Job Queue Scheduler Scheduler Resources? yes System no Jobs wait wait Víctor Robles Francisco Rosales Fernando Pérez José María Peña Job Descriptions • In order to decide the order of job executions, the scheduler should have some information about the jobs: – – – – – – – Number of tasks Priority Tasks dependencies (DAG) Estimation of the required resources (processors, memory, disk) Estimation of the execution time (per task) Other execution parameters Any applicable restriction • These definitions are included in a job description file. The format of this file depends on the specific scheduler. Víctor Robles Francisco Rosales Fernando Pérez José María Peña Scheduling Interdependent Tasks • Considering the following aspects: – Task (estimated) duration – Size of data transited after task execution (e.g., file) – Task precedence (which task should be finish before other task starts). – Restriction based on specific resource needs. 2 2 11 1 1 2 7 One option is to transform all data into the same measure (time) Execution time (tasks) Transmission time (data) 1 2 1 3 4 4 1 3 Direct acyclic graph (DAG) description Distributed Operating Systems 14 5 Heterogeneous systems make this estimation more difficult: Execution time depends on the processor Communication time depends on the connection Víctor Robles Francisco Rosales Fernando Pérez José María Peña Scheduling Interdependent Tasks • Scheduling becomes the assignation of task to processors at a given timestamp: – There are some approximate heuristics for task belonging to single job: Critical path assignation. – Polynomial time algorithms for 2-processors systems. – It is an NP-full problem for N>2. – The theoretical model is referred as multiprocessor scheduling. Distributed Operating Systems 15 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Example of Interdependent Tasks 10 10 1 2 1 20 1 3 4 1 5 0 1 N1 2 20 Scheduler 3 4 30 36 N1 1 10 1 5 N2 N2 Distributed Operating Systems 16 5 Communication time between two tasks depends on the node they are running at: Time ~0 if they are at the same node Time n if they are at different nodes. Víctor Robles Francisco Rosales Fernando Pérez José María Peña Cluster-based Algorithm • For general cases the cluster-based algorithm is used: – – – – – Group tasks in clusters. Assign one cluster to each processor. Optimal assignment is NP-full This model is valid for one or multiple jobs. Cluster can be determined using: • Lineal methods • Non-lineal methods • Heuristic/stochastic search Distributed Operating Systems 17 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Cluster-based Algorithm A 2 2 1 C 2 B 4 3 3 1 D 2 G 3 E 2 F 2 1 5 3 1 1 3 H 2 I Distributed Operating Systems 18 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 A B C D E G F H I Víctor Robles Francisco Rosales Fernando Pérez José María Peña Replication A 2 2 1 C 2 B 4 3 3 1 D 2 G 1 3 E 2 F 2 1 5 2 3 1 3 H 2 I Some tasks are executed in more than one node to avoid extra communication Distributed Operating Systems 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 A B C1 C2 D E G F H I Víctor Robles Francisco Rosales Fernando Pérez José María Peña Interdependent Task Migration • Better resource usage can be achieved using migration: – Can also be planned by static scheduling. – It is more common in dynamic strategies. 0 N1 1 N2 2 2 0 N2 2 Using migration 1 2 3 N1 3 2 3 4 Distributed Operating Systems 20 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Parallel Task Scheduling • The following aspects should be considered: – Tasks should be executed in parallel – Tasks exchange message during their execution. – Local resources (memory, I/O) are required. Centralized Model (Master/Slave) M S1 S2 S3 S4 Hypercube Distributed Model S5 Ring Distributed Operating Systems 21 S6 Different communicaton parameters: • Communicaton ration: Frequency and amount of data. • Connection Topology: Where messages are sent to/received from? • Communication Model: Synchronous (task are waiting for data) or Asynchronous. Restrictions • The existing physical topology of the networks • Network performance Víctor Robles Francisco Rosales Fernando Pérez José María Peña Performance of Parallel Tasks • Parallel tasks performance depends on: – Blocking conditions (internal load balancing) – System availability – Communication efficiency: latency and bandwidth Non-blocking send Blocking receive Synchronization barrier Non-blocking receive Running Blocked Idle Blocking send Distributed Operating Systems 22 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Parallel Tasks: Heterogeneity • In some cases connection topologies are not regular: – Model by a directed/non-directed graph: – Each node represents a task with its own requirements of memory/disk/CPU – Edges represents the amount of information exchange between a pair of nodes (communication ratio). • Problem solution is NP-full • Some heuristics can be used: – E.g., minimum cut: In the case of P nodes P-1 cut-points are selected (minimizing the information crossing each boundary) – Result: Each partition (node) has a tightly-coupled group of tasks – Asynchronous problems are more complicated. – Problem balancing the load of each node. Distributed Operating Systems 23 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Parallel Tasks: Heterogeneity N1 N2 3 2 N3 N1 N2 3 3 2 2 2 1 8 5 6 4 3 4 2 4 1 5 2 Inter-node connections 13+17=30 N3 3 2 1 8 5 6 4 3 4 4 1 5 2 Inter-node connections: 13+15=28 Tanenbaum. “Distributed Operating Systems” © Prentice Hall 1996 Distributed Operating Systems 24 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Scheduling Multiple Jobs • When multiple jobs should be executed the schedule: – Selects the next job from the queue and sends it to the system. – Considers if there are available resources (e.g, processors) to execute it. – Otherwise, it waits until some resources are released. Job Queue Scheduler Scheduler Resources? System no Jobs Distributed Operating Systems 25 yes wait wait Víctor Robles Francisco Rosales Fernando Pérez José María Peña Scheduling Multiple Jobs • How the job is selected from the queue?: – FCFS (first-come-first-serve): Submission time is preserved. – SJF (shortest-job-first): The smallest job is selected. The size of the job is measured by: • Resources, number of processors, or • Requested execution time (estimated by the user). – LJF (longest-job-first): The oposite case. – Priority-based: The administrator could define some priority criteria, such as: • Resource cost expenses. • Number of submitted jobs. • Job deadline. (EDF – Earliest-deadline-first) Distributed Operating Systems 26 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Backfilling • Backfilling is variant of any of the previous policies: – If the selected job cannot execute because there are not enough available resources, then, – Search another job in the queue that requires less resources (thus, it could be executed). – Increases system usage Searching jobs that require less processors Scheduler Scheduler Resources? yes no Backfilling Distributed Operating Systems 27 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Backfilling with Reservations • Reservations are: – Calculate when the job could be executed, based on the estimated execution time of the running jobs (deadlines) – Jobs that require less jobs are backfilled, but only if they finish before the estimated deadline. – System usage is not as efficient but avoids starvation in large jobs. Backfilling may cause that large jobs are never scheduled Scheduler Scheduler Resources? yes no Backfilling Distributed Operating Systems 28 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Distributed Operating Systems Process Scheduling Dynamic Scheduling Víctor Robles Francisco Rosales Fernando Pérez José María Peña Dynamic Scheduling • Static scheduling decides whether a job is executed in the system or not, but afterwards there is no monitoring of the executing job. • Dynamic scheduling: – Evaluates system status and makes corrective actions. – Resolves problems derived from the tasks parallelization (loadbalancing). – Reacts after system partial failures (node crashes). – Allows the system to be shared with other processes. – Requires a mechanism to monitor the system (task management policies): • Considering the same resources evaluated by the static scheduling Distributed Operating Systems 30 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Load Balancing vs. Load Sharing • Load Sharing: – Aim: Processor states should be the same. – Idle processors – A task waiting to be ran in a different processor Assigned • Load Balancing: – Aim: Processor load should be the same. – Processor load changes during task execution – How load are calculated? They are similar concepts, thus LS and LD use similar strategies but they are activated under different circumstances. LB has its own characteristics. Distributed Operating Systems 31 Víctor Robles Francisco Rosales Fernando Pérez José María Peña State/Load Measuring • What’s an idle node? – Workstation: “several minutes with no keyboard/mouse input and running no interactive process” – Calculation node: “no user process has been ran in a time frame.” • What happens when it is no longer idle? – Nothing → New process experiment bad performance. – Process migration (complex) – Keep running under lower priority. • If instead of the state (LS) it is necessary to know the load (LB) new measures are required. Distributed Operating Systems 32 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Task Management Policies All the task management decisions are performed using several policies, to be defined for each problem or scenario: • Information Policy: How information is distributed to take other decisions. • Transfer Policy: When transference is performed. • Selection Policy: Which process is selected to be transferred. • Location Policy: Which node the process is transferred to. Distributed Operating Systems 33 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Information Policy When node information is distributed: – On demand: Only when a transferral has to be done. – Periodically: Information is retrieved every sample window. The information is always available when transfer, but is could be deprecated. – On State Change: When node state has changed. Distribution scope: – Complete: All nodes know complete system information. – Partial: Each node only knows partial system information. Distributed Operating Systems 34 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Information Policy What information is distributed?: – Node Load Æ What’s load meaning? Different parameters: – – – – %CPU in a given instant. Number of processes ready to run (waiting) Number of page faults / swapping Considering several factors. • In an heterogeneous system processes has different capabilities (parameters might change). Distributed Operating Systems 35 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Transfer Policy • Usually, they are based on a threshold: – If node S load > T units, S is a process sending node. – If node S load < T units, S is a process receiving node. • Transfer decisions: – Pre-emptive: partially executed task can be transferred. • Process state is also transferred (migration) • Process execution restarts. – No Pre-emptive: Running processes cannot be transferred. Distributed Operating Systems 36 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Selection Policy • Choosing new processes, no under execution (no pre-emptive approach). • Selecting processes with mining transfer costs (small state, minimum usage of local resources). • Selecting processes only when their completion time will be less on the host node that on the original one (taking into account transfer time).. – Remote execution time should include migration time. – Execution time can be include in the job description (estimated by the user). – Otherwise, the system should estimate it. Distributed Operating Systems 37 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Location Policy • Sampling: Ask other nodes to know the most appropriate. • Alternatives: – – – – – – No sampling (randomly selected, hot-potato). Sequential/parallel sampling. Random sampling. Closest nodes. Broadcast sampling. Based on previously gathered information (Information policy). • Three possible policies: – Sender-driven (Push) → Process sender looking for nodes. – Receiver-driven (Pull) → Receiver searching for processes. – Combined → Sender/Receiver-driven. Distributed Operating Systems 38 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Remote Process Execution • How is remote execution performed? – Create same execution environment: • Environment variables, working directory, etc. – Redirect different OS calls to the original node: • E.g. Console interaction • Migration (pre-emptive) more complex: – “Freeze” process state – Transfer to the new host node – “Resume” process sate and execution • Several complex issues: – Message and signal forwarding – Copy swap space or remote page-fault service? Distributed Operating Systems 39 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Process Migration Different migration models: • Weak migration: – Restricted to some applications (executed over virtual machines) or different checkpoints. • Hard migration: – Native code migration and once the task has been started and anytime during the execution. – General purpose: More flexible but more complex • Data migration: – No process is migrated only working data are transferred. Distributed Operating Systems 40 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Migration: Task Data • Data used by the task should be also migrated: – Data on disk: Common filesytem. – Data in memory: Requires “to froze” all data belonging to the task (memory pages and processor records). Using checkpointing: • Memory pages are stored on disk. • It could be more selective if only (some) data pages are stored. It requires the use of special libraries/languages. • It is necessary to store also messages sent but potentially not received. • Checkpointing it also helps if system fails (crash recovery). Distributed Operating Systems 41 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Weak Migration • Weak migration can be performed by several methods: – Remote execution only when new processes are created: • In UNIX could be dome while FORK or EXEC • New processes would be executed in the same node but it does not provide load balancing. – Some state information should be sent even the task is not been started yet: • Arguments, environment, open files used by the task, etc. – Certain libraries allows programmer to define points in which state is stored/restored. These points can be used to migrate the process. – Anyway, executable file should be accessible in the other node: • Common filesystem. Distributed Operating Systems 42 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Weak Migration En languages (like Java): – There is a serialization mechanisms thet allow the system to transfer object state in a “byte stream format”. – It provides a low overhead mechanism for dynamic on-demand class loading form remote. serialization instance A=3 Process.class Node 1 Distributed Operating Systems 43 A=3 Class request Process.class Dynamic loading Node 2 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Hard Migration • Naïve solution: – Copying memory map: text, data, stack, ... – Creating a new program control block [PCB] (with all the information stored when the process changes its context). • There are other data (stored by the kernel) that are required: Named external process state – – – – – – Open files Waiting signals Sockets Semaphores Shared memory regions ..... Distributed Operating Systems 44 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Hard Migration There are different approximations: • Kernel-based: – Modified version of the kernel – All process information is available • User-level: – Checkpointing libraries – Socket mobility protocols – System call interceptions Other aspects: – Unique PID in the systems – Credentials and security issues Distributed Operating Systems 45 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Hard Migration • One of the objectives is to resume process execution as soon as possible: – Copy all memory space to the new node – Copy only modified pages; the rest of the pages will be provided by the swap area of the original node. – No previous copy; pages will be provided by the original node as page faults happen: • served from memory if they are modified. • served from swap if they are not modified – Swap out all memory pages in the original node and copy no page: pages will be served from original swap space. – Prefetching: start coping pages while process are already executing. • Code (read-only) pages do not require migration: – They are obtained by the remote node via a common filesystem Distributed Operating Systems 46 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Benefits of Process Migration • Better performance due to load-balancing • Get profit from resource proximity – Task that uses frequently a remote resource is migrated to the node this resource is. • Better performance in some client/server applications: – Minimize data transfer for large volumes of information: • Server sends code instead of data (e.g., applets) • Client sends request code (e.g., dta base access queries) • Fault tolerance when a partial failure happens • Development of “network applications” – Applications that are created to be executed on a network – Applications explicitly request its own migration – Example: Mobile agents Distributed Operating Systems 47 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Data Migration Used in master-slave applications. – Master: Distribute work among the slaves. – Slave: Worker (same code with different data). • Represents a work distribution algorithm (using data to define tasks): – Avoid slaves to be idle because master is not providing data. – Do not schedule to much work to the same salve (final execution time defined by the slowest) – Solution: Dispatch work in blocks (in different sizes). Distributed Operating Systems 48 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Connection Sharing • Some systems (e.g., web servers) consider workload as the number of incoming requests: – In these case the requests should be divided into several servers. – Problem: Server address should be unique. – Solution: Connection sharing: • DNS forwarding • IP forwarding (NAT rewriting or encapsulation) • MAC forwarding Distributed Operating Systems 49 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Dynamic vs. Static Scheduling • Systems might use any of them or even both. Non-Dynamic Dynamic Static Non-static Adaptive scheduling: All job submissions are centralized and scheduled, but continuous system monitoring is performed to react under misleading estimations and other unexpected circumstances. Load-balancing strategies: Jobs are executed without restrictions in any node of the system. The system performs, in parallel, a load-balancing to redistributed task in the different nodes. Resource manager (batch scheduling): Processors are assigned to only one task at a time. The resource manager controls job submissions and keeps a log of teh assigned resources. No scheduling service in a cluster of computers: God will provide…. Distributed Operating Systems 50 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Distributed Operating Systems Process Scheduling Víctor Robles Francisco Rosales Fernando Pérez José María Peña Computing Platforms • Depending of the preferred use of the platform: – Autonomous computers from independent users • User share the computer but only when it is idle • What happens when it is not idle anymore? – Task migration to other nodes – Keep executing task with lower priority – Dedicated system for parallel executions • A priori scheduling techniques are possible • Alternatively the behavior of the system cam be adapted dynamically • Optimizing either application execution time or resource usage. – General distributed systems (multiple users and multiple applications) • The goal is to achieve a well-balanced load distribution. Distributed Operating Systems 52 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Cluster Taxonomy • High Performance Clusters • Beowulf; parallel programs; MPI; dedicated facilities • High Availability Clusters • ServiceGuard, Lifekeeper, Failsafe, heartbeat • High Throughput Clusters • Workload/resource managers; load-balancing; supercomputing services • Based on application domain: – Web-Service Clusters • LVS/Piranha; balancing TCP connections; replicated data – Storage Clusters • GFS; paralle filesystems; common data view from all the nodes – Database Clusters • Oracle Parallel Server; Distributed Operating Systems 53 Víctor Robles Francisco Rosales Fernando Pérez José María Peña