Scheduler

Anuncio
Distributed Operating Systems
Process
Scheduling
Process Management
1. Concepts and Taxonomies: Jobs and parallel/distributed
systems
2. Static scheduling:
‰ Scheduling dependent tasks
‰ Scheduling parallel tasks
‰ Scheduling tasks from multiple jobs
3. Dynamic scheduling:
‰
‰
‰
‰
Load balancing
Process migration
Data migration
Connection balancing
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Starting Scenario: Concepts
• Jobs: Set of tasks (processes or threads) that require
(resources x time)
– Resource: Data, devices, CPU or other required (finite) elements to
carry out a job.
– Time: Period when resources are assigned (shared or dedicated) to a
certain job.
– Task relationship: Tasks should be performed in order keeping the
restrictions based on needed input or resources.
• Scheduling: Assigning jobs and their tasks to computational
resources (specially CPU). Scheduling could be monitored,
reviewed and changed along time.
Distributed Operating Systems
3
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Starting Scenario
Required Resources
Nodes
(Processors)
Tasks
Jobs
GOAL
To assign users’ jobs to the nodes, with the
objective to improve sequential performance.
Distributed Operating Systems
4
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Scheduling Characteristics
• Shared memory systems
– Any of the processor can access the resources used by a task:
• Memory space
• Internal OS resources (files, connections, etc.)
– Automatic load sharing/balancing:
• Free processors execute any process (task) in ready status
– Improvements derived from efficient process managements:
• Better resource usage and performance
• Parallel application uses available processors
• Distributed systems
– Tasks are assigned to a processor for their whole running time
– The resources used by a task can only be accessed from the local
processors.
– Load balancing requires process migration
Distributed Operating Systems
5
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Starting Scenario: Jobs
What do we execute?
Jobs are divided into tasks:
• Independent tasks
– Independent processes
– They could belong to different users
• Coopering tasks
–
–
–
–
They interact somehow
Belonging to the same application
There can be some dependencies
Or they can require parallel execution
Distributed Operating Systems
6
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Coopering Tasks
Task dependencies
• Model based on a direct acyclic
graph (DAG).
Tasks
Data transferred
• Example: Workflow
Distributed Operating Systems
7
Parallel execution
• It requires a number of parallel
executing tasks at the same
moment:
– Synchronous or asynchronous
interactions.
– Based on a connection topology.
– Either master/slave or fully
distributed model.
– Particular communication ratios and
message exchanges.
• Example: MPI code
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Starting Scenarios: Objectives
What kind of “better performance” are we expecting?
System taxonomy.
• High-availability systems
– HAS: High Availability Systems
– Service should be always working
– Fault tolerance
• High-performance systems
– HPC: High Performance Computing
– Reaching a higher computational power
– Execution one heavy job in less time.
• High-throughput systems
– HTS: High Throughput Systems
– Number of executed jobs should be maximized
– Optimizing the resource usage or the number of clients (could represent a different
objective).
Distributed Operating Systems
8
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Scheduling
• Scheduling deals with the distribution of tasks on a distributed
computing platform:
– Attending to resource requirements
– Attending to task inter-dependencies
• Final performance depends on:
– Concurrency: Maximum number of processors running in parallel.
– Parallelism degree: The smallest degree in which a parallel job can
be divided into tasks.
– Communication costs: Could be different between processors in the
same node than different nodes.
– Shared resources: Common resources (like memory) shared among
all the tasks running in the same node.
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Scheduling
• Processor usage:
– Exclusive: One task to one processor.
– Shared: If the tasks performs few I/O phases performance is limited.
It is not the usual strategy.
• Task scheduling can be planned as:
– Static Scheduling: The system decides previously where and when
tasks will be executed. These decisions are taken before the actual
execution of any of the tasks.
– Dynamic scheduling: When tasks are already assigned, and
depending on the behavior of the systems, initial decisions are
reviewed and modified. Some of the tasks of the job could be already
execution..
Distributed Operating Systems
10
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Distributed Operating Systems
Process
Scheduling
Static Scheduling
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Static Scheduling
• Generally, it is performed before allowing the job to enter in the
system.
• The scheduler (sometimes called resource manager) selects a
job from a waiting queue (depending on the policy) if there are
enough resources, otherwise it waits.
Job Queue
Scheduler
Scheduler
Resources?
yes
System
no
Jobs
wait
wait
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Job Descriptions
• In order to decide the order of job executions, the scheduler
should have some information about the jobs:
–
–
–
–
–
–
–
Number of tasks
Priority
Tasks dependencies (DAG)
Estimation of the required resources (processors, memory, disk)
Estimation of the execution time (per task)
Other execution parameters
Any applicable restriction
• These definitions are included in a job description file. The
format of this file depends on the specific scheduler.
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Scheduling Interdependent Tasks
• Considering the following aspects:
– Task (estimated) duration
– Size of data transited after task execution (e.g., file)
– Task precedence (which task should be finish before other task
starts).
– Restriction based on specific resource needs.
2
2
11
1
1
2
7
One option is to transform all data into the
same measure (time)
ƒExecution time (tasks)
ƒTransmission time (data)
1
2
1
3
4
4
1
3
Direct acyclic graph (DAG) description
Distributed Operating Systems
14
5
Heterogeneous systems make this
estimation more difficult:
ƒExecution time depends on the processor
ƒCommunication time depends on the
connection
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Scheduling Interdependent Tasks
• Scheduling becomes the assignation of task to processors at a
given timestamp:
– There are some approximate heuristics for task belonging to single
job: Critical path assignation.
– Polynomial time algorithms for 2-processors systems.
– It is an NP-full problem for N>2.
– The theoretical model is referred as multiprocessor scheduling.
Distributed Operating Systems
15
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Example of Interdependent Tasks
10
10
1
2
1
20
1
3
4
1
5
0
1
N1
2
20
Scheduler
3
4
30
36
N1
1
10
1
5
N2
N2
Distributed Operating Systems
16
5
Communication time between two tasks
depends on the node they are running at:
ƒTime ~0 if they are at the same node
ƒTime n if they are at different nodes.
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Cluster-based Algorithm
• For general cases the cluster-based algorithm is used:
–
–
–
–
–
Group tasks in clusters.
Assign one cluster to each processor.
Optimal assignment is NP-full
This model is valid for one or multiple jobs.
Cluster can be determined using:
• Lineal methods
• Non-lineal methods
• Heuristic/stochastic search
Distributed Operating Systems
17
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Cluster-based Algorithm
A
2
2
1
C
2
B 4
3
3
1
D 2
G
3 E 2
F
2
1
5
3
1
1
3
H
2
I
Distributed Operating Systems
18
2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
A
B
C
D
E
G
F
H
I
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Replication
A
2
2
1
C
2
B 4
3
3
1
D 2
G
1
3 E 2
F
2
1
5
2
3
1
3
H
2
I
Some tasks are executed in more than one
node to avoid extra communication
Distributed Operating Systems
19
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
A
B
C1
C2
D
E
G
F
H
I
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Interdependent Task Migration
• Better resource usage can be achieved using migration:
– Can also be planned by static scheduling.
– It is more common in dynamic strategies.
0
N1
1
N2
2
2
0
N2
2
Using migration
1
2
3
N1
3
2
3
4
Distributed Operating Systems
20
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Parallel Task Scheduling
• The following aspects should be considered:
– Tasks should be executed in parallel
– Tasks exchange message during their execution.
– Local resources (memory, I/O) are required.
Centralized Model (Master/Slave)
M
S1
S2
S3
S4
Hypercube
Distributed Model
S5
Ring
Distributed Operating Systems
21
S6
Different communicaton parameters:
• Communicaton ration: Frequency and amount of
data.
• Connection Topology: Where messages are sent
to/received from?
• Communication Model: Synchronous (task are
waiting for data) or Asynchronous.
Restrictions
• The existing physical topology of the networks
• Network performance
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Performance of Parallel Tasks
• Parallel tasks performance depends on:
– Blocking conditions (internal load balancing)
– System availability
– Communication efficiency: latency and bandwidth
Non-blocking send
Blocking receive
Synchronization barrier
Non-blocking receive
Running
Blocked
Idle
Blocking send
Distributed Operating Systems
22
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Parallel Tasks: Heterogeneity
• In some cases connection topologies are not regular:
– Model by a directed/non-directed graph:
– Each node represents a task with its own requirements of
memory/disk/CPU
– Edges represents the amount of information exchange between a pair
of nodes (communication ratio).
• Problem solution is NP-full
• Some heuristics can be used:
– E.g., minimum cut: In the case of P nodes P-1 cut-points are selected
(minimizing the information crossing each boundary)
– Result: Each partition (node) has a tightly-coupled group of tasks
– Asynchronous problems are more complicated.
– Problem balancing the load of each node.
Distributed Operating Systems
23
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Parallel Tasks: Heterogeneity
N1
N2
3
2
N3
N1
N2
3
3
2
2
2 1 8
5
6
4
3
4
2
4
1
5
2
Inter-node connections
13+17=30
N3
3
2 1 8
5
6
4
3
4
4
1
5
2
Inter-node connections:
13+15=28
Tanenbaum. “Distributed Operating Systems” © Prentice Hall 1996
Distributed Operating Systems
24
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Scheduling Multiple Jobs
• When multiple jobs should be executed the schedule:
– Selects the next job from the queue and sends it to the system.
– Considers if there are available resources (e.g, processors) to
execute it.
– Otherwise, it waits until some resources are released.
Job Queue
Scheduler
Scheduler
Resources?
System
no
Jobs
Distributed Operating Systems
25
yes
wait
wait
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Scheduling Multiple Jobs
• How the job is selected from the queue?:
– FCFS (first-come-first-serve): Submission time is preserved.
– SJF (shortest-job-first): The smallest job is selected. The size of the
job is measured by:
• Resources, number of processors, or
• Requested execution time (estimated by the user).
– LJF (longest-job-first): The oposite case.
– Priority-based: The administrator could define some priority criteria,
such as:
• Resource cost expenses.
• Number of submitted jobs.
• Job deadline. (EDF – Earliest-deadline-first)
Distributed Operating Systems
26
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Backfilling
• Backfilling is variant of any of the
previous policies:
– If the selected job cannot execute
because there are not enough
available resources, then,
– Search another job in the queue that
requires less resources (thus, it could
be executed).
– Increases system usage
Searching jobs that require less
processors
Scheduler
Scheduler
Resources?
yes
no
Backfilling
Distributed Operating Systems
27
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Backfilling with Reservations
• Reservations are:
– Calculate when the job could be
executed, based on the estimated
execution time of the running jobs
(deadlines)
– Jobs that require less jobs are
backfilled, but only if they finish
before the estimated deadline.
– System usage is not as efficient but
avoids starvation in large jobs.
Backfilling may cause that large jobs
are never scheduled
Scheduler
Scheduler
Resources?
yes
no
Backfilling
Distributed Operating Systems
28
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Distributed Operating Systems
Process
Scheduling
Dynamic Scheduling
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Dynamic Scheduling
• Static scheduling decides whether a job is executed in the
system or not, but afterwards there is no monitoring of the
executing job.
• Dynamic scheduling:
– Evaluates system status and makes corrective actions.
– Resolves problems derived from the tasks parallelization (loadbalancing).
– Reacts after system partial failures (node crashes).
– Allows the system to be shared with other processes.
– Requires a mechanism to monitor the system (task management
policies):
• Considering the same resources evaluated by the static scheduling
Distributed Operating Systems
30
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Load Balancing vs. Load Sharing
• Load Sharing:
– Aim: Processor states should be the same.
– Idle processors
– A task waiting to be ran in a different processor
Assigned
• Load Balancing:
– Aim: Processor load should be the same.
– Processor load changes during task execution
– How load are calculated?
They are similar concepts, thus LS and LD use similar
strategies but they are activated under different
circumstances. LB has its own characteristics.
Distributed Operating Systems
31
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
State/Load Measuring
• What’s an idle node?
– Workstation: “several minutes with no keyboard/mouse input and
running no interactive process”
– Calculation node: “no user process has been ran in a time frame.”
• What happens when it is no longer idle?
– Nothing → New process experiment bad performance.
– Process migration (complex)
– Keep running under lower priority.
• If instead of the state (LS) it is necessary to know the load (LB)
new measures are required.
Distributed Operating Systems
32
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Task Management Policies
All the task management decisions are performed using several
policies, to be defined for each problem or scenario:
• Information Policy: How information is distributed to take
other decisions.
• Transfer Policy: When transference is performed.
• Selection Policy: Which process is selected to be transferred.
• Location Policy: Which node the process is transferred to.
Distributed Operating Systems
33
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Information Policy
When node information is distributed:
– On demand: Only when a transferral has to be done.
– Periodically: Information is retrieved every sample window. The
information is always available when transfer, but is could be
deprecated.
– On State Change: When node state has changed.
Distribution scope:
– Complete: All nodes know complete system information.
– Partial: Each node only knows partial system information.
Distributed Operating Systems
34
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Information Policy
What information is distributed?:
– Node Load Æ What’s load meaning?
Different parameters:
–
–
–
–
%CPU in a given instant.
Number of processes ready to run (waiting)
Number of page faults / swapping
Considering several factors.
• In an heterogeneous system processes has different
capabilities (parameters might change).
Distributed Operating Systems
35
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Transfer Policy
• Usually, they are based on a threshold:
– If node S load > T units, S is a process sending node.
– If node S load < T units, S is a process receiving node.
• Transfer decisions:
– Pre-emptive: partially executed task can be transferred.
• Process state is also transferred (migration)
• Process execution restarts.
– No Pre-emptive: Running processes cannot be transferred.
Distributed Operating Systems
36
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Selection Policy
• Choosing new processes, no under execution (no pre-emptive
approach).
• Selecting processes with mining transfer costs (small state,
minimum usage of local resources).
• Selecting processes only when their completion time will be
less on the host node that on the original one (taking into
account transfer time)..
– Remote execution time should include migration time.
– Execution time can be include in the job description (estimated by the
user).
– Otherwise, the system should estimate it.
Distributed Operating Systems
37
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Location Policy
• Sampling: Ask other nodes to know the most appropriate.
• Alternatives:
–
–
–
–
–
–
No sampling (randomly selected, hot-potato).
Sequential/parallel sampling.
Random sampling.
Closest nodes.
Broadcast sampling.
Based on previously gathered information (Information policy).
• Three possible policies:
– Sender-driven (Push) → Process sender looking for nodes.
– Receiver-driven (Pull) → Receiver searching for processes.
– Combined → Sender/Receiver-driven.
Distributed Operating Systems
38
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Remote Process Execution
• How is remote execution performed?
– Create same execution environment:
• Environment variables, working directory, etc.
– Redirect different OS calls to the original node:
• E.g. Console interaction
• Migration (pre-emptive) more complex:
– “Freeze” process state
– Transfer to the new host node
– “Resume” process sate and execution
• Several complex issues:
– Message and signal forwarding
– Copy swap space or remote page-fault service?
Distributed Operating Systems
39
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Process Migration
Different migration models:
• Weak migration:
– Restricted to some applications (executed over virtual machines) or
different checkpoints.
• Hard migration:
– Native code migration and once the task has been started and
anytime during the execution.
– General purpose: More flexible but more complex
• Data migration:
– No process is migrated only working data are transferred.
Distributed Operating Systems
40
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Migration: Task Data
• Data used by the task should be also migrated:
– Data on disk: Common filesytem.
– Data in memory: Requires “to froze” all data belonging to the task
(memory pages and processor records). Using checkpointing:
• Memory pages are stored on disk.
• It could be more selective if only (some) data pages are stored. It
requires the use of special libraries/languages.
• It is necessary to store also messages sent but potentially not received.
• Checkpointing it also helps if system fails (crash recovery).
Distributed Operating Systems
41
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Weak Migration
• Weak migration can be performed by several methods:
– Remote execution only when new processes are created:
• In UNIX could be dome while FORK or EXEC
• New processes would be executed in the same node but it does not
provide load balancing.
– Some state information should be sent even the task is not been
started yet:
• Arguments, environment, open files used by the task, etc.
– Certain libraries allows programmer to define points in which state is
stored/restored. These points can be used to migrate the process.
– Anyway, executable file should be accessible in the other node:
• Common filesystem.
Distributed Operating Systems
42
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Weak Migration
En languages (like Java):
– There is a serialization mechanisms thet allow the system to transfer
object state in a “byte stream format”.
– It provides a low overhead mechanism for dynamic on-demand class
loading form remote.
serialization
instance
A=3
Process.class
Node 1
Distributed Operating Systems
43
A=3
Class request
Process.class
Dynamic
loading
Node 2
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Hard Migration
• Naïve solution:
– Copying memory map: text, data, stack, ...
– Creating a new program control block [PCB] (with all the
information stored when the process changes its context).
• There are other data (stored by the kernel) that are
required: Named external process state
–
–
–
–
–
–
Open files
Waiting signals
Sockets
Semaphores
Shared memory regions
.....
Distributed Operating Systems
44
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Hard Migration
There are different approximations:
• Kernel-based:
– Modified version of the kernel
– All process information is available
• User-level:
– Checkpointing libraries
– Socket mobility protocols
– System call interceptions
Other aspects:
– Unique PID in the systems
– Credentials and security issues
Distributed Operating Systems
45
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Hard Migration
• One of the objectives is to resume process execution as soon
as possible:
– Copy all memory space to the new node
– Copy only modified pages; the rest of the pages will be provided by
the swap area of the original node.
– No previous copy; pages will be provided by the original node as
page faults happen:
• served from memory if they are modified.
• served from swap if they are not modified
– Swap out all memory pages in the original node and copy no page:
pages will be served from original swap space.
– Prefetching: start coping pages while process are already executing.
• Code (read-only) pages do not require migration:
– They are obtained by the remote node via a common filesystem
Distributed Operating Systems
46
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Benefits of Process Migration
• Better performance due to load-balancing
• Get profit from resource proximity
– Task that uses frequently a remote resource is migrated to the node
this resource is.
• Better performance in some client/server applications:
– Minimize data transfer for large volumes of information:
• Server sends code instead of data (e.g., applets)
• Client sends request code (e.g., dta base access queries)
• Fault tolerance when a partial failure happens
• Development of “network applications”
– Applications that are created to be executed on a network
– Applications explicitly request its own migration
– Example: Mobile agents
Distributed Operating Systems
47
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Data Migration
Used in master-slave applications.
– Master: Distribute work among the slaves.
– Slave: Worker (same code with different data).
• Represents a work distribution algorithm (using data to
define tasks):
– Avoid slaves to be idle because master is not providing data.
– Do not schedule to much work to the same salve (final execution
time defined by the slowest)
– Solution: Dispatch work in blocks (in different sizes).
Distributed Operating Systems
48
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Connection Sharing
• Some systems (e.g., web servers) consider workload as the
number of incoming requests:
– In these case the requests should be divided into several servers.
– Problem: Server address should be unique.
– Solution: Connection sharing:
• DNS forwarding
• IP forwarding (NAT rewriting or encapsulation)
• MAC forwarding
Distributed Operating Systems
49
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Dynamic vs. Static Scheduling
• Systems might use any of them or even both.
Non-Dynamic
Dynamic
Static
Non-static
Adaptive scheduling: All job
submissions are centralized and
scheduled, but continuous system
monitoring is performed to react under
misleading estimations and other
unexpected circumstances.
Load-balancing strategies: Jobs are
executed without restrictions in any node
of the system. The system performs, in
parallel, a load-balancing to redistributed task in the different nodes.
Resource manager (batch scheduling):
Processors are assigned to only one
task at a time. The resource manager
controls job submissions and keeps a
log of teh assigned resources.
No scheduling service in a cluster of
computers: God will provide….
Distributed Operating Systems
50
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Distributed Operating Systems
Process
Scheduling
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Computing Platforms
• Depending of the preferred use of the platform:
– Autonomous computers from independent users
• User share the computer but only when it is idle
• What happens when it is not idle anymore?
– Task migration to other nodes
– Keep executing task with lower priority
– Dedicated system for parallel executions
• A priori scheduling techniques are possible
• Alternatively the behavior of the system cam be adapted dynamically
• Optimizing either application execution time or resource usage.
– General distributed systems (multiple users and multiple applications)
• The goal is to achieve a well-balanced load distribution.
Distributed Operating Systems
52
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Cluster Taxonomy
• High Performance Clusters
• Beowulf; parallel programs; MPI; dedicated facilities
• High Availability Clusters
• ServiceGuard, Lifekeeper, Failsafe, heartbeat
• High Throughput Clusters
• Workload/resource managers; load-balancing; supercomputing services
• Based on application domain:
– Web-Service Clusters
• LVS/Piranha; balancing TCP connections; replicated data
– Storage Clusters
• GFS; paralle filesystems; common data view from all the nodes
– Database Clusters
• Oracle Parallel Server;
Distributed Operating Systems
53
Víctor Robles
Francisco Rosales
Fernando Pérez
José María Peña
Descargar