Distributed Operating Systems Distributed Systems: Services Distributed Shared Memory Distributed Transactions Distributed Services Distributed Shared Memory Operating System Support Víctor Robles Francisco Rosales Fernando Pérez José María Peña 2 DSM Motivation • SMPs with shared memory vs. distributed systems: – Scallability. – Performance/cost ratio. • Programming support: – Shared memory programming is more intuitive than message passing. – Legacy code. Operating System Support Víctor Robles Francisco Rosales Fernando Pérez José María Peña 3 Distributed Shared Memory Architecture Distributed shared memory DSM appears as memory in address space of process Process accessing DSM Physical memory Physical memory Physical memory Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © AddisonWesley Publishers 2000 Operating System Support Víctor Robles Francisco Rosales Fernando Pérez José María Peña 4 DSM Implementation • Hardware DSM: – NUMA Multiprocessors (e.g. Dash) – HW allows processors to access memory in other processor. – Specific hardware. • Page-based DSM: – – – – – Virtual address space is common for the whole system. Memory page failures could be served remotely. DSM access methods (LOAD/STORE) The strategies used by SMPs can be adapted. First system IVY (Li, 1986) Operating System Support Víctor Robles Francisco Rosales Fernando Pérez José María Peña 5 Page-based DSM Process accessing paged DSM segment Kernel redirects page faults to user-level handler Kernel Pages transferred over network Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © AddisonWesley Publishers 2000 Operating System Support Víctor Robles Francisco Rosales Fernando Pérez José María Peña 6 DSM Implementation • Shared variables DSM: – – – – Only the specific program variables are shared. Compiler + execution environment manage variable access. DSM operations are special primitives of the execution environment. E.g.: Munin y Midway • Object-based DSM: – Shared objects – DSM=A collection of shared objects – Shared data access is performed by method invocation. • Provides abetter control access mechanism. – E.g.: Linda, Orca Operating System Support Víctor Robles Francisco Rosales Fernando Pérez José María Peña 7 DSM Design • Shared data granularity: – When a process access to data in other node: • Only gets the requested data → Poor performance • A bigger data unit → False sharing • Thrashing – Several processes access to the same data (actual or false sharing) – Network transfer overload. • Writing policy: – Write-update: Changes are transmitted to all the copies. – Write-invalidate: Changes invalidates other copies. • Coherence models Operating System Support Víctor Robles Francisco Rosales Fernando Pérez José María Peña 8 False Sharing • Pages shared • Actual date not shared. A page n B page n + 1 Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © AddisonWesley Publishers 2000 Operating System Support Víctor Robles Francisco Rosales Fernando Pérez José María Peña 9 Distributed Services Distributed Transactions Operating System Support Víctor Robles Francisco Rosales Fernando Pérez José María Peña 10 Transactions Group of operations in the same block that are executed together. ACID properties: – Atomicity: The transaction is either executed completly or not executed at all. – Consistency: The state before and after the transaction is stable. Consistent states. – Isolation: The intermediate states of the transaction are only visible inside the transaction. – Durability: Any modifications performed by the transactions durable. Víctor Robles Francisco Rosales Fernando Pérez José María Peña Transactions Transaction management has three main operations: – beginTransaction(): Starts a new block of operations belonging to the same transaction. – endTransaction(): Finishes a block of operations of the transaction. All transaction operations are committed. – abortTransaction(): At any time transaction process can be aborted and system status is returned to the state previous to the transaction starting point. – Any operation error inside the transaction can also abort its execution. Víctor Robles Francisco Rosales Fernando Pérez José María Peña Concurrent Transactions There are three bank account A, B and C with $100, $200 y $300 respectively. The possible account operations are: – balance=A.getBalance(): Get account balance. – A.setBalance(balance): Set account balance. – A.withdraw(amount): Withdraw an amount of money from the account. – A.deposit(amount): Deposit an amount of money in the account. Víctor Robles Francisco Rosales Fernando Pérez José María Peña Concurrent Transactions Lost update: bal=B.getBalance() bal=B.getBalance() B.setBalance(bal*1.1) B.setBalance(bal*1.1) A.withdraw(bal*0.1) C.withdraw(bal*0.1) bal=B.getBalance() → $200 bal=B.getBalance() → $200 B.setBalance(bal*1.1) → $220 B.setBalance(bal*1.1) → $220 A.withdraw(bal*0.1) → $80 C.withdraw(bal*0.1) → $280 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Concurrent Transactions Inconsistent queries: A.withdraw(100) <balance addition> B.deposit(100) A.withdraw(100) → $0 tot=A.getBalance() → $0 tot+=B.getBalance() → $300 tot+=C.getBalance() → $500 B.deposit(100) → $400 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Concurrent Transactions These two problems are due to: – Simultaneous reading and writing operations. – Many writing operations at a time. An alternative is to perform operations in the appropriate order to avoid these problems. The mechanisms are: – Locks: Assigned to the shared objects. – Optimist concurrency control: All the actions are performed without control up to a commit call is reached. – Operations sorted by time marks. Víctor Robles Francisco Rosales Fernando Pérez José María Peña Locks Each object shared by 2+ processes has its own lock: – The lock is closed when the process stars using the object. – The lock is released when the operation ends. When using locks the granularity level should be considered. Lock modes: – Reading – Writing Lock usage is prone to dead-locks. Víctor Robles Francisco Rosales Fernando Pérez José María Peña Locks Lost update: bal=B.getBalance() bal=B.getBalance() B.setBalance(bal*1.1) B.setBalance(bal*1.1) bal=B.getBalance()→$200 lock Lb bal=B.getBalance()→$200 • • • • Lb lock wait B.setBalance(bal*1.1)→$220 Lb • • unlock bal=B.getBalance()→$200 Lb B.setBalance(bal*1.1)→$220 Víctor Robles Francisco Rosales Fernando Pérez José María Peña Dead-locks A dead-lock happens when multiple processes are waiting in a cycle-graph: – Dead-lock detection: Waiting graphs. A T U B – Dead-lock prevention: Close all the locks at the beginning of the transaction (poor performance). – Dead-lock solving: The most common solution is using timeouts and aborting the current transaction. Víctor Robles Francisco Rosales Fernando Pérez José María Peña Optimistic Concurrency Control Few concurrent operations have conflicts. An operation block is divided into: – Working phase: The objects used by the transaction are copied as “tentative values”. A reading operation takes this value (if it exists) otherwise the last validated value.All the writing operations are performed in the “tentative values”. – Validation phase: At the end of the transaction interactions and conflicts with other transactions are checked. – Update phase: All the “tentative values” are copied as validated values. Víctor Robles Francisco Rosales Fernando Pérez José María Peña Optimistic Concurrency Control T1 Working T2 T3 Validation Update T4 Validation: – Backwards Validation: The transaction is discarded if other active transaction writes a value read by this one. – Forwards Validation: All writing operation in-validates former reading operations. Problems: – If the validation phase fails the transaction is aborted and restarted again. This approach is starvation-prone. Víctor Robles Francisco Rosales Fernando Pérez José María Peña Distributed Transactions Atomic transactions that use objects from different nodes. – When transaction ends: • If all the participating processes agree the transaction is committed. • If any process want to abort the operation, the entire transaction is aborted. Traditional protocol: two-phase-commit (2PC) – The process that request the transaction is considered the coordinator. – 2PC requires a stable storage: (“no information” is lost) • Using two media: writing in two disks. Víctor Robles Francisco Rosales Fernando Pérez José María Peña Two-Phase Commit Messages exchanged by the two-phase commit protocol: – – – – canCommit?(): The coordinator queries the other servers. doCommit(): Coordinator ask for operation execution to the other servers. doAbort(): Coordinator aborts transaction operations. Notifies to the servers. haveCommitted(): The server indicates that the transaction has been committed successfully. – getDecision(): Each server responds whether the operation can be performed or not. Víctor Robles Francisco Rosales Fernando Pérez José María Peña Two-Phase Commit Coordinator: • Write canCommit?() in stable memory. P0 • Send canCommit?() to the other servers. • Retrieve server response messages getDecision(). – – P1 P2 canCommit? Is all ok => doCommit() If any abort or not response=>doAbort() • Write the solution in stable memory. • Return the result. getDecision(ok) getDecision(ok) doCommit Performs operations haveCommitted haveCommitted Víctor Robles Francisco Rosales Fernando Pérez José María Peña Two-Phase Commit Servers: P0 • Receive canCommit?() • Evaluate the operation and write the decision in stable memory. • Send message: getDecision(). • Receive global decision. • Write decision into stable memory. • Execute the decision: – – doCommit()=> perform the operation. doAbort() => undo changes. P1 P2 canCommit? getDecision(ok) getDecision(ok) doCommit Performs operations haveCommitted haveCommitted Víctor Robles Francisco Rosales Fernando Pérez José María Peña 2PC Failures • Good fault tolerance – Recovery after a failure: using stable memory copies. • Recovery after a server crash: – If the stable memory has registered a response but not the response decision: • Ask coordinator again about the decision. – If the decision is also stored: • Perform it- • Recovery after coordinator crash: – If the stable memory has canCommit?() but not the decision: • Send canCommit?()message again to all the servers. – If the decision is also stored: • Send this decision again to all the servers. Víctor Robles Francisco Rosales Fernando Pérez José María Peña