This document describes the comp6231 final project design of the DSMS with support for tolerating a single software failure non-malicious Byzantine or achieving high availability tolerance of a single crash failure use active replication. This provide an overview of the system architecture, the roles and interactions of each module, and an outline of what needs to be implemented or modified.
In this project, we extend the existing DSMS from assignment 3 to incorporate active replication. The design allows for:
Below is a diagram illustrating the system components and data flows:
Below is a simplified sequence diagram showing how a typical request is processed, how failures are detected, and how RMs handle those failures.
Below shows how I desgined the DSMS with Front End, Sequencer, Replicas, and Replica Managers using only three physical machines on a local network. Also demonstrate how to test crash failures.
We need:
Because I have only 3 physical machines, one of them must host the FE + Sequencer alongside one Replica and its RM. The other two machines each host a Replica and its RM. That gives us 3 replicas total and a single FE Sequencer instance. Below is how the Distribution works:
Each block Replica + RM can be run as separate processes or combined logic, but they should communicate on the same machine localhost ports.
Machine 1 hosts:
Key Points:
Machines 2 and 3 do not host the FE or Sequencer. They only run:
That way, each city code has a replica on Machine 1, 2, and 3. The Sequencer on Machine 1 sends requests to all nine processes which 3 per machine.
Key Points:
In this deployment:
Once the system is up, point any Admin or Buyer Client to connect to the Front End on Machine 1. The FE will pass requests to the Sequencer, which multicasts them to all three Replicas in total order.
The same pseudo-code from the assignment 3. The only difference is how we assign host IP addresses and ports to each component. For example:
Machine1_IP = "192.168.0.10"
Machine2_IP = "192.168.0.11"
Machine3_IP = "192.168.0.12"
// On Machine 1
startReplica("A", "192.168.0.10", port=9000)
startRM("A", "192.168.0.10", port=9100)
startSequencer("192.168.0.10", port=7000)
startFrontEnd("192.168.0.10", port=7100, sequencerIP="192.168.0.10", seqPort=7000)
// On Machine 2
startReplica("B", "192.168.0.11", port=9000)
startRM("B", "192.168.0.11", port=9100)
// On Machine 3
startReplica("C", "192.168.0.12", port=9000)
startRM("C", "192.168.0.12", port=9100)
Each replica knows the IP address and ports of the other replicas and the Sequencer. The Front End is the only one the client ever contacts 192.168.0.10:7100 in this example.
pseudo-code for how a replica handle ordered requests:
ReplicaServer {
sortedRequests = new PriorityQueue(... compare by seqNo ...);
currentSeqExpected = 1;
onReceiveRequest(seqNo, clientRequest):
insert (seqNo, clientRequest) into sortedRequests
while sortedRequests.peek() has seqNo == currentSeqExpected:
nextReq = sortedRequests.poll()
result = DSMS_logic(nextReq.clientRequest)
sendResponseToFE(nextReq.seqNo, result)
currentSeqExpected++
}
// DSMS_logic(request):
// parse operation (addShare, removeShare, ....)
// run the existing DSMS server code
// return a string result
pseudo-code for the FE logic:
FrontEnd {
handleClientRequest(clientRequest):
seqNum = sendToSequencer(clientRequest)
responses = []
startTime = now()
while not enoughResponses(responses):
if responseArrivesFromReplica(rID, result):
responses.add( (rID, result) )
if checkMajorityOrTwoMatches(responses):
finalRes = getMajorityOrMatch(responses)
return finalRes
if now() - startTime > TIMEOUT:
identifyWhichReplicaDidNotRespond(responses)
notifyAllRMs( crashedReplicaID )
finalRes = getMajorityOrMatch(responses)
return finalRes
}
Each Replica Manager is bound to exactly one replica. The RM’s duties:
pseudo-code for the RM:
ReplicaManager(replicaID) {
consecutiveFaults = 0
onFaultSuspected(faultType):
if faultType == "IncorrectResult":
consecutiveFaults++
if consecutiveFaults >= 3:
stopReplica(replicaID)
startNewReplica(replicaID)
consecutiveFaults = 0
else if faultType == "CrashSuspected":
if confirmCrashWithPeers(replicaID):
stopReplica(replicaID)
startNewReplica(replicaID)
consecutiveFaults = 0
stopReplica(replicaID):
// forcibly kill the process or call a cleanup method
startNewReplica(replicaID):
// spawn a fresh DSMS replica process
// rejoin the group for receiving requests
}
The Sequencer enforces total order on all client requests. It:
nextSeqNo.nextSeqNo and reliably multicasts (nextSeqNo, request) to all replicas.pseudo-code :
Sequencer {
nextSeqNo = 1
onReceiveRequestFromFE(clientRequest):
seqNo = nextSeqNo
nextSeqNo++
for each replica in replicaList:
sendUDPWithAck(replica.address, (seqNo, clientRequest))
sendUDPWithAck(address, message):
do {
sendUDP(address, message)
wait for ack or timeout
} while (no ack received && retryCount < MAX_RETRIES)
}
Here is a list of comprehensive test cases to verify the Distributed Share Market System (DSMS) functionality, covering both normal operations and crash scenarios process crash. The DSMS includes:
Below is a listing of core admin and buyer operations, covering typical edge cases. These assume all replicas are running and no process crashes occur.
| Test Case ID | Operation | Scenario | Expected Result |
|---|---|---|---|
| TA01 | addShare | Admin adds a new share with valid inputs | Share is successfully created with correct capacity; success message returned. |
| TA02 | addShare | Admin attempts to add an existing shareID | Operation fails with a message “Share already exists.” No new share is created. |
| TA03 | removeShare | Admin removes an existing share | Share is removed from the server. Future list or purchase references should fail for that share. |
| TAd04 | removeShare | Admin attempts to remove a non-existent shareID | Operation returns “Share not found” or similar message. |
| TA05 | listShareAvailability | Admin requests availability for a valid shareType (EQUITY, BONUS, or DIVIDEND) | All known shares of that type are returned with capacities from each server city. No error. |
| TB01 | purchaseShare | Buyer purchases a share with sufficient capacity available | Buyer's purchase is recorded, capacity is reduced accordingly. Successful response. |
| TB02 | purchaseShare | Buyer tries to purchase more than capacity, or shareID doesn’t exist | Purchase fails or partial is allocated; an error or partial success message is returned. |
| TB03 | getShares | Buyer requests their existing holdings | All owned shares and quantities from all servers are returned. |
| TB04 | sellShare | Buyer tries to sell shares they own, less than or equal to the purchased quantity | Operation succeeds, capacity goes up on that share. Buyer’s holdings are decreased accordingly. |
| TB05 | sellShare | Buyer tries to sell a share they do not own | Operation fails with error message ( “You do not own this share.”). |
| TB06 | swapShares | Buyer attempts to swap oldShare for newShare with enough capacity in newShare | Old share is removed from the buyer's holdings, new share is allocated. Successful swap message returned. |
| TB07 | swapShares | Buyer attempts to swap but newShare lacks capacity | Swap fails; old share is returned to the buyer’s holdings. Clear error message. |
In the following scenarios, at least one replica is forcibly terminated during or right after a request is broadcast by the Sequencer. We assume DSMS is in crash-failure tolerance mode not the software-fault mode. The system should continue to operate with the remaining replicas.
| Test Case ID | Crash Point | Operation | Expected System Behavior |
|---|---|---|---|
| C01 | After Sequencer sends an “addShare” request to all replicas, Replica B is killed before responding |
addShare |
FE receives 2 responses Replica A and Replica C. FE times out waiting for B, suspects crash. FE returns success if A and C results match. RM(B) restarts or replaces B. System remains consistent with the new share added. |
| C02 | During removeShare broadcast, Replica A’s process is terminated in mid-execution | removeShare |
FE gets responses from B and C. A never responds => crash suspicion => RMs confirm. The share is removed on B and C. Replica A is later restarted or replaced. Once A is back, it can catch up on missed requests. |
| C03 | Replica C is killed after sending an incorrect partial result for “purchaseShare” | purchaseShare |
FE compares responses from A, B, C. If C is killed mid-flight, FE get no final ack from C. A and B match => FE proceeds with that result. FE flags C as crashed. RM(C) restarts C. The final system state is consistent per A and B. |
| C04 | Replica B is killed before any response for “swapShares” | swapShares |
FE collects responses from A and C. If they match, the swap is successful. B is restarted later by RM(B). The buyer’s holdings reflect the swapped share on A and C. B syncs after restart. |
| C05 | Replica A is killed mid “sellShare” operation. FE times out on A’s response | sellShare |
FE obtains matching results from B, C, concludes the sale. A is flagged for crash; RM(A) restarts it. Overall capacity is incremented as B and C performed the sale. |
In each crash case, once the Replica Manager detects or is notified of the crash, it spins up a new instance or restarts so the system returns to having 3 replicas. The FE can respond to requests as soon as it has at least 2 matching responses from the healthy replicas.
Below are two sample sequence diagrams illustrating crash scenarios and the subsequent recovery. The first focuses on a crash during an admin operation addShare, the second focuses on a buyer operation purchaseShare.
The test cases above normal operations + crash scenarios will verify correctness, robustness, and recovery in the actively replicated DSMS:
Together, these tests ensure the DSMS achieves the high availability goal under a single crash failure.
After coding all the above Functionalities, need to verify:
addShareremoveSharelistShareAvailabilitypurchaseSharegetSharessellShareswapSharesConfirm each replica’s data capacities and buyer records remain consistent. If forcibly stop any single replica process on Machine 2 or Machine 3, the FE should continue returning correct results the other two healthy replicas.
This document outlines the architectural structure and key components that must be implemented or extended in the existing DSMS code to achieve software failure tolerance or high availability using active replication. These design ensure correctness, fault tolerance, and ease of debugging.