Rebirth: I Want to Learn Distributed Systems Series Part One.
Course Homepage:
Star Course MIT 6.5840: Distributed Systems, originally named 6.824, has assigned a total of 5 labs in 2024. These labs progress step by step, starting from the traditional distributed MapReduce, then implementing Raft from scratch, and finally creating a distributed database similar to TiKV. The whole process gives a satisfying feeling akin to starting with a needle and manually crafting Windows.
How to Prevent and Treat Cervical Spondylosis#
Just like the first challenge in a certain domestic AAA game is to decompress the game, the first challenge of this course is to create a documentation webpage that makes people suffer from cervical spondylosis.
Due to the document being biased to the left, consider centering it horizontally in the console.
const body = document.body;
body.style.maxWidth = "800px";
body.style.marginLeft = "auto";
body.style.marginRight = "auto";
const elements = body.getElementsByTagName("*");
for (let i = 0; i < elements.length; i++) {
elements[i].style.maxWidth = "100%";
elements[i].style.boxSizing = "border-box";
}
Brief Introduction to Lab Content#
Lab 1 is a relatively independent appetizer, aiming to implement a MapReduce framework used by Google.
Lab 2 implements a simple monolithic KV server, which must ensure service reliability even under unstable network conditions.
Lab 3 is to implement the Raft protocol.
Lab 4 is to build a KV server cluster based on the Raft protocol implemented in Lab 3, ensuring fault tolerance on the basis of Lab 2 using the Raft protocol.
Lab 5 requires dividing the database data into several shards to reduce the burden on a single server, and implementing their data migration.
I personally believe the two most challenging labs are clearly Lab 3 and Lab 5. Lab 3 requires careful study of the original Raft paper, while Lab 5 requires designing the entire architecture from the beginning, supplemented by online searches for relevant implementation ideas to get started.
Debugging Tips#
I personally believe that the most challenging aspect of implementing these labs is not in design or coding, but in debugging. Traditional breakpoint debugging methods are for a monolithic application, while this lab deals with a cluster system, making breakpoint methods difficult to operate, so it mainly relies on the print method. Therefore, designing a good print function is very necessary, for example, abstracting the print function. For Lab 5, I used a function to wrap println like this:
func (kv *ShardKV) SSPrintf(format string, a ...interface{}) (n int, err error) {
if SDebug {
log.Printf("[server %d]"+format, append([]interface{}{kv.gid}, a...)...)
}
return
}
kv.gid is one of the identifiers related to ShardKV, and in practice, more basic information can be included, although this may need to consider whether obtaining this information will cause read-write race conditions.
In this experiment, I only used SDebug to determine whether to print logs, but in reality, a level-like system could be designed to easily distinguish between important logs and less important ones.
When analyzing bug logs, the Linux tee command can be used to output logs to both stdout and a file simultaneously.
go test -race -run TestConcurrent3_5B | tee bug.txt
In places where logs need to be output, additional condition checks can be added to confirm the conditions under which bugs might occur, thus reducing a large amount of meaningless logs.