I’m talking on Wednesday at Strata about Tips and Tricks for Debugging Distributed Systems. You should come check it out.
As a preview, let’s talk about two pretty pictures.
I’m running some typical distributed systems (HDFS, MapReduce, Impala, HBase, Zookeeper) on a small, seven-node cluster. The diagram above has individual processes and the TCP connections they’ve established to each other. Some processes are “masters” and they end up talking to many other processes.
This diagram gets to the bottom of what makes this stuff hard (and interesting). Everything is interconnected and dependent on other pieces. So if one of the pieces breaks away, the others have to adjust. For the most part, software like HDFS and MapReduce are designed to deal with that gracefully, but when the corner cases come to visit, referring to and understanding how the processes communicate is key.
Looking for Outliers in Log Rates
In systems which have a lot of actors playing similar roles (e.g., Datanodes in HDFS or TaskTrackers in MapReduce), you can smell out an issue just by looking at how fast the log files are growing. In the picture above, there’s one datanode that’s logging at a faster rate than the others. That’s something worth following up on.
See you at Strata!
Come to the talk—we’ll talk more about several more tricks to diagnosing distributed systems. Swing by Ballroom CD at 4:50 on Wednesday, 2/27. I’ll be having office hours in the Expo Hall, Table B, at 10:10am Thursday.