I’m speaking at Strata about this stuff in February, so this and a few other posts are by way of preparation for that talk.
I learned this trick at Google, and I’ve used it in every system I’ve helped build since. Expose, on an HTTP server inside of your process, as much useful information as possible. Do this as you go, to make development and debugging easier. Reap rewards in production. In distributed systems, HTTP is doubly-important: you’re constantly jumping between machines, and HTTP is critically easier than SSH’ing over to read a file.
It hardly needs re-iterating, but I’ll re-iterate why HTTP is your friend here: everyone’s got an HTTP client. Everyone understands how HTTP flows through their networks and firewalls. You’re only requiring your users to memorize one URL, something they’re already pretty good at.
The good news is that you have to do very little: you can
instrument much of your system essentially for free.
I’m going to go through a bunch of things that you ought to expose, pointing out examples from open source systems that I know well through my work at Cloudera. Though this post is Java-biased, this trick works everywhere. (Hue, which is python-based, does this too.)
What to expose?
Hadoop daemons exposes the configuration
that they’re using over
/conf. Hadoop’s configuration
is tricky, because there are defaults in code, defaults in
XML files inside of jars, and multiple XML files that
are loaded in a specific order. So,
not only tells you what the runtime
values are, but where they came from. Believe you me,
I’ve won several bets with this.
The following snippet will pick out your log4j file and return it as a string. (Embedding it into your HTTP server is left as an exercise.) Hadoop does this by exposing its logs directory with a static servlet, which works too. You can get at logs manually, of course, but they’re easy to get to in the browser.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Changing the log levels dynamically in log4j turns out to be possible. Embed
LogLevelServlet in your code. (I’ve seen slightly better,
with drop-downs for the available loggers, but Hadoop doesn’t.)
You can also do this via JMX,
but the easiest way to do that is
visualvm, and then you need
to set up X11 forwarding, and at this point it was easier
to just restart the server.
The JVM exposes tons of information, including
GC metadata, classpaths, operating system process data, and more.
If you visit
/jmx on a Hadoop daemon, it’ll dump out the
JMX data as JSON, via JmxJsonServlet. Here’s how it looks:
The traditional way of exposing JMX is to set some JVM flags,
and then connecting remotely with
jvisualvm (with the MBeans plugin), or
jmxsh, jmxterm, cmdline-jmxclient, twiddle, etc.
1 2 3
Most of these tools require you to configure JMX up front.
Full-fledged JMX lets you not just read instrumentation information about the JVM, but also change it. That’s really poor form: it’s one thing to expose counters, and it’s another entirely to allow various write operations. JmxJsonServlet exposes the good bits.
Look for a future post on how to get at the JMX metrics from the command line, without having set up either JmxJsonServlet or the JMX JVM parameters ahead of time.
If you want to expose some JMX metrics yourself, here’s a small example. In typically verbose Java fashion, you need both the interface and the class. Be careful not to take locks in your JMX code. It’s not fun when your metrics stop working in a deadlock.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Don’t be tempted to use JMX for actual management. If you want RPCs to be able to remotely manage your system, do that however you’re doing RPCs. I’m glad that the JVM is well-instrumented, but the whole thing is otherwise clunky.
Versions. Expose the version and build number/hash of your system. While you’re at it, expose server time too, since timezones are tricky.
Your application has its own metrics. Perhaps that’s database latency. Perhaps that’s the number of records cleaned up in the last phase. Perhaps it’s the number of peers connected to it. Perhaps it’s your birthday. Make it easy for your system’s developers to add these as they go.
Checkout Yammer’s Metrics system here. Hadoop’s Metrics2 and Metrics work fine too. You can use the JMX snippet above to expose metrics directly with JMX.
The important thing is to start exposing metrics.
Especially for systems that have background tasks or are prone to deadlocks, being able to access the stack dump easily is key. /stacks is Hadoop’s.
Poor Man’s Profiler, Thread Time, ehCache
I haven’t open sourced these, but, if you’re
too lazy to hook up a real profiler, you can
do “Poor Man’s Profiling.” You can use
a bit of ridiculous
jstack to do it at the shell.
Or you can hook up a servlet and do the aggregation inside
of the process.
Similarly, you can ask Java for the total CPU time used by various threads, to get a sense of where your CPU is going.
If you use ehCache, you can hook up a servlet to dump its contents, which is overwhelming, but useful for tracking doing cache coherence issues.
Pretty much every daemon in Hadoop exposes some daemon-specific data. HBase and HDFS are great examples here, showing, for example, information about how various regions are doing, and what transaction the journal is at. Solr does the same thing.
JavaMelody is the best pre-built solution I know. In addition to showing much of the conten’t we’ve already talked about, it keeps history using JRobin and presents graphs. One of JavaMelody’s strenghts is that it can hook up to your JDBC and web servlets, and show you statistics for those, without your having had to lift a finger. It’s LGPL if that makes a difference for you.
JAMon, not to be confused with the templating engine, works along the same lines, but I’ve found JavaMelody much more useful.
NewRelic is hosted. If that works for you, folks rave about ‘em.
Jolokia is a newish library bridging JMX and HTTP. It can be installed as a JVM agent (and as a plugin to several other things), and they’ve got some cool integration with Cubism.js.
A Quick Note on Security
All of the systems I’ve discussed are not connected to the Internet at large. There’s a trade-off between exposing information and appropriately paranoid security. Assess your risks appropriately.
A plea for distributed systems developers—as you’re building your systems, expose runtime information over HTTP.
Let me know if you’ve got other stand-by debugging tools in this vein.