How did I build a custom dynamic ZooKeeper management and migration framework and an asynchronous job framework for Tableau Server 11

 

First of all, Tableau Server 11 is not out yet. Tableau Server 10 is just released yesterday.

Product Strategy Overview

Tableau Server is an enterpise data analytics and visualization platform, and our customers are business people. In order to make the project usage experience as smooth as possible, we believe we should enable business people to manage Tableau Server on their own without any help from enterprise IT.

Therefore, one of the most critical goals for developing Tableau Server is that Tableau will provide GUIs and built-in workflows to automate as many managing processes as possible.

Project and Feature Requirements

Tableau core server team is building a new Java-based Tableau Server architecutre for Tableau Server 11 (target release, might change in the future) to replace the old jruby one.

Notably, old jruby Tabadmin persisting cluster-wide configuration and topology information in a local file (workgroup.yml). The role of ZooKeeper (referenced as ZK) is only used a distributed lock and a temp data store – it keeps some vilatile data that can be lost or wiped out and no one cares about it. This characteristic also makes managing and migrating ZooKeeper very very easy – you only need to clean up all data in existing ZooKeeper nodes, deploy the ZooKeeper bits to new nodes, and bring the new cluster up.

Unlike the old world, new Tabadmin will persist all essential configuration and topology data in ZK as a single source of truth, which makes managing, expanding, and migrating ZK much harder than before.

The Problem

The problem we are facing is how to maintain ZK’s data and expand/shrink ZK cluster, as well as making this process super reliable and easy for users.

Let’s elaborate more about the problem.

Say we start with a single Tableau Server node A, which has a single node ZK cluster runs on it. Then we want to expand the Tableau Server cluster from 1 node (A) to 3 nodes (A, B, C), as well as the ZK cluster runs on top of it. How to achieve that?

(A) -> (A, B, C)

Non-Goal

For customers who already have a ZK cluster in house and want to reuse that cluster, Tableau Server provides configurations that enable them to do so. That is not part of the project, and will not be discussed here. This project is focusing on smoothing the experience of managing Tableau Server’s built-in ZooKeeper cluster.

Naive Approach

The naive approach is to shut down (A), deploy ZK bits on B and C, change their configuration files, and bring up (A, B, C). Well, this can lead to disasters. The reason being that ZK runs on a quorum mode, means that if more than half nodes in a ZK cluster form a quorum, their decision will be the final decision that all nodes should follow. In this case, if B and C come up first and form a quorum, their decision will be “there’s no data in our cluster”. When A joins them, B and C will force A to delete its data, causing data loss.

ZooKeeper’s built-in Approach

Actually all developers using ZooKeeper faces the same problem. Think about running a ZooKeeper in your data center. If a ZooKeeper node fails, the common practice is that admin will try to bring it back manually/through scripts. If the machine dies, admins have to bring up a new machine, change its IP to match the failed one, and join it to the cluster. And you must know that machine failure is more than normal.

So ZooKeeper 3.5 starts to build a native rolling based approach to solve this common problem. The principle is to bring up a new node each time, join it to the existing cluster, wait them to sync up and become stable, bring up another new node.

(A) -> (A, B) -> (A, B, C)

The problem is that ZK is not originally developed to support this use case. It involves lots of refactoring, produces lots of bugs, and is not stable for production.

Take a look at ZK’s release page. 3.5.0 was released around 3 years ago. Apache then release 3.5.1 and 3.5.2, both are alpha version. They themselves even don’t recommend using it for production.

 

Continue reading “How did I build a custom dynamic ZooKeeper management and migration framework and an asynchronous job framework for Tableau Server 11”