After you play a while with the put()’s and get()’s of your Hbase client, you’ll probably start to think about how to organize the mess.
When a classic DAO…
This classic approach with one backend database ensures that the client code needs no knowledge of the database and also not of the actual persistance strategies in the Persistance Layer (which was called DataAccessObject some years ago). Modern software can even drive several database models just by different configurations. Proper interfaces allow a independend testing of all layers.
The Database itself ensures the data consistency and has a strong API (SQL).
All thats nice, but now you want to manage a lot of data, and you may choose HBase to do this.
When using a distributed, shared-nothing database, there is -by definition- no single point that manages the persistance strategies, such as
- Ensuring Consistency (was the job of the RDBMS)
- Transactions (RDBMS)
- Maintain Indexes (RDBMS and your admin)
- CRUD into/from several Tables (classically a job of the Persistance Layer)
- Caches and Buffers to heal short Database Hickups
- Handle Security (Code or Database)
For a first attempt, just let a new persistance layer for the distributed database do all this:
(In a shared-nothing Database, as Shard holds a fragment of your data row set. A Master typically somehow knows of everything, but has no data.)
But how about consistency? We would have to update all clients in the same time to make sure, everybody properly maintains the index, adheres to the new security rules etc. Also, if we have more than one application, this gets complicated. Imagine 24×7.
So changes on the persistance layer require a immediated distribution to all clients that use it. This is new and quite different to classical RDBMS: “Lets put an index” is not so simple anymore. There are several options to get the persistance code distributed:
|A||shutdown clients, redeploy, restart||Normally not possible, also there is the risk of forgetting a client|
|B||build some schema version checking, let clients check the DAO Version for any access, Reload the DAO Code dynamically||While loading code dynamically is really cool, your QA Department will probably not like it. You need good security measures as well….|
|C||Have a additional layer of servers that act as DAO Layer||This seems to promise a solution to many problems at the cost of that additional servers.|
A separate set of DAO servers
So if you have the money for some additional servers, this is the way to go. It offers solutions to all the problems mentioned above.
Performance might be a problem. The beauty of shared-nothing comes from the independend life that each thread in your business logic can live. If you query Google, you might -in that moment- well have 20 computers available, just for your single request. This additional layer should scale at the same rate than the other IO streams of your application, possibly in a one-to-one relation as shown in the picture above. If you have some reasonaable work to do in your DAO Layer (such as encryping some fields, or calculating hashes for indexes), this computing power is not only additional cost, it frees your business layer from that.
So you have your separate DAO. How to update them now??? Restarting all the same time is also a short downtime. So here you are challanged to write code that allows you to do the mentioned things at runtime, such as changing permission settings or adding index rules. After all, these servers can also hold database maintainance code as well.
You might want to use a load-balancing between the Client and the DAO Nodes, which gives you the additional benefit of scaling and replacing nodes at runtime. The DAO Nodes may well buffer calls or run them in a multithreaded fashion, to give better reponse time from the database to the clients. A Firewall can offer addtional safety in your datacenter.
With all this freedom, dont forget that such a design does still not offer many things you may have got used to from traditional RDBM’s – unless you’d put them into your DAO Layer code yourself (if you are a hard-core database expert). But you may not need many of these things – and if the scaling benefits play out the potential loss of precision and accuracy – the data storage will never limit your business anymore.