As many, you may have come across the enlightenment, that running M/R jobs is not really an ad-hoc adventure. Not sure how such an illusion has come up, after all, the way we process data has changed little, despite the fact that it is much more data now. I want to position some hype’d terms here into the big picture here, following with some tips on doing real time data processing with Hbase in later posts.
Mostly you will do these things with your big data:
- keep it for reference (perhaps your bills),
- collect it first, then dive into it for analysis (“Where do my customers actually come from?”, aka OLAP)
- or apply some algorithm on it and derive business decision values, possibly real-time (“customers that bought A also bought B…” or “Current Top-Tweeds by Region”, call it OLTP or CEP)
If you already know your data and also have the question that you need answers for, you can go straight to data processing. Otherwise you may need to make the collected data somehow readable first. We used to call that ETL, these days also M/R is doing well here with the help of Hive, Pig or Cascading. Finally there are many great tools out there to investigate such data, some place it into analysis cubes to make that work a bit more handy for non programming analysts.
Once you know what you are looking for, you can decide how current your answers should be. You can stick with the ETL-to-Cube-Approach if it is enough to look at these answers once in a week or so. Or you automate and improve your ETL process further (here a M/R approach simplifies scaling). Or you look at your incoming data as a stream of events and rebuild the ETL logic to operate real-time. As the “Load”-Part of ETL is obsolete, I replace it here with a “P” for “Processing:
- turning aggregations into incremental aggregations (“select sum() over a week” may become “increment X of thisWeek”)
- Keeping a calculation context over a longer period (“if X happened 5 hours before Y then…”).
- Handling of unique-value aggregations (“how many unique visitors do I have over a week…”)
- You may need more CPU cycles and overall I/O as you can not benefit from batch processing of the classic ETL tools
- Synchronization: if your data arrives through different channels and you want to manipulate shared data (such as an index perhaps)
- Your Business Analyst may still need some kind of good old Analysis – these tools want to be loaded, and thus you may keep some kind of ETL alive. In other words, you probably have to add code and computers, you can not simply reuse what you have.
The good thing is, that the new, highly scalable key-value stores help to implement that with rather simple patterns (that indeed often look quite similar to M/R algorithms, so it may justify the term “Realtime”). My preferred toy is HBase, but most of them can be implemented also with cassandra, hypertable, redis or tokyo cabinet.
Read about some simple patterns that seem to occur repeatedly in the ETP stage in the next post.
It’s not complicated to build your own code to Extract your Data, do whatever Transformation and Process it, possibly even using data that is already in your data store. For HBase check the Avro, Thrift or ProtocolBuffers to conviniently talk to HBase using complex domain objects. To be scalable, you generally want to avoid complex synchronizations. So some are out there to help you on that task. I guess WibiData offers something more ready-to use.
Christoph Rupp started with a introduction to his embeddable HamsterDB, a nice and light alternative for BerkeleyDB. Beside a comprehensive feature set, he impressed with strong quality rules that are applied to each release. A embedded fast DB is certainly a different world than the massive scaling we do with Hadoop & Co, but why not imagine the HBase Shards using a really fast embedded piece of C-Code?
Afterwards we had some good discussions in different directions, spanning from Hadoop* up to all kinds of cloud challenges.
This was our second meeting and I still assume there are more data-fighters out there in the munich area. Spread the word!
Looking forward to he next Open Hadoop Users Group Meeting, probably mid-May 2010.
After you play a while with the put()’s and get()’s of your Hbase client, you’ll probably start to think about how to organize the mess.
When a classic DAO…
This classic approach with one backend database ensures that the client code needs no knowledge of the database and also not of the actual persistance strategies in the Persistance Layer (which was called DataAccessObject some years ago). Modern software can even drive several database models just by different configurations. Proper interfaces allow a independend testing of all layers.
The Database itself ensures the data consistency and has a strong API (SQL).
All thats nice, but now you want to manage a lot of data, and you may choose HBase to do this.
When using a distributed, shared-nothing database, there is -by definition- no single point that manages the persistance strategies, such as
- Ensuring Consistency (was the job of the RDBMS)
- Transactions (RDBMS)
- Maintain Indexes (RDBMS and your admin)
- CRUD into/from several Tables (classically a job of the Persistance Layer)
- Caches and Buffers to heal short Database Hickups
- Handle Security (Code or Database)
For a first attempt, just let a new persistance layer for the distributed database do all this:
(In a shared-nothing Database, as Shard holds a fragment of your data row set. A Master typically somehow knows of everything, but has no data.)
But how about consistency? We would have to update all clients in the same time to make sure, everybody properly maintains the index, adheres to the new security rules etc. Also, if we have more than one application, this gets complicated. Imagine 24×7.
So changes on the persistance layer require a immediated distribution to all clients that use it. This is new and quite different to classical RDBMS: “Lets put an index” is not so simple anymore. There are several options to get the persistance code distributed:
|A||shutdown clients, redeploy, restart||Normally not possible, also there is the risk of forgetting a client|
|B||build some schema version checking, let clients check the DAO Version for any access, Reload the DAO Code dynamically||While loading code dynamically is really cool, your QA Department will probably not like it. You need good security measures as well….|
|C||Have a additional layer of servers that act as DAO Layer||This seems to promise a solution to many problems at the cost of that additional servers.|
A separate set of DAO servers
So if you have the money for some additional servers, this is the way to go. It offers solutions to all the problems mentioned above.
Performance might be a problem. The beauty of shared-nothing comes from the independend life that each thread in your business logic can live. If you query Google, you might -in that moment- well have 20 computers available, just for your single request. This additional layer should scale at the same rate than the other IO streams of your application, possibly in a one-to-one relation as shown in the picture above. If you have some reasonaable work to do in your DAO Layer (such as encryping some fields, or calculating hashes for indexes), this computing power is not only additional cost, it frees your business layer from that.
So you have your separate DAO. How to update them now??? Restarting all the same time is also a short downtime. So here you are challanged to write code that allows you to do the mentioned things at runtime, such as changing permission settings or adding index rules. After all, these servers can also hold database maintainance code as well.
You might want to use a load-balancing between the Client and the DAO Nodes, which gives you the additional benefit of scaling and replacing nodes at runtime. The DAO Nodes may well buffer calls or run them in a multithreaded fashion, to give better reponse time from the database to the clients. A Firewall can offer addtional safety in your datacenter.
With all this freedom, dont forget that such a design does still not offer many things you may have got used to from traditional RDBM’s – unless you’d put them into your DAO Layer code yourself (if you are a hard-core database expert). But you may not need many of these things – and if the scaling benefits play out the potential loss of precision and accuracy – the data storage will never limit your business anymore.