From ETL to Realtime Map-Reduce
As many, you may have come across the enlightenment, that running M/R jobs is not really an ad-hoc adventure. Not sure how such an illusion has come up, after all, the way we process data has changed little, despite the fact that it is much more data now. I want to position some hype’d terms here into the big picture here, following with some tips on doing real time data processing with Hbase in later posts.
Mostly you will do these things with your big data:
- keep it for reference (perhaps your bills),
- collect it first, then dive into it for analysis (“Where do my customers actually come from?”, aka OLAP)
- or apply some algorithm on it and derive business decision values, possibly real-time (“customers that bought A also bought B…” or “Current Top-Tweeds by Region”, call it OLTP or CEP)
If you already know your data and also have the question that you need answers for, you can go straight to data processing. Otherwise you may need to make the collected data somehow readable first. We used to call that ETL, these days also M/R is doing well here with the help of Hive, Pig or Cascading. Finally there are many great tools out there to investigate such data, some place it into analysis cubes to make that work a bit more handy for non programming analysts.
Once you know what you are looking for, you can decide how current your answers should be. You can stick with the ETL-to-Cube-Approach if it is enough to look at these answers once in a week or so. Or you automate and improve your ETL process further (here a M/R approach simplifies scaling). Or you look at your incoming data as a stream of events and rebuild the ETL logic to operate real-time. As the “Load”-Part of ETL is obsolete, I replace it here with a “P” for “Processing:
- turning aggregations into incremental aggregations (“select sum() over a week” may become “increment X of thisWeek”)
- Keeping a calculation context over a longer period (“if X happened 5 hours before Y then…”).
- Handling of unique-value aggregations (“how many unique visitors do I have over a week…”)
- You may need more CPU cycles and overall I/O as you can not benefit from batch processing of the classic ETL tools
- Synchronization: if your data arrives through different channels and you want to manipulate shared data (such as an index perhaps)
- Your Business Analyst may still need some kind of good old Analysis – these tools want to be loaded, and thus you may keep some kind of ETL alive. In other words, you probably have to add code and computers, you can not simply reuse what you have.
The good thing is, that the new, highly scalable key-value stores help to implement that with rather simple patterns (that indeed often look quite similar to M/R algorithms, so it may justify the term “Realtime”). My preferred toy is HBase, but most of them can be implemented also with cassandra, hypertable, redis or tokyo cabinet.
Read about some simple patterns that seem to occur repeatedly in the ETP stage in the next post.
It’s not complicated to build your own code to Extract your Data, do whatever Transformation and Process it, possibly even using data that is already in your data store. For HBase check the Avro, Thrift or ProtocolBuffers to conviniently talk to HBase using complex domain objects. To be scalable, you generally want to avoid complex synchronizations. So some are out there to help you on that task. I guess WibiData offers something more ready-to use.