After a while we meet again, 24. May 2012, you are invited! Check http://www.nosqlmunich.de/ .
As many, you may have come across the enlightenment, that running M/R jobs is not really an ad-hoc adventure. Not sure how such an illusion has come up, after all, the way we process data has changed little, despite the fact that it is much more data now. I want to position some hype’d terms here into the big picture here, following with some tips on doing real time data processing with Hbase in later posts.
Mostly you will do these things with your big data:
- keep it for reference (perhaps your bills),
- collect it first, then dive into it for analysis (“Where do my customers actually come from?”, aka OLAP)
- or apply some algorithm on it and derive business decision values, possibly real-time (“customers that bought A also bought B…” or “Current Top-Tweeds by Region”, call it OLTP or CEP)
If you already know your data and also have the question that you need answers for, you can go straight to data processing. Otherwise you may need to make the collected data somehow readable first. We used to call that ETL, these days also M/R is doing well here with the help of Hive, Pig or Cascading. Finally there are many great tools out there to investigate such data, some place it into analysis cubes to make that work a bit more handy for non programming analysts.
Once you know what you are looking for, you can decide how current your answers should be. You can stick with the ETL-to-Cube-Approach if it is enough to look at these answers once in a week or so. Or you automate and improve your ETL process further (here a M/R approach simplifies scaling). Or you look at your incoming data as a stream of events and rebuild the ETL logic to operate real-time. As the “Load”-Part of ETL is obsolete, I replace it here with a “P” for “Processing:
- turning aggregations into incremental aggregations (“select sum() over a week” may become “increment X of thisWeek”)
- Keeping a calculation context over a longer period (“if X happened 5 hours before Y then…”).
- Handling of unique-value aggregations (“how many unique visitors do I have over a week…”)
- You may need more CPU cycles and overall I/O as you can not benefit from batch processing of the classic ETL tools
- Synchronization: if your data arrives through different channels and you want to manipulate shared data (such as an index perhaps)
- Your Business Analyst may still need some kind of good old Analysis – these tools want to be loaded, and thus you may keep some kind of ETL alive. In other words, you probably have to add code and computers, you can not simply reuse what you have.
The good thing is, that the new, highly scalable key-value stores help to implement that with rather simple patterns (that indeed often look quite similar to M/R algorithms, so it may justify the term “Realtime”). My preferred toy is HBase, but most of them can be implemented also with cassandra, hypertable, redis or tokyo cabinet.
Read about some simple patterns that seem to occur repeatedly in the ETP stage in the next post.
It’s not complicated to build your own code to Extract your Data, do whatever Transformation and Process it, possibly even using data that is already in your data store. For HBase check the Avro, Thrift or ProtocolBuffers to conviniently talk to HBase using complex domain objects. To be scalable, you generally want to avoid complex synchronizations. So some are out there to help you on that task. I guess WibiData offers something more ready-to use.
We are inviting again to discuss NoSQL and BigData matters. Stefan Seelmann will show us an example of a real world integration of the two top class (and top-level now)-Projects. You are welcome to bring a short presentation too.
When: 6. May 2010, 18:00, open end. We may get something to eat & drink from the pizza shop around the corner.
Place: eCircle GmbH, 80686 München, Nymphenburger Str. 86.
We set the date for the next meeting now. Its four weeks before the Berlin Meeting, which is in parallel to THE CONFERENCE , so a perfect time to get the right questions together, that can be discussed there.
I’ll post more about the talks later.
Christoph Rupp started with a introduction to his embeddable HamsterDB, a nice and light alternative for BerkeleyDB. Beside a comprehensive feature set, he impressed with strong quality rules that are applied to each release. A embedded fast DB is certainly a different world than the massive scaling we do with Hadoop & Co, but why not imagine the HBase Shards using a really fast embedded piece of C-Code?
Afterwards we had some good discussions in different directions, spanning from Hadoop* up to all kinds of cloud challenges.
This was our second meeting and I still assume there are more data-fighters out there in the munich area. Spread the word!
Looking forward to he next Open Hadoop Users Group Meeting, probably mid-May 2010.