Back

StrategyDec 08, 2015

Insights on Big Data from Some of Credera’s Thought Leaders

Justin Bell

Last week I published a post that got a really good response.  The topic was architecture insights from some of Credera’s technology leaders.  As I mentioned, one of my favorite things about working at Credera is that every day I get to collaborate with our amazing team (my friends) on a wide variety of problems.  When a client asks me “what do you know about (fill in the blank technology topic)?”, I almost always have a few talented colleagues that I can turn to for help.  We get together, talk about that client situation / problem / question and put our collective experience and skill sets together to help.

I thought it might be helpful to share some interesting perspectives from some of Credera’s top technology leaders here.  This week, the question is regarding “Big Data”.

How is “Big Data” changing the way you architect systems?

Andrew Stewart, Principal, Business Intelligence Practice

Big Data is a relative term.  I believe at any point in time we will always have some version of Big Data.  As technology advances, the big data of today will become the norm and there will be another new “big data” that we will be talking about.  Social and IoT are good examples of this.  Social was the big data topic a couple years ago, now IoT is the next wave.

The confluence of big data and advanced analytics is a perfect storm of new opportunity for companies.  Technology tends to outpace the skills of the people, however over the last 6-12 months, it is becoming more common for people to understand what’s possible within the realm of Analytics.  The upstart of Business Analytics programs at Universities is seeding a tremendous uptick in the analytical skills within a company. The next shoe to drop is executive understanding of analytics.  It will be interesting to see how this continues to evolve as the future leaders of the millennial generation fill the executive suite.

Big Data has morphed the architecture for BI and DW environments to accommodate the speed (velocity), variety, and volume of data.  Everyone is talking about the three V’s, but of these, I’d say that velocity is the one that we’ve seen challenge architecture the most.  Real-time is really becoming within the nanoseconds after an event occurs and as volume increases it will be important to keep up with the velocity.  Technologies like Storm, Spark, Kafka and Go, will continue to stretch our imagination of what’s possible.

Best new tool: SQL in Hadoop (e.g., Actian Vortex, Splice Machine, etc.)

Hadoop has been a mysterious technology for so many over the past few years and map reduce required specialized skills to develop.  With SQL in Hadoop technologies starting to surface, it has allowed companies to leverage the power of Hadoop while also providing a familiar SQL interface to query the vast amounts of data available.

Jason Goth, Principal Architect, Technical Architecture & Strategy Practice

First, we have to agree on what is meant by “big data”.  It’s a pretty overloaded term.  When I use the term “Big Data”, I am generally referring to a system with data volumes so large that traditional solutions (SAN/NAS, RDBMS, etc.) simply can’t handle them.  So you have to – by definition – do something different.

What is it then, you do differently?   Well, my answer is “almost everything”.

For example:

  • Re-think Consistency: As data volumes grow, they have to be replicated, partitioned, etc. and that mean giving up one of the CAP features, Consistency, Availability, or Partition tolerance (see https://en.wikipedia.org/wiki/CAP_theorem for details).  Consistency is usually the one that has to go.  This can be a big challenge for business and technology folks alike – anyone who strives to hold onto that “Single Source of the Truth”

  • Re-think Programming Models: Do you expect to create some OO application that would manage a graph of billions of objects?  Not likely. You’ll need to leverage Map-Reduce and other new approaches designed to interact with large, distributed data sets

  • Re-think Synchronous Interfaces: Very large volumes and real time processing means taking an asynchronous or event-driven approach in many cases.  These solutions are much harder to reason about and implement

  • Re-think Testing:  In many big data applications, there is no way to know the “correct answer”.  For example,  in a machine learning system that recommends products, what is the “right” product recommendation? Won’t it change with every purchase?  This changes the way we think about testing systemsRe-think Hosting and Storage Strategies: With volumes approach petabyte/exabyte scale, can you afford to take the same strategies of SAN/NAS hardware?  How will back that volume of data up?  Does it even make sense to try and restore?

  • Re-think Hosting and Storage Strategies: With volumes approach petabyte/exabyte scale, can you afford to take the same strategies of SAN/NAS hardware?  How will back that volume of data up?  Does it even make sense to try and restore?

That’s just a few of a dozen or so examples I can think of.  I hope it illustrates the point. Implementing big data solutions requires breaking down some deeply-rooted beliefs about how systems should be built and should behave.  Both business owners and technologists have to embrace these changes to really see the benefits of big data.

Rajesh Rao, Senior Architect, Business Intelligence Practice

We implemented a newish concept in the world of Hadoop called “SQL on Hadoop” at a recent client. In essence, the concept is to take the complexities out of working with Hadoop directly (using MapReduce, PIG, Hive etc.) by using a SQL and scripting layer over Hadoop. The idea is that analysts and programmers have traditionally been SQL aware and to have them move to a rather complex programming paradigm is a steep challenge. Hence, keep them in the familiar work of SQL but leverage the power of Hadoop underneath. The product is Actian Vortex from Actian. I think this technology has tremendous opportunity for growth and there are other players here as well with SpliceMachine and Datastax.

Highly Honorable mention goes to Datastax Enterprise. It is a complete OLTP and OLAP platform that an app can interact with for persistence and data operations while providing a real-time analytics pipeline that does not impact performance of operational store. It is built on Hadoop, Cassandra, Solr and Spark all of which are proven. This is my first choice in any BigData project that comes up in the next year.

On how Big Data is changing the way I architect systems, quick anecdote: We were at a client last year bidding on a pure OLTP application to manage expenses and my first few questions to the product owner and other client stake holders was “How much data will you be generating? How are you using that to improve your product offering? Do you realize that the higher value lies (in this case) not the application but the data?”. Because they were collecting huge amounts of financial transactions data with geospatial content built in, that was a gold mine. My architecture recommendation to them started with ensuring adequate storage, warehousing capabilities etc because the architecture for their application was not going to be a challenge.

While this is one perspective of Big Data (seen as large data sets), to me BigData is a “Strategy” and not a technology. “Legacy” or “Traditional” definition of Big Data was based on Volume, Variety and Velocity. My personal addition is the fourth V – Value.

Big data can only work its magic if a business puts a well-defined data strategy in place before it starts collecting and processing information. And that strategy should be based on key business priorities; the data component is developed afterwards, with the aim of serving those priorities. Often, clients come into a BigData engagement thinking they have to store and manage data but really, that is a rabbit hole with no end. Instead, if they know why they want that data, that frames their BigData approach and balance priorities – realtime or not, storage heavy with historical analytics or analytics heavy with point-in-time analytics, etc.

A Big Data project should be measured by Time To Delivery; rather it must measured by Time To Answer (a.k.a. Value) and further, Time To Decision based on the data. And the decision and value should be well known (at least reasonably well known) before tackling a big data project.

Gilbert Sharp, Senior Architect, Business Intelligence Practice

While this doesn’t directly address the question as asked, the following is a restatement of a synopsis I put together based on the TDWI conference I attended last Spring. Keep in mind that this particularly addresses things from a Business Intelligence perspective, and not necessarily an entire Enterprise Architecture perspective.

  • Big Data has been over hyped in the last few years, with the over-hype now leading the concept into Gartner’s trough of disillusionment

  • While the benefits to be achieved are not up to the promises of the press, there is still an important role for Big Data to play

  • Big Data continues to be a big theme in the Data Warehousing/Business Intelligence arena but it appears that the EDW is not going away despite press that continues to imply Hadoop will replace data warehouses

  • In the realm of Business Intelligence/Analytics, the emphasis is shifting away from Big Data in terms of volume, as that is a primarily technical issue which can be addressed in a number of ways, to a primary focus on it as a conceptual set of data (velocity, variety) which presents itself in new, often streaming and generally non-relational formats such as unstructured (e.g. video, image), self-structuring (e.g. IOT)  and semi-structured (e.g. Social Media feeds)

  • Accommodating the impact of Big Data and the needs to introduce advanced analytics against Big Data while still supporting a more regulated approach to reporting needs has led to a growing concept of Business Intelligence being supported by an entire interconnected Data Ecosystem, and not just a database

  • As the Data Ecosystem grows, the concept of Data Governance needs to expand to include multiple levels of data certification to cover differences in the Big Data arena (such as a data lake) vs an EDW and other classic BI constructs in regards to things like timeliness, cleanliness, consistency and reliability

  • The introduction of Social Networks and the plethora of Social Media based data, usually covered under the umbrella of Big Data, will continue to be a driver of innovation and evolution in Business Intelligence Systems

Hopefully you find their insights as helpful as I do.  If you ever have a question or need some help, let me know.  Odds are that one of my friends at Credera can help you.

Have thoughts on this topic?  Comment or send me a note.

Have a Question?

Please complete the Captcha