Customer data used to start and end with the company – nice orderly, structured, manageable data that could be loaded into traditional data warehouses. Today, companies are facing a tidal wave of new data streams coming at them at an unprecedented size and rate. The sheer volume and variety of this big data necessitates a new way of thinking about data warehousing and analysis, writes Michael Haddad, BI and solutions architect and co-founder of Praxis Computing, the local partner of software company, Pentaho.
We already know that contemporary enterprises are inundated with data, and this data holds huge potential for them. Sure, big data has enjoyed buzzword status of late, but a new one that is bubbling up is the concept of having a “customer360 view” – a coherent view of a customer enabled by big data analysis, drawn from existing IT systems, social media, the Internet of Things (IoT), and machine-generated data in their environments. Multi-source, well-analysed data can tell a company not just who their customers are and how many products they have through a single organisation, but also where they shop, where they go on holiday, what they name their pets, how they feel about you, and whether or not they also deal with your competitors. In the same vein, data can reveal public sentiment, and even enable governments to be able to spot potential problems before they boil over into, for example, service delivery protests.
Big data-ready or not?
Knowing this information is the first step. Modern enterprises then need to derive value from massive volumes of data generated – to offer better service, to upsell, to promote loyalty and stretch their access to the customer lifecycle. Data can be used to make sense of historical or trend information, or to predict events, promoting proactive policing or interventions on both a public and private organisational level.
All of this, though, lies out of the scope or capabilities of the traditional data warehouses currently employed by most governments and corporates. This is the dawning realisation facing organisations in the digital age: They are simply not equipped to effectively and cheaply store, manage or extract data of this size and variety.
Big bucks, big maintenance
Of course, extracting business intelligence (BI) from data isn’t a new concept, and for many years big IT companies have offered tools to help corporates access this. These have evolved to keep up with big data, but they are not easy – or remotely cheap – to build into your business process. Dropping a couple million dollars on the initial implementation and huge sums in licensing fee is not uncommon.
Then, you must run these systems, keeping them tuned up – through a combination of your own internal resources and the provider’s team. This expensive and resource-intensive system can have the effect of keeping big data capabilities in the hands of huge multi-nationals and highly profitable organisations only, giving them yet another competitive advantage.
A new model, new capabilities
It was this conundrum that sparked the idea for Pentaho. Its founders – who include some top minds in computing, such as James Dixon who coined the phrase ‘data lake’ – chose to be market disrupters, combining a handful of open-source technologies and producing this powerful solution for accessing Hadoop file stores and extracting, transforming, and loading data, as well as analysis and reporting. And this is where Pentaho has really come into its own – this ability to blend data sources. And it’s still an open source initiative, with community and enterprise versions.
All that for a fraction of the price of the existing and dominant competitors. It is really challenging the notion that expensive equals good, and in this way, perhaps even “democratising” data access. It also offers organisations who prefer to have view of the system they are using, the ability to see the source code, control the info flow, if they so desire. Pentaho is comparable in most ways and superior in others with all of those expensive, proprietary options.
Pentaho data integration has a drag-and-drop GUI for developers. Off the shelf transformations can be dragged onto the canvas to perform hundreds of data access, data transformations, integrations and data persistence to and from almost every data source, database and file system without specialist knowledge. That is why at Praxis we have been Pentaho resellers and system integrators for six years.
A fortuitous acquisition
Last year Hitachi acquired Pentaho, and since the acquisition, Hitachi Hyper Scale-Out Platform (HSP) now offers native integration with the Pentaho Enterprise Platform – which gives customers a high-level, software-defined, hyper-converged platform for big data deployments.
For us at Praxis, this is an exciting time. Working with Hitachi Data Systems (HDS) opens up a whole new and bigger market for Pentaho. And it takes Pentaho from being a slightly fringe product, to one backed by a huge international corporation. And that, I think, makes users and corporations feel more secure that they can incorporate this into their companies, that it will deliver revelatory insights without the huge capital outlay. With this partnership, HDS offers a real “plug-and-play” solution, quickly implemented and flexible enough to deal with the expanding structured, semi-structured and data dumped into a data lake.
The way forward
All of this leaves companies with a choice – what to do with their existing data warehouse investments. Thankfully, there are strategies available to them, both an “all-in” model that migrates all existing company data into the new storage with processing options, and a “migration-lite” version. In the latter, you don’t have to abandon your existing servers, but run Hadoop file store in parallel and integrated with your existing data warehouse. With Pentaho, we’re able to take the two together and in almost real-time create new data sources and results for the user. We can merge the two, integrating and conforming the data, and making it available through an analytic front end. End users, such as an actuaries in a life insurance company, can access decades of historical data blended with social streams and IoT-generated data, without having to know the programming languages or involving the development team for, for instance, new reporting formats.
The availability and pervasiveness of vast amounts of personal, anonymous, transactional, financial and demographic data is supported by our increased ability to store and analyse the data for a reasonable cost. Companies can access numerous data cleansing, verifying and enhancing services to improve business information. This is not new. A new option open to companies is to purchase and keep these data sets along with their own for use by their own data scientists, analysts, quants and high-end users now and in the future. Made possible by Hadoop, Pentaho and other (mainly open source) technologies.
The next big conversations
The next step in this new paradigm of data is for companies to begin having the necessary conversations about where they want to be in the big data and BI strategies in the coming two years – and, critically, if their current resources have the skill set to move from the SQL-based storage configuration to the new world of storage technologies. Finally, they will need to find the technology mix that best serves their BI strategy.
Organisations must simultaneously find the place and skills in their organisation for the business analysts and data scientists to flourish.