Nobody can deny the adoption of Cloud Computing. According to Forbes, Cloud Computing will reach $411 B market by 2020. We all remember the debates in the late 2000s on whether or not cloud computing was just a fad or a “real thing”. A few years after, it was clear that cloud was here to stay and the debate was on whether a public cloud or a private cloud was the right way to go. Here we are, about to enter 2018, and companies have made significant progress in determining what’s right for them. The verdict is in…. We’re in a new world, folks. Some applications work great on the public cloud. Some work great in a private cloud environment. Some will have to remain on premises. It looks like an on-premises + public cloud hybrid is here to stay!
The question is why? Doesn’t having multiple environments create more complexity? Rather than debate this endlessly, we thought we’d talk to the companies making this decision. So, we talked to two of our customers. One is a large restaurant chain that feeds 1% of the world’s population daily and another is a large coffee company with over 26,000 stores world-wide. Like most enterprises, they have adopted a unified cloud architecture where analytics is done both in the cloud and on-premises. Given their scale and complexity, Hadoop was the obvious analytics platform, but Hadoop is complicated. It takes expensive admins to manage it. Customers love the scale out power of Hadoop but don’t want to deal with the complexities of managing the Hadoop cluster. This is where the cloud comes in. Imagine someone else managing the elasticity and administration of the Hadoop cluster, so you can simply just focus on analyzing your data. What a novel idea! But that’s exactly the value proposition of having a data lake in the cloud.
But, while these customers enjoy the benefits of the cloud, including elastic compute and minimal administration, they also have a lot of data in legacy systems on premises. The data in these systems is valuable for analytics, but moving all that data with related applications and analytics to the cloud is not always immediately feasible given costs and disruption. Many of these applications and analytics systems were developed using legacy technology that doesn’t work in the cloud and would need to be completely redesigned. Anyone who has experienced data migration can relate to this pain. This can be a long process that’s wrought with high cost, huge disruption to the business caused by planned application downtime, potential data loss, etc. That’s why we still have so many mainframes around. Consequently, most enterprises leave legacy data on premises and adopt a cloud first policy for new projects. So now, the restaurant chain can have data in Redshift on AWS and data in Oracle and Teradata in their data centers that they have to analyze to make intelligent decisions. Similarly, the coffee company continues to leverage Oracle, Teradata and Cloudera environments on-premises while building a new data lake using Power BI and ADLS on Microsoft Azure.
So, things should be more complicated, right? I mean, now analytics have to span yet another platform. How does one find data across the cloud, multiple clouds, on-premises data lakes, and RDBMS’s? Well, it doesn’t have to be complicated. We at Waterline have been helping many Fortune 2000 companies across the globe automatically discover, understand and govern their data with a single solution. To tie this diverse and distributed data estate together, our customers are deploying the Waterline Smart Data Catalog to automatically:
- Create a virtual view of the data across all data sources
- Show both a business or semantic context
- Show governance and compliance context and technical context.
Let’s examine these one at a time starting with Waterline Smart Data Catalog. While many relational centric product offerings like Alation or Informatica Enterprise Information Catalog claim Hadoop and Object Stores support, you will see that the support is too primitive and narrow to be useful. For example, the other data catalogs do list all the files in an S3 bucket and let you search them by name and manually tag each file. But they don’t look inside! What if the labels are wrong or simply missing altogether? They also don’t conduct any of this at web -scale. How is someone going to tag millions of files without really knowing what’s inside each file?
Read the rest here!