Data Vault and Cloud Enterprise Data Warehouse
Data Vault and Cloud Enterprise Data Warehouse

Dan Lindstedt: Data Vault and Cloud Enterprise Data Warehouse

A new frontier for changes to architecture and methodology

Recorded Live during Cloud Data Summit on October 16th, 2019

Share on

Contact Dan at danlinstedt.com

Full Transcript

Eric Axelrod: Welcome everybody. We are live now with Dan Lindstedt, founder of Data Vault, creator of Data Vault and he’s going to talk about how Data Vault works on Cloud Data Warehouse platforms. And if you do have any questions during the event, please drop them in the event chat and he will address those at the end. So Dan, it floor is yours.

Dan Linstedt: All right. Thank you Eric. Welcome to everybody and thank you for attending. Um, we’re going to not focus on me but rather focus on the presentation. So we’re going to talk a little bit first. I need to thank the contributors to my presentation. I’ve got a lot of people who contributed some ideas to my presentation and Eric’s one of them.

So before we get started, let’s just talk a little bit about a company I just founded called Data Vault Alliance. Um, definitely a lot of experience. We’ve got seven authorized instructors worldwide for, uh, large partners for consulting around the world, Australia and New Zealand. Um, and of course Europe in the U S North America and so on. So some interesting things. You want to take a look at datavaultalliance.com and see what we’ve got there.

So before we get started, you probably don’t know what a Data Vault is.

Dan Linstedt: So we’re going to talk a little bit about data vault and why data vault and then we’ll get into cloud tech and we’ll talk a little bit about data vault on cloud tech. So don’t feel lost about Dana vault, but my first question to you is, do you have one death star to rule the mall Have you built something like the death star with a single point of failure Uh, I E an exhaust shoot where the, the rebel basis can come in and just simply explode what you’ve built. Or perhaps it’s not the rebels that exploded.

Perhaps it’s big data. Perhaps it’s real-time. Perhaps it’s your modeling solution. So when you take a look at data warehousing, and a lot of people like to think that data warehouse is a bad idea today or a bad name or a problem to deal with, it turns out we need data warehousing.

Dan Linstedt: Whether we want to admit it or not, whether we want to admit that BI and analytics can do without data warehousing, there’s, there’s no way to get rid of it. We definitely need a concept, at least a logical one called the data warehouse. But when you look at third normal form, you see all these cascading, changing packs. You see entire models being built up front before you can load it. And then there’s massive load dependencies. Third normal form is a data modeling technique for data warehouses don’t work. And in the day and age of a cloud vendors, you can’t really do third normal form modeling in the cloud. It’s possible in some instances where you get acid compliance, but when you start scaling out big data systems or big data solutions, you get a lot of problems, especially with the eventual consistency, which is commonly applied in big data solutions where you’re dealing with tons and tons of data.

Dan Linstedt: And then there’s this, this whole dimensional model. And a lot of people start dimensional models and they try this landing zone PSA. And the latest thing is that, you know, uh, the data virtualization vendors come along and they try to do this. Now the problem here is we’ve been trying to do this for 20 years. We’ve been trying for the last 20 or 30 years to create our data warehouses in this fashion. And so my question to you is, have you seen it work? A nd in some cases it does, especially if you’re building silos. But in the cloud instance, we can no longer build silos. We can’t afford to build a silo for finance in a silo, for manufacturing, and a silo for sales and have it all work. The integration points don’t mesh. And of course the answers don’t mesh. So even just simply throwing another tool like data virtualization, which don’t get me wrong, data virtualization is great, but throwing another tool on top of an existing platform won’t solve these problems.

Dan Linstedt: So my question to you here is, what’s the definition of insanity And of course the answer is doing the same thing over and over again and expecting a different result. The sad part is is a lot of people are trying to do this again with a data Lake or today’s buzzword data hub. I’m sure all of you have heard that. So let’s talk a little bit about why this fails. So complex patterns, high rates of production failures, slow load times, big volumes, all these things that the cloud tries to address. But remember cloud, cloud, data warehousing and cloud is really just a platform. It’s an elastically scalable platform if it’s done right. So being a platform, it doesn’t come with a methodology, it doesn’t come with an architecture, it doesn’t come with the people to make it work. It just says, Hey, we can throw all our data in the cloud and try to scale.

Dan Linstedt: So you’ve got this architecture that did not work on premise. You throw it up in a cloud and you expect it to work on the cloud. Well, it might get you a little further down the road because you have this elastic scale, but eventually you’re going to run out of space on the cloud as well. Eventually you’re going to hit the same bottlenecks that you have today. So landing zones that are not integrated by anything that don’t create Delta data, and business rules that are complex and you still have the siloed solutions and for federated information marts and all of that. But the biggest one out in the cloud is how do you separate and protect your PI I or your private or classified information. You can’t really do that in the cloud unless your vendor helps you with solving that problem. And your architecture and your model are also at work to help you solve this problem.

Dan Linstedt: So we’re going to talk a little bit now about Data Vault 2.0 and what it brings to the table. So from an architectural perspective, I want to say, look, this is a logical diagram of your system. And of course in Data Vault 2.0 we have systems architecture, we have data architecture, process architecture, methodology, architecture, all kinds of different architectures that play at work. And it really doesn’t matter what kind of a platform you throw at it, whether you’re in the cloud or on prem or you’ve got a hybrid solution. But what matters is that you’re building a data warehouse and that you split out your business rules from your data storage and that your integration points are done by business key. So we split out data acquisition teams from data provisioning teams or information provisioning teams and we split out soft business rules. These soft business rules are the things that change all the time.

Dan Linstedt: And in fact the data virtualization engines are really, really, really good at this part. And this is where we start to talk about something that cloud does allow us to build called virtual information marts. So instead of building physical marts and moving the data one more time down the stream, we will virtualize those information mites and make it easier for a lot of these technologies to take over. So what is data vault 2.0 well date about 2.0 is what we call a system of business intelligence containing a necessary components needed to accomplish, accomplish the enterprise vision in both data warehousing and information delivery. Now I hate to say or use the words data warehousing. I probably should use some sexier term called analytics, or BI, or enterprise BI solution, or data hub or you know, call it what you will. But when it comes down to brass tacks, a data warehouse is necessary and inside the data warehouse for the enterprise data warehouse vision, we still need a methodology.

Dan Linstedt: We still need ways of working. We still need architecture and modeling and all of these things that come from the age old practices of information engineering. Simply saying, I’m going to throw my data up into the cloud and then give everybody a free reign or free access to it isn’t going to get you to a success. We’ve seen this before and we’re going to talk about this later, but we used to call this federated query engine. We used to call it enterprise information integration and when that failed, it all went away for a long time. And now it’s back. Is data virtualization again, virtualization isn’t bad, it’s just another tool in the toolkit, but without a methodology you’re, you could have the best tool in the world. Um, but you can’t leverage it properly without a ways of working. So it’s finally important that you focus on far more than just data modeling for your enterprise warehouse and that you work with your architecture, work with your model and work with a methodology to get best practices.

Dan Linstedt: So one of the things that the data vault brings to the table, this is what we call a data vault model. So in the data vault we have the architecture that methodology, the implementation in the model and in the model we’ve got a flexible scalable model. And we have these components. One called the hub, which is the list of unique business keys, one called the link, which are relationships and associations and the other call and the satellite which is filled with descriptive data. So this is sort of the model that we have. This is what we call a hub and spoke model and we can adapt it without a re-engineering. So one of the things that you do need if you’re going to build an enterprise data platform, I’ll be at logical, and when I say enterprise, what I mean is worldwide. I mean split into multiple instances.

Dan Linstedt: I’m not talking about a single monolithic enterprise data warehouse sitting on a server in one place. That’s not what I’m talking about. In fact, that’s not what I built when I worked at Lockheed Martin from 1990 to 2000 where I built data ball. We had 125 different source systems to integrate in under six months with a team of three and a half people and we were successful. But one of the things about Lockheed Martin is that it was a global enterprise. Even in the 1990s it was a global enterprise, 125,000 sorry, 153,000 employees worldwide at the time with a divided into seven sectors of business divided into 53 different companies, each with their own profit and loss. So we, while we didn’t have the cloud, we actually had distributed data centers and this was back when a 10 base T was fast. So not only did we have to solve the problems at the architectural level, we had to solve the problems at the data modeling level for split services and split datasets.

Dan Linstedt: We weren’t capable or enabled to copy the data from one server to another. We had servers in Japan, we had service in Australia, servers in the U S a service in different parts of the US and some of that is is helpful nowadays with the cloud, the cloud can absorb that, but even with PII and all the regulations like the EU regulations, you still have to split your data into separate cloud instances. How are you going to get enterprise answers if you don’t have the right data modeling contexts If you try to build a single conformed dimension, for example, on a single server, you’re not going to be able to answer your enterprise questions from that perspective, and this is where the data vault model really helps out so we can build on premise or in cloud using these hub techniques. As you see here, we have a hub for us customers and a hub for EU customers.

Dan Linstedt: Each on their respective cloud instances. We can apply different, uh, different protection, different encryption, different mechanisms to protect the data. But the beautiful thing is we can link these things together so we can throw these links up either in the cloud or on premise or both. And we can run associated queries, again, using a data virtualization tool, for example, like to Noto. Uh, if you, if you’re interested in, in a name out there, there’s Looker in the Domo. There’s a few others that do these things, but a data virtualization tool on top of cloud technology with the right data model in your enterprise warehouse allows you to automatically take a look at this dataset and link these things together on the fly. If the rules change, if the business changes their mind on how these things integrate, you can certainly change the storage and the way these links work without actually impacting or affecting, uh, the audit ability of your raw data sets that you have.

Dan Linstedt: Now, one of the things I want to point out, I want to back up just a little bit. Suppose I’ve got this model between customers and products and I want to add suppliers. Well that’s very, very easy to do and this is another thing that you want to do. You don’t want to just move to the cloud and try to scale your solution. You actually want to be able to scale your model with no re-engineering effort. You want to be able to add on it, build incrementally, build different components in different parts of the world, whether it’s on premise or in cloud incrementally, and again, I’m assuming here that you’ll have multiple teams. You’re going to have a U S team, you might have four or five or eight different teams in the U S split across different geographical locations in order to get them to succeed, they all need to work to the same level of methodology and this is where the data vault methodology comes in.

Dan Linstedt: Otherwise one team is going to produce one result, another team, another result, and you’re not going to be able to synchronize. Everybody has to follow. The same standards and this is where automation comes in,

right So this is something that we’re looking at. Now. Let’s take a look at some cloud issues, some data issues, and switch away from the intro to data vault and talk a little bit about Yoda swamp and levitating your star ship and alignment with the force. And then finally try to answer the question, how do I become a Jedi master Right So if you’re stuck in yo-yos swamp a long time ago in a galaxy far, far away, MPP was the only way to scale their little miss NPP sat eating your curds and whey along came and elastic spider and sat down beside her and scared the NPP away. Well, not really a, this is your brain on MPP and especially MPP on prem.

Dan Linstedt: This is what we called massively parallel processing, shared nothing solution. And it turns out that this was the way to go for a long, long time, especially if you have large data sets, need sub-second query response times. But along came a small company called Snowflake DB, which is not so small anymore. What really happened was the elastic spider decoupled little miss NPP into two different working components, the process of compute and the process of storage. And when you do that, and, and you can only do this in your cloud, and the reason why you can do this in your cloud and only in your cloud is because of elastic storage is now handled by the cloud provider. And the cloud provider plays a crucial role in guaranteeing IO throughput and performance. No matter where your data lives. Inside of that cloud instance, they guarantee the throughput to a company like snowflake.

Dan Linstedt: So you don’t have to think about it. So Snowflake says to say Amazon or Google or, or, uh, um, the other one, Azure, Microsoft skews me and they say to them, Hey, look, you got to guarantee certain throughput between the storage connects and our compute nodes and then we will scale our compute nodes. But regardless of that, the throughput remains constant or consistent. And this is how your brain works in the clouds. And may the cloud be with you. But, but honestly, this is important. Why Because we no longer have to think or know about data co-location data, split data sharing. We no longer have to ship data across nodes to get joins to work inside of the platform. So this is a fundamental shift in the way of thinking over on the left side when we did NPP the old way. And there are a number of databases that are new that still operate under this old paradigm.

Dan Linstedt: Uh, and when we did that in the old way, it was the data architect’s responsibility for thinking about data layout, data, split data partitioning, load balancing and all of that. But in the new way, we don’t have to think about that stuff. So can you levitate your Starship without help Can you actually sit there and do this Uh, and the answer is yeah, of course you can. But you got to worry about data layout. You’ve got to worry about data distribution, data shipping. You have to worry about bringing new nodes online, indexing, cross node joys, temp space for common results. You’ve got to worry about load balancing, hot swapping data. These are all the things that data architects in an MPP environment using large, large data sets have to worry about. Now, I’ll say one more thing about the NPP, the old school MPP environments, many of them, Hadoop, excepted, many of them require structural hard structure or fixed structural data.

Dan Linstedt: They’re getting better at handling Jason and XML. But nonetheless, you still have all of these things that you’ve got to worry about. All the jet I training in the world can’t fix these problems. So what happens is if you’re on the management side, your people come to you and say, Hey, I’ve got a new idea. Let me take Hadoop, which is open source. Let me rebuild it, or let me take Postgres sequel and modify myself for you. Let’s build something on prem. I’m going to rebuild or reinvent the wheel for you. And of course they failed. So rather than reinventing the wheel, what can you do What options do you have Well, it’s easier to redirect the flow of water than it is to stop it. So Snowflake manages. If you move to a platform like Snowflake database and you may think, well why isn’t he mentioning red shift or why isn’t he mentioning, you know, some of these other big, big query or some of these other platforms that are out there, believe me, I’ve worked with them.

Dan Linstedt: Snowflake database is a game in the cloud. They are a disruptive force. If you’ve never heard of them, you need to go check them out. There’s a reason why they have so much money in financial backing. There’s a reason why customers are leaving their current MPP platforms and moving to Snowflake, and I’m going to give you a couple of big examples going uh, coming up here, but Snowflake manages all this stuff for you. It’s like leveraging Yoda to help you levitate your Starship. And then cloud storage provider manages all the rest of this stuff. Enough bio and the platform and the elasticity of the IO keeping performance going, hot swapping and recoverability. You don’t have to worry about any of that, including hardware upgrades. Now, the nice thing about Snowflake, and this is where Snowflake really is different than say a for instance, Redshift or different than Azure.

Dan Linstedt: Uh, big data edition, uh, which can run columnar format. But as your big data edition runs MPP, even all the way down to the uh, non shared IO format and red shift, uh, runs MPP under the covers as well. So Snowflake allows you to set up what’s called a virtual compute cluster and you can, you don’t have to worry about how many nodes or what nodes are, what compute power is underneath it. They’re size, like a tee shirt sizes, small, medium, large, extra large. And that that’s what you ask for query by query. You can change this size query by query sizing and resizing is instant. And these clusters, these nodes are always on. So suppose you want to load balance with a mining cluster, you don’t have to load balance, you just start a mining cluster and it’s separate. And the beautiful thing is at an all leverages the exactly the same data set.

Dan Linstedt: You don’t have to move your data, you don’t have to realign your data, you don’t have to shift your data or ship your data to different parallel instances. If you’ve got a load cluster that needs more compute resources, start one up. So these are virtual compute clusters. They share this infinite elastic compute layer and then of course they all share the same data. So Snowflake guarantees you throughput performance. Now, I’m not trying to sell you on Snowflake, but I do believe and I’ve been in the industry around 30 years or so and I’ve never seen anything like it. So I do believe that they are seriously a game changer. And there’s a reason why they’re doing so well in the market today. That’s not to say they’re going to be the end, uh, the end game or the end result. Um, and that’s not to say they’re one size fits all, but today if you’re trying to do data warehousing, uh, of any kind or analytics of any kind, it is a serious platform to look at.

Dan Linstedt: So the vendors will say, no, I am your father. And the vendors in your troops can claim they can do it. They claim they can rebuild it and yeah, they can. But NPP is still MTP. In order to get MPP running on your environment or in even in a cloud, in a VM for example, first thing you got to do is spin up and configure the compute nodes. It was snowflake. You simply issue a single command, say set size as extra large and they’re there, they’re there. And the data’s there as well. There’s no balancing of the dataset, which if you’re running your own MPP, you’ve got to do yourself. And then of course the new nodes are offline until the data’s finished in terms of redistributing and balancing. And then all of a sudden the new nodes are available for compute sharing. And then after that, only after all of those processes, uh, do you get a chance to load balance your queries and your processes and and on a shared get this a shared logical compute cluster.

Dan Linstedt: Whereas again, in Snowflake, you can put those things, uh, put those things across and split them out. Okay. So do you want after all of this to still be limited in your compute power And if the answer is yes, then by all means use Microsoft Azure big data addition. You’ll be limited in your compute power based on only what you’ve purchased. Uh, sir, sir. Uh, same thing with Terra data. Same thing with uh, with red shift. And so you can be limited in your compute power. You can always buy more compute power, but then you have to go through the same steps here. So, okay, so your vendor may take days or weeks to scale. This is the difference. So why will it take so long to scale Because of the previous steps I just went through. It has to load, balance the datasets and then you have to actually use architectural decisions to figure out which queries will work in those environments.

Dan Linstedt: Now, difference in cost on prem or AR versus cloud and on-prem. Those have to be gauged depending on what you’re doing and what platforms you choose. So we can’t really get into that here. Your employers, of course, your employees without wanting to reinvent the wheel. I can do it, I can build it. Um, but no, they can, they can build it, but it’s going to be a lot more money and a lot more time. Uh, so it doesn’t matter whether it’s on prem or in cloud. The process of scaling is the same. However, in the cloud using something like Snowflake, all the engineering is prebuilt for you, right And you don’t have to deal with dynamically added nodes, uh, or reconfiguring or sitting idle. They’re just there automatically there. They’re automatically running. So all types of load mining queries you can an NPP world, they all share the same compute cluster.

Dan Linstedt: So you can’t stop change any more than you can stop the suns from setting when of course in tattooing and star Wars are two sons that set on that planet. So having a light saber does not have jet I make. So simply signing up for a cloud service and throwing your data up on the cloud doesn’t make you a Jetta. Okay. It is simply moving your stuff. If you’ve got garbage methodology, if you’ve got garbage standards, if you’ve got siloed solutions or, or you’ve got a standard equivalent of a Kimball data warehouse that’s in shambles today and you simply move it to the cloud, all you’ve done is lifted and shifted garbage up into the cloud, okay So does not clean up the mistakes that are present. Again, doing the same thing over and over again and expecting an enterprise result, that clean data result, a better result, uh, is the definition of insanity, right

Dan Linstedt: So instead, unlearn what you have learned and changed the way you do things. Bring discipline, bring governance, bring standards, bring best practices. Bring a proven methodology that shows how to do these things. Whether you’re doing an on prem or in cloud, you see to leverage cloud properly, you need a change in culture. You need a change in the way you’re building your solution. So this is something that’s important, okay So, um, use the force. Let it flow through. You don’t try to invent the force yourself, right If you’re going to fix what’s broken, you have got to change the way you build it. Leverage the proven methodology.

So let’s talk about this adventure. Excitement or Jeddah craves not these things. So the path to hard work comes from hard work and study. Leveraging the lessons learned. One does simply not wake up and call themselves a Jedi master.

Dan Linstedt: It takes practice and consistency and discipline and standards and all of that. We want to leverage a on top of the people that came before us, all the best practices that we have. And this includes disciplined agile delivery. This includes the right methodology, the right standards, the right architecture and all of that. So

let’s talk a little bit about what’s going on with the propaganda that’s out there. Let’s cut through the hype. Um,

let’s take a look at hype for a minute. Uh, you’ve heard this before. The, the, the propaganda says, well we used to call these things cycle time reduction. Then it got a bad name. It changed to lean initiatives, which got a bad name, changed the business process reengineering, which got a bad name, went to business process management and so on. And it ended up being continuous improvement in agile today.

Dan Linstedt: Some more labels here that have changed names over the years. Very large databases was what we called it in the eighties and the nineties um, and then very large data warehousing and then big data. And then somewhere I saw extreme data warehousing, which makes no sense. And then flat file staging moved to data dumps. Now we call it a landing zone. And then federated query engines was what we started with in the 80s which changed to enterprise information integration in the nineties and today we call them data virtualization engines. There have been significant movement in the technology of these platforms. However, labels only lasts so long. And here’s one that, uh, yesterday was a Lake today, they’re calling it data hub would be, we got a new label to learn. So your vendor says don’t re-engineer, just launch our platform and a VM in a cloud, see where you are cloud ready.

Dan Linstedt: And that’s just not true. So some interesting things to think about there. So the empire will strike back. Propaganda is everywhere. The vendors blow a lot of smoke. Like Senator Palpatine don’t end up on the dark side of the force. Existing vendor reps will tell you they can do this and you don’t have to move to snowflake. And yeah, they can do it if you pumped four at a time, 10 times the cost for the same functionality. So it really is a cost or price performance play. I will say that the cost of Snowflake has been measured, uh, consistently and repeatedly over the cost of Teradata and the cost of red shift, uh, and the cost of, uh, Microsoft Azure big data edition. So there are a vendor, a cost analysis out there that have shown these things. I am not here to discuss cost unfortunately.

Dan Linstedt: Uh, but internal it departments will tell you they can build it, uh, rather than by, and this is the old build versus buy solution. The real kick in the pants is, you know, how much can they build it for and how long does it take them to build the solution when there’s already one available complete with engineering and support. And then of course it’s the end of the day with 10 X to 50 X the bugs in production. Oh, we didn’t think of that. And Oh, the platform that we choose, we had to modify. And so the support of the modification is not available from the original vendors. So on and on and on. So some things to think about that all sounds good until it doesn’t scale elastically or on demand or it fails in production. Right. So something

to uh, to, to be aware of. Now we’re getting close to the end of the presentation here.

Dan Linstedt: Uh, I’m gonna, we’re gonna open it up for questions in just a few minutes, but I want to show you in step off into engineering here. Talk a little bit about case studies, show you some numbers of people who have leveraged not only cloud environments but data vaults in the cloud and what they’ve gained from this. And if you want to talk to a couple of friends of mine, uh, I’m certain I can hook you up with some folks in the industry who have moved both customers as well as vendors. Uh, who can talk to you about the impact of data vault that has had on their organizations. Uh, so engineer

for endurance, so date of all summarizes system of business intelligence. One of the reasons why data vault is so strong, especially data about 2.0 is we include people and we include tech process and we include technology.

Dan Linstedt: We return to the roots of information engineering and we say, look, you need all three to make it successful. You’ve got to have an architecture, you’ve got to have a model, you’ve got to have a methodology and you have to have implementation best practices. So absolutely important. And why not lift and shift Well, you need a solid scalable methodology. Now remember at Lockheed Martin, I had to source 125 source systems with a team of three people and we had six months to put it all in to an enterprise data warehouse that span the globe. How do you do that And yeah, we source people soft I to Oracle financials, JD Edwards, SAP out of bass even and some very arcane pick universe databases. How do you do that And build a solution for rockets for launchpads, for supporting the NSA and NASA. Unless you have a scalable methodology, a process that drives the people that the people can repeatably do in an agile fashion, you won’t get there from here.

Dan Linstedt: So these are some of the important points. You’ve got to have an engineer engineered methodology for near zero production errors. When you release to production with a big dataset, the last thing you want to be told is, Oh that failed. Whether you’re on the cloud or on prem, that failed for reason, X or reason why you have to go back and re-engineer it. Well, I, I’m going to tell you this, if you put 130 terabytes into a solution, you release it to production. You have to re-engineer 130 terabytes. That’s going to be a huge hit to your agility. Never mind the fact that you have to re-engineer, do the re-engineering, moving 130 terabytes around and re re-engineering the data model is going to cost you, right So you have to enable automated test cases and generation. You have to enable an ease integration of data curation, AI and ML and deep learning algorithms, parallel repeatable, scalable patterns and so on.

Dan Linstedt: So these are some of the things that you want to look at. So here’s some case studies to back all those numbers up. Let’s take a look at this. So, Sox, basil, GDPR, HIPAA, you want to meet compliance, AXA, global insurance, use data vault and cloud and on prem technologies, uh, to uh, meet GDPR regulations worldwide with PII, data vault and cell level security. So something to think about you, you say, well, data vault, you may have heard a data vault has too many joins or data ball can’t scale. Well, I’m here to tell you both are wrong. This is the world’s largest commercial data vault in the world. So three point 2 trillion records per day. That’s 2.2 billion records in real time from IOT manufacturing devices per hour, every 60 minutes, 2.2 billion records, uh, moving in to their environment. Uh, and then the insider threat, uh, identity resolution is using data vault, uh, business key attribution and identification.

Dan Linstedt: There you want to talk about agility, you want to reduce your turnaround time. These guys have also moved to the Snowflake, uh, platform recently and they’ve reduced turnaround time even further, but from four weeks to two days, this is why a methodology is important. Okay. And then QSuper, they reach CMI level five repeatability optimization seven weeks after leveraging data vault 2.0 and this was with a team of 10 people. So some interesting things there. Uh, let’s talk a little bit about something I call a kickstart. This is a two week process that we offer, which starts out with three days of training followed by a seven working days of bill. The productivity gains for people four days Burj two systems, eight source tables built raw data ball or your data warehouse, three-star schemas. The previous effort from the previous vendor was built out at six months, 1.5 million in 30 people.

Dan Linstedt: That’s ridiculous. Blockade Martin, as I mentioned before, 250 different source systems. We only got 125 in the first six months. Integrated one data vault data, a data warehouse servicing 5,000 business users, sub-second query response times in 1997 and that was cyber system five,

right So that clearly wasn’t cloud, but today it would be even faster on the cloud. So agility in the methodology data vault two brings the assurance we can cope with increased velocity and change. So you can see cloud is beneficial from a price performance in a data level, allowing you to do things with massively large data sets that you could not do on premise before simply because of the cloud platform and the ability to scale dynamically. Right So split parallel teams work in the same way. All right, so only 125 source systems. Yeah. In 1997, it was only 125 source systems.

Dan Linstedt: Three and a half people, uh, in it. Um, the rest of my competitive it teams were running with at least teams of 10 to 15 people. And for the same workload they were building billing 90 days when we had a four hour turnaround. So this is the, this is the reason. So cloud computing, you want split parallel teams work in the same way. Semantic master data integration. This is big. If you’re going to go to the cloud, you need to govern your environment. Governance is number one. Security is also number one. They tie in that realm for number one for going to cloud. Understanding how you can scale and virtualize your, your uh, Mart selection is important. So some summary thoughts. Cloud computing is here to stay. This is the future. This is the way to go. Cloud data platforms like Snowflake in my personal opinion, are the right way to move forward.

Dan Linstedt: There are others and I can talk to you all day about others and I’ve got some customers on Redshift and others on Microsoft Azure. Big data platform. They will do things, um, and they will do things well up to a certain point and that’s where they fall down. Snowflake takes over, which is an interesting statement. So date of all too can help you succeed with the other parts of the equation in the cloud, including the people, the process, the training, the technology. I believe that methodology is just as important as how you model your solution and how you leverage your cloud solution. And of course governance is absolutely critical to all of this as is security. So with that, I want

to thank you all for attending my solution. I hope you got something out of it and I’m happy to take questions or go offline and network with anyone who’s interested. Is Snowflake still a leading case even with smaller amount of data I believe the answer is yes. Um, and the reason why is again, price performance and you can actually size Snowflake down to what they call an extra small, uh, and you can pay for only what you,

Eric Axelrod: thank you very much for the talk. It looks like we do have a, um, a couple of good questions in here.

Dan Linstedt: The leading case, even with smaller amount of data, I believe the answer is yes.

Dan Linstedt: So the next question we got any thoughts comparing Snowflake with Cloud era, cloud data warehouse, considered evaluating, um, Cloudera Yes. I will talk about Cloudera here briefly. Cloudera approached me about five years ago, four years ago. They wanted to put data vault on Cloudera, but Cloudera had some internal engineering issues that they may have solved. I don’t know, honestly, but Cloudera has got some issues with Kudu. Uh, in a way the storage mechanism will works and they have some, uh, because kudu is a storage, uh, architecture and they have some issues with their in memory database as well. A cloud era cannot, in my personal opinion, only cannot hold a hold a candle to what Snowflake is doing. Cloudera, um, is continuously losing market share. Last I heard and these are numbers that are publicly available, um, and everybody is now wondering, at least all of my customers have said, we don’t know if cloud era is actually going to survive going forward. But we can talk more about that, uh, later.

Dan Linstedt: Thank you for the comments. I appreciate it. Um, no sequel possible. Yes, no sequel is possible. You can use different kinds of databases. Um, and in fact, no SQL is possible on, um, sorry about that. Uh, no. SQL is possible, uh, on graph databases and in fact, um, Snowflake allows you to run AR natively. It also allows you to run Java script and a few other things, uh, directly inside the database. They’ve cut some interesting engineering for it. Um, but as far as no sequel, uh, away from Snowflake or not using a Snowflake platform, can you leverage that Yeah. For certain specific business use cases, graph databases are a good thing to do. Uh, but Neo four J seems to be leading the charge there.

Dan Linstedt: What I recommend Snowflake for big data, upwards of a hundred gig. Um, not necessarily. I, I think Snowflake is a game changer. Whether you have big data, whether you have 20 gig or a hundred gig or a couple of terabytes. Now obviously under a hundred gig I could run a hundred gig and under on my laptop these days using Microsoft SQL server. So it’s not a problem. Um, that, that’s not an issue at all. Uh, and I, I don’t worry about that. If you want to use a hundred gig or under, on an on-prem solution, you might not need Cloud. But if you want to involve AI and machine learning, you probably need a lot of data to make those things work properly without false positives. Um, and as far as, um, as far as bigger datasets or smaller data sets, to me, again, it’s price performance.

Dan Linstedt: You have to look at what you’re paying for with your current platform. And if you’re on-prem under a hundred gigs, chances of you actually paying less on a platform like Snowflake are slim to none. Now that said, if you have complex data sets, what we’re going to call semi-structured, in other words, X amount and Jason for two to be precise and they’re super complex, Snowflake has capabilities that go way beyond anything a SQL server and Oracle offer today, including Tera data, Postgres comes close. Um, but you can, you can and, and I would highly recommend that you move to Snowflake or even give it a try under a 30 day eval, I think is what they’re offering these days. Uh, with your complex Jason and XML, I think you’re going to be pleasantly surprised, a lot of features there. Okay. So clients have been asking no sequel and Snowflake to be precise again.

Dan Linstedt: Uh, you can’t do something like a graph query in snowflake. They don’t have a graph language that they interpret yet today. Uh, but they are working their next big releases working on something like a geospatial coordinates and geospatial queries. But again, you have user-defined functions, you can use JavaScript for example. You can run our, and I believe you can run Python right inside the platform itself. So they, there are ways to execute code that goes beyond sequel directly against the datasets that Snowflake Snowflake leverages. Okay. Uh, white papers, samples available from implementation, um, implementation on snowflake. What I forgot to tell you was this, uh, micron move from Tera data to Snowflake or is in the process of moving from Teradata is Snowflake and there are data vault implementation I don’t have a white paper from them, but I’m happy to talk with you offline about that.

Dan Linstedt: Um, white paper samples around Snowflake itself. Yes. If you contact Snowflake there’s, there’s a bunch of information available from their sales team. Uh, they can talk to you about that. And the customers that have moved, they run frequently run something called Snowflake for breakfast events in local cities. You have to check for their schedule. These are the, they, they’re usually very well attended and it gets you hands on onto their platform with a guided instructor who’s really good. Um, in terms of other implementation, white papers talk to me, uh, we’ve got some case studies that we can share around data vaults. They’re not necessarily data vaults on snowflake. We do have a number of customers that are data vault centric that are moving just Snowflake as we speak. Okay. So the last question here, um, I guess is the last one. We could probably get time for a few more.

Dan Linstedt: How open-source platforms work in this scenario? Again, Snowflake has what they call UDL for user-defined functions, which you find in most other databases the differences. Snowflake will run the code natively. So Snowflake can execute those things natively like R and Python directly against, um, directly against the, uh, the datasets. Sorry, I’m repeating myself here, but uh, there are ways to do it in terms of running R and Python in Redshift for example. That’s a no go as far as I know you have to do that through sequel in redShift. Uh, and the same with Microsoft Azure big data edition today, my limited understanding of Microsoft as your big data is that it cannot run the code natively that you can execute it, but it has to go outside the platform to do it. I do have more case studies and more success to share with you in the future if you are interested in diving deeper and they are not necessarily Snowflake related. Other than that, thank you very much. I’m happy to network with you.

Eric Axelrod: And thank you very much, Dan. And, do you want to drop your info on where people can learn more about Data Vault

Dan Linstedt: datavaultalliance.com so you can head over there. We have 535 practitioners today worldwide. That should be around a thousand this time next year. And we’re growing every day.

Scroll to Top