{"id":20945,"date":"2021-09-24T23:13:21","date_gmt":"2021-09-24T16:13:21","guid":{"rendered":"https:\/\/renovacloud.com\/?p=20945"},"modified":"2024-12-02T17:27:25","modified_gmt":"2024-12-02T10:27:25","slug":"the-rise-of-cloud-data-warehouse","status":"publish","type":"post","link":"https:\/\/renovacloud.com\/en\/the-rise-of-cloud-data-warehouse\/","title":{"rendered":"THE RISE OF CLOUD DATA WAREHOUSE"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">The rise of data-driven analytics, cross-functional data teams, and most importantly, the cloud, the phrase \u201ccloud data warehouse\u201d is nearly analogous with agility and innovation. In many ways, the cloud makes data easier to manage, more accessible to a wider variety of users, and far faster to process.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">When it comes to selecting the right cloud data warehouse for your data platform, however, the answer isn\u2019t as straightforward. With the release of Amazon Redshift in 2013 followed by Snowflake, Google Big Query, and others in the subsequent years, the market has become increasingly hot. <\/span><span style=\"font-weight: 400;\">Add\u00a0<\/span><a href=\"https:\/\/aws.amazon.com\/big-data\/datalakes-and-analytics\/what-is-a-data-lake\/\" rel=\"noopener\"><span style=\"font-weight: 400;\">data lakes<\/span><\/a><span style=\"font-weight: 400;\">\u00a0to the mix, and the decision becomes that much harder<\/span><span style=\"font-weight: 400;\">!<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Whether you\u2019re just getting started or are in the process of re-assessing your existing solution, here\u2019s what you need to know to choose the right data warehouse (or lake) for your data stack:<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">What makes a data warehouse\/lake?<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Data warehouses and lakes are the foundation of your data infrastructure, providing the storage, compute power, and contextual information about the data in your ecosystem. Like the engine of a car, these technologies are the workhorse of the data platform.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Data warehouses and lakes incorporate the following four main components:<\/span><\/p>\n<h4><span style=\"font-weight: 400;\">Metadata<\/span><\/h4>\n<p><span style=\"font-weight: 400;\">Warehouses and lakes typically offer a way to manage and track all the databases, schemas, and tables that you create. These objects are often accompanied by additional information such as schema, data types, user-generated descriptions, or even freshness and other statistics about the data.<\/span><\/p>\n<h4><span style=\"font-weight: 400;\">Storage<\/span><\/h4>\n<p><span style=\"font-weight: 400;\">Storage refers to the way in which the warehouse\/lake physically stores all the records that exist across all tables.\u00a0<\/span><\/p>\n<h4><span style=\"font-weight: 400;\">Compute<\/span><\/h4>\n<p><span style=\"font-weight: 400;\">Compute refers to the way in which the warehouse\/lake performs calculations on the data records it stores. This is the engine that allows users to \u201cquery\u201d data, ingest data, transform it \u2014 and more broadly, extract value from it. Frequently, these calculations are expressed via SQL.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Why choose a data warehouse?<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Data warehouses are fully integrated and managed solutions, making them simple to build and operate out-of-the-box. When using a data lake, you typically use metadata, storage and compute from a single solution, built and operated by a single vendor.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Unlike data lakes, data warehouses typically require more structure and schema, which often forces better data hygiene and results in less complexity when reading and consuming data.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Owing to its pre-packaged functionalities and strong support for SQL, data warehouses facilitate fast, actionable querying, making them great for data analytics teams.<\/span><\/p>\n<h5><span style=\"font-weight: 400;\">Common data warehouse technologies include:<\/span><\/h5>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Amazon Redshift<\/b><span style=\"font-weight: 400;\">: The first widely popular (and readily available) cloud data warehouse, Amazon Redshift sits on top of Amazon Web Services (AWS) and leverages source connectors to pipe data from raw data sources into relational storage. Redshift\u2019s columnar storage structure and parallel processing makes it ideal for analytic workloads.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Google BigQuery<\/b><span style=\"font-weight: 400;\">: Like Redshift, Google BigQuery leverages its mothership\u2019s proprietary cloud platform (Google Cloud), uses a columnar storage format, and takes advantage of parallel processing for quick querying. Unlike Redshift, BigQuery is a serverless solution that scales according to usage patterns.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Snowflake<\/b><span style=\"font-weight: 400;\">: Unlike Redshift or GCP which rely on their proprietary clouds to operate, Snowflake\u2019s cloud data warehousing capabilities are powered by AWS, Google, Azure, and other public cloud infrastructure. Unlike Redshift, Snowflake allows users to pay separate fees for compute and storage, making the data warehouse a great option for teams looking for a more flexible pay structure.<\/span><\/li>\n<\/ul>\n<h3><span style=\"font-weight: 400;\">Why choose a data lake?<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Data lakes are the do-it-yourself version of a data warehouse, allowing data engineering teams to pick and choose the various metadata, storage, and compute technologies they want to use depending on the needs of their systems.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Data lakes are ideal for data teams looking to build a more customized platform, often supported by a handful (or more) of data engineers.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20939\" src=\"http:\/\/renovacloud.com\/wp-content\/uploads\/2021\/09\/datalake1.png\" alt=\"\" width=\"960\" height=\"540\" \/><\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\">Data lakes are often built with a combination of open source and closed source technologies, making them easy to customize and able to handle increasingly complex workflows. Image courtesy of Lior Gavish\/Monte Carlo.<\/span><\/i><\/p>\n<h5><span style=\"font-weight: 400;\">Some common features of data lakes include:<\/span><\/h5>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Decoupled storage and compute<\/b><span style=\"font-weight: 400;\">: Not only can this functionality allow for substantial cost savings, but it also facilitates parsing and enriching of the data for real-time streaming and querying.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Strong support for distributed compute<\/b><span style=\"font-weight: 400;\">: Distributed computing helps support the performance of large-scale data processing because it allows for better segmented query performance, more fault-tolerant design, and superior parallel data processing.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Customization and interoperability<\/b><span style=\"font-weight: 400;\">: Owing to their \u201cplug and chug\u201d nature, data lakes support data platform scalability by making it easy for different elements of your stack to play well together as the data needs of your company evolve and mature.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Largely built on open source technologies<\/b><span style=\"font-weight: 400;\">: This facilitates reduced vendor lock-in, and affords great customization, which works well for companies with large data engineering teams.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Ability to handle unstructured or weakly structured data<\/b><span style=\"font-weight: 400;\">: Data lakes and some data warehouses (like Snowflake and BigQuery) can support raw data, meaning that you have greater flexibility when it comes to working with your data, ideal for data scientists and data engineers. Working with raw data gives you more control over your aggregates and calculations.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Supports sophisticated non-SQL programming models<\/b><span style=\"font-weight: 400;\">: Many data lakes support\u00a0<\/span><a href=\"https:\/\/hadoop.apache.org\/\" rel=\"noopener\"><span style=\"font-weight: 400;\">Apache Hadoop<\/span><\/a><span style=\"font-weight: 400;\">,\u00a0<\/span><a href=\"https:\/\/spark.apache.org\/\" rel=\"noopener\"><span style=\"font-weight: 400;\">Apache Spark<\/span><\/a><span style=\"font-weight: 400;\">,\u00a0<\/span><a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/index.html\" rel=\"noopener\"><span style=\"font-weight: 400;\">PySpark<\/span><\/a><span style=\"font-weight: 400;\">, and other frameworks for advanced data science and machine learning.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">It\u2019s important to note that many warehouses, such as Snowflake and BigQuery, now support several of these functionalities, too.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Just when you thought the decision was tough enough, another data warehousing option has emerged as an increasingly popular one, particularly among data engineering teams.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Meet the data lakehouse, a solution that marries features of both data warehouses and data lakes, and as a result, combines traditional data analytics technologies with those built for more advanced computations (i.e., machine learning).<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20935\" src=\"http:\/\/renovacloud.com\/wp-content\/uploads\/2021\/09\/datalake2.jpg\" alt=\"\" width=\"960\" height=\"540\" \/><\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\">The data lakehouse gives data teams even greater customizability, allowing them to store data on the cloud and leverage a warehouse solely for its compute engine. Image courtesy of Lior Gavish\/Monte Carlo.<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">Data lakehouses first came onto the scene when cloud warehouse providers began adding features that offer lake-style benefits, such as Redshift Spectrum or Delta Lake. Similarly, data lakes have been adding technologies that offer warehouse-style features, such as SQL functionality and schema.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Today, the historical differences between warehouses and lakes are narrowing so you can access the best of both words in one package.<\/span><\/p>\n<h5><span style=\"font-weight: 400;\">The following functionalities are helping data lakehouses further blur the lines between the two technologies:<\/span><\/h5>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>High-performance SQL<\/b><span style=\"font-weight: 400;\">: technologies like Presto and Spark provide SQL interface at close to interactive speeds over data lakes. This opened the possibility of data lakes serving analysis and exploratory needs directly, without requiring summarization and ETL into traditional data warehouses.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Schema<\/b><span style=\"font-weight: 400;\">: file formats like Parquet introduced more rigid schema to data lake tables, as well as a columnar format for greater query efficiency.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Atomicity, Consistency, Isolation, and Durability (ACID)<\/b><span style=\"font-weight: 400;\">: lake technologies like\u00a0<\/span><a href=\"https:\/\/delta.io\/\" rel=\"noopener\"><span style=\"font-weight: 400;\">Delta Lake<\/span><\/a><span style=\"font-weight: 400;\">\u00a0and\u00a0<\/span><a href=\"https:\/\/hudi.apache.org\/\" rel=\"noopener\"><span style=\"font-weight: 400;\">Apache Hudi<\/span><\/a><span style=\"font-weight: 400;\">\u00a0introduced greater reliability in write\/read transactions, and takes lakes a step closer to the highly desirable ACID properties that are standard in traditional database technologies.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Managed services<\/b><span style=\"font-weight: 400;\">: for teams that want to reduce the operational lift associated with building and running a data lake, cloud providers offer a variety of managed lake services. For example, Databricks offers a managed version of\u00a0<\/span><a href=\"https:\/\/hive.apache.org\/\" rel=\"noopener\"><span style=\"font-weight: 400;\">Apache Hive<\/span><\/a><span style=\"font-weight: 400;\">,\u00a0<\/span><a href=\"https:\/\/delta.io\/\" rel=\"noopener\"><span style=\"font-weight: 400;\">Delta Lake<\/span><\/a><span style=\"font-weight: 400;\">, and\u00a0<\/span><a href=\"https:\/\/spark.apache.org\/\" rel=\"noopener\"><span style=\"font-weight: 400;\">Apache Spark<\/span><\/a><span style=\"font-weight: 400;\">\u00a0while\u00a0<\/span><a href=\"https:\/\/aws.amazon.com\/athena\/\" rel=\"noopener\"><span style=\"font-weight: 400;\">Amazon Athena<\/span><\/a><span style=\"font-weight: 400;\">\u00a0offers a fully managed lake SQL query engine and\u00a0<\/span><a href=\"https:\/\/aws.amazon.com\/glue\/\" rel=\"noopener\"><span style=\"font-weight: 400;\">Amazon\u2019s Glue<\/span><\/a><span style=\"font-weight: 400;\">\u00a0offers a fully managed metadata service.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">With the rise of real-time data aggregation and streaming to inform lightspeed analytics (think Silicon Valley speeds:\u00a0<\/span><a href=\"https:\/\/eng.uber.com\/uber-big-data-platform\/\" rel=\"noopener\"><b>Uber<\/b><\/a><span style=\"font-weight: 400;\">,\u00a0<\/span><a href=\"https:\/\/doordash.engineering\/2020\/09\/25\/how-doordash-is-scaling-its-data-platform\/\" rel=\"noopener\"><b>DoorDash<\/b><\/a><span style=\"font-weight: 400;\">, and\u00a0<\/span><a href=\"https:\/\/medium.com\/airbnb-engineering\/data\/home\" rel=\"noopener\"><b>Airbnb<\/b><\/a><span style=\"font-weight: 400;\">), data lakehouses are likely to rise in popularity and relevance for data teams across industries in the coming years.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">So, what should you choose?<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">There\u2019s not an easy answer. In fact, it\u2019s no surprise that data teams frequently<\/span><a href=\"https:\/\/towardsdatascience.com\/migrating-to-snowflake-like-a-boss-6163293f0bcb\" rel=\"noopener\"><span style=\"font-weight: 400;\">\u00a0migrate from one data warehouse solution to another<\/span><\/a><span style=\"font-weight: 400;\">\u00a0as the needs of their data organization shifts and evolves to meet the demands of data consumers (which nowadays, is nearly every functional area in the business, from Marketing and Sales to Operations and HR).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">While data warehouses often make sense for data platforms whose primary use case is for data analysis and reporting, data lakes are becoming increasingly user-friendly.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We find that regardless of the route you choose, it\u2019s important to apply the following best practices:<\/span><\/p>\n<h4><b>Choose the solution that maps to your company\u2019s data goals.<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">If your company only uses one or two key data sources on a regular basis for a select few workflows, then it might not make sense to build a data lake from scratch, both in terms of time and resources. But if your company is trying to use data to inform everything under the sun, then a hybrid warehouse-lake solution may just be your ticket to fast, actionable insights for users across roles.<\/span><\/p>\n<h4><b>Know who your core users will be.<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">Will the primary users of your data platform be your company\u2019s business intelligence team, distributed across several different functions? What about a dedicated team of data engineers? Or a few groups of data scientists running A\/B tests with various data sets? All of the above? Regardless, choose the data warehouse\/lake\/lakehouse option that makes the most sense for the skill sets and needs of your users.<\/span><\/p>\n<h4><b>Don\u2019t forget data observability.<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">Data warehouse, data lake, data lakehouse: it doesn\u2019t matter. All three solutions (and any combination of them) will require\u00a0<\/span><a href=\"https:\/\/towardsdatascience.com\/what-we-got-wrong-about-data-governance-365555993048\" rel=\"noopener\"><span style=\"font-weight: 400;\">a holistic approach to data governance<\/span><\/a><span style=\"font-weight: 400;\">\u00a0and\u00a0<\/span><a href=\"https:\/\/towardsdatascience.com\/data-quality-youre-measuring-it-wrong-8863e5ae6491\" rel=\"noopener\"><span style=\"font-weight: 400;\">data quality<\/span><\/a><span style=\"font-weight: 400;\">.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">After all, your data platform is only as powerful and reliable as the data that informs it; if your data is broken, missing, or otherwise inaccurate (we call this problem\u00a0<\/span><a href=\"https:\/\/towardsdatascience.com\/the-rise-of-data-downtime-841650cedfd5\" rel=\"noopener\"><span style=\"font-weight: 400;\">data downtime<\/span><\/a><span style=\"font-weight: 400;\">), it doesn\u2019t matter how advanced your pipelines are.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Your thoughtful investment in the latest and greatest data warehouse doesn\u2019t matter if you can\u2019t trust your data. To address this problem, some of the best data teams are leveraging\u00a0<\/span><a href=\"https:\/\/towardsdatascience.com\/what-is-data-observability-40b337971e3e\" rel=\"noopener\"><b>data observability<\/b><\/a><span style=\"font-weight: 400;\">, an end-to-end approach to monitoring and alerting for issues in your data pipelines.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">More on that in a future <\/span><a href=\"https:\/\/renovacloud.com\/news-events\/?lang=en\"><span style=\"font-weight: 400;\">article<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Author: Doron Shachar, CEO Renova Cloud<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The rise of data-driven analytics, cross-functional data teams, and most importantly, the cloud, the phrase \u201ccloud data warehouse\u201d is nearly analogous with agility and innovation. In many ways, the cloud makes data easier to manage, more accessible to a wider variety of users, and far faster to process.\u00a0 When it comes to selecting the right [&#8230;]\n","protected":false},"author":11,"featured_media":20948,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[35],"class_list":["post-20945","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-and-analytics","tag-aws"],"_links":{"self":[{"href":"https:\/\/renovacloud.com\/en\/wp-json\/wp\/v2\/posts\/20945","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/renovacloud.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/renovacloud.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/renovacloud.com\/en\/wp-json\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"https:\/\/renovacloud.com\/en\/wp-json\/wp\/v2\/comments?post=20945"}],"version-history":[{"count":1,"href":"https:\/\/renovacloud.com\/en\/wp-json\/wp\/v2\/posts\/20945\/revisions"}],"predecessor-version":[{"id":20946,"href":"https:\/\/renovacloud.com\/en\/wp-json\/wp\/v2\/posts\/20945\/revisions\/20946"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/renovacloud.com\/en\/wp-json\/wp\/v2\/media\/20948"}],"wp:attachment":[{"href":"https:\/\/renovacloud.com\/en\/wp-json\/wp\/v2\/media?parent=20945"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/renovacloud.com\/en\/wp-json\/wp\/v2\/categories?post=20945"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/renovacloud.com\/en\/wp-json\/wp\/v2\/tags?post=20945"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}