How Apache Hudi transformed Yuno’s data lake

Discover how Yuno transformed its data infrastructure with Apache Hudi, optimizing data lake performance, reducing costs by 70%, and enabling real-time data insights. Learn how Hudi's advanced features like time travel, indexing, and automated file management improved efficiency and scalability, revolutionizing Yuno's data management strategy.

As the Head of Data at Yuno, I have witnessed firsthand the challenges and opportunities that come with managing a modern data infrastructure. Data is at the heart of everything we do, driving insights, decisions, and innovations across the company.

However, with the increasing volume and complexity of data, we faced significant obstacles in maintaining efficiency, consistency, and cost-effectiveness. And so, our primary objective was to enhance our data management capabilities. We needed a solution that could provide better control over our tables, improve performance, and offer a high degree of organization.

The criteria for our new system were clear: it had to be ACID-compliant (atomicity, consistency, isolation, and durability compliant) and capable of efficiently handling upserts and deletes. After evaluating several options, Apache Hudi emerged as the ideal choice.

Apache Hudi is a data lake framework that simplifies data management on cloud storage by enabling efficient ingestion, updates, and deletes on large datasets. It also offers benefits such as incremental ingestion and excellent compatibility with real-time data sources.

The diagram illustrates the architecture of a lakehouse within a VPC (Virtual Private Cloud) environment.
The diagram illustrates the architecture of a lakehouse within a VPC (Virtual Private Cloud) environment.

High customization over different use cases

Before diving into specific strategies, it's important to understand the flexibility that Apache Hudi offers across different use cases. Whether your priority is optimizing read or write performance, Hudi provides options tailored to your specific needs.

1. COW and MOR

Apache Hudi offers a wide range of options, but the most fundamental choice you’ll make is selecting the table type that best suits your needs. This decision depends on whether you want to prioritize reading or writing performance.

If you need faster writing performance and can tolerate slower reads, the MOR (Merge on Read) strategy is a better fit.

If you want to optimize for faster reads at the expense of slower writes, you should go with the COW (Copy on Write) strategy.

The table illustrates a comparison between 'Copy on Merge' and 'Merge on Read'.
The table illustrates a comparison between 'Copy on Merge' and 'Merge on Read'.

2. Partitions it’s not enough, what about INDEX? 

While the choice between COW and MOR is critical, it’s just one piece of the puzzle. As datasets grow, partitioning alone is not enough to ensure performance. This is where indexing becomes a crucial factor in improving query efficiency and reducing latency.

When dealing with massive datasets, common challenges often arise with operations like updates, upserts, or reading specific rows. Partitioning your table is essential, but it's only the starting point. As your data grows, even partitioned tables can become large, and you’ll need to efficiently identify which partition contains the specific row you’re looking for.

To reduce latency, minimize the amount of data read, and improve query performance, you’ll need more than just partitions—you’ll need to consider indexing. This is where Apache Hudi shines, offering robust support for multi-modal indexing.

Among the various indexing strategies that Hudi provides, the Record Level Index (RLI) stands out. RLI leverages HBase, a key-value store within the metadata folder, delivering exceptional lookup performance compared to other global indexes.

However, not every use case requires a global index. For these situations, Hudi offers efficient alternatives, such as the well-known Bloom Index, tailored for specific needs without the overhead of a global index.


3. Tables maintenance

Beyond indexing, maintaining your tables effectively is essential for long-term performance optimization. This includes ensuring efficient file management, such as file sizing, clustering, cleaning, and compaction. Let’s take a closer look at how these features contribute to keeping your data processing smooth and efficient.

The clustering service is as crucial as indexing when it comes to performance. Query engines perform more efficiently when frequently queried data is physically ordered. Apache Hudi natively supports clustering with a variety of strategies to meet different needs.

The file sizing service addresses common issues like small file sizes, which can significantly slow down read performance in data lakes. When tables are fragmented into many small files, queries require more requests, leading to increased processing time. Proper file sizing also improves compression, as poorly sized files can lead to inefficient compression and, consequently, larger storage requirements.

The cleaning service is important for reclaiming space occupied by outdated versions of data. By cleaning up old versions, you can free up storage and maintain a more efficient table structure.

All of these features are fully supported by Apache Hudi. You can run these processes either inline (as part of your pipeline) or asynchronously as separate jobs, depending on your specific requirements.

4. Time travel for debugging and data auditing

In addition to performance optimizations, Hudi’s time travel feature adds another layer of value by improving data integrity and enabling efficient debugging and auditing. This capability plays a pivotal role in maintaining data quality over time.

One of the standout features of Apache Hudi is its time travel capability, which allows you to roll back to previous versions of tables. This feature has been a game changer for our debugging and data auditing processes, significantly improving the efficiency of our QA operations. The ability to access and review historical data states has enabled us to quickly identify and resolve issues, ultimately enhancing our overall data integrity.

5. Real life COW example

To better grasp how a COW (Copy on Write) table operates with inline clustering and cleaning, take a look at the example below. This example can also serve as a practical reference when implementing similar configurations in your own environment.

Overcoming implementation challenges

As with any advanced technology, implementing Apache Hudi comes with its own set of challenges. In our case, the complexity of Spark and the abundance of new Hudi options posed some difficulties. To address these, we developed templates for most of our use cases and incorporated DBT (Data Build Tool) into our workflow.

This abstraction allowed us to fully leverage Spark SQL and its high-performance built-in functions without getting overwhelmed by the complexities of Spark. By creating reusable templates, we significantly reduced development time and maintained consistency across our data processing pipelines.

Additionally, we structured our data lake following the medallion architecture, which typically includes:

The image illustrates Yuno's data lake medallion architecture at the 'Landing' stage.
The image illustrates Yuno's data lake medallion architecture at the 'Landing' stage.

Landing

We store the raw data without any transformations and with the original extension, if applicable.

The image illustrates Yuno's data lake medallion architecture at the 'Raw' stage.
The image illustrates Yuno's data lake medallion architecture at the 'Raw' stage.

Raw

We convert the data to Parquet format for consumption, but we don’t perform any other type of data transformation.

The image illustrates Yuno's data lake medallion architecture at the 'Master' stage.
The image illustrates Yuno's data lake medallion architecture at the 'Master' stage.

Master

We have our Hudi tables, and the source could be a raw table or a master Hudi table to create a new model.

This structure, along with DBT, ensures efficient data processing and supports our growing workloads.

How we manage our resources

Managing resources efficiently is key to ensuring scalability as your data operations grow. To facilitate this, we created custom profiles in our DBT repository to allocate resources based on workload size and complexity.

To effectively manage our resources, we created different profiles—XS, S, M, L, and XL—within the profiles.yml file in our DBT repository. These profiles allow us to allocate resources appropriately based on the size and complexity of the workload, ensuring efficient and scalable performance across various use cases.

Integration with AWS Glue and creating templates to abstract spark and Hudi complexity

After addressing initial challenges, we sought further simplifications. The integration of the dbt-glue connector allowed us to abstract the complexities of Spark and Hudi, giving us more control and efficiency in our operations.

The dbt-glue connector eliminated the need to manage a Spark cluster, allowing us to run everything seamlessly on AWS Glue. To further streamline our operations, we adapted our DBT repository with a fork of the dbt-glue connector to better address our Hudi requirements. By creating predefined templates, we simplified the implementation process and tailored our setup to work with Hudi version 0.14.1.

Throughout this process, we identified key options like Record Level Index (RLI), Glue Data Catalog sync, and minimum and maximum file sizes that are commonly used in our workflows. These options can be embedded into your code to customize your own templates and optimize your operations.

By using the AWS Glue connector, we’ve significantly reduced the complexity of managing Spark and Hudi while maintaining high performance and flexibility.

Orchestration and outcomes of implementing Hudi

To ensure smooth and automated workflows, we used Airflow for orchestration, which simplified job scheduling and monitoring. This, combined with our AWS Glue setup, helped us achieve significant performance improvements and cost reductions.

To manage our data workflows efficiently, we used Airflow for orchestration, ensuring smooth operations without unnecessary complications. By leveraging Airflow, we were able to effectively schedule, monitor, and manage our ETL jobs with ease.

Paired with AWS Glue, we gained access to a scalable, serverless environment that eliminated the need for manual infrastructure management, allowing us to focus on our data processing tasks. This combination of tools helped streamline our operations and optimize performance.

The results of implementing Apache Hudi were impressive. By incorporating Hudi’s advanced features, we saw a nearly 70% reduction in costs across our most resource-demanding processes. Hudi’s capabilities, such as time travel, indexing, and file management, drastically improved our data management efficiency. This not only helped us cut costs but also enhanced the overall performance of our data pipeline, making it more scalable and cost-effective.

Future plans: migrating workloads to the data lake

With these optimizations in place, our focus is now on the future. We plan to migrate high-performance workloads to the data lake, a move designed to further reduce costs and improve scalability.

Looking ahead, we plan to migrate high-performance workloads from our Snowflake warehouse to the data lake. This strategic move aims to reduce costs further and enable Snowflake to read directly from the data lake for certain models, thereby optimizing our resources and enhancing efficiency. By moving these workloads, we expect to leverage the scalability of our data lake while maintaining the analytical capabilities of Snowflake.

Our ultimate vision is to evolve into a data lakehouse, integrating all data operations across the company. This unified platform will drive insights and innovation, fostering a data-driven culture throughout the organization. A data lakehouse combines the best features of data lakes and data warehouses, providing a single platform for both analytical and operational workloads. This integration will enable us to manage our data more effectively and derive actionable insights in real-time.

The best part is that we’re already working in our next step and real-time data in our data lake using HoodieStreamer, another great tool from Hudi. This advancement will enable us to harness real-time data insights, propelling our operations to new heights. Real-time data processing allows us to react to changes instantly, improving our decision-making processes and operational efficiency.

As we continue to innovate and optimize our data infrastructure, we remain committed to leveraging cutting-edge technologies to drive efficiency and foster growth. If you have any questions or wish to learn more about our experience, feel free to reach out.