JustPaste.it

Understanding COW and MOR in Apache Hudi: Choosing the Right Storage Strategy

User avatar
lency @lency2 · Nov 15, 2024

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is a powerful framework designed for managing large datasets on cloud storage systems, enabling efficient data ingestion, storage, and retrieval. One of the key features of Hudi is its support for two distinct storage types: Copy-On-Write (COW) and Merge-On-Read (MOR). Each of these storage strategies has unique characteristics and serves different use cases. In this blog, we will explore COW and MOR.

understandingcowandmorinapachehudi.jpg

Prerequisites

Before you begin, ensure you have the following installed on your local machine:

  • Docker
  • Docker Compose

Local Setup

To set up Apache Hudi locally, follow these steps:

  1. Clone the Repository:

git clone https://github.com/dnisha/hudi-on-localhost.git cd hudi-on-localhost

  1. Start Docker Compose:

docker-compose up -d

  1. Access the Notebooks:

Username: minioadmin Password: minioadmin

e5a88a849f95c4eb89fd7c33950758e2.jpg

 

What is Copy-On-Write (COW)?

Copy-On-Write (COW) is a storage type in Apache Hudi that allows for atomic write operations. When data is updated or inserted:

  • Hudi creates a new version of the entire data file.
  • The existing data file remains unchanged until the new file is successfully written.

This ensures that the operation is atomic, meaning it either completely succeeds or fails without partial updates.

[ Good Read: cloud data warehouse vs data lake ]

 

Steps to Evaluate COW

  • Open the Notebook:
  • In your browser, navigate to hudi_cow_evaluation.ipynb.

6e902a2a1860689225da5015a76a26c3.jpg

 

2. Run Configuration Code:

  • Execute all configuration-related code in the notebook.
  • Ensure you specify the COPY_ON_WRITE table type, as shown in the provided image.

e555ff8dbb8aee8e624a9276f62282ee.jpg

 

3. Updating a Record:

a. Focus on updating a record in the 34 partition of the COW bucket.

e555ff8dbb8aee8e624a9276f62282ee.jpg

 

b. Since you are using the COPY_ON_WRITE table type, a new Parquet file will be created for this update. You can find this file in the bucket located at warehouse/cow/transactions/document=34. Open

f0cce6d34a10ed7b54c4f0ad550771a6.jpg

 

What is Merge-On-Read (MOR)?

Merge-On-Read (MOR) is an alternative storage type in Apache Hudi that employs a different approach to data management. Here’s how it works:

  • Base Parquet Files and Log Files: In MOR, Hudi maintains a combination of base Parquet files alongside log files that capture incremental changes.
  • On-the-Fly Merging: When a read operation is executed, Hudi merges the base files and log files in real-time, providing the most up-to-date view of the data.

This approach allows for efficient handling of updates and inserts while enabling faster read operations, as the system does not need to rewrite entire files for every change.

 

You can check more info about: COW and MOR in Apache Hudi.