Notion Builds In-House Data Lake to Power Analytics and Machine Learning

Facing massive data growth, productivity app Notion has ditched its traditional data pipeline for a custom-built data lake. This new architecture allows Notion to process and analyze its user data faster and at a lower cost, enabling the addition of new features like AI-powered functionality.

Notion's user base has grown exponentially, leading to a doubling of user data every 6-12 months. This surge placed a strain on their existing data pipeline, which relied on a combination of Postgres, Fivetran, and Snowflake. The key challenges included:

  • Maintaining Hundreds of Connectors:Managing 480 Fivetran connectors for each Postgres shard proved burdensome for Notion's engineers.
  • Data Freshness:The existing setup resulted in slow and expensive data ingestion to Snowflake, which is optimized for insert-heavy workloads rather than Notion's update-heavy environment.
  • Limited Transformation Capabilities:Implementing complex data transformations became difficult with the standard SQL interface offered by off-the-shelf data warehouses.

To address these issues, Notion built its own data lake leveraging open-source technologies like Debezium, Apache Spark, Apache Hudi, and AWS S3.

Here's a breakdown of the new architecture:

  • Debezium:This tool captures data changes in real-time from Notion's Postgres databases and transmits them to Apache Kafka.
  • Apache Spark:Spark processes and transforms the data received from Kafka.
  • Apache Hudi:This framework enables efficient updates to the data lake stored in AWS S3, ensuring data consistency and allowing for rollbacks if necessary.

This new in-house data lake offers Notion significant advantages:

  • Faster Data Processing:Ingestion times have been reduced from over a day to mere minutes or hours.
  • Cost Savings:Notion has achieved millions of dollars in cost savings through this new approach.
  • Improved Scalability:The data lake architecture is designed to handle Notion's ever-growing data volume.
  • Foundation for AI Features:The faster data processing allows Notion to integrate new features that rely on machine learning.

Notion still utilizes Snowflake and Fivetran for specific use cases where they are well-suited. However, the custom data lake provides Notion with the flexibility and scalability required to handle its massive and ever-growing data landscape.