Databricks - bem Enterprise Guide

For ingesting event data from bem into the Databricks Lakehouse Platform, the best architecture depends on your primary use case. We outline two common patterns below, with the Auto Loader approach being the definitive best practice for analytics.

Pattern 1: Operational-First via Production Database

This pattern is ideal when the primary, immediate need for the data is to power a live application, such as an internal dashboard or a user-facing feature.

Architecture: bem Event Subscription -> API Gateway -> Lambda -> Production DB (e.g., Postgres) -> CDC/ETL -> Databricks

How it Works:

The bem webhook sends the event payload to your API Gateway, which triggers a Lambda function.
The Lambda function writes the structured data directly to your production database (e.g., PostgreSQL, MySQL). This makes the data immediately available to your application for real-time workflows.
A separate process, either a periodic ETL job or a Change Data Capture (CDC) stream, then moves the data from your production database into your Databricks environment (landing in cloud storage for ingestion) for large-scale analytics and model training.

When to use this pattern:

When you need to display results to an operator in an internal tool the moment they are processed.
When the data triggers an immediate transactional workflow in your application's backend.

Pattern 2: Analytics-First via Cloud Storage (Recommended for Analytics)

This event-driven architecture is the most scalable, reliable, and cost-effective method for getting webhook data into Databricks for large-scale analytics and AI.

Architecture: bem Event Subscription -> AWS API Gateway -> AWS Lambda -> S3 Bucket -> Databricks Auto Loader -> Delta Lake Table

Why this pattern is recommended:

Scalability & Decoupling: S3 acts as a durable, highly-available buffer. If your Databricks cluster is down or jobs are paused, event data from bem safely accumulates in S3. Auto Loader will automatically process the backlog once the cluster is active again, ensuring zero data loss.
Efficiency & Cost-Effectiveness: Auto Loader is highly optimized to discover new files in cloud storage, often using cloud notification services (like SQS) to avoid expensive directory listing operations. This scales more efficiently and at a lower cost than other methods.
Simplicity and Power: Auto Loader handles complex, real-world data problems out of the box. It automatically infers your data schema, gracefully handles schema evolution over time, and rescues data that might otherwise be lost due to malformed records.

The Best of Both Worlds: A Hybrid Approach

For maximum flexibility, the Lambda function in your integration can perform two actions in parallel:

Write to your Production Database for immediate operational use.
Drop the raw JSON event into S3 for reliable, decoupled ingestion into Databricks via Auto Loader.

This hybrid pattern serves both your real-time application needs and your long-term analytics requirements without compromise.