Data Lake Integration Guide
This guide will help you integrate your Mambu data lake with your AWS environment. For more information on the Mambu data lake, see the Data Lake Overview.
Prerequisites for your AWS Account
Before you can integrate with your Mambu data lake, ensure your AWS environment meets the following general prerequisites:
- An active AWS account: You must have an active AWS account to establish cross-account access.
- Basic AWS IAM understanding: Familiarity with AWS Identity and Access Management (IAM) concepts such as IAM Users, IAM Roles, and Policies is beneficial for configuring access.
- Administrative access (for setup): Initial setup steps will require an IAM User or Role with sufficient administrative privileges within your own AWS account to create or modify IAM roles and policies.
- S3 Block Public Access: Ensure that the S3 Block Public Access settings in your AWS account (if they are enabled at the account level) do not inadvertently prevent the necessary cross-account S3 access.
The specific IAM permissions required for each integration option will be detailed in the respective sections of this guide.
Integration options
This guide presents two options for integrating with your Mambu data lake:
Integration option 1: Direct S3 access (Recommended for basic/development access)
This option provides direct access to the underlying S3 bucket that houses your data. This method is straightforward to set up and is ideal for initial testing, direct data loading operations, or for tools that prefer accessing S3 paths directly.
Required IAM Permissions
To enable direct S3 access from your AWS Account to our Producer Data Lake Bucket (s3://$bucket_that_we_will_share_with_you/), you need to configure an IAM policy in both accounts.
You will need an IAM role in your account that has permission to make S3 calls to our data lake bucket. For example, $YourDataEngineeringRole role is the example of the role we will create (you can name it whatever you like)
- Log into AWS Account
- Navigate to the IAM console.
- Go to Roles and select your primary execution role (e.g.,
$YourDataEngineeringRole). - Go to the Permissions tab.
- Click Add permissions > Create inline policy.
If your role is SSO-managed, you might not be able to attach inline policies directly. In such cases, these permissions must be granted via an AWS SSO Permission Set applied to the user/group that obtains this role. Consult your AWS administrator for this. For direct local testing, if you can't modify the SSO role, you'd typically create a separate, modifiable IAM role in your account with these permissions and ensure your SSO role has sts:AssumeRole on it.
- Click the JSON tab.
- Paste the following JSON:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListMultipartUploadParts",
"s3:HeadObject",
"s3:GetObjectVersion",
"s3:GetBucketAcl",
"s3:GetObjectTagging",
"s3:GetObjectVersionTagging"
],
"Resource": [
"arn:aws:s3:::$bucket_that_we_will_share_with_you",
"arn:aws:s3:::$bucket_that_we_will_share_with_you/*"
]
}
]
}
- Click Review Policy and enter a policy name
- Click Create Policy.
This will create your initial direct S3 access integration. Below is a sample PySpark script to test this direct access using Apache Spark and Python:
from pyspark.sql import SparkSession
DELTA_TABLE_S3_PATH = "s3a://$bucket_that_we_will_share_with_you/Gold/LoanProduct"
print("Initializing SparkSession with Delta Lake support...")
spark = (
SparkSession.builder.appName("MambuDataLake Demo")
.config(
"spark.jars.packages",
"io.delta:delta-spark_2.12:3.2.1,org.apache.hadoop:hadoop-aws:3.3.2",
)
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config(
"spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog",
)
.getOrCreate()
)
# --- Load and Process Delta Table ---
print(f"Attempting to load Delta table from: {DELTA_TABLE_S3_PATH}")
df = spark.read.format("delta").load(DELTA_TABLE_S3_PATH)
print(f"Record count: {df.count()}")
print("Schema:")
df.printSchema()
print("First 5 rows:")
df.show(5)
What we will need from you
To enable your access, we will need the Account ID of your AWS Consumer Account. This will allow us to grant necessary permissions in our Producer Data Lake bucket policy.
To enable your access, we will need the following information from your AWS environment:
- Your AWS Account ID: This is your 12-digit AWS account identifier. (e.g., 123456789012).
- The ARN of the IAM Role or User in your AWS consumer account that you wish to grant access to: This role or user will be used by your applications (e.g., PySpark scripts, Athena, Glue) to access the data lake. (e.g.,
arn:aws:iam::123456789012:role/YourDataEngineeringRole).
What you will need from us (Mambu Insights team)
We will provide you with the specific details of your data lake bucket, which you will need to configure access from your side:
- Producer Data Lake S3 bucket name: This is the name of the S3 bucket in our AWS account where your data is stored. (e.g.,
$bucket_that_we_will_share_with_you). - Example Gold Layer data path: A representative S3 path to a dataset in your Gold layer, which you can use for initial testing. (e.g.,
s3://$bucket_that_we_will_share_with_you/Gold/DepositProduct).
Using the data lake with AWS Athena
AWS Athena is an interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL. It's serverless, meaning you don't need to manage any infrastructure, and you pay only for the queries you run. Athena integrates seamlessly with the AWS Glue Data Catalog, where your table schemas will be stored.
To start using the data lake with AWS Athena, refer to the sections above for configuring your AWS environment and IAM permissions:
Producer account bucket policy configuration
This policy resides on our data lake bucket and explicitly grants your Consumer account read access.
- Log into AWS Account 710271915571. This is the Producer account.
- Navigate to the S3 console.
- Go to Buckets and select
$bucket_that_we_will_share_with_you. - Go to the Permissions tab.
- Scroll down to Bucket policy and click Edit.
- Replace your entire existing policy content with the following JSON:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::710271915571:root" // Allows our own account's root full control
},
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::$bucket_that_we_will_share_with_you",
"arn:aws:s3:::$bucket_that_we_will_share_with_you/*"
]
},
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::$your_account_number:root" // Grants access to ANY authenticated principal in your Consumer Account
},
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListMultipartUploadParts",
"s3:HeadObject",
"s3:GetObjectVersion",
"s3:GetBucketAcl",
"s3:GetObjectTagging",
"s3:GetObjectVersionTagging"
],
"Resource": [
"arn:aws:s3:::$bucket_that_we_will_share_with_you",
"arn:aws:s3:::$bucket_that_we_will_share_with_you/*"
]
}
]
}
This policy directly grants various read and metadata access permissions to any authenticated principal in your Consumer Account ($your_account_number). This is a broad policy for initial unblocking and testing. For production, you would typically scope this down to specific IAM roles or use Lake Formation.
- Click Save changes.
If you encounter errors saving related to Block Public Access, you may need to temporarily disable these settings at the bucket or account level:
- Go to the Block public access (bucket settings) section in the Permissions tab.
- Click Edit to uncheck Block all public access and save.
- Remember to re-enable the settings after testing for security.
Consumer Account ($your_account_number) - IAM Role Permissions
You will need an IAM role in your consumer account that has permission to make S3 calls to our data lake bucket. You would typically use the AWSReservedSSO_data-insights-dev-pu_314b1b9d83b21963 role obtained via AWS SSO.
- Log into AWS Account
$your\_account\_number(Consumer Account). - Navigate to the IAM console.
- Go to Roles and select your primary execution role (e.g.,
AWSReservedSSO_data-insights-dev-pu_314b1b9d83b21963). - Go to the Permissions tab.
- Click Add permissions > Create inline policy.
If your role is SSO-managed, you might not be able to attach inline policies directly. In such cases, these permissions must be granted via an AWS SSO Permission Set applied to the user/group that obtains this role. Consult your AWS administrator for this. For direct local testing, if you can't modify the SSO role, you'd typically create a separate, modifiable IAM role in your account ($your_account_number) with these permissions and ensure your SSO role has sts:AssumeRole on it.
- Click the JSON tab.
- Paste the following JSON:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListMultipartUploadParts",
"s3:HeadObject",
"s3:GetObjectVersion",
"s3:GetBucketAcl",
"s3:GetObjectTagging",
"s3:GetObjectVersionTagging"
],
"Resource": [
"arn:aws:s3:::$bucket_that_we_will_share_with_you",
"arn:aws:s3:::$bucket_that_we_will_share_with_you/*"
]
}
]
}
- Click Review policy.
- Policy name:
ReadDataLakeProducerBucket(or a descriptive name). - Click Create policy.
Using the data lake with AWS Athena
AWS Athena is an interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL. It's serverless, meaning you don't need to manage any infrastructure, and you pay only for the queries you run. Athena integrates seamlessly with the AWS Glue Data Catalog, where your table schemas will be stored.
3.3.1 Required IAM Permissions (Your AWS Account)
Your Athena execution role (which is implicitly assumed by Athena when you run queries) needs permissions to:
- Run Athena queries.
- Access the AWS Glue Data Catalog to retrieve table metadata.
- Read data from our S3 Data Lake bucket.
- Write query results to an S3 bucket in your account.
You will typically use an existing IAM role that has the AmazonAthenaFullAccess managed policy attached, and then add custom permissions to access our data lake.
- Log into AWS Account $your_account_number (Consumer Account).
- Navigate to the IAM console.
- Go to "Roles" and select the IAM role you use for Athena queries (e.g., the same role as above
$YourDataEngineeringRoleor a dedicated Athena execution role). - Go to the "Permissions" tab.
- Ensure the
AmazonAthenaFullAccessmanaged policy is attached. - Click "Add permissions" > "Create inline policy".
- Click the "JSON" tab.
- Paste the following JSON:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetObject",
"s3:GetBucketLocation",
"s3:ListMultipartUploadParts",
"s3:HeadObject",
"s3:GetObjectVersion",
"s3:GetBucketAcl",
"s3:GetObjectTagging",
"s3:GetObjectVersionTagging"
],
"Resource": [
"arn:aws:s3:::$bucket_that_we_will_share_with_you",
"arn:aws:s3:::$bucket_that_we_will_share_with_you/*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetTable",
"glue:GetTables",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource": [
"arn:aws:glue:ap-southeast-1:710271915571:catalog",
"arn:aws:glue:ap-southeast-1:710271915571:database/$databasename",
"arn:aws:glue:ap-southeast-1:710271915571:table/$databasename/deposit_product",
"arn:aws:glue:ap-southeast-1:710271915571:table/$databasename/loan_product"
]
}
]
}
Click "Review policy".
Policy name: AthenaCrossAccountDataLakeAccess
Click "Create policy".
3.3.2 Glue Data Catalog Setup (Your AWS Account)
Athena queries data based on schemas defined in the AWS Glue Data Catalog. For Direct S3 Access, you will need to create or manage Glue tables in your Consumer Account ($your_account_number) that point directly to the S3 data in our Producer Account.
Log into AWS Account $your_account_number (Consumer Account).
Navigate to the AWS Glue console.
Create a Database:
- Go to "Databases" in the left navigation pane.
- Click "Add database".
- Database name:
mambu_gold_data(or a name of your choice for your data catalog). - Click "Create".
Create a Crawler to Discover Data & Schema:
- Go to "Crawlers" in the left navigation pane.
- Click "Create crawler".
- Crawler name:
mambu-gold-data-crawler - Click "Next".
- Data sources:
- Choose "Add data source".
- Data source type: S3.
- S3 path:
s3://$bucket_that_we_will_share_with_you"/Gold/(the base path to your Gold layer in our Producer bucket). - Click "Add S3 data source".
- Click "Next".
- IAM role: Click "Create an IAM role".
- IAM role name:
AWSGlueCrawlerRole-MambuGold(or descriptive name). - Click "Create". (This role will automatically get permissions to access the S3 path you specified, but only within its own account by default).
- CRITICAL: You need to manually edit this newly created IAM role to add cross-account S3 permissions to our Producer bucket.
- Go to IAM console, "Roles", find
AWSGlueCrawlerRole-MambuGold. - Attach a new inline policy (or managed policy) with the same S3 permissions JSON you attached to your Athena execution role in Section 3.3.1
- Go to IAM console, "Roles", find
- IAM role name:
- Click "Next".
- Output configuration:
- Database: Select
mambu_gold_data. - Crawler schedule: Choose "Run on demand" for i...
- Database: Select
Maven/Gradle Dependencies
<groupId>io.delta</groupId>
<artifactId>delta-kernel-defaults</artifactId>
<version>0.3.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>3.3.4</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.4</version>
</dependency>
</dependencies>
Sample Code (MambuKernelReader.java)
package com.mambu;
import java.io.IOException;
import java.util.List;
import java.util.Optional;
import org.apache.hadoop.conf.Configuration;
import io.delta.kernel.Scan;
import io.delta.kernel.Snapshot;
import io.delta.kernel.Table;
import io.delta.kernel.data.ColumnarBatch;
import io.delta.kernel.data.FilteredColumnarBatch;
import io.delta.kernel.data.Row;
import io.delta.kernel.defaults.engine.DefaultEngine;
import io.delta.kernel.engine.Engine;
import io.delta.kernel.types.StructField;
import io.delta.kernel.types.StructType;
import io.delta.kernel.utils.CloseableIterator;
/**
* An example showing how to use the Delta Kernel to read a Mambu Data Lake table.
* This code connects to a table, reads its latest version, schema, and then
* iterates through the data.
*/
public class MambuKernelReader {
public static void main(String[] args) throws IOException {
// 1. Configure S3 access.
// This uses AWS environment variables for credentials by default.
Configuration hadoopConf = new Configuration();
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
hadoopConf.set("fs.s3a.endpoint", "s3.ap-southeast-1.amazonaws.com");
hadoopConf.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.EnvironmentVariableCredentialsProvider");
// The table path in your S3 bucket
String tablePath = "s3a://$your_data_lake_s3_bucket/Silver/loanproduct/";
System.out.println("🚀 Reading Mambu Table with Delta Kernel: " + tablePath);
System.out.println("=".repeat(60));
// 2. Create a Delta Kernel Engine and Table object.
Engine engine = DefaultEngine.create(hadoopConf);
Table table = Table.forPath(engine, tablePath);
// 3. Get the latest snapshot of the table to read its state.
Snapshot snapshot = table.getLatestSnapshot(engine);
StructType schema = snapshot.getSchema(engine);
System.out.println("📊 Table Schema (first 10 fields):");
List<StructField> fields = schema.fields();
for (int i = 0; i < Math.min(10, fields.size()); i++) {
StructField field = fields.get(i);
System.out.printf(" - %s (%s)\n", field.getName(), field.getDataType());
}
if (fields.size() > 10) {
System.out.println(" ... and " + (fields.size() - 10) + " more fields.");
}
// 4. Build a scan to read the table data.
// This creates a plan to read all data from the latest snapshot.
Scan scan = snapshot.getScanBuilder(engine).build();
CloseableIterator<FilteredColumnarBatch> data = Scan.readData(
engine,
scan,
snapshot.getSchema(engine),
Optional.empty() // No filter predicate
);
// 5. Iterate through the data.
// The data is returned in columnar batches for efficiency.
System.out.println("\n🔍 Reading data...");
int rowCount = 0;
while(data.hasNext()) {
FilteredColumnarBatch batch = data.next();
try (CloseableIterator<Row> rows = batch.getRows()) {
while(rows.hasNext()) {
// You can process each row here
Row row = rows.next();
rowCount++;
}
}
}
System.out.println("\n✅ Successfully read " + rowCount + " rows from the table.");
System.out.println("💡 Note: Data is read in efficient columnar batches using Parquet format.");
}
}