<@U06TM7M40RK> I want to create lineage from the logic inside the S3 files. These S3 files have stored procedure logic inside. I need to extract the source table and final table; then I need to display them in the lineage representation. How this can be implemented?
Sample code present in S3 file
select * from "SCHEMA_NAME"."SOURCE_TABLE_NAME" where "Month_ID" < @month_id```
To create lineage from the logic inside S3 files containing stored procedure logic, you can follow these steps:
Extract SQL Logic: First, you need to extract the SQL logic from the S3 files. This involves reading the files and parsing the SQL statements to identify the source and target tables.
Parse SQL for Lineage: Use a SQL parser to analyze the SQL statements and extract lineage information. The parser should be able to identify the source table ("SCHEMA_NAME"."SOURCE_TABLE_NAME") and the target table ("SCHEMA_NAME"."FINAL_TABLE_NAME").
Emit Lineage to DataHub: Once you have the lineage information, you can use the DataHub SDK to emit this lineage. You will need to create lineage relationships between the source and target datasets programmatically.
Here is an example of how you can use the DataHub Python SDK to emit lineage:
from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
DatasetLineageTypeClass,
UpstreamClass,
UpstreamLineageClass,
)
# Initialize the DataHub emitter
emitter = DatahubRestEmitter("http://localhost:8080")
# Define the source and target datasets
source_dataset_urn = make_dataset_urn(platform="your_platform", name="SCHEMA_NAME.SOURCE_TABLE_NAME", env="PROD")
target_dataset_urn = make_dataset_urn(platform="your_platform", name="SCHEMA_NAME.FINAL_TABLE_NAME", env="PROD")
# Create the lineage relationship
upstream = UpstreamClass(
dataset=source_dataset_urn,
type=DatasetLineageTypeClass.TRANSFORMED,
)
lineage = UpstreamLineageClass(
upstreams=[upstream],
)
# Emit the lineage to DataHub
emitter.emit_mce(
{
"entityType": "dataset",
"entityUrn": target_dataset_urn,
"aspectName": "upstreamLineage",
"aspect": lineage,
}
)
View Lineage in DataHub: After emitting the lineage, you can view it in the DataHub UI. Navigate to the datasets involved, and you should see the lineage relationships visualized.
This guide provides detailed instructions on how to add lineage between datasets, including using the DataHub API to programmatically emit lineage information.
If you need further assistance or more detailed steps, please let me know!