<@U06TM7M40RK> We are using dbt core as our source, target for the models is azure sql, and we use power BI to build dashboards based on the data in Azure SQL.
I want to ingest PowerBI including lineage. So for example, I will see a lineage from dbt to azuresql and from azuresql to powerbi. Below is the example of the recipe file taken from datahub documentation.
How should I adjust it for our purposes?
type: "powerbi"
config:
# Your Power BI tenant identifier
tenant_id: a949d688-67c0-4bf1-a344-e939411c6c0a
# Azure AD Application identifier
client_id: foo
# Azure AD App client secret
client_secret: bar
# Ingest elements of below PowerBi Workspace into Datahub
workspace_id_pattern:
allow:
- 4bd10256-e999-45dd-8e56-571c77153a5f
deny:
# Enable / Disable ingestion of ownership information for dashboards
extract_ownership: true
# Enable/Disable extracting workspace information to DataHub containers
extract_workspaces_to_containers: true
# Enable / Disable ingestion of endorsements.
# Please notice that this may overwrite any existing tags defined to ingested entities!
extract_endorsements_to_tags: false
# Optional -- This mapping is optional and only required to configure platform-instance for upstream tables
# A mapping of PowerBI datasource's server i.e host[:port] to data platform instance.
# :port is optional and only needed if your datasource server is running on non-standard port.
# For Google BigQuery the datasource's server is google bigquery project name
server_to_platform_instance:
<http://ap-south-1.snowflakecomputing.com|ap-south-1.snowflakecomputing.com>:
platform_instance: operational_instance
env: DEV
oracle-server:1920:
platform_instance: high_performance_production_unit
env: PROD
big-query-sales-project:
platform_instance: sn-2
env: QA
# Need admin_api, only ingest workspace that are modified since...
modified_since: "2023-02-10T00:00:00.0000000Z"
ownership:
# create powerbi user as datahub corpuser, false will still extract ownership of workspace/ dashboards
create_corp_user: false
# use email to build user urn instead of powerbi user identifier
use_powerbi_email: true
# remove email suffix like @acryl.io
remove_email_suffix: true
# only ingest user with certain authority
owner_criteria: ["ReadWriteReshareExplore","Owner","Admin"]
# wrap powerbi tables (datahub dataset) under 1 powerbi dataset (datahub container)
extract_datasets_to_containers: true
# only ingest dataset that are endorsed, like "Certified"
filter_dataset_endorsements:
allow:
- Certified
# extract powerbi dashboards and tiles
extract_dashboards: false
# extract powerbi dataset table schema
extract_dataset_schema: true
# Enable PowerBI dataset profiling
profiling:
enabled: false
# Pattern to limit which resources to profile
# Matched resource format is following:
# workspace_name.dataset_name.table_name
profile_pattern:
deny:
- .*
sink:
# sink config```
To adjust your PowerBI ingestion recipe for your specific use case, where you want to see lineage from dbt to Azure SQL and from Azure SQL to PowerBI, you need to ensure that the configuration captures the necessary lineage information. Here are some key points and adjustments you might consider:
Ensure Lineage Extraction: Make sure that the extract_lineage parameter is set to true to enable lineage extraction. This is crucial for capturing the lineage from Azure SQL to PowerBI.
Server to Platform Instance Mapping: You need to map your Azure SQL server to a platform instance in the server_to_platform_instance section. This mapping helps DataHub understand the relationship between your data sources and PowerBI.
Workspace and Dataset Configuration: Ensure that your workspace_id_pattern is correctly configured to include all relevant workspaces. This will ensure that all necessary PowerBI workspaces are ingested.
Ownership and Endorsements: If you want to capture ownership and endorsements, ensure that extract_ownership and extract_endorsements_to_tags are set according to your needs.
Profiling and Schema Extraction: If you want to profile datasets or extract schema information, ensure that profiling.enabled and extract_dataset_schema are set to true.
Here’s an example of how you might adjust your recipe:
Make sure to replace placeholders like azure-sql-server and azure_sql_instance with your actual Azure SQL server details. This configuration should help you capture the desired lineage from dbt to Azure SQL and from Azure SQL to PowerBI.
Would you like more detailed steps or further assistance on any of these points?
<@U06TM7M40RK> I still dont see columns and lineage four the sources on powerbi platform.
In my case, the azure-sql-server should be something like this? my-sql.http://database.windows.net|database.windows.net
Do I need the port?
What should I put instead of the azure_sql_instance?
To ensure that you see columns and lineage for the sources on the PowerBI platform, it’s important to correctly configure the server_to_platform_instance mapping in your ingestion recipe. Let’s address your questions step-by-step:
Azure SQL Server Format:
The format for your Azure SQL server should be my-sql.database.windows.net. You do not need to include http:// or https:// in the server name.
You typically do not need to specify a port unless your Azure SQL server is running on a non-standard port. The default port for SQL Server is 1433, which is usually implied.
Platform Instance:
The platform_instance is a logical name you assign to your Azure SQL instance within DataHub. It helps DataHub identify and map the data source to the PowerBI platform.
You can choose a descriptive name for your platform_instance, such as azure_sql_prod or azure_sql_dev, depending on your environment.
Here’s how you can adjust your configuration:
source:
type: "powerbi"
config:
tenant_id: a949d688-67c0-4bf1-a344-e939411c6c0a
client_id: foo
client_secret: bar
workspace_id_pattern:
allow:
- 4bd10256-e999-45dd-8e56-571c77153a5f
extract_ownership: true
extract_workspaces_to_containers: true
extract_endorsements_to_tags: false
extract_lineage: true
server_to_platform_instance:
my-sql.database.windows.net: # Use your actual Azure SQL server name
platform_instance: azure_sql_prod # Choose a descriptive name for your instance
env: PROD
modified_since: "2023-02-10T00:00:00.0000000Z"
ownership:
create_corp_user: false
use_powerbi_email: true
remove_email_suffix: true
owner_criteria: ["ReadWriteReshareExplore","Owner","Admin"]
extract_datasets_to_containers: true
filter_dataset_endorsements:
allow:
- Certified
extract_dashboards: true
extract_dataset_schema: true
profiling:
enabled: false
profile_pattern:
deny:
- .*
sink:
type: datahub-rest
config:
server: 'http://localhost:8080'
Make sure to replace my-sql.database.windows.net with your actual Azure SQL server name and choose an appropriate name for azure_sql_prod that reflects your environment.
Would you like more detailed steps or further assistance on any specific part of this process?
The admin_apis_only parameter in the PowerBI ingestion configuration is relevant when you want to use the PowerBI Admin API exclusively for metadata retrieval. This can be important for extracting comprehensive metadata, including lineage, but it comes with certain caveats.
When to Use admin_apis_only:
Comprehensive Metadata Extraction:
Using the Admin API allows you to extract more detailed metadata, including lineage, as it provides access to all PowerBI resources irrespective of whether the service principal has explicit access to them.
Caveats:
If admin_apis_only is set to true, certain metadata like report pages will not be ingested because the page API is not available in the PowerBI Admin API.
PowerBI parameters might not be resolved to actual values while processing M-Query for table lineage.
Dataset profiling is unavailable, as it requires access to the workspace API.
Configuration:
If you decide to use the Admin API for lineage extraction, you can enable the admin_apis_only parameter in your configuration:
Using admin_apis_only can be beneficial for extracting detailed lineage information, but you should be aware of the limitations and ensure that it aligns with your metadata extraction needs.
Would you like more detailed steps or further assistance on any specific part of this process?
<@U06TM7M40RK> with or without the admin_apis_only - I still don’t see upstream lineage to azure sql.
Do I need to re-run azure sql ingestion after ingesting the power bi ?
To ensure that you see upstream lineage from PowerBI to Azure SQL, it’s important to correctly sequence and configure your ingestion processes. Here are some steps and considerations:
Ingestion Order:
It is generally recommended to run the ingestion for both your data warehouse (Azure SQL in this case) and PowerBI. The order can matter if there are dependencies, but typically, you should ensure that both are ingested to capture the full lineage.
Re-run Azure SQL Ingestion:
If you have already ingested PowerBI and are not seeing the expected lineage, it might be beneficial to re-run the Azure SQL ingestion. This ensures that all metadata is up-to-date and any new configurations or changes are captured.
Verify Configuration:
Double-check that your ingestion configurations for both Azure SQL and PowerBI are correct and that they include all necessary parameters for lineage extraction.
Check for Known Issues:
There might be specific issues or limitations with the current version of DataHub that affect lineage visibility. Checking the <https://github.com/datahub-project/datahub/issues|DataHub GitHub issues> or <Slack discussions> can provide insights into any known problems.
Ensure Correct Mapping:
Make sure that the server_to_platform_instance mapping in your PowerBI configuration correctly matches the Azure SQL server details. This mapping is crucial for DataHub to understand the relationship between the data sources.
Debugging and Logs:
Enable detailed logging during the ingestion process to capture any warnings or errors that might indicate why lineage is not being captured as expected.
Here’s a general approach to follow:
Step 1: Ingest Azure SQL metadata into DataHub.
Step 2: Ingest PowerBI metadata into DataHub.
Step 3: Verify lineage in the DataHub UI.
If you have followed these steps and still do not see the expected lineage, it might be necessary to review the logs for any errors or warnings and ensure that all configurations are correct.
Would you like more detailed steps or further assistance on any specific part of this process?
<@U06TM7M40RK> To be more accurate, I’m adding here the recipe for the azure sql ingestion and for the power bi recipe.
Go over the server_to_platform_instance and in general let me know if there is any possible conflict that might impact the missing lineage.
Azure SQL recipe:
type: mssql-odbc
config:
# Coordinates
host_port: "<http://myserver-dev1-001.database.windows.net:1433|myserver-dev1-001.database.windows.net:1433>"
database: "custom-data"
# Credentials
username: $HYPERPLANE__KEYVAULT_SECRET__QQID
password: $HYPERPLANE__KEYVAULT_SECRET__QQPASS
include_tables: true
include_jobs: false
profiling:
enabled: true
profile_table_level_only: false # Only profile at the table level if set to true
include_field_distinct_count: true
include_field_min_value: true
include_field_max_value: true
include_field_mean_value: false
include_field_median_value: false
include_field_stddev_value: false
include_field_histogram: false
include_views: true
options:
connect_args:
authentication: ActiveDirectoryPassword
use_odbc: "True"
uri_args:
driver: "ODBC Driver 18 for SQL Server"
Encrypt: "No"
TrustServerCertificate: "No"
ssl: "False"
# Exclude specific tables
stateful_ingestion:
enabled: true # Enable stateful ingestion
remove_stale_metadata: true # Automatically remove stale datasets
pipeline_name: "azure_sql_ingestion_pipeline" # Consistent pipeline name
To ensure that the lineage from Azure SQL to PowerBI is correctly captured, let’s review the configurations in both your Azure SQL and PowerBI ingestion recipes, focusing on the server_to_platform_instance mapping and other potential conflicts.
Azure SQL Ingestion Recipe
Host Port:
Ensure that the host_port is correctly specified without the http:// prefix. It should be myserver-dev1-001.database.windows.net:1433.
ODBC Configuration:
The use_odbc parameter should be a boolean, not a string. It should be true instead of "True".
URI Arguments:
Ensure that the uri_args are correctly configured for your environment. The Encrypt and TrustServerCertificate settings should align with your security requirements.
PowerBI Ingestion Recipe
Server to Platform Instance Mapping:
The server_to_platform_instance mapping should match the host_port used in the Azure SQL ingestion. Ensure that it is specified without the http:// prefix, like this:
The admin_apis_only parameter is set to true, which is fine for detailed metadata extraction. Ensure that your Azure AD app has the necessary permissions to access the PowerBI Admin API.
Stateful Ingestion:
Ensure that stateful ingestion is enabled in both recipes to manage metadata changes effectively.
General Considerations
Consistency: Ensure that the platform_instance name (azure_sql_instance) is consistent across both recipes. This helps DataHub map the lineage correctly.
Environment: Verify that the env parameter is set correctly in both recipes. It should reflect the actual environment (e.g., PROD, DEV).
Logging and Debugging: Enable detailed logging during ingestion to capture any warnings or errors that might indicate why lineage is not being captured as expected.
By ensuring consistency and correctness in these configurations, you should be able to capture the desired lineage from Azure SQL to PowerBI. If issues persist, reviewing logs and checking for any specific errors or warnings during ingestion can provide further insights.
Would you like more detailed steps or further assistance on any specific part of this process?
<@U06TM7M40RK> About this suggestion: General Considerations
Consistency: Ensure that the platform_instance name (azure_sql_instance) is consistent across both recipes. This helps DataHub map the lineage correctly.
I don’t have platform_instance in my azure sql ingestion, so how can I make sure its the same?
To ensure consistency between your Azure SQL and PowerBI ingestion recipes, particularly regarding the platform_instance, you can follow these steps:
Add platform_instance to Azure SQL Ingestion:
While the platform_instance is not mandatory for all ingestion sources, adding it can help ensure that DataHub correctly maps the lineage between Azure SQL and PowerBI. You can add a platform_instance parameter to your Azure SQL ingestion configuration.
Ensure Consistent Naming:
Use the same platform_instance name in both the Azure SQL and PowerBI configurations. This consistency helps DataHub understand that the data from Azure SQL is the same as the data being used in PowerBI.
Here’s how you can modify your Azure SQL ingestion recipe to include the platform_instance:
By adding the platform_instance to your Azure SQL ingestion and ensuring it matches the one in your PowerBI configuration, you help DataHub correctly map the lineage between these two data sources.
Would you like more detailed steps or further assistance on any specific part of this process?
<@U06TM7M40RK> I added the platform instance to azure sql ingestion - still no column lineage and no lineage from azure sql to powerbi.
From the log I see this in results (is it ok that only 1 dashboard scanned?)
'last_state_non_deletable_entities': [],
'dashboards_scanned': 1,
'charts_scanned': 0,
'filtered_dashboards': [],
'filtered_charts': [],
'number_of_workspaces': 1,
'start_time': '2024-11-02 01:26:09.794352 (16 seconds ago)',
'running_time': '16 seconds',
'failures': [],
'warnings': [],
'infos': []}
Sink (datahub-rest) report:
{'total_records_written': 332,
'records_written_per_second': 20,
'warnings': [],
'failures': [],
'start_time': '2024-11-02 01:26:09.435398 (16.36 seconds ago)',
'current_time': '2024-11-02 01:26:25.796493 (now)',
'total_duration_in_seconds': 16.36,
'max_threads': 15,
'gms_version': 'v0.14.1',
'pending_requests': 0,
'main_thread_blocking_timer': '1.857 seconds'}```
And I also see this in the log:
```[2024-11-02 01:26:16,436] INFO {datahub.ingestion.source.powerbi.rest_api_wrapper.powerbi_api:325} - Unable to fetch dataset parameters for 946d58a0-39c5-4866-a9d4-ecdb337b0528: 'currentValue'
[2024-11-02 01:26:20,464] INFO {datahub.ingestion.source.powerbi.rest_api_wrapper.powerbi_api:325} - Unable to fetch dataset parameters for 07426e41-bdf1-45a1-b2f7-1df406390808: 'currentValue'
[2024-11-02 01:26:21,914] INFO {datahub.ingestion.source.powerbi.rest_api_wrapper.powerbi_api:415} - Skipping endorsements tag as extract_endorsements_to_tags is set to false
[2024-11-02 01:26:23,343] INFO {datahub.ingestion.source.powerbi.rest_api_wrapper.powerbi_api:175} - Skipping endorsements tags retrieval for report as extract_endorsements_to_tags is set to false
[2024-11-02 01:26:23,344] INFO {datahub.ingestion.source.powerbi.rest_api_wrapper.powerbi_api:466} - Skipping tag retrieval for dashboard as extract_endorsements_to_tags is set to false
[2024-11-02 01:26:23,344] INFO {datahub.ingestion.source.powerbi.rest_api_wrapper.powerbi_api:426} - Skipping independent datasets retrieval as extract_independent_datasets is set to false```
Based on the logs and the configuration you’ve shared, there are a few areas to consider for troubleshooting the missing column lineage and the lineage from Azure SQL to PowerBI:
Dashboard and Dataset Scanning:
The log indicates that only one dashboard was scanned. This might be expected if you have only one dashboard in the workspace or if the configuration limits the scope of ingestion. Ensure that your workspace_id_pattern in the PowerBI configuration is set to include all relevant workspaces.
Dataset Parameters:
The log messages about being unable to fetch dataset parameters ('currentValue') suggest there might be issues with parameter resolution in PowerBI datasets. This can affect lineage extraction if parameters are used in queries that define lineage.
Endorsements and Tags:
The log indicates that endorsements and tags are being skipped because extract_endorsements_to_tags is set to false. While this setting is not directly related to lineage, it might affect the completeness of metadata if endorsements are used to signify important datasets.
Independent Datasets:
The log mentions skipping independent datasets retrieval because extract_independent_datasets is set to false. If your PowerBI reports rely on datasets that are not directly linked to dashboards, this setting might need to be adjusted.
Column Lineage:
Ensure that extract_column_level_lineage is set to true in your PowerBI configuration, which you have already done. However, verify that the Azure SQL ingestion is also configured to support column-level metadata extraction.
Re-run Ingestion:
After making any configuration changes, re-run both the Azure SQL and PowerBI ingestion processes to ensure that all metadata is up-to-date.
Suggested Actions:
Review Workspace Configuration:
Ensure that all relevant workspaces are included in the workspace_id_pattern and that there are no filters inadvertently excluding dashboards or datasets.
Check Dataset Parameters:
Investigate the datasets mentioned in the log to understand why parameters cannot be fetched. This might require checking the PowerBI dataset configurations.
Enable Independent Datasets:
Consider setting extract_independent_datasets to true if your reports use datasets not directly linked to dashboards.
Verify Column Lineage Support:
Double-check that both Azure SQL and PowerBI configurations are set to support column-level lineage and that there are no known issues with the versions you are using.
Enable Detailed Logging:
Increase the logging level to capture more detailed information during ingestion, which might provide further insights into any issues.