Creating Custom Browse Paths for Datasets in DataHub using Python Codes

Original Slack Thread

This message was deleted.attachment

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: _Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To address the issue of displaying only the table name in the UI instead of the complete path, you can customize the browse paths aspect in DataHub. This involves modifying the metadata ingestion process to set custom browse paths for your datasets.

Based on the provided data sources, you can achieve this by emitting a custom browse path aspect from your ingestion source code. Here is a Python code snippet that demonstrates how to set a custom browse path for a dataset:

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.metadata.schema_classes import MetadataChangeProposalWrapper, BrowsePathsClass

# Define the custom browse path
browse_path = BrowsePathsClass(
    paths=["/sql_server/instance_name"]
)

# Create a MetadataChangeProposalWrapper for the dataset
mcp = MetadataChangeProposalWrapper(
    entityType="dataset",
    changeType="UPSERT",
    entityUrn=make_dataset_urn(platform="sql_server", name="table_name", env="PROD"),
    aspectName="browsePaths",
    aspect=browse_path,
)

# Emit the MetadataChangeProposalWrapper
# Assuming you have an emitter instance
emitter.emit(mcp)

In this example:

  • Replace "sql_server" with your actual platform name.
  • Replace "instance_name" with your actual instance name.
  • Replace "table_name" with your actual table name.

This code sets the browse path to /sql_server/instance_name, which should display only the table name in the UI under the specified path.

Would you like more details on any specific part of this process?

Sources:

Certainly! Below is a full sample code for setting custom browse paths for a dataset on the MSSQL platform. This example demonstrates how to configure the ingestion process to display only the table name in the UI.

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.metadata.schema_classes import MetadataChangeProposalWrapper, BrowsePathsClass
from datahub.emitter.rest_emitter import DatahubRestEmitter

# Initialize the DataHub REST emitter
emitter = DatahubRestEmitter(gms_server="http://localhost:8080")

# Define the custom browse path
browse_path = BrowsePathsClass(
    paths=["/mssql/instance_name"]
)

# Create a MetadataChangeProposalWrapper for the dataset
mcp = MetadataChangeProposalWrapper(
    entityType="dataset",
    changeType="UPSERT",
    entityUrn=make_dataset_urn(platform="mssql", name="table_name", env="PROD"),
    aspectName="browsePaths",
    aspect=browse_path,
)

# Emit the MetadataChangeProposalWrapper
emitter.emit(mcp)

print("Custom browse path set successfully.")

Steps to Run the Code:

  1. Install DataHub Python Client:
    Ensure you have the DataHub Python client installed. You can install it using pip:

    pip install acryl-datahub
    
  2. Configure the Emitter:
    Update the gms_server URL to point to your DataHub instance. The default is http://localhost:8080.

  3. Customize the Browse Path:

    • Replace "mssql" with your actual platform name.
    • Replace "instance_name" with your actual instance name.
    • Replace "table_name" with your actual table name.
  4. Run the Script:
    Execute the script to set the custom browse path for your dataset.

This code sets the browse path to /mssql/instance_name, which should display only the table name in the UI under the specified path.

Would you like more detailed steps or troubleshooting tips?

Sources:

<@U06TM7M40RK> - can you please help to create python code for showing the Sql Server on UI as below screenshot?

![attachment]({‘ID’: ‘F07L2QGBHTK’, ‘EDITABLE’: False, ‘IS_EXTERNAL’: False, ‘USER_ID’: ‘U07K90G6431’, ‘CREATED’: ‘2024-09-03 14:38:17+00:00’, ‘PERMALINK’: ‘Slack’, ‘EXTERNAL_TYPE’: ‘’, ‘TIMESTAMPS’: ‘2024-09-03 14:38:17+00:00’, ‘MODE’: ‘hosted’, ‘DISPLAY_AS_BOT’: False, ‘PRETTY_TYPE’: ‘JPEG’, ‘NAME’: ‘Image from iOS.jpg’, ‘IS_PUBLIC’: True, ‘PREVIEW_HIGHLIGHT’: None, ‘MIMETYPE’: ‘image/jpeg’, ‘PERMALINK_PUBLIC’: ‘https://slack-files.com/TUMKD5EGJ-F07L2QGBHTK-cfce9ea789’, ‘FILETYPE’: ‘jpg’, ‘EDIT_LINK’: None, ‘URL_PRIVATE’: ‘Slack’, ‘HAS_RICH_PREVIEW’: False, ‘TITLE’: ‘Image from iOS’, ‘IS_STARRED’: False, ‘PREVIEW_IS_TRUNCATED’: None, ‘URL_PRIVATE_DOWNLOAD’: ‘Slack’, ‘PREVIEW’: None, ‘PUBLIC_URL_SHARED’: False, ‘MESSAGE_TS’: ‘1725374301.444579’, ‘PARENT_MESSAGE_TS’: ‘1725370434.848879’, ‘MESSAGE_CHANNEL_ID’: ‘CUMUWQU66’, ‘_FIVETRAN_DELETED’: False, ‘LINES_MORE’: None, ‘LINES’: None, ‘SIZE’: 26528, ‘_FIVETRAN_SYNCED’: ‘2024-09-08 08:21:53.414000+00:00’})

Sure thing! I’m looking through the available data to answer your question. Hold on a second… :hourglass_flowing_sand:

<@U06TM7M40RK>
Can you Please help to give a complete python code sample for Sql Server ingestion and show it properly on UI with platform instance along with schema and database?

Certainly! Below is a complete Python code sample for ingesting metadata from a SQL Server (MSSQL) database into DataHub and configuring it to display properly on the UI with platform instance, schema, and database.

Step-by-Step Guide:

  1. Install Required Packages:
    Ensure you have the DataHub Python client and MSSQL ingestion plugin installed:

    pip install acryl-datahub acryl-datahub[mssql]
    
  2. Create the Ingestion Recipe:
    Create a YAML file (mssql_ingestion_recipe.yaml) to configure the ingestion process.

    source:
      type: mssql
      config:
        host_port: "localhost:1433"  # Update with your SQL Server host and port
        database: "DemoDatabase"     # Update with your database name
        username: "user"             # Update with your username
        password: "pass"             # Update with your password
        options:
          connect_args:
            TrustServerCertificate: "Yes"
            Encrypt: "yes"
    sink:
      type: "datahub-rest"
      config:
        server: "http://localhost:8080"  # Update with your DataHub server URL
    
  3. Run the Ingestion:
    Use the DataHub CLI to run the ingestion:

    datahub ingest -c mssql_ingestion_recipe.yaml
    
  4. Customize Browse Paths:
    After ingestion, you can customize the browse paths to display the platform instance, schema, and database properly.

    from datahub.emitter.mce_builder import make_dataset_urn
    from datahub.metadata.schema_classes import MetadataChangeProposalWrapper, BrowsePathsClass
    from datahub.emitter.rest_emitter import DatahubRestEmitter
    
    # Initialize the DataHub REST emitter
    emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
    
    # Define the custom browse path
    browse_path = BrowsePathsClass(
        paths=["/mssql/DemoDatabase/schema_name/table_name"]
    )
    
    # Create a MetadataChangeProposalWrapper for the dataset
    mcp = MetadataChangeProposalWrapper(
        entityType="dataset",
        changeType="UPSERT",
        entityUrn=make_dataset_urn(platform="mssql", name="schema_name.table_name", env="PROD"),
        aspectName="browsePaths",
        aspect=browse_path,
    )
    
    # Emit the MetadataChangeProposalWrapper
    emitter.emit(mcp)
    
    print("Custom browse path set successfully.")
    

Full Python Code Sample:

import datahub.emitter.mce_builder as builder
from datahub.metadata.schema_classes import MetadataChangeProposalWrapper, BrowsePathsClass
from datahub.emitter.rest_emitter import DatahubRestEmitter

# Initialize the DataHub REST emitter
emitter = DatahubRestEmitter(gms_server="<http://localhost:8080>")

# Define the custom browse path
browse_path = BrowsePathsClass(
    paths=["/mssql/DemoDatabase/schema_name/table_name"]
)

# Create a MetadataChangeProposalWrapper for the dataset
mcp = MetadataChangeProposalWrapper(
    entityType="dataset",
    changeType="UPSERT",
    entityUrn=builder.make_dataset_urn(platform="mssql", name="schema_name.table_name", env="PROD"),
    aspectName="browsePaths",
    aspect=browse_path,
)

# Emit the MetadataChangeProposalWrapper
emitter.emit(mcp)

print("Custom browse path set successfully.")

Explanation:

  • Ingestion Recipe: The YAML file configures the MSSQL source and DataHub REST sink.
  • Custom Browse Path: The Python script sets a custom browse path to display the platform instance, schema, and database properly in the UI.

Would you like more detailed steps or troubleshooting tips?

Sources:

To insert a dataset with platform, database, and schema information and display it properly in the DataHub UI, you need to follow these steps:

  1. Ingest Metadata from SQL Server: Use the DataHub ingestion framework to ingest metadata from your SQL Server database.
  2. Customize Browse Paths: Set custom browse paths to ensure the platform, database, and schema are displayed at the top level in the UI.

Below is a complete Python code sample that demonstrates how to achieve this.

Step-by-Step Guide

Step 1: Install Required Packages

Ensure you have the DataHub Python client and MSSQL ingestion plugin installed:

pip install acryl-datahub acryl-datahub[mssql]

Step 2: Create the Ingestion Recipe

Create a YAML file (mssql_ingestion_recipe.yaml) to configure the ingestion process.

source:
  type: mssql
  config:
    host_port: "localhost:1433"  # Update with your SQL Server host and port
    database: "DemoDatabase"     # Update with your database name
    username: "user"             # Update with your username
    password: "pass"             # Update with your password
    options:
      connect_args:
        TrustServerCertificate: "Yes"
        Encrypt: "yes"
sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Update with your DataHub server URL

Step 3: Run the Ingestion

Use the DataHub CLI to run the ingestion:

datahub ingest -c mssql_ingestion_recipe.yaml

Step 4: Customize Browse Paths

After ingestion, you can customize the browse paths to display the platform, database, and schema properly.

import datahub.emitter.mce_builder as builder
from datahub.metadata.schema_classes import MetadataChangeProposalWrapper, BrowsePathsClass
from datahub.emitter.rest_emitter import DatahubRestEmitter

# Initialize the DataHub REST emitter
emitter = DatahubRestEmitter(gms_server="http://localhost:8080")

# Define the custom browse path
browse_path = BrowsePathsClass(
    paths=["/mssql/DemoDatabase/schema_name/table_name"]
)

# Create a MetadataChangeProposalWrapper for the dataset
mcp = MetadataChangeProposalWrapper(
    entityType="dataset",
    changeType="UPSERT",
    entityUrn=builder.make_dataset_urn(platform="mssql", name="schema_name.table_name", env="PROD"),
    aspectName="browsePaths",
    aspect=browse_path,
)

# Emit the MetadataChangeProposalWrapper
emitter.emit(mcp)

print("Custom browse path set successfully.")

Full Python Code Sample

import datahub.emitter.mce_builder as builder
from datahub.metadata.schema_classes import MetadataChangeProposalWrapper, BrowsePathsClass
from datahub.emitter.rest_emitter import DatahubRestEmitter

# Initialize the DataHub REST emitter
emitter = DatahubRestEmitter(gms_server="<http://localhost:8080>")

# Define the custom browse path
browse_path = BrowsePathsClass(
    paths=["/mssql/DemoDatabase/schema_name/table_name"]
)

# Create a MetadataChangeProposalWrapper for the dataset
mcp = MetadataChangeProposalWrapper(
    entityType="dataset",
    changeType="UPSERT",
    entityUrn=builder.make_dataset_urn(platform="mssql", name="schema_name.table_name", env="PROD"),
    aspectName="browsePaths",
    aspect=browse_path,
)

# Emit the MetadataChangeProposalWrapper
emitter.emit(mcp)

print("Custom browse path set successfully.")

Explanation

  • Ingestion Recipe: The YAML file configures the MSSQL source and DataHub REST sink.
  • Custom Browse Path: The Python script sets a custom browse path to display the platform, database, and schema properly in the UI.

Summary

  1. Configure Ingestion: Create a YAML configuration file with the necessary connection details and options for SQL Server.
  2. Run Ingestion: Use the DataHub CLI to run the ingestion.
  3. Customize Browse Paths: Use the provided Python script to set custom browse paths for your datasets.

Would you like more detailed steps or troubleshooting tips?

Sources:

To create a URN (Uniform Resource Name) with platform, database, and container/schema in DataHub, you need to follow the URN format specified by DataHub. The URN format for a dataset typically includes the platform, the dataset name (which can include the database and schema), and the environment.

URN Format

The general format for a dataset URN in DataHub is:

urn:li:dataset:(urn:li:dataPlatform:<platform>,<database>.<schema>.<table>,<env>)

Example

For a SQL Server (MSSQL) dataset, the URN might look like this:

urn:li:dataset:(urn:li:dataPlatform:mssql,DemoDatabase.schema_name.table_name,PROD)

Python Code to Create URN

Here is a Python code snippet to create a URN with platform, database, and schema:

from datahub.emitter.mce_builder import make_dataset_urn

# Define the platform, database, schema, table, and environment
platform = "mssql"
database = "DemoDatabase"
schema = "schema_name"
table = "table_name"
env = "PROD"

# Create the dataset URN
dataset_urn = make_dataset_urn(platform=platform, name=f"{database}.{schema}.{table}", env=env)

print(f"Dataset URN: {dataset_urn}")

Explanation

  • Platform: The data platform (e.g., “mssql” for SQL Server).
  • Database: The name of the database (e.g., “DemoDatabase”).
  • Schema: The schema or container name (e.g., “schema_name”).
  • Table: The table name (e.g., “table_name”).
  • Environment: The environment (e.g., “PROD” for production).

Full Example

Here is a complete example that includes creating the URN and emitting it to DataHub:

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.metadata.schema_classes import MetadataChangeProposalWrapper, BrowsePathsClass
from datahub.emitter.rest_emitter import DatahubRestEmitter

# Initialize the DataHub REST emitter
emitter = DatahubRestEmitter(gms_server="http://localhost:8080")

# Define the platform, database, schema, table, and environment
platform = "mssql"
database = "DemoDatabase"
schema = "schema_name"
table = "table_name"
env = "PROD"

# Create the dataset URN
dataset_urn = make_dataset_urn(platform=platform, name=f"{database}.{schema}.{table}", env=env)

# Define the custom browse path
browse_path = BrowsePathsClass(
    paths=[f"/{platform}/{database}/{schema}/{table}"]
)

# Create a MetadataChangeProposalWrapper for the dataset
mcp = MetadataChangeProposalWrapper(
    entityType="dataset",
    changeType="UPSERT",
    entityUrn=dataset_urn,
    aspectName="browsePaths",
    aspect=browse_path,
)

# Emit the MetadataChangeProposalWrapper
emitter.emit(mcp)

print(f"Dataset URN: {dataset_urn}")
print("Custom browse path set successfully.")

Explanation

  • Dataset URN: The URN is created using the make_dataset_urn function with the platform, database, schema, table, and environment.
  • Custom Browse Path: The browse path is set to display the platform, database, schema, and table properly in the UI.
  • MetadataChangeProposalWrapper: This wrapper is used to emit the custom browse path to DataHub.

Would you like more detailed steps or troubleshooting tips?

Sources: