Distinguishing Data Element Names with Different Semantics in DataHub

Original Slack Thread

<@U06TM7M40RK> How to distinguish the same data element name with different semanticity for different data sources. Is there a namespace concept in datahub

To distinguish the same data element name with different semantics for different data sources in DataHub, you can use the concept of Platform Instances. This allows you to differentiate between datasets that might have the same name but belong to different instances of a platform, such as different databases or clusters.

Platform Instances

According to the <https://github.com/datahub-project/datahub/blob/master/docs/platform-instances.md|Platform Instances documentation>, DataHub supports the concept of platform instances to handle cases where the same data platform (e.g., MySQL) is deployed in multiple instances (e.g., different hosts or clusters). This is particularly useful when you have datasets with the same name but different semantics across these instances.

Example Configuration

To configure a platform instance, you can add an additional configuration parameter called platform_instance in your ingestion recipe. Here is an example of how you would configure a MySQL instance:

source:
  type: mysql
  config:
    # Coordinates
    host_port: localhost:3306
    platform_instance: core_finance
    database: dbname
    
    # Credentials
    username: root
    password: example

sink:
  # sink configs

URN Structure

The URN (Uniform Resource Name) in DataHub is designed to uniquely identify any resource. The URN structure includes a namespace, entity type, and ID. For datasets, the ID can include the platform instance, dataset name, and environment, which helps in distinguishing datasets with the same name but different semantics.

As per the <https://datahubproject.io/docs/0.13.1/what/urn|URN documentation>, the URN for a dataset might look like this:

urn:li:dataset:(urn:li:dataPlatform:mysql,core_finance.dbname.table_name,PROD)

In this example:

  • urn:li:dataPlatform:mysql specifies the data platform.
  • core_finance.dbname.table_name specifies the platform instance, database name, and table name.
  • PROD specifies the environment.

Namespace Concept

DataHub uses the li namespace by default, but this can be changed if you fork DataHub. The namespace is part of the URN and helps in uniquely identifying resources within your organization.

For more details, you can refer to the <https://datahubproject.io/docs/0.13.1/what/urn|URN documentation> and the <https://github.com/datahub-project/datahub/blob/master/docs/platform-instances.md|Platform Instances documentation>.

Would you like more details on any specific part of this process?

Sources:

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

I’m not able to find any data sources to answer your question. Could you provide more details about the specific issue or question you have regarding DataHub? If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.