Troubleshooting Dataset Existence Issue in DataHub with PostgreSQL Data Source

Original Slack Thread

Hello Team, I have this reciepe to ingest data from postgres which ingest fine but when I the datahub exists --urn “urn:li:dataset:(urn:li:dataPlatform:postgres,cdr.public.customer, PROD)”
it return false. Any thought on why that is so.

    type: postgres
    config:
        host_port: xxxxxxx
        database: xxxx
        username: xxxx
        include_tables: true
        include_views: true
        platform_instance: customer_info
        profiling:
            enabled: true
            profile_table_level_only: true
        stateful_ingestion:
            enabled: true
        password: xxxxx
        env: PROD```

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To help you more effectively, could you please provide the following information?

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g., BigQuery)

Based on the information provided, it seems like you are using the CLI for ingestion and integrating with a PostgreSQL data source. The issue you are facing might be related to the URN format or the ingestion process itself.

Possible Issues and Solutions

  1. URN Format: Ensure that the URN format is correct. The URN should follow the pattern urn:li:dataset:(urn:li:dataPlatform:<platform>,<name>,<env>). In your case, it should be:

    urn:li:dataset:(urn:li:dataPlatform:postgres,cdr.public.customer,PROD)
    
  2. Platform Instance: Since you are using a platform_instance in your ingestion recipe, the URN should include the platform instance. According to the <https://github.com/datahub-project/datahub/blob/master/docs/platform-instances.md|Platform Instances documentation>, the URN should be:

    urn:li:dataset:(urn:li:dataPlatform:postgres,customer_info.cdr.public.customer,PROD)
    
  3. Check Entity Existence: You can use the datahub exists command to check if the entity exists in DataHub. Ensure that you are using the correct URN format:

    datahub exists --urn "urn:li:dataset:(urn:li:dataPlatform:postgres,customer_info.cdr.public.customer,PROD)"
    
  4. GraphQL Query: If the CLI command does not work, you can use a GraphQL query to check if the dataset exists. Here is an example of how to do it using Python:

    import requests
    import json
    
    graphql_endpoint = "http://<your-datahub-host>/api/graphql"
    headers = {
        "Content-Type": "application/json",
        "Authorization": "Bearer <your-access-token>"
    }
    payload = {
        "query": """
        query get_dataset($urn: String!) {
            dataset(urn: $urn) {
                urn
                name
            }
        }
        """,
        "variables": {
            "urn": "urn:li:dataset:(urn:li:dataPlatform:postgres,customer_info.cdr.public.customer,PROD)"
        }
    }
    response = requests.post(graphql_endpoint, data=json.dumps(payload), headers=headers)
    response.raise_for_status()
    print(response.json())
    

Next Steps

  1. Verify the URN format and try the datahub exists command again.
  2. If the issue persists, use the GraphQL query to check if the dataset exists.

Would you like more detailed steps or further assistance on any of these points?

Sources:

There was extra space in the urn so I remove it and it gave me a true return