Troubleshooting Dataset Creation Using OpenAPI Approach

Original Slack Thread

I am trying to use http://localhost:8080/openapi/v3/entity/dataset POST for creating new dataset. I get response 200 if async and 202 if not async but I don’t see dataset getting created. I want to use Open API approach to fit into existing nodejs based soln even though not recommended to use. If I use python SDK and Event emitter then it works. This is the sample json I am using for testing. I tried different combination but none of them are working.

                "urn": "urn:li:dataset:(urn:li:dataPlatform:BigQuery,xxx.yyy.poc_test,DEV)",
                "schemaMetadata": {
                    "value": {
                        "schemaName": "poc_test",                       
                        "platformSchema": {
                            "com.linkedin.schema.OtherSchema": {
                            "rawSchema": "__insert raw schema here__"
                            }
                        },                        
                        "fields": [
                            {
                                "nullable": "false",
                                "fieldPath": "application_id",
                                "description": "",
                                "isPartOfKey": "false",
                                "lastModified": {
                                    "actor": "urn:li:corpuser:ingestion",
                                    "time": updatedtime
                                },
                                "type": {
                                    "type": {
                                    "com.linkedin.schema.StringType": {}
                                    }
                                },
                                "recursive": "false",
                                "nativeDataType": "VARCHAR(50)"
                            },
                            {
                                "nullable": "false",
                                "fieldPath": "application_name",
                                "description": "",
                                "isPartOfKey": "false",
                                "lastModified": {
                                    "actor": "urn:li:corpuser:ingestion",
                                    "time": updatedtime
                                },
                                "type": {
                                    "type": {
                                    "com.linkedin.schema.StringType": {}
                                    }
                                },
                                "recursive": "false",
                                "nativeDataType": "VARCHAR(100)"
                            },
                            {
                                "nullable": "false",
                                "created": {
                                    "actor": "urn:li:corpuser:ingestion",
                                    "time": updatedtime
                                },
                                "fieldPath": "application_acronym",
                                "description": "",
                                "isPartOfKey": "false",
                                "lastModified": {
                                    "actor": "urn:li:corpuser:ingestion",
                                    "time": updatedtime
                                },
                                "type": {
                                    "type": {
                                    "com.linkedin.schema.DateType": {}
                                    }
                                },
                                "nativeDataType": "Date",
                                "recursive": "false"
                            }
                        ],
                        "platform": "urn:li:dataPlatform:BigQuery"
                    }
                }                
            }```
1. Are you using UI or CLI for ingestion? : API
2. Which DataHub version are you using? (e.g. 0.13.2)
3. What data source(s) are you integrating with : Test Sample

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses! We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic and to help us improve the assistant!

To address your issue with creating a new dataset using the OpenAPI approach, let’s break down the steps and ensure everything is correctly configured.

Steps to Create a Dataset Using OpenAPI

  1. Ensure Correct Endpoint and Method:

    • You are using the correct endpoint http://localhost:8080/openapi/v3/entity/dataset with the POST method.
  2. Check JSON Payload:

    • Your JSON payload should be correctly structured according to the OpenAPI specifications. Ensure that all required fields are included and correctly formatted.
  3. Verify Response Codes:

    • A 200 response code indicates success for asynchronous operations.
    • A 202 response code indicates that the request has been accepted for processing but the processing is not complete.

Example JSON Payload

Here is a refined example of the JSON payload based on the provided data sources:

{
  "urn": "urn:li:dataset:(urn:li:dataPlatform:BigQuery,xxx.yyy.poc_test,DEV)",
  "schemaMetadata": {
    "value": {
      "schemaName": "poc_test",
      "platformSchema": {
        "com.linkedin.schema.OtherSchema": {
          "rawSchema": "__insert raw schema here__"
        }
      },
      "fields": [
        {
          "nullable": false,
          "fieldPath": "application_id",
          "description": "",
          "isPartOfKey": false,
          "lastModified": {
            "actor": "urn:li:corpuser:ingestion",
            "time": 1629840771000
          },
          "type": {
            "type": {
              "com.linkedin.schema.StringType": {}
            }
          },
          "nativeDataType": "VARCHAR(50)"
        },
        {
          "nullable": false,
          "fieldPath": "application_name",
          "description": "",
          "isPartOfKey": false,
          "lastModified": {
            "actor": "urn:li:corpuser:ingestion",
            "time": 1629840771000
          },
          "type": {
            "type": {
              "com.linkedin.schema.StringType": {}
            }
          },
          "nativeDataType": "VARCHAR(100)"
        },
        {
          "nullable": false,
          "fieldPath": "application_acronym",
          "description": "",
          "isPartOfKey": false,
          "lastModified": {
            "actor": "urn:li:corpuser:ingestion",
            "time": 1629840771000
          },
          "type": {
            "type": {
              "com.linkedin.schema.DateType": {}
            }
          },
          "nativeDataType": "Date"
        }
      ],
      "platform": "urn:li:dataPlatform:BigQuery"
    }
  }
}

Troubleshooting Steps

  1. Check Logs:

    • Review the DataHub logs to identify any errors or warnings that might indicate why the dataset is not being created.
  2. Verify DataHub Configuration:

    • Ensure that your DataHub instance is correctly configured to accept and process OpenAPI requests.
  3. Use Python SDK as a Reference:

    • Since the Python SDK and Event Emitter work, compare the payloads and requests made by the SDK with your OpenAPI request to identify any discrepancies.

Additional Resources

Would you like more detailed instructions or have any specific questions about this process?

Sources: