Troubleshooting DataHub Metadata Ingestion Errors

Original Slack Thread

<@U06TM7M40RK> datahub 0.14.0.2를 쓰고 있는데, metadata-file을 ingest하는데, metadata-file의 예제를 보여줘. 그리고, 이 metadata-file을 gms로 보낼 recipe.yml을 알려줘.

To ingest a metadata file into DataHub version 0.14.0.2 and send it to the GMS (Graph Metadata Service), you need to create a metadata file and a corresponding ingestion recipe. Below are examples of both:

Example Metadata File
Let’s assume you have a JSON file named metadata.json with the following content:

[
    {
        "name": "table_A",
        "platform": "bigquery",
        "schema": {
            "fields": [
                {"fieldPath": "col1", "nativeDataType": "STRING"},
                {"fieldPath": "col2", "nativeDataType": "INTEGER"}
            ]
        }
    },
    {
        "name": "table_B",
        "platform": "bigquery",
        "schema": {
            "fields": [
                {"fieldPath": "col1", "nativeDataType": "STRING"},
                {"fieldPath": "col2", "nativeDataType": "INTEGER"}
            ]
        }
    }
]

Example Ingestion Recipe
Create a YAML file named recipe.yml with the following content:

source:
  type: "file"
  config:
    path: "./metadata.json"
    file_extension: ".json"
    read_mode: "AUTO"

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub instance URL
    token: "your-auth-token"  # Replace with your DataHub auth token if required

Running the Ingestion
To run the ingestion, use the DataHub CLI with the following command:

datahub ingest -c recipe.yml

This command will read the metadata from metadata.json and ingest it into your DataHub instance.

Would you like more detailed steps or additional information on any specific part of the process?

Sources:

<@U06TM7M40RK> 그러면, 위 예제를 ingestion하면, datasetProperties도 자동으로 생기나?

Ingesting metadata using the provided example will not automatically create datasetProperties. The example provided focuses on ingesting basic schema information. If you want to include datasetProperties, you need to explicitly define them in your metadata file.

Example Metadata File with datasetProperties
Here is an updated example of the metadata.json file that includes datasetProperties:

[
    {
        "name": "table_A",
        "platform": "bigquery",
        "schema": {
            "fields": [
                {"fieldPath": "col1", "nativeDataType": "STRING"},
                {"fieldPath": "col2", "nativeDataType": "INTEGER"}
            ]
        },
        "properties": {
            "description": "This is table A",
            "customProperties": {
                "property1": "value1",
                "property2": "value2"
            }
        }
    },
    {
        "name": "table_B",
        "platform": "bigquery",
        "schema": {
            "fields": [
                {"fieldPath": "col1", "nativeDataType": "STRING"},
                {"fieldPath": "col2", "nativeDataType": "INTEGER"}
            ]
        },
        "properties": {
            "description": "This is table B",
            "customProperties": {
                "property1": "value1",
                "property2": "value2"
            }
        }
    }
]

Example Ingestion Recipe
The ingestion recipe (recipe.yml) remains the same:

source:
  type: "file"
  config:
    path: "./metadata.json"
    file_extension: ".json"
    read_mode: "AUTO"

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub instance URL
    token: "your-auth-token"  # Replace with your DataHub auth token if required

Running the Ingestion
To run the ingestion, use the DataHub CLI with the following command:

datahub ingest -c recipe.yml

This command will read the metadata from metadata.json and ingest it into your DataHub instance, including the datasetProperties.

Would you like more detailed steps or additional information on any specific part of the process?

Sources: