Uploading Metadata Information from a Database to a File in DataHub

Original Slack Thread

<@U06TM7M40RK> DB의 메타데이터정보를 파일로 올리는 방법은?

To upload metadata information from a database to a file in DataHub, you can use the file sink feature. This allows you to output metadata to a file, which can be useful for decoupling metadata sourcing from the process of pushing it into DataHub. Here’s a general approach based on the available data sources:

  1. Set Up the File Sink: You need to configure your ingestion recipe to use the file sink. This will output the metadata to a specified file path.

  2. Create a YAML Configuration: Prepare a YAML configuration file for the ingestion process. Here is an example configuration:

    source:
      type: <your_source_type>
      config:
        <source_specific_config>
    
    sink:
      type: file
      config:
        filename: ./path/to/mce/file.json
    
    • Replace <your_source_type> with the type of your data source (e.g., mysql, postgres).
    • Replace <source_specific_config> with the specific configuration required for your data source.
    • The filename parameter specifies the path to the file where the metadata will be written.
  3. Run the Ingestion: Use the DataHub CLI to run the ingestion process with the specified configuration file:

    datahub ingest -c path/to/your/config.yaml
    

This process will extract metadata from your database and write it to the specified file in JSON format.

Would you like more detailed steps or examples on any part of this process?

Sources:

<@U06TM7M40RK> 미안, 파일로 gms로 보내는 방법

To send metadata from a file to DataHub’s GMS (Graph Metadata Service), you can use the file source in your ingestion recipe. This involves reading metadata from a file and sending it to the DataHub GMS. Here’s how you can do it:

  1. Prepare the Metadata File: Ensure your metadata file (e.g., metadata.json) is formatted correctly. It should contain metadata in a structure that DataHub can understand, such as datasetProperties and schemaMetadata.

  2. Create an Ingestion Recipe: Prepare a YAML configuration file for the ingestion process. Here is an example configuration:

    source:
      type: file
      config:
        filename: "./metadata.json"
        format: "json"
    
    sink:
      type: datahub-rest
      config:
        server: "http://localhost:8080"  # Replace with your DataHub instance URL
        token: "your-auth-token"  # Replace with your DataHub auth token if required
    
    • The filename parameter specifies the path to your metadata file.
    • The server parameter in the sink configuration specifies the URL of your DataHub GMS instance.
  3. Run the Ingestion: Use the DataHub CLI to run the ingestion process with the specified configuration file:

    datahub ingest -c path/to/your/ingestion_recipe.yaml
    

This command will read the metadata from the specified file and send it to your DataHub GMS instance.

Would you like more detailed steps or additional examples on any specific part of this process?

Sources:

<@U06TM7M40RK> filename: "./metadata.json"의 파일내용은 어떻게 구성해야 되나?

To structure the metadata.json file for ingestion into DataHub, you need to ensure it contains the necessary metadata aspects such as datasetProperties and schemaMetadata. Here’s an example of how you can structure your metadata.json file:

[
  {
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,my-dataset,PROD)",
    "changeType": "UPSERT",
    "aspectName": "datasetProperties",
    "aspect": {
      "description": "This is a sample dataset",
      "customProperties": {
        "key1": "value1",
        "key2": "value2"
      }
    }
  },
  {
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,my-dataset,PROD)",
    "changeType": "UPSERT",
    "aspectName": "schemaMetadata",
    "aspect": {
      "schemaName": "my-dataset-schema",
      "platform": "urn:li:dataPlatform:mysql",
      "version": 0,
      "created": {
        "time": 1627847260000,
        "actor": "urn:li:corpuser:datahub"
      },
      "lastModified": {
        "time": 1627847260000,
        "actor": "urn:li:corpuser:datahub"
      },
      "fields": [
        {
          "fieldPath": "field1",
          "nativeDataType": "VARCHAR",
          "type": {
            "type": "string"
          },
          "description": "This is field1"
        },
        {
          "fieldPath": "field2",
          "nativeDataType": "INT",
          "type": {
            "type": "number"
          },
          "description": "This is field2"
        }
      ]
    }
  }
]

Explanation

  1. datasetProperties:

    • entityType: Specifies the type of entity, which is dataset in this case.
    • entityUrn: The unique identifier for the dataset.
    • changeType: The type of change, which is UPSERT to update or insert the record.
    • aspectName: The name of the aspect, which is datasetProperties.
    • aspect: Contains the properties of the dataset, such as description and customProperties.
  2. schemaMetadata:

    • entityType: Specifies the type of entity, which is dataset.
    • entityUrn: The unique identifier for the dataset.
    • changeType: The type of change, which is UPSERT.
    • aspectName: The name of the aspect, which is schemaMetadata.
    • aspect: Contains the schema metadata, including schemaName, platform, version, created, lastModified, and fields.

This structure ensures that the metadata is correctly formatted for ingestion into DataHub.

Would you like more detailed steps or additional examples on any specific part of this process?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><