Ingesting Metadata Directly from Code into DataHub without Connecting to Datasource

Original Slack Thread

<@U06TM7M40RK> is it possible to ingest metadata directly from code without connecting to the actual datasource

Yes, it is possible to ingest metadata directly from code without connecting to the actual datasource. This can be achieved by using the DataHub ingestion framework to create and send Metadata Change Proposals (MCPs) directly to DataHub.

Here are the steps to achieve this:

  1. Create Metadata Change Proposals (MCPs): You can create MCPs in your code to represent the metadata you want to ingest.
  2. Send MCPs to DataHub: Use the DataHub ingestion framework to send these MCPs to DataHub via HTTP or Kafka.

For example, you can use the DataHub Python library to create and send MCPs. Below is a basic example of how you might do this:

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import MetadataChangeEventClass, DatasetPropertiesClass

# Initialize the DataHub REST emitter
emitter = DatahubRestEmitter(gms_server="http://localhost:8080")

# Create a dataset URN
dataset_urn = make_dataset_urn(platform="mysql", name="my_dataset", env="PROD")

# Create a MetadataChangeEvent
mce = MetadataChangeEventClass(
    proposedSnapshot=DatasetSnapshotClass(
        urn=dataset_urn,
        aspects=[
            DatasetPropertiesClass(
                description="This is a sample dataset",
                customProperties={"key": "value"}
            )
        ]
    )
)

# Emit the MetadataChangeEvent to DataHub
emitter.emit_mce(mce)

This example demonstrates how to create a dataset metadata change event and send it to DataHub using the REST emitter. You can customize the metadata aspects as needed.

For more detailed information, you can refer to the <https://datahubproject.io/docs/metadata-ingestion/as-a-library|DataHub Metadata Ingestion as a Library> guide.

Would you like more detailed steps or additional examples?

Sources:

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

I’m not able to find any data sources to answer your question. Could you provide more details about the specific issue you are facing or the information you are seeking? If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> the example you shared adds only description and custom properties. Can you share an example to add schema metadata

Sure thing! I’m looking through the available data to answer your question. Hold on a second… :hourglass_flowing_sand:

<@U06TM7M40RK> the example you shared adds only description and custom properties. Can you share an example to add schema metadata using metadata change event

<@U06TM7M40RK> how to create dataset schema using python rest emitter