Creating a Custom Source for Describing Apps and Updating Metadata Manually

Original Slack Thread

Hi
I am deploying my DataHub test environment and I have Clickhouse clusters, kafka topics and developers’ apps that put data into kafka.

It is not a problem to ingest metadata from Clickhouse or from Kafka, but I wonder if I can create simple custom source that would describe some app? May be there is a way to create some container with metadata written and updated manually?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To create a custom source for describing an app and manually updating its metadata, you can indeed use DataHub’s extensible ingestion framework. Here’s a step-by-step guide to help you achieve this:

Steps to Create a Custom Source

  1. Build a Custom Source:

  2. Package and Install the Custom Source:

    • Package your custom source as a Python package and install it in the environment where you are running the DataHub CLI.
    • Example structure and YAML configuration are provided in the guide.
  3. Ingest Metadata Using the Custom Source:

    • Once your custom source is ready and installed, you can reference it in your ingestion recipe YAML file.
    • Example YAML configuration:
      source:
        type: my-source.custom_ingestion_source.MySourceClass
        config:
          # place for your custom config defined in the configModel
      
  4. Manual Metadata Updates:

    • For manual updates, you can use the DataHub CLI or Python SDK to emit metadata change proposals (MCPs).
    • You can also create JSON-encoded MetadataChangeProposal objects and ingest them using the “file” source.

Example Code for Custom Source

Here’s a simplified example of what your custom source class might look like:

from datahub.ingestion.api.common import PipelineContext
from datahub.ingestion.api.source import Source, SourceReport
from datahub.metadata.schema_classes import MetadataChangeEventClass

class MySourceClass(Source):
    def __init__(self, config, ctx: PipelineContext):
        super().__init__(ctx)
        self.config = config

    @classmethod
    def create(cls, config_dict, ctx):
        config = MySourceConfig.parse_obj(config_dict)
        return cls(config, ctx)

    def get_workunits(self):
        # Logic to fetch and process metadata
        mce = MetadataChangeEventClass(
            # Populate with your app's metadata
        )
        yield mce

    def get_report(self):
        return SourceReport()

Ingesting Metadata Manually

You can also manually create and ingest metadata using the Python SDK:

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.metadata.schema_classes import MetadataChangeProposalClass, DatasetPropertiesClass
from datahub.emitter.rest_emitter import DatahubRestEmitter

# Initialize the emitter
emitter = DatahubRestEmitter(gms_server='http://localhost:8080')

# Create a MetadataChangeProposal
mcp = MetadataChangeProposalClass(
    entityUrn=make_dataset_urn(platform='app', name='my_app', env='PROD'),
    aspect=DatasetPropertiesClass(
        description='This is my custom app',

        customProperties={'key': 'value'}
    )
)

# Emit the MCP
emitter.emit_mcp(mcp)

Sources

Feel free to ask if you have more questions or need further assistance! :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

well… this seems too complex

is there a way to just “draw” a rectangle in my lineage map and call it my_app or something?