Understanding the acryl-datahub-dagster-plugin and dagster-datahub integration for Datahub with Dagster

Original Slack Thread

Does anyone know the difference between the acryl-datahub-dagster-plugin (from Datahub) and the dagster-datahub integration (from Dagster)? Are they both required to integrate Datahub with Dagster or are they independent ways of doing the same? I ask because I followed the documentation to install the acryl plugin and the sensor is firing but it does not seem to do anything when assets are materialized. This leads me to believe that either I have to add the additional metadata datahub info in the asset definitions or I need something else to make this work. Anyone else tried this integration successfully?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To address your questions:

Difference Between acryl-datahub-dagster-plugin and dagster-datahub Integration

  1. acryl-datahub-dagster-plugin:

    • This plugin is developed and maintained by the DataHub team.
    • It provides a sensor that emits metadata to DataHub after every Dagster pipeline run execution.
    • The sensor can capture asset materialization, input, and output events, and can be configured to extract custom asset lineage information.
    • For setup and configuration, you can refer to the <https://github.com/datahub-project/datahub/blob/master/docs/lineage/dagster.md|DataHub Dagster Integration Guide>.
  2. dagster-datahub Integration:

    • This integration is maintained by the Dagster team.
    • It allows you to emit metadata to DataHub, including pipeline metadata, job and op run information, and lineage information.
    • The integration is designed to work seamlessly with Dagster’s native features, such as sensors and schedules.
    • For more details, you can check the <https://datahubproject.io/docs/0.13.1/lineage/dagster|Dagster Integration Documentation>.

Are Both Required?

Both plugins aim to achieve similar goals, i.e., integrating Dagster with DataHub to emit metadata. However, they are maintained by different teams and may have slightly different features and configurations. You do not need to use both; you can choose one based on your specific requirements and the features you need.

Issues with Asset Materialization

If the sensor is firing but not capturing asset materialization events, you might need to ensure that the asset definitions include the necessary metadata. Here are a few steps you can take:

  1. Verify Configuration:

    • Ensure that the capture_asset_materialization configuration is set to True.
    • Check if the platform_instance and other relevant configurations are correctly set.
  2. Custom Asset Lineage:

    • You can implement custom logic to capture asset lineage information using the asset_lineage_extractor function. For example:
    from datahub_dagster_plugin.client.dagster_generator import DagsterGenerator, DatasetLineage
    
    def asset_lineage_extractor(
        context: RunStatusSensorContext,
        dagster_generator: DagsterGenerator,
        graph: DataHubGraph,
    ) -> Dict[str, DatasetLineage]:
       dataset_lineage: Dict[str, DatasetLineage] = {}
    
       # Extracting input and output assets from the context
       return dataset_lineage
    
  3. Debugging:

    • Check the Dagster daemon logs for any DataHub-related log messages.
    • Ensure that the DataHub GMS service is up and running.

Would you like more detailed steps or specific code examples for any of these points?

Sources: