Troubleshooting Data Lineage In DataHub Spark Listener Configuration

Original Slack Thread

<@U06TM7M40RK> help me with 1:*

I’m here to help with questions specifically about DataHub. Could you please provide more details or rephrase your question so I can assist you more effectively? If you have any specific issues or topics related to DataHub, feel free to mention them! :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> how to filter 1:* in data lineage?

example

silver data set :
a
b
c
d

gold data set :

A
B
E
F
G

but my lineage shows

a->A,B,E,F,G

b ->A,B,E,F,G

why this happening and how to solve it?

To address the issue of incorrect lineage where each dataset in the “silver” project is showing lineage to all datasets in the “gold” project, we need to ensure that the lineage is being ingested and represented correctly. Here are some steps and considerations to help troubleshoot and resolve this issue:

  1. Check Ingestion Configuration:
    Ensure that your ingestion configurations for both the “silver” and “gold” projects are correctly set up to include accurate lineage information. Verify that the lineage is being captured correctly in your dbt or other ETL configurations.

  2. Enable Stateful Ingestion:
    Stateful ingestion helps maintain the state of the metadata and can prevent issues where lineage information is overwritten. Ensure that stateful ingestion is enabled in your configuration.

  3. Ingest Both Projects Together:
    Instead of ingesting the “silver” and “gold” projects separately, try ingesting them together in a single pipeline. This can help ensure that the lineage information is correctly merged and not duplicated.

  4. Check Logs for Errors:
    Review the logs from the ingestion process to identify any errors or warnings that might indicate why lineage information is being incorrectly displayed. Logs can provide valuable insights into what might be going wrong.

  5. Use the DataHub CLI to Inspect Lineage:
    Use the DataHub CLI to inspect the lineage information for specific datasets. This can help you understand if the lineage is being ingested correctly but not displayed properly in the UI.

  6. Consider Version Compatibility:
    Ensure that the version of DataHub you are using supports the features you need. Some features might not be available or fully functional in older versions.

If you need more detailed steps or examples for any specific part of this process, please let me know!

Sources:

<@U06TM7M40RK> My processes uses the spark listener and this configurations

ignores the commented code

‘spark.extraListeners’: ‘datahub.spark.DatahubSparkListener’,
‘spark.datahub.rest.server’: “http://10.252.1.91:32406”,
#‘spark.extraListeners’: ‘io.openlineage.spark.agent.OpenLineageSparkListener’,
#‘spark.openlineage.transport.type’: ‘http’,
#‘spark.openlineage.transport.url’: f’http://{__IP_ADDRESS_SERVER}:{__PORT_MARQUEZ_API}‘,
#‘spark.openlineage.host’: f’http://{__IP_ADDRESS_SERVER}:{__PORT_MARQUEZ_API}’,
#‘spark.openlineage.transport.endpoint’: ‘/api/v1/lineage’,
#‘spark.openlineage.namespace’: namespace_openlineage, # NAMESPACE COM O NOME DA FONTE DE DADOS. Ex.: bemtevi-tio, protheus, …
‘spark.datahub.flow_name’: app, # APLICAÇÃO CONFORME O NOME DO INDICADOR / TABELA
# ‘spark.datahub.metadata.remove_partition_pattern’ : ‘^1521/’,
‘spark.datahub.metadata.include_scheme’ : ‘false’,
‘spark.datahub.metadata.dataset.materialize’ : ‘true’,
‘spark.datahub.metadata.dataset.experimental_include_schema_metadata’ :‘true’,
‘spark.datahub.lineage.captureColumnLevel’ : ‘true’,
‘spark.datahub.coalesce_jobs’ : ‘true’,
‘spark.datahub.rest.token’: __DATAHUB_TOKEN

To ensure that your Spark application is correctly configured to use the DataHub Spark Listener for capturing lineage, let’s review the configuration settings you’ve provided and address any potential issues:

  1. Configuration Review:

    • spark.extraListeners: This is correctly set to datahub.spark.DatahubSparkListener, which is necessary for capturing lineage with DataHub.
    • spark.datahub.rest.server: Ensure that the URL is correct and accessible from your Spark environment. This is the endpoint where lineage data will be sent.
    • spark.datahub.flow_name: This should be set to a meaningful name for your application or data flow.
    • spark.datahub.metadata.include_scheme: Setting this to false is recommended to avoid including the scheme in dataset URNs.
    • spark.datahub.metadata.dataset.materialize: Setting this to true ensures that datasets are materialized in DataHub.
    • spark.datahub.metadata.dataset.experimental_include_schema_metadata: This should be true if you want to include schema metadata.
    • spark.datahub.lineage.captureColumnLevel: Setting this to true enables column-level lineage capture.
    • spark.datahub.coalesce_jobs: This should be true to coalesce jobs into a single data job.
    • spark.datahub.rest.token: Ensure that this token is valid and has the necessary permissions to send data to DataHub.
  2. Troubleshooting Steps:

    • Verify Network Connectivity: Ensure that the Spark environment can reach the DataHub server at the specified URL.
    • Check Logs: Review the Spark application logs for any errors or warnings related to the DataHub listener. This can provide insights into any issues with the configuration or connectivity.
    • Token Validation: Confirm that the token provided in spark.datahub.rest.token is correct and has the necessary permissions.
    • Version Compatibility: Ensure that the versions of Spark and the DataHub Spark Lineage library are compatible. Refer to the <https://datahubproject.io/docs/0.14.0/metadata-integration/java/acryl-spark-lineage|DataHub Spark Lineage documentation> for version compatibility details.
  3. Additional Considerations:

    • If you are using other lineage tools like OpenLineage, ensure that there are no conflicts in the configuration. Since those lines are commented out, they should not interfere, but it’s good to be aware of potential conflicts if they are uncommented in the future.

Would you like more detailed steps or further assistance with specific configurations?

Sources:

<@U06TM7M40RK> now you know my configuration, show me how to fix the first problem

To address the issue where your lineage shows each dataset in the “silver” project linking to all datasets in the “gold” project, we need to ensure that the lineage is being captured accurately. Here are some steps to help resolve this:

  1. Verify Column-Level Lineage Configuration:

    • Ensure that spark.datahub.lineage.captureColumnLevel is set to true, which you have already done. This setting is crucial for capturing detailed lineage at the column level.
  2. Check DataHub Spark Listener Configuration:

    • Make sure that the datahub-spark-lineage library is correctly configured and that your operations are supported for column-level lineage extraction. This involves ensuring that the correct version of the library is being used and that all necessary configurations are set.
  3. Inspect Lineage Emission:

    • Use the DataHub Python API to programmatically emit metadata for the input datasets if they are not automatically captured. This can help ensure that the correct lineage is being emitted.

    Here’s an example of how to emit metadata for input datasets:

    import datahub.emitter.mce_builder as builder
    from datahub.emitter.mcp import MetadataChangeProposalWrapper
    from datahub.emitter.rest_emitter import DatahubRestEmitter
    from datahub.metadata.com.linkedin.pegasus2avro.dataset import Upstream, DatasetLineageType, UpstreamLineage
    
    # Define dataset URNs
    def datasetUrn(tbl):
        return builder.make_dataset_urn("postgres", tbl)
    
    # Define upstream dataset
    upstream = Upstream(
        dataset=datasetUrn("input_table"),
        type=DatasetLineageType.TRANSFORMED,
    )
    
    # Create upstream lineage
    upstreamLineage = UpstreamLineage(
        upstreams=[upstream],
    )
    
    # Create metadata change proposal
    lineageMcp = MetadataChangeProposalWrapper(
        entityUrn=datasetUrn("output_table"),
        aspect=upstreamLineage,
    )
    
    # Create an emitter to the GMS REST API
    emitter = DatahubRestEmitter("http://localhost:8080")
    
    # Emit metadata
    emitter.emit_mcp(lineageMcp)
    
  1. Review and Adjust Lineage Logic:

    • Ensure that the logic in your Spark jobs correctly reflects the lineage you expect. If transformations or joins are not accurately represented, it could lead to incorrect lineage.
  2. Check for Overlapping Configurations:

    • If you have other lineage tools configured (e.g., OpenLineage), ensure that they are not interfering with DataHub’s lineage capture. Since those configurations are commented out, they should not affect the current setup.
  3. Inspect DataHub UI:

    • Use the DataHub UI to verify the lineage and make any necessary manual adjustments. The UI allows you to edit lineage and ensure that it accurately reflects your data flows.

Would you like more detailed steps or further assistance with specific configurations?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><