Troubleshooting Stateful Ingestion Exception in CLI

Original Slack Thread

Hi <@U06TM7M40RK>

Getting below exception when running stateful ingestion from CLI

    raise PipelineInitError(f"Failed to {step}: {e}") from e
datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure the source (mysql): pipeline_name must be provided if stateful ingestion is enabled.```

The error you’re encountering indicates that the pipeline_name is missing in your configuration file, which is required when stateful ingestion is enabled. The pipeline_name is a mandatory field that uniquely identifies the ingestion pipeline and is used to save and retrieve checkpoint states.

According to the <https://datahubproject.io/docs/0.13.1/metadata-ingestion/docs/dev_guides/stateful|Stateful Ingestion documentation>, the pipeline_name must be provided in your YAML recipe when stateful ingestion is enabled. This is also confirmed by the <Issues · datahub-project/datahub · GitHub issue> and <Slack discussion>.

Here is an example of how you can modify your YAML recipe to include the pipeline_name:

source:
  type: "mysql"
  config:
    username: <user_name>
    password: <password>
    host_port: <host_port>
    database: <database_name>
    # Rest of the source specific params ...
  stateful_ingestion:
    enabled: true
    remove_stale_metadata: true
pipeline_name: "my_mysql_pipeline_1"  # This is mandatory

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"

Would you like more details on configuring stateful ingestion or any other aspect of DataHub?

Sources:

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

Hi there! I’d be happy to help you with your DataHub questions. To assist you more effectively, could you please provide the following information?

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)?
  3. What data source(s) are you integrating with DataHub? (e.g., BigQuery, PostgreSQL)?

This information will help me provide you with the most relevant guidance.

Sources:

<@U06TM7M40RK>

  1. Using CLI ingestion
  2. v0.13.3
  3. MySQL

Sure thing! I’m looking through the available data to answer your question. Hold on a second… :hourglass_flowing_sand:

<@U06TM7M40RK>
Any solutions?