Troubleshooting Stuck DBT Ingestion Job from AWS S3 to DataHub

user-3 · January 13, 2025, 12:03am

<@U06TM7M40RK> i want to ingest dbt via aws s3

the job is stuck, it running without any data ingested
This version of datahub supports report-to functionality

exec datahub ingest run -c /tmp/datahub/ingest/5b09a6e6-032a-4f96-8864-7e59bdbdbbec/recipe.yml --report-to /tmp/datahub/logs/5b09a6e6-032a-4f96-8864-7e59bdbdbbec/artifacts/ingestion_report.json
[2025-01-07 09:26:58,164] INFO {datahub.cli.ingest_cli:147} - DataHub CLI version: 0.13.1.2
[2025-01-07 09:26:58,252] INFO {datahub.ingestion.run.pipeline:239} - Sink configured successfully. DataHubRestEmitter: configured to talk to http://datahub-datahub-gms:8080
/tmp/datahub/ingest/venv-dbt-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/dbt/dbt_core.py:387: ConfigurationWarning: We recommend using platform_instance instead of env. While specifying env does still work, we intend to deprecate it in the future.
config = DBTCoreConfig.parse_obj(config_dict)
[2025-01-07 09:26:59,051] INFO {datahub.ingestion.source.state.stateful_ingestion_base:241} - Stateful ingestion will be automatically enabled, as datahub-rest sink is used or datahub_api is specified
[2025-01-07 09:26:59,058] INFO {datahub.ingestion.run.pipeline:256} - Source configured successfully.
[2025-01-07 09:26:59,058] INFO {datahub.cli.ingest_cli:128} - Starting metadata ingestion

user-1 · January 13, 2025, 12:03am

To address the issue of your DBT ingestion job being stuck without any data being ingested, here are some steps and checks you can perform:

Check Recipe Configuration: Ensure that your ingestion recipe is correctly configured. Misconfigurations can lead to ingestion failures. Verify that all paths and configurations in your recipe are correct, especially the paths to your DBT artifacts (e.g., manifest.json, catalog.json, etc.).
Verify AWS S3 Access: Ensure that your DataHub instance has the necessary permissions to access the AWS S3 bucket where your DBT artifacts are stored. This includes checking IAM roles and policies.
Inspect Logs for Errors: Check the detailed logs for any errors or warnings that might indicate what is going wrong. You can view the output logs by clicking DETAILS on the corresponding ingestion run in the UI.
Network and Connectivity: Ensure that there are no network issues preventing DataHub from accessing the S3 bucket. This includes checking for DNS resolution issues or network firewalls.
Stateful Ingestion: The logs indicate that stateful ingestion is enabled. Ensure that the stateful ingestion configuration is correct and that there are no issues with the state store.
Version Compatibility: Ensure that the versions of DataHub and the DBT ingestion plugin you are using are compatible. Sometimes, upgrading to a newer version of DataHub can resolve issues.
Use platform_instance Instead of env: The warning in the logs suggests using platform_instance instead of env. Update your configuration to use platform_instance to avoid potential deprecation issues.
Run with Debug Logs: Running the DataHub CLI with debug logs enabled can provide more insights into what might be going wrong. Use the --debug flag to enable detailed logging:
```
datahub ingest -c your_ingestion_recipe.yml --debug
```

For more detailed steps on debugging a failed ingestion run, you can refer to the <Ingestion | DataHub UI Ingestion documentation>.

Would you like more detailed steps on any of these points or further assistance on a specific part of the process?

Sources:

<Ingestion | DataHub UI Ingestion documentation> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · January 13, 2025, 12:03am

<@U06TM7M40RK> this is my recipe

source:
type: dbt
config:
manifest_path: ‘s3://abc/manifest.json’
catalog_path: ‘s3://abc/catalog.json’
test_results_path: ‘s3://abc/run_results.json’
target_platform: redshift
env: PROD
meta_mapping:
contains_pii:
match: true
operation: add_tag
config:
tag: has_pii
aws_connection:
aws_region: ap-southeast-1
aws_access_key_id: ‘${AWS_ACCESS_KEY_S3_DBT_STAGING}’
aws_secret_access_key: ‘${AWS_SECRET_KEY_S3_DBT_STAGING}’

user-1 · January 13, 2025, 12:03am

Your recipe for ingesting DBT metadata from AWS S3 into DataHub seems mostly correct, but there are a few things to verify and consider:

AWS Credentials: Ensure that the environment variables AWS_ACCESS_KEY_S3_DBT_STAGING and AWS_SECRET_KEY_S3_DBT_STAGING are correctly set in your environment where the ingestion is running. These should be accessible to the DataHub ingestion process.
AWS Connection Configuration: The aws_connection configuration seems correct, but make sure that the IAM user associated with these credentials has the necessary permissions to access the specified S3 bucket and objects.
Use of env: The logs you provided earlier mentioned a warning about using platform_instance instead of env. Consider replacing env: PROD with platform_instance: PROD to avoid potential deprecation issues.
S3 Path Format: Ensure that the S3 paths are correctly formatted and accessible. The paths should not include angle brackets (< and >). They should be specified as plain strings, like s3://abc/manifest.json.
Meta Mapping: The meta_mapping configuration looks correct for adding a tag based on the presence of PII. Ensure that the DBT metadata actually contains the contains_pii field for this mapping to work.
Debugging: If the job is still stuck, consider running the ingestion with debug logs enabled to get more detailed information about what might be going wrong:
```
datahub ingest -c your_ingestion_recipe.yml --debug
```
Check for Errors in Logs: Review the logs for any specific error messages or warnings that could provide more insight into the issue.

If you have verified these points and the issue persists, please let me know if you need more detailed assistance on any specific part of the process.

Sources:

<Ingestion | DataHub UI Ingestion documentation> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · January 13, 2025, 12:03am

how to set debug in UI?

user-3 · January 13, 2025, 12:03am

<@U06TM7M40RK> how to set debug in UI?

user-1 · January 13, 2025, 12:03am

To enable debug logging in the DataHub UI, you can follow these steps:

Advanced Debug Mode: If you are using the DataHub UI, there might be an option to enable an advanced or debug mode directly from the UI. This can often be found in the settings or configuration section of the UI. However, the specific steps to enable this mode are not detailed in the provided data sources.
Environment Variable: You can set the DATAHUB_DEBUG environment variable to true to enable debug logging for DataHub components. This can be done by modifying the environment variables for the DataHub services running in your environment. For example, if you are using Docker, you can set this environment variable in your Docker configuration.
Helm Charts: If you are deploying DataHub using Helm charts, you can modify the Helm values file to include the DATAHUB_DEBUG environment variable for the relevant DataHub components. Here is an example configuration for the GMS component:
```
gms:
  env:
    - name: DATAHUB_DEBUG
      value: "true"
```
After modifying the values file, apply the changes using the Helm upgrade command:
```
helm upgrade --install datahub datahub/datahub -f values.yaml
```
Accessing Logs: Once debug mode is enabled, you may need to access the logs directly from the container or server where DataHub is running. Debug logs might not be redirected to stdout by default, so you may need to check the log files directly, typically found at /tmp/datahub/logs.

For more detailed instructions or if you have specific questions about this process, please let me know.

Sources:

https://datahubspace.slack.com/archives/C029A3M079U/p1699038604.588899|https://datahubspace.slack.com/archives/C029A3M079U/p1699038604.588899
https://datahubspace.slack.com/archives/CUMUWQU66/p1723095730.486339|https://datahubspace.slack.com/archives/CUMUWQU66/p1723095730.486339 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · January 13, 2025, 12:03am

<@U06TM7M40RK> where can i run the ingest? is it action pod or gms pod?

user-1 · January 13, 2025, 12:03am

To run the ingestion process in DataHub, you should use the acryl-datahub-actions pod. This pod is specifically responsible for executing ingestion jobs, including metadata crawling and ingestion tasks.

According to the <Deploying to AWS | DataHub documentation>, the acryl-datahub-actions pod powers UI-based ingestion and can be configured to handle different ingestion recipes. It is the primary pod involved in running ingestion tasks within a Kubernetes cluster.

If you are using the DataHub UI to set up and run ingestion jobs, the datahub-actions pod will be the one executing these jobs. Ensure that this pod is running and healthy to successfully perform ingestion tasks.

Would you like more details on configuring the acryl-datahub-actions pod for ingestion or any other specific aspect of DataHub’s ingestion process?

Sources:

<Slack Conversation on DataHub Ingestion Pods> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · January 13, 2025, 12:03am

<@U06TM7M40RK>
WARNING: These logs appear to be stale. No new logs have been received since 2025-01-07 09:39:43.350516 (168 seconds ago). However, the ingestion process still appears to be running and may complete normally.

user-1 · January 13, 2025, 12:03am

The warning message you are seeing indicates that the logs for your ingestion process have not been updated for a certain period (168 seconds in this case), but the ingestion process is still running. This situation can occur due to several reasons:

Network Latency or Delays: There might be network latency or delays in log transmission, which can cause the logs to appear stale even though the process is still running.
Ingestion Process is Busy: The ingestion process might be busy processing a large amount of data or performing complex operations, which can temporarily halt log updates.
Log Buffering: Logs might be buffered and not immediately flushed to the output, causing a delay in log updates.
Resource Constraints: If the system running the ingestion process is under heavy load or has limited resources (CPU, memory), it might slow down the logging process.
Check System Resources: Ensure that the system running the ingestion process has sufficient resources (CPU, memory) to handle the workload. You can monitor system resources to identify any bottlenecks.
Monitor Progress: You can monitor the progress of the ingestion process through the DataHub UI. If new assets are becoming searchable, it indicates that the ingestion is making progress.
Cancel and Restart: If the process seems stuck for an extended period, you might consider canceling and restarting the ingestion. According to the <https://github.com/datahub-project/datahub/blob/master/docs/ui-ingestion.md|DataHub UI Ingestion documentation>, you can cancel a hanging ingestion run by clicking Cancel on the problematic run.

Would you like more detailed guidance on any of these points or further assistance on a specific part of the process?

Sources:

<https://github.com/datahub-project/datahub/blob/master/docs/ui-ingestion.md|DataHub UI Ingestion documentation> 0 button 0 button Hint: Mention @runllm in the thread for followups.

datahub_team · January 13, 2025, 12:03am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

Topic		Replies	Views
Error while trying to ingest dbt run_results.json from S3 bucket using DataHub v0.13.2 ingestion	1	81	May 13, 2024
Troubleshooting hang during dbt ingestion with dbt v1.7 run_results.json and DataHub CLI version: 1!0.13.2+docker ingestion	2	33	May 20, 2024
Troubleshooting Errors in Data Ingestion with DataHub, DBT, and BigQuery ingestion	0	35	June 24, 2024
Troubleshooting Tips for Hanging DBT Cloud Ingestion Jobs ingestion	4	4	January 13, 2025
Creating and Managing a dbt Ingestion Source with DataHub Python Client ingestion	12	46	October 28, 2024

Troubleshooting Stuck DBT Ingestion Job from AWS S3 to DataHub

Related topics