Troubleshooting Data Ingestion Issues with DataHub on GCP

Original Slack Thread

Hi everyone, I’ve deployed DataHub on GCP, and I’m facing some issues with data ingestion. I have data stored in Snowflake, and while the connection is working fine, every time I trigger the ingestion manually, I get 0 assets ingested.

I even tried ingesting a simple CSV file and encountered the same issue. I’m wondering if this might be related to the GCP deployment setup?

Has anyone faced a similar issue or has insights on how to resolve this? Any advice would be greatly appreciated. Thanks in advance!

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To address the issue of 0 assets ingested when trying to ingest data from Snowflake and a CSV file into DataHub, there are several potential areas to investigate:

  1. Configuration Check: Ensure that your ingestion configuration is correctly set up. This includes verifying that the Snowflake account ID, warehouse, username, password, and role are correctly specified in your ingestion recipe. For CSV ingestion, ensure that the file path and format are correctly specified.

  2. Role and Permissions: Verify that the Snowflake role used for ingestion has the necessary permissions to access the databases, schemas, and tables you intend to ingest. According to the <https://datahubproject.io/docs/0.14.0/quick-ingestion-guides/snowflake/setup|Snowflake Ingestion Guide>, you need to grant the appropriate usage and select permissions on the databases and schemas.

  3. Ingestion Logs: Check the ingestion logs for any warnings or errors that might indicate why no assets are being ingested. Logs can provide insights into issues such as connectivity problems, permission errors, or misconfigurations.

  4. DataHub Deployment: Since you mentioned deploying DataHub on GCP, ensure that all necessary services are running correctly and that there are no network issues preventing DataHub from accessing Snowflake or the CSV file. You can check the status of your DataHub services using GCP’s monitoring tools.

  5. Test Connection: If the test connection step is hanging or not providing feedback, it might indicate network or configuration issues. Try running the ingestion from a local environment to see if the issue persists, as suggested in the <Slack Slack Discussion>.

  6. Resource Allocation: Ensure that your GCP deployment has sufficient resources allocated to handle the ingestion process. Insufficient memory or CPU can cause the ingestion process to hang or fail.

Would you like more detailed steps on any of these troubleshooting points or further assistance on a specific part of the process?

Sources:

<@U06TM7M40RK> tell me more on Resource Allocation?

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><