<@U06TM7M40RK> Trying to ingest CSV data, and getting the following error:
for wu in itertools.islice(
File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 147, in auto_workunit_reporter
for wu in stream:
File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 245, in auto_browse_path_v2
for urn, batch in _batch_workunits_by_urn(stream):
File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 460, in _batch_workunits_by_urn
for wu in stream:
File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 360, in auto_fix_duplicate_schema_field_paths
for wu in stream:
File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 164, in auto_materialize_referenced_tags_terms
for wu in stream:
File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 102, in auto_status_aspect
for wu in stream:
File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/source/csv_enricher.py", line 647, in get_workunits_internal
if not row["resource"]:
KeyError: 'resource'
[2025-01-02 03:15:33,613] ERROR {datahub.ingestion.run.pipeline:739} - Pipeline Error: Ingestion pipeline raised an unexpected exception!Pipeline Error: Ingestion pipeline raised an unexpected exception!: 'resource'
[2025-01-02 03:15:33,615] INFO {datahub.ingestion.reporting.file_reporter:54} - Wrote FAILURE report successfully to <_io.TextIOWrapper name='/tmp/datahub/logs/abcdbbff-cf25-4247-a7c5-912d89a18358/artifacts/ingestion_report.json' mode='w' encoding='UTF-8'>
[2025-01-02 03:15:33,618] INFO {datahub.cli.ingest_cli:143} - Finished metadata ingestion
Pipeline finished with at least 1 failures; produced 0 events in 0.08 seconds.```
The error you’re encountering, KeyError: 'resource', indicates that the code is trying to access a key named “resource” in a dictionary (likely a row from your CSV file), but this key does not exist. This issue is specifically occurring in the get_workunits_internal method of the csv_enricher.py file.
Possible Causes:
CSV File Format: The CSV file you are using might not have a column named “resource”. This column is mandatory for the csv-enricher to function correctly.
CSV File Encoding: Sometimes, CSV files saved from different editors (like Excel) might have encoding issues or hidden characters that can cause parsing problems.
Incorrect CSV Configuration: The configuration in your ingestion recipe might not be correctly set up to read the CSV file.
Steps to Resolve:
Check CSV File: Ensure that your CSV file has a column named “resource”. Here is an example of a correctly formatted CSV file:
resource,subresource,glossary_terms,tags,owners,ownership_type,description,domain
"urn:li:dataset:(urn:li:dataPlatform:snowflake,datahub.growth.users,PROD)",,[urn:li:glossaryTerm:Users],[urn:li:tag:HighQuality],[urn:li:corpuser:lfoe|urn:li:corpuser:jdoe],TECHNICAL_OWNER,description for users table,urn:li:domain:Engineering
CSV Configuration in Recipe: Ensure that your ingestion recipe correctly points to the CSV file and uses the correct delimiters. Here is an example of a correct configuration:
CSV File Encoding: Save your CSV file with UTF-8 encoding. This can often be done in your text editor or spreadsheet software by selecting “Save As” and choosing the appropriate encoding.
Would you like more detailed steps or further assistance on any specific part of this process?
For testing purposes, you can use the sample CSV file provided in the DataHub GitHub repository. This file contains the necessary ‘resource’ property and is suitable for testing the CSV Enricher module.
To successfully ingest the CSV file using the DataHub CSV Enricher, you need to create a configuration file (YAML format) that specifies the source type, the path to the CSV file, and the sink configuration. Here’s a step-by-step guide on how to set up the configuration:
Prepare the Ingestion Recipe: Create a YAML file (e.g., ingestion_recipe.yaml) with the following content:
source:
type: "csv-enricher"
config:
filename: "path/to/csv_enricher_test_data.csv" # Update this path to the location of your CSV file
write_semantics: PATCH # Use PATCH to append metadata or OVERRIDE to replace existing metadata
delimiter: "," # Ensure this matches the delimiter used in your CSV file
sink:
type: "datahub-rest"
config:
server: "http://localhost:8080" # Update this to your DataHub server URL
Run the Ingestion: Use the DataHub CLI to execute the ingestion process with the following command:
datahub ingest -c path/to/ingestion_recipe.yaml
Key Configuration Details:
filename: Path to the CSV file you want to ingest.
write_semantics: Determines whether to append (PATCH) or replace (OVERRIDE) existing metadata.
delimiter: The character used to separate fields in the CSV file (commonly a comma ,).
server: The URL of your DataHub server where the metadata will be ingested.
Make sure your CSV file is accessible and correctly formatted with the required columns, including the ‘resource’ column.
Would you like more detailed steps or examples on any part of this process?
The error message “Failed to connect to DataHub with DataHubRestEmitter: configured to talk to http://localhost:8080” indicates that the DataHub CLI is unable to establish a connection to the DataHub server at the specified URL. Here are some steps to troubleshoot and resolve this issue:
Verify DataHub Server Status: Ensure that the DataHub server is running and accessible. You can check this by trying to access the DataHub UI in a web browser at http://localhost:8080. If the server is not running, you may need to start it.
Check Network Connectivity: Make sure that there are no network issues preventing your machine from reaching the DataHub server. You can use tools like curl or ping to test connectivity:
curl -v http://localhost:8080
Correct Server URL: Double-check the server URL in your ingestion recipe. If you are running DataHub on a different host or port, update the server field in your YAML configuration file accordingly.
Port Forwarding (if applicable): If you are using a setup like Kubernetes, ensure that port forwarding is correctly configured to allow access to the DataHub server.
Firewall and Security Settings: Ensure that there are no firewall rules or security settings blocking access to the DataHub server.
Logs and Error Messages: Check the logs of the DataHub server for any error messages that might provide more context on why the connection is failing.
If you continue to experience issues, please provide more details about your setup, such as whether you are using Docker, Kubernetes, or another deployment method.
Would you like more detailed steps on any specific part of this troubleshooting process?
If you are able to successfully connect to http://localhost:9002 using curl, but data ingestion is failing, it suggests that the DataHub server might be running on a different port than the one specified in your ingestion configuration. Here are the steps you can take to resolve this issue:
Update Ingestion Configuration: Ensure that the server URL in your ingestion configuration file points to the correct port. If your DataHub server is accessible at http://localhost:9002, update your YAML configuration file to reflect this:
sink:
type: "datahub-rest"
config:
server: "http://localhost:9002" # Update to the correct port
Verify DataHub Server Configuration: Double-check the configuration of your DataHub server to ensure it is set to listen on the correct port. This can typically be configured in the server’s configuration files or environment variables.
Check for Multiple Instances: Ensure that there are no multiple instances of DataHub running on different ports, which might cause confusion about which instance to connect to.
Restart Services: After making changes to the configuration, restart the DataHub services to ensure that all settings are applied correctly.
Logs and Error Messages: Review the logs of both the DataHub server and the ingestion process for any error messages that might provide additional context on why the ingestion is failing.
If you continue to experience issues, please provide more details about your setup, such as whether you are using Docker, Kubernetes, or another deployment method, and any specific error messages you are encountering during ingestion.
Would you like more detailed steps on any specific part of this troubleshooting process?