Dealing with KeyError in CSV Data Ingestion

Original Slack Thread

<@U06TM7M40RK> Trying to ingest CSV data, and getting the following error:

    for wu in itertools.islice(
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 147, in auto_workunit_reporter
    for wu in stream:
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 245, in auto_browse_path_v2
    for urn, batch in _batch_workunits_by_urn(stream):
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 460, in _batch_workunits_by_urn
    for wu in stream:
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 360, in auto_fix_duplicate_schema_field_paths
    for wu in stream:
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 164, in auto_materialize_referenced_tags_terms
    for wu in stream:
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 102, in auto_status_aspect
    for wu in stream:
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/source/csv_enricher.py", line 647, in get_workunits_internal
    if not row["resource"]:
KeyError: 'resource'
[2025-01-02 03:15:33,613] ERROR    {datahub.ingestion.run.pipeline:739} - Pipeline Error: Ingestion pipeline raised an unexpected exception!Pipeline Error: Ingestion pipeline raised an unexpected exception!: 'resource'
[2025-01-02 03:15:33,615] INFO     {datahub.ingestion.reporting.file_reporter:54} - Wrote FAILURE report successfully to &lt;_io.TextIOWrapper name='/tmp/datahub/logs/abcdbbff-cf25-4247-a7c5-912d89a18358/artifacts/ingestion_report.json' mode='w' encoding='UTF-8'&gt;
[2025-01-02 03:15:33,618] INFO     {datahub.cli.ingest_cli:143} - Finished metadata ingestion
Pipeline finished with at least 1 failures; produced 0 events in 0.08 seconds.```

The error you’re encountering, KeyError: 'resource', indicates that the code is trying to access a key named “resource” in a dictionary (likely a row from your CSV file), but this key does not exist. This issue is specifically occurring in the get_workunits_internal method of the csv_enricher.py file.

Possible Causes:

  1. CSV File Format: The CSV file you are using might not have a column named “resource”. This column is mandatory for the csv-enricher to function correctly.
  2. CSV File Encoding: Sometimes, CSV files saved from different editors (like Excel) might have encoding issues or hidden characters that can cause parsing problems.
  3. Incorrect CSV Configuration: The configuration in your ingestion recipe might not be correctly set up to read the CSV file.

Steps to Resolve:

  1. Check CSV File: Ensure that your CSV file has a column named “resource”. Here is an example of a correctly formatted CSV file:

    resource,subresource,glossary_terms,tags,owners,ownership_type,description,domain
    "urn:li:dataset:(urn:li:dataPlatform:snowflake,datahub.growth.users,PROD)",,[urn:li:glossaryTerm:Users],[urn:li:tag:HighQuality],[urn:li:corpuser:lfoe|urn:li:corpuser:jdoe],TECHNICAL_OWNER,description for users table,urn:li:domain:Engineering
    
  2. CSV Configuration in Recipe: Ensure that your ingestion recipe correctly points to the CSV file and uses the correct delimiters. Here is an example of a correct configuration:

    source:
      type: "csv-enricher"
      config:
        filename: /path/to/your/csvfile.csv
        write_semantics: PATCH
        delimiter: ","
    
  3. CSV File Encoding: Save your CSV file with UTF-8 encoding. This can often be done in your text editor or spreadsheet software by selecting “Save As” and choosing the appropriate encoding.

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

<@U06TM7M40RK> Can you point me to a public url containing csv data with a ‘resource’ property that I can use to test with?

For testing purposes, you can use the sample CSV file provided in the DataHub GitHub repository. This file contains the necessary ‘resource’ property and is suitable for testing the CSV Enricher module.

You can find the sample CSV file at the following URL: <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/tests/integration/csv-enricher/csv_enricher_test_data.csv|CSV Enricher Test Data>.

This file includes the ‘resource’ column and other necessary fields, making it a good starting point for testing your ingestion setup.

Would you like more detailed instructions on how to use this file for testing?

Sources:

<@U06TM7M40RK> What configs do I need to successfully ingest that csv file?

To successfully ingest the CSV file using the DataHub CSV Enricher, you need to create a configuration file (YAML format) that specifies the source type, the path to the CSV file, and the sink configuration. Here’s a step-by-step guide on how to set up the configuration:

  1. Prepare the Ingestion Recipe: Create a YAML file (e.g., ingestion_recipe.yaml) with the following content:

    source:
      type: "csv-enricher"
      config:
        filename: "path/to/csv_enricher_test_data.csv"  # Update this path to the location of your CSV file
        write_semantics: PATCH  # Use PATCH to append metadata or OVERRIDE to replace existing metadata
        delimiter: ","  # Ensure this matches the delimiter used in your CSV file
    sink:
      type: "datahub-rest"
      config:
        server: "http://localhost:8080"  # Update this to your DataHub server URL
    
  2. Run the Ingestion: Use the DataHub CLI to execute the ingestion process with the following command:

    datahub ingest -c path/to/ingestion_recipe.yaml
    

Key Configuration Details:

  • filename: Path to the CSV file you want to ingest.
  • write_semantics: Determines whether to append (PATCH) or replace (OVERRIDE) existing metadata.
  • delimiter: The character used to separate fields in the CSV file (commonly a comma ,).
  • server: The URL of your DataHub server where the metadata will be ingested.

Make sure your CSV file is accessible and correctly formatted with the required columns, including the ‘resource’ column.

Would you like more detailed steps or examples on any part of this process?

Sources:

<@U06TM7M40RK> Now I get the following error:
Failed to connect to DataHub with DataHubRestEmitter: configured to talk to <http://localhost:8080>

The error message “Failed to connect to DataHub with DataHubRestEmitter: configured to talk to http://localhost:8080” indicates that the DataHub CLI is unable to establish a connection to the DataHub server at the specified URL. Here are some steps to troubleshoot and resolve this issue:

  1. Verify DataHub Server Status: Ensure that the DataHub server is running and accessible. You can check this by trying to access the DataHub UI in a web browser at http://localhost:8080. If the server is not running, you may need to start it.

  2. Check Network Connectivity: Make sure that there are no network issues preventing your machine from reaching the DataHub server. You can use tools like curl or ping to test connectivity:

    curl -v http://localhost:8080
    
  3. Correct Server URL: Double-check the server URL in your ingestion recipe. If you are running DataHub on a different host or port, update the server field in your YAML configuration file accordingly.

  4. Port Forwarding (if applicable): If you are using a setup like Kubernetes, ensure that port forwarding is correctly configured to allow access to the DataHub server.

  5. Firewall and Security Settings: Ensure that there are no firewall rules or security settings blocking access to the DataHub server.

  6. Logs and Error Messages: Check the logs of the DataHub server for any error messages that might provide more context on why the connection is failing.

If you continue to experience issues, please provide more details about your setup, such as whether you are using Docker, Kubernetes, or another deployment method.

Would you like more detailed steps on any specific part of this troubleshooting process?

Sources:

<@U06TM7M40RK> `
curl -v <http://localhost:9002>
Is successful, data ingestion is not. What now

If you are able to successfully connect to http://localhost:9002 using curl, but data ingestion is failing, it suggests that the DataHub server might be running on a different port than the one specified in your ingestion configuration. Here are the steps you can take to resolve this issue:

  1. Update Ingestion Configuration: Ensure that the server URL in your ingestion configuration file points to the correct port. If your DataHub server is accessible at http://localhost:9002, update your YAML configuration file to reflect this:

    sink:
      type: "datahub-rest"
      config:
        server: "http://localhost:9002"  # Update to the correct port
    
  2. Verify DataHub Server Configuration: Double-check the configuration of your DataHub server to ensure it is set to listen on the correct port. This can typically be configured in the server’s configuration files or environment variables.

  3. Check for Multiple Instances: Ensure that there are no multiple instances of DataHub running on different ports, which might cause confusion about which instance to connect to.

  4. Restart Services: After making changes to the configuration, restart the DataHub services to ensure that all settings are applied correctly.

  5. Logs and Error Messages: Review the logs of both the DataHub server and the ingestion process for any error messages that might provide additional context on why the ingestion is failing.

If you continue to experience issues, please provide more details about your setup, such as whether you are using Docker, Kubernetes, or another deployment method, and any specific error messages you are encountering during ingestion.

Would you like more detailed steps on any specific part of this troubleshooting process?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><