Troubleshooting CSV File Ingestion and Metadata Production

Original Slack Thread

<@U06TM7M40RK> I’m trying to ingest a CSV file and get the error No metadata was produced by the source.

The error message “No metadata was produced by the source” indicates that the ingestion process completed without any errors, but no metadata was actually ingested. This can happen due to several reasons, such as incorrect configuration or issues with the CSV file format.

Here are some potential causes and solutions:

  1. CSV File Format: Ensure that your CSV file is correctly formatted and contains the necessary columns. For the csv-enricher source, the resource column is mandatory. According to the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/csv_enricher.py|CSV Enricher Source Code>, the CSV should have columns like resource, subresource, glossary_terms, tags, owners, ownership_type, description, and domain.

  2. Configuration: Verify that your ingestion configuration is correctly set up. Here is an example configuration for ingesting a CSV file using the csv-enricher source:

    source:
      type: "csv-enricher"
      config:
        filename: "/path/to/your/csvfile.csv"
        write_semantics: PATCH
        delimiter: ","
    sink:
      type: "datahub-rest"
      config:
        server: "http://localhost:8080"
    
  3. File Accessibility: Ensure that the file path specified in the configuration is correct and that the file is accessible from the environment where the ingestion command is being executed. If you are using Docker, you may need to mount the local directory to the Docker container.

  4. Logs and Errors: Check the logs for any specific errors or warnings that might indicate what is going wrong. The logs can provide valuable insights into why the ingestion process is not producing any metadata.

Would you like more detailed steps or further assistance on any specific part of the process?

Sources:

<@U06TM7M40RK> im not using enricher. im using ingestion source: other.

If you are using the “other” ingestion source and encountering the error “No metadata was produced by the source,” it suggests that the configuration or the data format might not be aligned with what the ingestion source expects. Here are some steps to troubleshoot and resolve the issue:

  1. Verify Configuration: Ensure that your configuration file is correctly set up for the “other” source type. The configuration should specify the correct path to your CSV file and any other necessary parameters.

  2. Check CSV File Format: Make sure that your CSV file is formatted correctly and contains the necessary columns that the ingestion source expects. If the source type is “other,” it might require a specific format or set of columns.

  3. Review Logs: Check the logs for any specific errors or warnings that might indicate what is going wrong. The logs can provide valuable insights into why the ingestion process is not producing any metadata.

  4. Use a Supported Source Type: If possible, consider using a more specific source type that matches your data format. For example, if your data is in a CSV format, using a source type that explicitly supports CSV might be more appropriate.

  5. Consult Documentation: Refer to the <https://datahubproject.io/docs/0.14.0/metadata-ingestion|DataHub documentation> for details on configuring different source types and ensuring that your setup aligns with the expected configurations.

Would you like more detailed steps or further assistance on any specific part of the process?

Sources:

<@U06TM7M40RK> give me an example CSV file that can be ingested like any other data source

To ingest a CSV file into DataHub as a data source, you need to ensure that the CSV file is formatted correctly and that your ingestion configuration is set up properly. Below is an example of a CSV file and a corresponding ingestion configuration that you can use as a reference.

Example CSV File

Here is a sample CSV file format that includes metadata such as resource, glossary terms, tags, and descriptions:

resource,subresource,glossary_terms,tags,owners,ownership_type,description,domain,ownership_type_urn
"urn:li:dataset:(urn:li:dataPlatform:postgres,example_db.example_table,PROD)",,"[urn:li:glossaryTerm:ExampleTerm]","[urn:li:tag:ExampleTag]","[urn:li:corpuser:admin]",TECHNICAL_OWNER,"Description for example table","urn:li:domain:Engineering","urn:li:ownershipType:example-ownership-type"
"urn:li:dataset:(urn:li:dataPlatform:postgres,example_db.example_table,PROD)",column1,"[urn:li:glossaryTerm:ColumnTerm]","[urn:li:tag:ColumnTag]",,"","Description for column 1",,

Example Ingestion Configuration

Here is an example YAML configuration for ingesting the above CSV file using the csv-enricher source:

source:
  type: "csv-enricher"
  config:
    filename: "/path/to/your/example.csv"  # Update this path to the actual location of your CSV file
    write_semantics: PATCH
    delimiter: ","
sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Update this if your DataHub server runs on a different address

Steps to Ingest

  1. Prepare the CSV File: Ensure your CSV file is saved and accessible at the specified path.
  2. Create the Configuration File: Save the above YAML configuration as config.yml.
  3. Run the Ingestion: Use the DataHub CLI to run the ingestion:
    datahub ingest -c path/to/your/config.yml
    

This setup will ingest the metadata from the CSV file into DataHub, enriching existing datasets with the specified metadata.

Would you like more detailed instructions or help with a specific part of the process?

Sources: