Ingesting Metadata from Delta Lake Tables in MinIO using DataHub

Original Slack Thread

<@U06TM7M40RK> how to ingest from delta lake tables in minio?

To ingest metadata from Delta Lake tables stored in MinIO using DataHub, you can follow these steps:

  1. Configuration: You need to configure your DataHub ingestion recipe to point to your MinIO setup. MinIO is S3-compatible, so you can use the S3 configuration options. Here’s a sample configuration:

    source:
      type: delta-lake
      config:
        env: DEV
        base_path: "s3://my-test-bucket/delta_tables/sales"
        s3:
          aws_config:
            aws_access_key_id: "miniouser"
            aws_secret_access_key: "miniopassword"
            aws_endpoint_url: "http://localhost:9000"
            aws_region: "us-east-1"
    sink:
      type: file
      config:
        filename: "/path/to/output/delta_lake_minio_mces.json"
    

    This configuration specifies the base_path where your Delta Lake tables are stored in MinIO. The aws_endpoint_url should point to your MinIO server, and you should provide the appropriate access credentials.

  2. Run the Ingestion Pipeline: Use the DataHub CLI to run the ingestion pipeline with the above configuration. This will read the Delta Lake tables from MinIO and ingest the metadata into DataHub.

  3. Troubleshooting: If you encounter issues, ensure that the aws_endpoint_url does not have a trailing slash, as this can cause errors, as noted in a <Slack discussion>.

  4. Verify the Output: After running the pipeline, verify the output file to ensure that the metadata has been correctly ingested.

Would you like more detailed instructions or a breakdown of any specific part of this process?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><