<@U06TM7M40RK> how to ingest from delta lake tables in minio?
To ingest metadata from Delta Lake tables stored in MinIO using DataHub, you can follow these steps:
-
Configuration: You need to configure your DataHub ingestion recipe to point to your MinIO setup. MinIO is S3-compatible, so you can use the S3 configuration options. Here’s a sample configuration:
source: type: delta-lake config: env: DEV base_path: "s3://my-test-bucket/delta_tables/sales" s3: aws_config: aws_access_key_id: "miniouser" aws_secret_access_key: "miniopassword" aws_endpoint_url: "http://localhost:9000" aws_region: "us-east-1" sink: type: file config: filename: "/path/to/output/delta_lake_minio_mces.json"
This configuration specifies the
base_path
where your Delta Lake tables are stored in MinIO. Theaws_endpoint_url
should point to your MinIO server, and you should provide the appropriate access credentials. -
Run the Ingestion Pipeline: Use the DataHub CLI to run the ingestion pipeline with the above configuration. This will read the Delta Lake tables from MinIO and ingest the metadata into DataHub.
-
Troubleshooting: If you encounter issues, ensure that the
aws_endpoint_url
does not have a trailing slash, as this can cause errors, as noted in a <Slack discussion>. -
Verify the Output: After running the pipeline, verify the output file to ensure that the metadata has been correctly ingested.
Would you like more detailed instructions or a breakdown of any specific part of this process?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/tests/integration/delta_lake/test_delta_lake_minio.py|GitHub - Delta Lake MinIO Test>
- <Slack Discussion on MinIO Configuration> 0 button 0 button Hint: Mention @runllm in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><