We are doing some POC on Datahub. As part of POC testing the S3 ingestion and exploring all the options in S3. Present i am working on the partition detection in S3 ingestion with adding all the partition details in the recipe. When i run the ingestion i see only one file datahub considering from list of samples files. whatever the schema present in that file displaying the metadata in UI.
Is there any option to consider more than one file to determine the schema from sample files?
Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!
Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)
<@U06TM7M40RK> Can you help with the questions in this thread?
We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a or so other folks in the Community know if you’re still looking for help on the topic!
To address your question about considering more than one file to determine the schema from sample files during S3 ingestion in DataHub, the short answer is that DataHub does not merge schemas from multiple files. Instead, it extracts the schema from a single file, typically the last updated file in the dataset.
Detailed Explanation
Schema Extraction from a Single File:
According to the <https://github.com/datahub-project/datahub/pull/7202|GitHub pull request>, DataHub’s S3 ingestion process extracts schema metadata from the last updated file in a dataset. This means that if a dataset consists of multiple Parquet files, the schema is inferred from the most recently updated file, not by merging schemas from all files.
Current Limitations:
As discussed in the <Slack conversation>, there are limitations in the S3 ingestion process where only one file’s schema is considered. This can be problematic if there are schema changes across different files in the dataset.
Example Configuration
Here is an example of how you might configure the S3 ingestion source in your DataHub ingestion recipe:
Schema Consistency: Ensure that all Parquet files in a dataset have a consistent schema to avoid issues during ingestion.
Schema Updates: If schema changes are expected, consider updating the ingestion process to handle schema evolution more gracefully, possibly by implementing custom logic to merge schemas before ingestion.
Would you like more details on any specific part of this process?
i see in the response ingestion process find the latest file when multiple files available in folder and read the metadata from the latest file. In that case it would be fine. because if the schema changes in future, then when we re-run the ingestion, those changes will be picked up.