Determining Schema from Multiple Files in S3 DataHub Ingestion

user-4 · July 29, 2024, 12:03am

We are doing some POC on Datahub. As part of POC testing the S3 ingestion and exploring all the options in S3. Present i am working on the partition detection in S3 ingestion with adding all the partition details in the recipe. When i run the ingestion i see only one file datahub considering from list of samples files. whatever the schema present in that file displaying the metadata in UI.

Is there any option to consider more than one file to determine the schema from sample files?

datahub_team · July 29, 2024, 12:03am

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?

We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a or so other folks in the Community know if you’re still looking for help on the topic!

user-2 · July 29, 2024, 12:03am

To address your question about considering more than one file to determine the schema from sample files during S3 ingestion in DataHub, the short answer is that DataHub does not merge schemas from multiple files. Instead, it extracts the schema from a single file, typically the last updated file in the dataset.

Detailed Explanation

Schema Extraction from a Single File:
According to the <https://github.com/datahub-project/datahub/pull/7202|GitHub pull request>, DataHub’s S3 ingestion process extracts schema metadata from the last updated file in a dataset. This means that if a dataset consists of multiple Parquet files, the schema is inferred from the most recently updated file, not by merging schemas from all files.
Schema Inference Process:
The <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/s3/README.md|S3 ingestion documentation> confirms that schemas for Parquet files are extracted as provided. This implies that the schema is taken directly from the file without any merging or aggregation of schemas from multiple files.
Current Limitations:
As discussed in the <Slack conversation>, there are limitations in the S3 ingestion process where only one file’s schema is considered. This can be problematic if there are schema changes across different files in the dataset.

Example Configuration

Here is an example of how you might configure the S3 ingestion source in your DataHub ingestion recipe:

source:
    type: s3
    config:
        path_specs:
            - include: '<s3://your-bucket/path/to/files/*.parquet>'
        aws_config:
            aws_access_key_id: YOUR_ACCESS_KEY
            aws_secret_access_key: YOUR_SECRET_KEY
            aws_region: YOUR_AWS_REGION
        env: PROD
        profiling:
            enabled: false

Recommendations

Schema Consistency: Ensure that all Parquet files in a dataset have a consistent schema to avoid issues during ingestion.
Schema Updates: If schema changes are expected, consider updating the ingestion process to handle schema evolution more gracefully, possibly by implementing custom logic to merge schemas before ingestion.

Would you like more details on any specific part of this process?

Sources:

<https://github.com/datahub-project/datahub/pull/7202|GitHub Pull Request>
<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/s3/README.md|S3 Ingestion Documentation>
<Slack Conversation> 1 button 0 button Hint: Mention <@U06TM7M40RK> in the thread for followups.

user-4 · July 29, 2024, 12:03am

i see in the response ingestion process find the latest file when multiple files available in folder and read the metadata from the latest file. In that case it would be fine. because if the schema changes in future, then when we re-run the ingestion, those changes will be picked up.

user-1 · July 29, 2024, 12:04am

yes, that is the reason it picks up the latest file

Topic		Replies	Views
Does Datahub merges schema in parquet files with S3 ingestion? ingestion	2	75	May 27, 2024
Ingesting Metadata for S3 Datalake: How to Ingest Only the Latest File and Infer Schema from It? ingestion	14	92	March 4, 2024
Troubleshooting S3 Recipe Issues with Partitioned Datasets ingestion	16	21	December 2, 2024
Using Ingestion with Profiling on S3 for Aggregating Multiple Entities ingestion	6	121	March 4, 2024
Integrating DataHub with S3 Data Lake: Configuring Folder Structure Visibility for JSON Data ingestion	2	37	May 20, 2024

Determining Schema from Multiple Files in S3 DataHub Ingestion

Related topics