Does Datahub merges schema in parquet files with S3 ingestion?

user-1 · May 27, 2024, 12:04am

datahub_team · May 27, 2024, 12:04am

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?

We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a or so other folks in the Community know if you’re still looking for help on the topic!

user-2 · May 27, 2024, 12:04am

To address your question about whether DataHub merges schemas in Parquet files during S3 ingestion, the short answer is no, DataHub does not merge schemas from multiple Parquet files during S3 ingestion. Instead, it extracts the schema from a single file, typically the last updated file in the dataset.

Detailed Explanation

Schema Extraction from a Single File:
According to the <https://github.com/datahub-project/datahub/pull/7202|GitHub pull request>, DataHub’s S3 ingestion process extracts schema metadata from the last updated file in a dataset. This means that if a dataset consists of multiple Parquet files, the schema is inferred from the most recently updated file, not by merging schemas from all files.
Schema Inference Process:
The <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/s3/README.md|S3 ingestion documentation> confirms that schemas for Parquet files are extracted as provided. This implies that the schema is taken directly from the file without any merging or aggregation of schemas from multiple files.
Current Limitations:
As discussed in the <Slack conversation>, there are limitations in the S3 ingestion process where only one file’s schema is considered. This can be problematic if there are schema changes across different files in the dataset.

Example Configuration

Here is an example of how you might configure the S3 ingestion source in your DataHub ingestion recipe:

source:
    type: s3
    config:
        path_specs:
            - include: '<s3://your-bucket/path/to/files/*.parquet>'
        aws_config:
            aws_access_key_id: YOUR_ACCESS_KEY
            aws_secret_access_key: YOUR_SECRET_KEY
            aws_region: YOUR_AWS_REGION
        env: PROD
        profiling:
            enabled: false

Recommendations

Schema Consistency: Ensure that all Parquet files in a dataset have a consistent schema to avoid issues during ingestion.
Schema Updates: If schema changes are expected, consider updating the ingestion process to handle schema evolution more gracefully, possibly by implementing custom logic to merge schemas before ingestion.

Sources

<Slack Conversation>
<https://github.com/datahub-project/datahub/pull/7202|GitHub Pull Request>
<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/s3/README.md|S3 Ingestion Documentation> 1 button 0 button Hint: Mention <@U06TM7M40RK> in the thread for followups.

Topic		Replies	Views
Determining Schema from Multiple Files in S3 DataHub Ingestion ingestion	4	13	July 29, 2024
Troubleshooting ingestion of Parquet file from GitHub using S3 ingestion in DataHub ingestion	1	48	March 4, 2024
Ingesting Metadata for S3 Datalake: How to Ingest Only the Latest File and Infer Schema from It? ingestion	14	88	March 4, 2024
How to Ingest and Transform Hive Database Names in DataHub ingestion	8	51	July 29, 2024
Integrating DataHub with S3 Data Lake: Configuring Folder Structure Visibility for JSON Data ingestion	2	34	May 20, 2024

Does Datahub merges schema in parquet files with S3 ingestion?

Related topics