Does Datahub merges schema in parquet files with S3 ingestion?

Original Slack Thread

Does Datahub merges schema in parquet files with S3 ingestion?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To address your question about whether DataHub merges schemas in Parquet files during S3 ingestion, the short answer is no, DataHub does not merge schemas from multiple Parquet files during S3 ingestion. Instead, it extracts the schema from a single file, typically the last updated file in the dataset.

Detailed Explanation

  1. Schema Extraction from a Single File:
    According to the <https://github.com/datahub-project/datahub/pull/7202|GitHub pull request>, DataHub’s S3 ingestion process extracts schema metadata from the last updated file in a dataset. This means that if a dataset consists of multiple Parquet files, the schema is inferred from the most recently updated file, not by merging schemas from all files.

  2. Schema Inference Process:
    The <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/s3/README.md|S3 ingestion documentation> confirms that schemas for Parquet files are extracted as provided. This implies that the schema is taken directly from the file without any merging or aggregation of schemas from multiple files.

  3. Current Limitations:
    As discussed in the <Slack conversation>, there are limitations in the S3 ingestion process where only one file’s schema is considered. This can be problematic if there are schema changes across different files in the dataset.

Example Configuration

Here is an example of how you might configure the S3 ingestion source in your DataHub ingestion recipe:

source:
    type: s3
    config:
        path_specs:
            - include: '<s3://your-bucket/path/to/files/*.parquet>'
        aws_config:
            aws_access_key_id: YOUR_ACCESS_KEY
            aws_secret_access_key: YOUR_SECRET_KEY
            aws_region: YOUR_AWS_REGION
        env: PROD
        profiling:
            enabled: false

Recommendations

  • Schema Consistency: Ensure that all Parquet files in a dataset have a consistent schema to avoid issues during ingestion.
  • Schema Updates: If schema changes are expected, consider updating the ingestion process to handle schema evolution more gracefully, possibly by implementing custom logic to merge schemas before ingestion.

Sources