Differences and Lineage in Iceberg Table Ingestion and Lineage Configuration

user-1 · December 16, 2024, 12:05am

<@U06TM7M40RK> I want to scan a iceberg table in a glue catalog. I used the glue ingestion recipe and it worked. However, it did not work using the Iceberg ingestion recipe. What are the differences ingesting a iceberg table from glue or the iceberg recipe?

user-2 · December 16, 2024, 12:05am

The differences between ingesting an Iceberg table using the Glue ingestion recipe versus the Iceberg ingestion recipe primarily revolve around the configuration and the underlying mechanisms used to access the metadata.

Glue Ingestion Recipe:
- The Glue ingestion recipe is specifically designed to work with AWS Glue as the catalog. It uses AWS Glue APIs to fetch metadata about the tables.
- This method is typically more straightforward if your tables are already registered in Glue, as it leverages Glue’s capabilities to manage and retrieve metadata.
- According to the <Glue | DataHub Configuration Details>, you need to ensure that the AWS credentials and region are correctly configured to access the Glue catalog.
Iceberg Ingestion Recipe:
- The Iceberg ingestion recipe is more generic and can be used with various catalog types, including REST, SQL, and others, as specified in the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/iceberg|Iceberg Ingestion Source Configuration>.
- This method requires a more detailed configuration, especially if you are using a REST or SQL-based catalog. The configuration must align with the pyiceberg library’s requirements.
- If you are using the Iceberg ingestion recipe with a Glue catalog, you need to ensure that the configuration is correctly set up to use Glue as the catalog type, and all necessary dependencies like pyiceberg[glue] are installed.
Dependencies and Environment:
- The Iceberg ingestion recipe might require additional dependencies, such as pyiceberg, which need to be installed in the environment where the ingestion is running. This is highlighted in the <Iceberg | DataHub source configuration documentation>.
- If you encounter issues with the Iceberg ingestion recipe, it might be due to missing dependencies or incorrect configuration of the catalog type.

If you are facing issues with the Iceberg ingestion recipe, it would be helpful to verify the configuration and ensure that all required dependencies are installed. Would you like more detailed steps on configuring the Iceberg ingestion recipe or troubleshooting common issues?

Sources:

<https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/iceberg|Iceberg Ingestion Source Configuration> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · December 16, 2024, 12:05am

<@U06TM7M40RK> but does iceberg vs glue ingestion recipe bring anything back differently for an iceberg table? Any lineage differences?

user-2 · December 16, 2024, 12:05am

It seems likely that the differences between using the Iceberg ingestion recipe and the Glue ingestion recipe for an Iceberg table primarily relate to how metadata is accessed and potentially the level of detail in lineage information.

Metadata Access:
- The Glue ingestion recipe accesses metadata through AWS Glue, which might provide a more integrated view if your data ecosystem heavily relies on AWS services. This can include additional metadata managed by Glue, such as job configurations and transformations.
- The Iceberg ingestion recipe, on the other hand, accesses metadata directly from the Iceberg catalog, which might be more suitable if you are using a non-AWS catalog or need to leverage Iceberg-specific features.
Lineage Information:
- According to the <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/glue|Glue Configuration Details>, Glue ingestion supports table-level lineage by default. This means it can track data transformations and dependencies as managed by Glue.
- The Iceberg ingestion recipe might provide lineage information based on the Iceberg catalog’s capabilities. If the catalog supports detailed lineage tracking, this can be reflected in the ingestion process.
Differences in Lineage Detail:
- If your Iceberg tables are part of a broader data pipeline managed by Glue, using the Glue ingestion recipe might provide more comprehensive lineage details, including job-level transformations and dependencies.
- The Iceberg ingestion recipe might focus more on the table-level lineage as defined by the Iceberg catalog, which could be less detailed if the catalog does not manage job-level metadata.

In summary, the choice between the two recipes might affect the granularity and scope of the lineage information you receive, depending on how your data ecosystem is structured and the capabilities of the catalog you are using. Would you like more detailed information on configuring lineage tracking for either recipe?

Sources:

<https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/glue|Glue Configuration Details> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · December 16, 2024, 12:05am

<@U06TM7M40RK> does iceberg have lineage with snowflake? how do I configure it?

user-2 · December 16, 2024, 12:05am

To configure lineage between Iceberg tables and Snowflake in DataHub, you need to ensure that both data sources are properly ingested and that lineage information is captured during the ingestion process. Here’s a general approach to achieve this:

Ingest Iceberg Tables:
- Use the Iceberg ingestion recipe to ingest metadata from your Iceberg tables. Ensure that the catalog type (e.g., Hive, Glue, REST) is correctly configured in the ingestion recipe.
- Make sure that the necessary dependencies, such as pyiceberg, are installed in your environment.
Ingest Snowflake Metadata:
- Use the Snowflake ingestion recipe to ingest metadata from your Snowflake instance. This includes configuring the Snowflake account, warehouse, database, and schema details in the ingestion recipe.
- Enable lineage extraction by setting the include_table_lineage and include_column_lineage flags to true in the Snowflake ingestion configuration. This will allow DataHub to capture lineage information based on SQL queries and operations performed in Snowflake.
Configure Lineage Extraction:
- Ensure that both ingestion recipes are configured to extract lineage information. For Snowflake, this involves parsing SQL queries to determine data flows between tables.
- For Iceberg, lineage extraction will depend on the capabilities of the catalog and the ingestion configuration.
Run Ingestion Jobs:
- Execute the ingestion jobs for both Iceberg and Snowflake. This will populate DataHub with metadata and lineage information for both data sources.
Visualize Lineage in DataHub:
- Once the ingestion jobs are complete, you can visualize the lineage in the DataHub UI. Navigate to the dataset pages for your Iceberg and Snowflake tables and view the lineage tab to see upstream and downstream dependencies.

It’s important to note that the ability to capture lineage between Iceberg and Snowflake will depend on the specific operations and transformations performed in your data pipelines. If there are SQL queries or data transformations that involve both Iceberg and Snowflake tables, DataHub should be able to capture and represent this lineage.

Would you like more detailed steps on configuring the ingestion recipes or any specific aspect of lineage extraction?

Sources:

https://datahubspace.slack.com/archives/CUMUWQU66/p1720578726.113079|https://datahubspace.slack.com/archives/CUMUWQU66/p1720578726.113079
https://github.com/datahub-project/datahub/pull/11384|https://github.com/datahub-project/datahub/pull/11384 0 button 0 button Hint: Mention @runllm in the thread for followups.

datahub_team · December 16, 2024, 12:05am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

Topic		Replies	Views
Troubleshooting Iceberg Source Recipe for Ingestion via Glue Catalog ingestion	13	94	July 8, 2024
Analyzing Data Lineage on Column Level in DataHub Version 0.13.3 CLI for Snowflake, Oracle, and PostgreSQL ingestion	6	125	July 22, 2024
Is salesforce lineage and iceberg lineage available through Datahub open source? ingestion	3	50	April 8, 2024
Building custom column level lineage between Salesforce and Snowflake ingestion	1	65	May 6, 2024
Storing AWS Glue Catalog Table in DataHub for Exploration via UI ingestion	2	94	March 4, 2024

Differences and Lineage in Iceberg Table Ingestion and Lineage Configuration

Related topics