<@U06TM7M40RK> if I am using the ingestion for a source using DataHub connector, and also a different method (Ex: using an LLM service) to write the lineage to DataHub, is there any risk of ingestion will overwrite the lineage and additional metadata wrote using the LLMs? (for context; I will run the ingestion using the connector daily but the LLM service once a week)
When using multiple methods to write lineage to DataHub, such as a DataHub connector for ingestion and an LLM service, there is a risk of overwriting lineage and additional metadata. According to the <https://datahubproject.io/docs/0.14.0/generated/lineage/lineage-feature-guide|DataHub Lineage Feature Guide>, lineage added manually or programmatically may conflict with lineage extracted automatically, leading to unwanted overwrites. This is because the ingestion source’s picture of lineage is presumed to be more accurate and can overwrite manually added lineage.
To mitigate this risk, it is recommended to carefully manage the timing and scope of each method’s updates. For example, you could ensure that the LLM service updates do not overlap with the connector ingestion updates, or you could use incremental lineage updates to minimize conflicts.
Would you like more details on how to manage these updates or any specific configurations that might help in your scenario?
Sources:
- <https://datahubproject.io/docs/0.14.0/generated/lineage/lineage-feature-guide|DataHub Lineage Feature Guide> 1 button 0 button Hint: Mention <@U06TM7M40RK> in the thread for followups.
<@U06TM7M40RK> is there a configuration in the ingestion recipe to avoid overwriting some values from ingestion; if they are existing in DataHub?
Yes, there is a way to configure your ingestion recipe to avoid overwriting certain metadata values in DataHub if they already exist. This can be achieved using the PATCH
semantics in transformers. The PATCH
semantics allow you to add or update metadata without overwriting existing values.
For example, if you want to add tags to datasets without overwriting existing tags, you can use the simple_add_dataset_tags
transformer with the semantics: PATCH
configuration. Here’s how you can configure your ingestion recipe:
transformers:
- type: "simple_add_dataset_tags"
config:
tag_urns:
- "urn:li:tag:NeedsDocumentation"
- "urn:li:tag:Legacy"
semantics: PATCH # This ensures existing tags are preserved
This configuration ensures that the specified tags are added to the datasets, but any existing tags are not overwritten.
Would you like more detailed instructions or have any specific questions about this process?
Sources:
- https://datahubspace.slack.com/archives/CUMUWQU66/p1726216517.161159|https://datahubspace.slack.com/archives/CUMUWQU66/p1726216517.161159
- https://datahubspace.slack.com/archives/CUMUWQU66/p1722867483.803079|https://datahubspace.slack.com/archives/CUMUWQU66/p1722867483.803079 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> is it possible to not to overwrite the lineage?
Yes, it is possible to configure your ingestion process to avoid overwriting existing lineage in DataHub. This can be achieved by using the incremental_lineage
configuration option, which allows lineage data to be emitted incrementally rather than being restated on each run.
When incremental_lineage
is set to true
, the ingestion process will add new lineage information to the existing data without replacing it. This is particularly useful if you want to preserve the history of lineage changes over time and avoid overwriting manually added lineage.
Here is an example configuration for enabling incremental lineage in a Snowflake ingestion:
pipeline_name: snowflake-lineage-ingestion
source:
type: snowflake
config:
account_id: ${SNOWFLAKE_ACCOUNT_ID}
warehouse: ${SNOWFLAKE_WAREHOUSE}
username: ${SNOWFLAKE_USERNAME}
password: ${SNOWFLAKE_PASSWORD}
role: ${SNOWFLAKE_ROLE}
include_view_lineage: true
include_table_lineage: true
include_usage_stats: true
include_column_lineage: true
incremental_lineage: true # Enable incremental lineage
stateful_ingestion:
enabled: true
sink:
type: "datahub-rest"
config:
server: ${DATAHUB_GMS_ENDPOINT}
This configuration ensures that new lineage information is added without overwriting existing lineage data.
Would you like more details on this configuration or have any specific questions about implementing it?
Sources:
- https://datahubspace.slack.com/archives/CUMUWQU66/p1716403406.590709|https://datahubspace.slack.com/archives/CUMUWQU66/p1716403406.590709
- https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/file-based-lineage|https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/file-based-lineage 0 button 0 button Hint: Mention @runllm in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><