Understanding the Difference Between Datahub Spark Configuration Options

user-3 · February 10, 2025, 12:04am

<@U06TM7M40RK> in the datahub spark listener, is there a way to strip partitions from s3 paths?

user-1 · February 10, 2025, 12:04am

Yes, in the DataHub Spark Listener, you can strip partitions from S3 paths using the spark.datahub.metadata.remove_partition_pattern configuration option. This parameter allows you to specify a regular expression pattern to remove partition patterns from dataset URNs. According to the <https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/acryl-spark-lineage/README.md|Spark Lineage Beta README>, you can configure it as follows:

spark.datahub.metadata.remove_partition_pattern=/partition=\d+

This configuration will transform paths like database/table/partition=123 to database/table by removing the partition pattern.

If you need further assistance or a more detailed explanation, feel free to ask!

Sources:

<https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/acryl-spark-lineage/README.md|Spark Lineage Beta README> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · February 10, 2025, 12:04am

What is the difference between the options spark.datahub.metadata.remove_partition_pattern and spark.datahub.file_partition_regexp?

user-3 · February 10, 2025, 12:04am

<@U06TM7M40RK> What is the difference between the options spark.datahub.metadata.remove_partition_pattern and spark.datahub.file_partition_regexp?

user-1 · February 10, 2025, 12:04am

It seems likely that the options spark.datahub.metadata.remove_partition_pattern and spark.datahub.file_partition_regexp serve different purposes in the context of DataHub’s Spark integration, particularly for handling partition patterns in dataset paths.

spark.datahub.metadata.remove_partition_pattern: This option is used to remove partition patterns from dataset URNs. It allows you to specify a regular expression pattern to strip out partition information from the dataset path. For example, setting this option to /partition=\d+ would transform paths like database/table/partition=123 to database/table. This is useful for normalizing dataset URNs by removing dynamic partition information.
spark.datahub.file_partition_regexp: This option is used to strip partition parts from the path if the path ends with a specified regular expression. It seems to be more focused on handling file paths directly, rather than just the URN. For instance, you might use a pattern like year=.*/month=.*/day=.* to remove date-based partitioning from file paths. This option is particularly useful when you want to manage how file paths are represented in the metadata.

The distinction between these two options is that remove_partition_pattern is more about normalizing dataset URNs by removing partition information, while file_partition_regexp is about managing how file paths are processed and represented, especially when dealing with specific partitioning schemes like date-based partitions.

If you need more detailed information or examples, please let me know!

Sources:

<https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/acryl-spark-lineage/README.md|Spark Lineage Beta README> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · February 10, 2025, 12:04am

<@U06TM7M40RK> can you give an example of using the spark.datahub.metadata.remove_partition_pattern option?

user-1 · February 10, 2025, 12:04am

Certainly! The spark.datahub.metadata.remove_partition_pattern option is used to remove partition patterns from dataset URNs in DataHub’s Spark integration. Here’s an example of how you can use this option:

Suppose you have a dataset path like s3://my-bucket/data/table/partition=2023/01/01, and you want to remove the partition information so that the dataset is represented as s3://my-bucket/data/table in DataHub. You can achieve this by setting the spark.datahub.metadata.remove_partition_pattern option with an appropriate regular expression.

Here’s how you can configure it:

spark.datahub.metadata.remove_partition_pattern=/partition=\d+/\d+/\d+

In this example, the regular expression /partition=\d+/\d+/\d+ is used to match and remove the partition part of the path, which consists of a year, month, and day. This configuration will transform paths like s3://my-bucket/data/table/partition=2023/01/01 to s3://my-bucket/data/table.

This option is particularly useful for normalizing dataset URNs by removing dynamic partition information, which can vary over time.

If you need further assistance or more examples, feel free to ask!

Sources:

<https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/acryl-spark-lineage/README.md|Spark Lineage Beta README> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · February 10, 2025, 12:04am

<@U06TM7M40RK> what if I want to replace all characters after the equal sign until I get to the next partition?

user-1 · February 10, 2025, 12:04am

To replace all characters after the equal sign until the next partition in a dataset path, you can use a regular expression pattern with the spark.datahub.metadata.remove_partition_pattern option. This pattern should match the partition key-value pairs and replace the values with an empty string or another placeholder if needed.

Here’s an example configuration:

spark.datahub.metadata.remove_partition_pattern=/([^/]+)=([^/]+)

In this example, the regular expression /([^/]+)=([^/]+) is used to match any partition key-value pair in the path. The pattern ([^/]+)=([^/]+) captures the partition key and value, and you can replace the value with an empty string or another placeholder.

If you want to replace the values with a specific placeholder, you can modify the pattern accordingly. For example, to replace the values with the word “REPLACED”, you could use:

spark.datahub.metadata.remove_partition_pattern=/([^/]+)=([^/]+)

And then apply a replacement logic in your code to substitute the captured values with “REPLACED”.

If you need more detailed steps or further assistance, please let me know!

Sources:

<https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/acryl-spark-lineage/README.md|Spark Lineage Beta README> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · February 10, 2025, 12:04am

<@U06TM7M40RK> Does the config spark.datahub.metadata.remove_partition_pattern=/([^/]+)=([^/]+) remove just 1 partition?

user-1 · February 10, 2025, 12:04am

The configuration spark.datahub.metadata.remove_partition_pattern=/([^/]+)=([^/]+) is designed to match and remove partition key-value pairs in a dataset path. However, as written, it will only match and remove one partition key-value pair at a time. This is because regular expressions typically operate on a single match per application unless applied iteratively.

If you want to remove multiple partition key-value pairs in a path, you would need to ensure that the regular expression is applied iteratively over the path string. This might require additional logic in your code to repeatedly apply the pattern until all matches are removed.

For example, if your path is s3://bucket-name/database/case/table/partition_key1=value1/partition_key2=value2, applying the pattern once would remove only the first partition key-value pair (partition_key1=value1). You would need to apply the pattern again to remove the second pair (partition_key2=value2).

If you need more detailed guidance on how to implement this in your specific context, please let me know!

Sources:

https://datahubspace.slack.com/archives/CUMUWQU66/p1720625480.052599|https://datahubspace.slack.com/archives/CUMUWQU66/p1720625480.052599
https://datahubproject.io/docs/0.13.1/metadata-integration/java/spark-lineage#configuration-instructions--databricks|https://datahubproject.io/docs/0.13.1/metadata-integration/java/spark-lineage#configuration-instructions--databricks 0 button 0 button Hint: Mention @runllm in the thread for followups.

datahub_team · February 10, 2025, 12:04am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

Topic		Replies	Views
Hiding Partition Tables in PostgreSQL and DataHub Configurations ingestion	6	10	December 2, 2024
Troubleshooting S3 Recipe Issues with Partitioned Datasets ingestion	16	5	December 2, 2024
Troubleshooting deleting folders and datasets in an ingested S3 Bucket troubleshoot	2	63	March 4, 2024
Cleaning Up Existing Spark Pipelines and Tasks in DataHub Instance ingestion	2	2	December 16, 2024
Does Datahub merges schema in parquet files with S3 ingestion? ingestion	2	68	May 27, 2024

Understanding the Difference Between Datahub Spark Configuration Options

Related topics