Including the Catalog Name in Spark Emitted Table Names in Databricks

Original Slack Thread

When using the Spark listener in Databricks, is it possible to include the catalog name as part of the emitted table name?

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To include the catalog name as part of the emitted table name when using the Spark listener in Databricks, you need to configure the Spark lineage emitter correctly. According to the <Slack Slack discussion>, the user had to concatenate the platform instance value with the catalog name in the spark.datahub.metadata.dataset.platformInstance configuration to ensure that the catalog name is included in the dataset URN. This workaround was necessary because, without it, the upstream and downstream table URNs would miss the catalog component.

However, this approach may not be scalable if a job involves multiple tasks handling Unity Catalog tables from different catalog instances. Unfortunately, the data sources do not provide a more scalable solution or a direct configuration option to automatically include the catalog name without this manual concatenation.

Would you like more detailed information on how to set up this configuration or explore other related aspects?

Sources:

  • <Slack Slack discussion> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><