Troubleshooting Data Ingestion Issues from Databricks to DataHub

Original Slack Thread

Hi Team,

I am facing issue while ingesting data from Databricks. I appreciate it if you can help me with them.
Below is the error message that i am getting
“Execution finished with errors.
{‘exec_id’: ‘54d01ae5-6f58-4506-becd-20dad04be90c’,
‘infos’: [‘2023-10-18 17:07:21.194599 INFO: Starting execution for task with name=RUN_INGEST’,
“2023-10-18 17:07:35.870711 INFO: Failed to execute ‘datahub ingest’, exit code 1”,
‘2023-10-18 17:07:35.875322 INFO: Caught exception EXECUTING task_id=54d01ae5-6f58-4506-becd-20dad04be90c, name=RUN_INGEST, ’
‘stacktrace=Traceback (most recent call last):\n’
’ File “/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py”, line 140, in execute_task\n’
’ task_event_loop.run_until_complete(task_future)\n’
’ File “/usr/local/lib/python3.10/asyncio/base_events.py”, line 649, in run_until_complete\n’
’ return future.result()\n’
’ File “/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py”, line 282, in execute\n’
’ raise TaskError(“Failed to execute 'datahub ingest'”)\n’
“acryl.executor.execution.task.TaskError: Failed to execute ‘datahub ingest’\n”],
‘errors’: }”

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

Sure, I can help with that! To assist you more effectively, could you please provide the following information?

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.13.1)
  3. What data source(s) are you integrating with DataHub? (e.g., Databricks)

Additionally, please share the ingestion recipe or configuration you are using. This will help in diagnosing the issue more accurately.

Based on the error message you provided, it seems like there might be an issue with the ingestion task execution. According to the <Ingestion | DataHub documentation>, common reasons for ingestion failures include:

  • Recipe Misconfiguration: Ensure that your recipe has provided the required or expected configurations for the ingestion source.
  • Failure to Resolve Secrets: Verify that the names of the secrets referenced in your recipe match those which have been created.
  • Connectivity/Network Reachability: Ensure that the network where DataHub is deployed has access to the data source you are trying to reach.
  • Authentication Issues: If you’ve enabled Metadata Service Authentication, make sure to provide a Personal Access Token in your Recipe Configuration.

Would you like more detailed steps on any of these troubleshooting points or further assistance on a specific part of the process?

Sources:

Sources:

  • <Ingestion | DataHub documentation> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

Hi Team,

when i tested the connection from datahub to databricks unity catalog i am getting below error

"Databricks Unity Catalog Connection Test
Connection Failed
A connection was not able to be established with Databricks Unity Catalog.

Internal FailureFailed to parse config due to 1 validation error for UnityCatalogSourceConfig root When include_hive_metastore is set, warehouse_id must be set. (type=value_error)"

Kindly suggest

<@U06TM7M40RK> Please find the below details.

  1. Are you using UI or CLI for ingestion? :UI Only
  2. Which DataHub version are you using? : v0.13.3rc1
  3. What data source(s) are you integrating with DataHub? : Databricks.

Thank you for providing the additional details. Based on the error message and the information you provided, it looks like the issue is related to the include_hive_metastore setting in your configuration. When include_hive_metastore is set to true, you must also provide a warehouse_id.

Steps to Resolve the Issue

  1. Set warehouse_id: Ensure that you have specified a warehouse_id in your configuration. This is required when include_hive_metastore is set to true.

  2. Disable include_hive_metastore: If you do not need to include the Hive Metastore, you can set include_hive_metastore to false.

Here is an example configuration for both scenarios:

Example with warehouse_id:

source:
  type: unity-catalog
  config:
    workspace_url: 'https://your-databricks-workspace-url'
    token: 'your-databricks-token'
    include_hive_metastore: true
    warehouse_id: 'your-warehouse-id'
    # other configurations

Example without include_hive_metastore:

source:
  type: unity-catalog
  config:
    workspace_url: 'https://your-databricks-workspace-url'
    token: 'your-databricks-token'
    include_hive_metastore: false
    # other configurations

Additional Considerations

  • Permissions: Ensure that your Databricks service principal has the necessary permissions as outlined in the <Databricks | DataHub Unity Catalog documentation>.
  • Profiling Configuration: If you are using profiling, make sure that the warehouse_id is also set in the profiling configuration.

Example Profiling Configuration

source:
  type: unity-catalog
  config:
    workspace_url: 'https://your-databricks-workspace-url'
    token: 'your-databricks-token'
    include_hive_metastore: true
    warehouse_id: 'your-warehouse-id'
    profiling:
      method: "ge"
      enabled: true
      warehouse_id: 'your-warehouse-id'
    # other configurations

Would you like more detailed steps or further assistance on a specific part of the process?

Sources:

Sources:

Great the connection got succeeded. thanks <@U06TM7M40RK> for your help

But we are trying to ingest some sample data i am getting below error

"

  type: unity-catalog
  config:
    workspace_url: '<https://your-databricks-workspace-url>'
    token: 'your-databricks-token'
    include_hive_metastore: true
    warehouse_id: 'your-warehouse-id'
    profiling:
      method: "ge"
      enabled: true
      warehouse_id: 'your-warehouse-id'
    # other configurations```
in error message that we can see we can give warehouse ID but we don't have any such kind of thing
we have cluster and in that cluster we have metastore (we have metadata ID), and there we have schema's (we have schema ID's) and in that schema's we have tables and the tables having data

( cluster--&gt;metastore--&gt;schema--&gt;tables--&gt;data )
we are trying to ingest that data into datahub

Kindly suggest

It looks like you are trying to ingest data from Databricks Unity Catalog into DataHub, but you are encountering issues related to the warehouse_id. The warehouse_id is required when include_hive_metastore is set to true, and it is also necessary for profiling.

Understanding the warehouse_id

In Databricks, a SQL Warehouse (formerly known as SQL Endpoint) is required to run SQL queries. The warehouse_id is the identifier for this SQL Warehouse. If you do not have a SQL Warehouse, you will need to create one in your Databricks workspace.

Steps to Create a SQL Warehouse

  1. Navigate to SQL Warehouses:

    • Go to your Databricks workspace.
    • Click on the SQL icon on the sidebar.
    • Navigate to the SQL Warehouses tab.
  2. Create a New SQL Warehouse:

    • Click on the “Create SQL Warehouse” button.
    • Fill in the required details such as name, cluster size, and other configurations.
    • Click “Create” to provision the SQL Warehouse.
  3. Obtain the warehouse_id:

    • Once the SQL Warehouse is created, you can find its warehouse_id in the URL of the SQL Warehouse details page or through the Databricks API.

Example Configuration with warehouse_id

Here is how your configuration should look once you have the warehouse_id:

source:
  type: unity-catalog
  config:
    workspace_url: 'https://your-databricks-workspace-url'
    token: 'your-databricks-token'
    include_hive_metastore: true
    warehouse_id: 'your-warehouse-id'
    profiling:
      method: "ge"
      enabled: true
      warehouse_id: 'your-warehouse-id'
    # other configurations

Additional Configuration

Make sure that your Databricks service principal has the necessary permissions to access the SQL Warehouse, metastore, schemas, and tables. The permissions required are detailed in the <Databricks | DataHub Unity Catalog documentation>.

Example Permissions

  • Metastore Admin Role: Required for accessing the metastore.
  • USE CATALOG Privilege: Required for accessing catalogs.
  • USE SCHEMA Privilege: Required for accessing schemas.
  • SELECT Privilege: Required for accessing tables.

Troubleshooting

If you encounter any issues, please check the following:

  • Ensure the warehouse_id is correctly specified.
  • Verify that the service principal has the necessary permissions.
  • Check the DataHub logs for any specific error messages that can provide more insight.

Would you like more detailed steps on any of these points or further assistance with a specific part of the process?

Sources:

Sources:

Hi <@U06TM7M40RK> Thanks a lot for your assistance
i need some more assistance from you if possible that is there any chance to ingest sample data as well because we didn’t find any option for that
Kindly check attached screenshot![attachment]({‘ID’: ‘F07AUPQ7LF7’, ‘EDITABLE’: False, ‘IS_EXTERNAL’: False, ‘USER_ID’: ‘U07A741HJQP’, ‘CREATED’: ‘2024-07-03 16:42:19+00:00’, ‘PERMALINK’: ‘Slack’, ‘EXTERNAL_TYPE’: ‘’, ‘TIMESTAMPS’: ‘2024-07-03 16:42:19+00:00’, ‘MODE’: ‘hosted’, ‘DISPLAY_AS_BOT’: False, ‘PRETTY_TYPE’: ‘PNG’, ‘NAME’: ‘image.png’, ‘IS_PUBLIC’: True, ‘PREVIEW_HIGHLIGHT’: None, ‘MIMETYPE’: ‘image/png’, ‘PERMALINK_PUBLIC’: ‘https://slack-files.com/TUMKD5EGJ-F07AUPQ7LF7-588b8342f0’, ‘FILETYPE’: ‘png’, ‘EDIT_LINK’: None, ‘URL_PRIVATE’: ‘Slack’, ‘HAS_RICH_PREVIEW’: False, ‘TITLE’: ‘image.png’, ‘IS_STARRED’: False, ‘PREVIEW_IS_TRUNCATED’: None, ‘URL_PRIVATE_DOWNLOAD’: ‘Slack’, ‘PREVIEW’: None, ‘PUBLIC_URL_SHARED’: False, ‘MESSAGE_TS’: ‘1720024966.630469’, ‘PARENT_MESSAGE_TS’: ‘1719896965.332949’, ‘MESSAGE_CHANNEL_ID’: ‘CUMUWQU66’, ‘_FIVETRAN_DELETED’: False, ‘LINES_MORE’: None, ‘LINES’: None, ‘SIZE’: 229295, ‘_FIVETRAN_SYNCED’: ‘2024-07-07 03:50:00.203000+00:00’})

Sure thing! I’m looking through the available data to answer your question. Hold on a second… :hourglass_flowing_sand: