Setting Up BigQuery as an Ingestion Source for DataHub on GKE

Original Slack Thread

<@U06TM7M40RK> I have deployed datahub cloud on GKE. I want to setup bigquery as an ingestion source. How can I do that?

To set up BigQuery as an ingestion source for DataHub on Google Kubernetes Engine (GKE), you can follow these steps:

  1. Set Up a Service Account:

  2. Create a Service Account Key:

  3. Configure Secrets in DataHub:

  4. Create an Ingestion Recipe:

  5. Run the Ingestion:

    • Use the DataHub CLI or UI to run the ingestion process. You can schedule the ingestion to run at regular intervals if needed.
  6. Verify the Ingestion:

    • After running the ingestion, check the DataHub UI to verify that the metadata has been correctly ingested and is visible.

Would you like more detailed steps or additional information on any specific part of the process?

Sources:

<@U06TM7M40RK> I dont want to use service account keys because of security constraint I rather want to use workload identity federation. Is it possible using that?

It seems likely that using Workload Identity Federation (WIF) for BigQuery ingestion in DataHub is not currently supported out-of-the-box. According to discussions in the <Slack Slack community> and <https://forum.datahubproject.io/t/using-oauth-tokens-for-bigquery-authentication-with-aws-workload-identity/475|forum posts>, there is interest in supporting WIF, but it is not yet implemented.

Currently, DataHub primarily supports authentication using a JSON key file for BigQuery ingestion. However, there are some workarounds and community contributions that might help you achieve your goal. For example, you can modify the BigQuery ingestion code to use OAuth tokens or other forms of credentials.

If you are open to contributing, you might consider raising a pull request to add support for WIF or service account impersonation. The community is generally supportive of such contributions.

Would you like more detailed guidance on any of these points or help with a specific workaround?

Sources:

<@U06TM7M40RK> both slack thread and link which you gave are in conclusive. Can you talk with someone from acryl is this is possible to implement.

It seems likely that the current documentation and community discussions do not provide a definitive answer regarding the implementation of Workload Identity Federation (WIF) for BigQuery ingestion in DataHub. Unfortunately, I don’t have the capability to directly communicate with Acryl representatives or any external parties.

However, you can reach out to the Acryl team directly for more specific guidance. According to the <https://datahubproject.io/docs/0.14.0/managed-datahub/managed-datahub-overview#more-questions|DataHub Cloud documentation>, you can fill out a form to have someone from the Acryl team contact you. This might be the best way to get a definitive answer or assistance with implementing WIF for your use case.

If you have any other questions or need further assistance, feel free to ask!

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><