Switching Database in DataHub's Quickstart Setup

Original Slack Thread

<@U06TM7M40RK> datahub에서 metastore로 mysql 대신 postgresql로 한다면, quickstart는 어떻게 해야 하나?

To switch from MySQL to PostgreSQL in DataHub’s quickstart setup, you can follow these steps:

  1. Use the PostgreSQL Docker Compose File: According to a <Slack conversation>, you can replace the MySQL configuration in your Docker Compose file with the PostgreSQL configuration. You can refer to the https://github.com/datahub-project/datahub/blob/master/docker/postgres/docker-compose.postgres.yml|docker-compose.postgres.yml file as a guide.

  2. Modify the Docker Compose File: Replace the MySQL service and environment variables with those for PostgreSQL. Ensure that the PostgreSQL service is properly configured with the necessary environment variables such as POSTGRES_USER, POSTGRES_PASSWORD, and POSTGRES_DB.

  3. Update Environment Variables: Ensure that the environment variables in your Docker Compose file are set to connect to the PostgreSQL database. This includes setting the DATAHUB_DB_NAME and other relevant variables as shown in another <Slack conversation>.

  4. Run the Setup: Once the Docker Compose file is updated, you can run the setup using Docker Compose commands to bring up the DataHub services with PostgreSQL as the backend.

Would you like more detailed instructions on any of these steps?

Sources:

  • <Slack conversation on changing to PostgreSQL>
  • <Slack conversation on PostgreSQL configuration> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> cli 명령어로는 meta-store db를 지정할 수 없나?

I’m not able to find any data sources to answer your question. You might want to specify which CLI tool or command you are referring to within DataHub. If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> cli 명령어로 mysql대신 다른 db를 지정할 수는 없나?

The data sources provided do not explicitly mention a CLI command for directly specifying a different database (such as PostgreSQL) instead of MySQL for the DataHub meta-store. However, you can configure the database connection by modifying the environment variables or configuration files used by DataHub.

For instance, when deploying DataHub using Docker or Helm, you can specify the database type and connection details in the configuration files or environment variables. This is typically done by setting the appropriate JDBC URL, driver, username, and password for the desired database (e.g., PostgreSQL).

If you are using Docker, you can modify the docker-compose.yml file to use PostgreSQL by setting the appropriate environment variables for the DataHub services. Similarly, if you are using Helm, you can update the values.yaml file to configure the database connection.

If you need more detailed guidance on how to configure these settings, please let me know!

Sources:

<@U06TM7M40RK> 쿠버네티스로 운영할 때, neo4j를 운영할 수 있나?

Yes, you can operate Neo4j as part of your DataHub deployment on Kubernetes. Here are the steps and considerations based on the available data sources:

  1. Neo4j as a Graph Backend: DataHub supports using Neo4j as a graph backend. You can configure DataHub to use Neo4j by setting the GRAPH_SERVICE_IMPL environment variable to neo4j. This can be done in your Kubernetes deployment configurations.

  2. Kubernetes Deployment: When deploying DataHub on Kubernetes, you can include Neo4j as part of your setup. According to the <https://datahubproject.io/docs/0.14.0/deploy/kubernetes|DataHub Kubernetes deployment guide>, you can deploy Neo4j along with other dependencies like Elasticsearch, MySQL, and Kafka using Helm charts.

  3. Configuration: You will need to configure the values.yaml file to enable Neo4j. Set the enabled field for Neo4j to true in the prerequisites chart and configure the connection details accordingly.

  4. Managed Services: If you prefer not to manage Neo4j yourself, you can use a managed Neo4j service and configure DataHub to connect to it by specifying the appropriate connection details in your Kubernetes setup.

Would you like more detailed steps on configuring Neo4j in your Kubernetes deployment?

Sources:

<@U06TM7M40RK> neo4j community버전도 쿠버네티스로 운영할 수 있나?

Yes, you can operate the Neo4j Community Edition on Kubernetes as part of your DataHub deployment. Here are the steps and considerations:

  1. Helm Charts: DataHub provides Helm charts that include Neo4j as one of the dependencies. You can use these charts to deploy Neo4j Community Edition on your Kubernetes cluster. According to the <https://datahubproject.io/docs/0.14.0/deploy/kubernetes|DataHub Kubernetes deployment guide>, you can deploy Neo4j along with other dependencies like Elasticsearch, MySQL, and Kafka.

  2. Configuration: In the values.yaml file for the prerequisites chart, you can enable Neo4j by setting the enabled field to true. You can also configure other settings such as the Neo4j password and database name.

  3. Resource Requirements: Ensure that your Kubernetes cluster has sufficient resources to run Neo4j, as it can be resource-intensive depending on your data size and query complexity.

  4. Metrics and Monitoring: If you are interested in extracting metrics from the Neo4j Community Edition pod, you might face limitations as it does not natively support some enterprise features like advanced monitoring. However, you can still use basic monitoring tools available in Kubernetes.

Would you like more detailed instructions on deploying Neo4j Community Edition on Kubernetes?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><