Deploying DataHub via Docker and Moving to Production: Security Concerns and Best Practices

Original Slack Thread

Hey folks!

We’re exploring data hub for our company. I just tried the quick start and it was GREAT.

We would like to deploy via Docker but I saw this in the documentation - https://datahubproject.io/docs/quickstart/#move-to-production|https://datahubproject.io/docs/quickstart/#move-to-production

But then there’s a deploying with Docker guide? https://datahubproject.io/docs/docker|https://datahubproject.io/docs/docker.

Would someone mind shedding a little light on deploying via Docker in production? Are there any security concerns?

In particular this piece. Does this mean data hub services can be accessed by anyone? Or just anyone with access to the machine Docker is running on?

Exposed Ports
DataHub’s services, and it’s backend data stores use the docker default behavior of binding to all interface addresses. This makes it useful for development but is not recommended in a production environment.

Would also love to learn more about this - is there a guide about the steps needed to move from Quickstart to production (if not, I would write one once I’m finished). We are currently thinking about using Kubernetes for production, but would also be open to continue with Docker.

I can provide a bit of insights:

DataHub services by default would be able to be accessed by anyone who has access to the docker / kubernetes pods on the network. Using K8s (or cloud provider security policies), you can control which ports on each pod are exposed to the outside world using proxies. The key services by default will appear on localhost:9002, localhost:8080, localhost:9092, localhost:3306, … and a few others on the local machine where DataHub is deployed.

If you are not publicly exposing the ports where datahub runs (e.g. localhost:8080) outside of the node where datahub is deployed, there is no security concern. But when you deploy it is expected you understand how to properly set up security rules to prevent unauthorized access to the internal datahub services by only surfacing officially public endpoints (localhost:9002 where the datahub frontend service is served).

If you only intend to expose DataHub to your organization (e.g. on a private network not accessible from the internet), you should generally be A.O.K. The disclaimers noted are primarily to serve as a warning to closely consider security implications before making DataHub "production’, reminding you to closely consider how traffic accesses your DataHub instance where ever it is hosted

That being said, there is no mandated set of rules that that dictate what constitutes a “production” grade deployment. ‘Production-grade’ will be unique to your organization’s requirements and needs

<@U01GCJKA8P9> this is super helpful and detailed. I suspected as much but wanted to confirm. THANK YOU!