Managing Observability of Crawler Jobs in a DataHub Environment

Original Slack Thread

Hi datahubers! Question about observability of crawler jobs. When managing the crawler jobs, how are you managing the observability?

My team is running crawler jobs as k8s jobs. I guess this is a common pattern, if you are using the ingestion UI, of course. My question is on the observability of these crawler jobs.

Current strategy for us is to base observability on metrics from the k8s jobs. While handling observability at infra level works, it has some limitations. Instead I would love to operate observability at application level: number of events produced, number of warnings, number of assets, etc. Has anyone here been able to handle observability at this level? without parsing the logs :unamused:

Beyond the feedback from the community, is there any ongoing work to cover this <@U0121TRV0FL>? I remember long time ago some discoverability work on this, is there any outcome from that?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

this is for cli ingestion
version 0.13… however the question is generic and not specific to any version
not source specific question

Hey <@U027ZS25RFS>! I’m not sure I’m following the use case - are you talking about capturing metrics from crawlers that are running in k8s (and crawling some other system/app/whatever), and then emitting those events out to DataHub? Meaning, you would have a Crawler as an Entity within DataHub, and emit observability metrics as corresponding metadata?

Or, are you talking about crawling any application(s) running within the k8s cluster, and want to capture metrics without parsing logs?

(many apologies if these are n00b questions :melting_face:)

Sorry, I haven’t explain myself correctly :sweat_smile: . No, this is not about extending the graph entity model for having more information in the data catalog.

The use case here is “as DataHub operator managing the infrastructure supporting the data catalogue + about 50 crawlers, I want to know how healthy my crawlers are”.

The information that we currently have is mainly the crawler logs, which is nice for troubleshooting but not very practical for automation.
My request is for crawlers to additionally emit metrics that I could dispatch to eg Grafana and build dashboards and alarms.