Discussing Kafka Ingestion with Schema Registry and Bootstrap Server

Original Slack Thread

Hi Team,

Can you please help us with the queries regarding Kafka ingestion using schema registry & bootstrap server.
• I see that schemas in Confluent’s Kafka Schema Registry are ingested as Dataset with DatasetSubType as Topic instead of Schema, ideally we expect to see both Topic and Schema as entities. How do you suggest us to ingest Kafka topics and schemas from schema registry for a given Kafka cluster?
• If a topic is associated with multiple schemas, how can we show all schemas for that topic in datahub UI? Also do we have property which can show the latest version of ingested AVRO schema in UI?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

Sharing the screenshot of ingested topic: CDS-DQ-T5-RULE-OP (from datahub UI) for schema: CDS-DQ-T5-RULE-OP-value (from confluent’s schema registry)![attachment]({‘ID’: ‘F07360MNAMU’, ‘EDITABLE’: False, ‘IS_EXTERNAL’: False, ‘USER_ID’: ‘U05PX2X2AKD’, ‘CREATED’: ‘2024-05-14 13:40:43+00:00’, ‘PERMALINK’: ‘Slack’, ‘EXTERNAL_TYPE’: ‘’, ‘TIMESTAMPS’: ‘2024-05-14 13:40:43+00:00’, ‘MODE’: ‘hosted’, ‘DISPLAY_AS_BOT’: False, ‘PRETTY_TYPE’: ‘PNG’, ‘NAME’: ‘Screenshot 2024-05-14 at 3.12.52\u202fPM.png’, ‘IS_PUBLIC’: True, ‘PREVIEW_HIGHLIGHT’: None, ‘MIMETYPE’: ‘image/png’, ‘PERMALINK_PUBLIC’: ‘https://slack-files.com/TUMKD5EGJ-F07360MNAMU-48ba530d31’, ‘FILETYPE’: ‘png’, ‘EDIT_LINK’: None, ‘URL_PRIVATE’: ‘Slack’, ‘HAS_RICH_PREVIEW’: False, ‘TITLE’: ‘Screenshot 2024-05-14 at 3.12.52\u202fPM.png’, ‘IS_STARRED’: False, ‘PREVIEW_IS_TRUNCATED’: None, ‘URL_PRIVATE_DOWNLOAD’: ‘Slack’, ‘PREVIEW’: None, ‘PUBLIC_URL_SHARED’: False, ‘MESSAGE_TS’: ‘1715694060.809689’, ‘PARENT_MESSAGE_TS’: ‘1715671951.610469’, ‘MESSAGE_CHANNEL_ID’: ‘CUMUWQU66’, ‘_FIVETRAN_DELETED’: False, ‘LINES_MORE’: None, ‘LINES’: None, ‘SIZE’: 607707, ‘_FIVETRAN_SYNCED’: ‘2024-05-19 08:22:15.453000+00:00’})![attachment]({‘ID’: ‘F0741D06L0Y’, ‘EDITABLE’: False, ‘IS_EXTERNAL’: False, ‘USER_ID’: ‘U05PX2X2AKD’, ‘CREATED’: ‘2024-05-14 13:40:35+00:00’, ‘PERMALINK’: ‘Slack’, ‘EXTERNAL_TYPE’: ‘’, ‘TIMESTAMPS’: ‘2024-05-14 13:40:35+00:00’, ‘MODE’: ‘hosted’, ‘DISPLAY_AS_BOT’: False, ‘PRETTY_TYPE’: ‘PNG’, ‘NAME’: ‘Screenshot 2024-05-14 at 3.12.23\u202fPM.png’, ‘IS_PUBLIC’: True, ‘PREVIEW_HIGHLIGHT’: None, ‘MIMETYPE’: ‘image/png’, ‘PERMALINK_PUBLIC’: ‘https://slack-files.com/TUMKD5EGJ-F0741D06L0Y-0d5097aa88’, ‘FILETYPE’: ‘png’, ‘EDIT_LINK’: None, ‘URL_PRIVATE’: ‘Slack’, ‘HAS_RICH_PREVIEW’: False, ‘TITLE’: ‘Screenshot 2024-05-14 at 3.12.23\u202fPM.png’, ‘IS_STARRED’: False, ‘PREVIEW_IS_TRUNCATED’: None, ‘URL_PRIVATE_DOWNLOAD’: ‘Slack’, ‘PREVIEW’: None, ‘PUBLIC_URL_SHARED’: False, ‘MESSAGE_TS’: ‘1715694060.809689’, ‘PARENT_MESSAGE_TS’: ‘1715671951.610469’, ‘MESSAGE_CHANNEL_ID’: ‘CUMUWQU66’, ‘_FIVETRAN_DELETED’: False, ‘LINES_MORE’: None, ‘LINES’: None, ‘SIZE’: 409756, ‘_FIVETRAN_SYNCED’: ‘2024-05-19 08:22:15.453000+00:00’})

You can hit the “raw” button in the datahub UI to see the original avro schema.

We don’t currently support multiple schemas for a single topic

Hi <@U01GZEETMEZ> Thanks for sharing. Is there any way to locate exact schema name for a given topic from datahub UI?

Is that not displayed in the “raw” schema view?

<@U01GZEETMEZ> no. The raw schema view just replicates the avro schema but not the schema name itself.![attachment]({‘ID’: ‘F073PCVPATE’, ‘EDITABLE’: None, ‘IS_EXTERNAL’: None, ‘USER_ID’: None, ‘CREATED’: None, ‘PERMALINK’: None, ‘EXTERNAL_TYPE’: None, ‘TIMESTAMPS’: None, ‘MODE’: ‘tombstone’, ‘DISPLAY_AS_BOT’: None, ‘PRETTY_TYPE’: None, ‘NAME’: None, ‘IS_PUBLIC’: None, ‘PREVIEW_HIGHLIGHT’: None, ‘MIMETYPE’: None, ‘PERMALINK_PUBLIC’: None, ‘FILETYPE’: None, ‘EDIT_LINK’: None, ‘URL_PRIVATE’: None, ‘HAS_RICH_PREVIEW’: None, ‘TITLE’: None, ‘IS_STARRED’: None, ‘PREVIEW_IS_TRUNCATED’: None, ‘URL_PRIVATE_DOWNLOAD’: None, ‘PREVIEW’: None, ‘PUBLIC_URL_SHARED’: None, ‘MESSAGE_TS’: ‘1715794914.130139’, ‘PARENT_MESSAGE_TS’: ‘1715671951.610469’, ‘MESSAGE_CHANNEL_ID’: ‘CUMUWQU66’, ‘_FIVETRAN_DELETED’: False, ‘LINES_MORE’: None, ‘LINES’: None, ‘SIZE’: None, ‘_FIVETRAN_SYNCED’: ‘2024-05-19 08:22:15.692000+00:00’})

![attachment]({‘ID’: ‘F073LUP3NRY’, ‘EDITABLE’: False, ‘IS_EXTERNAL’: False, ‘USER_ID’: ‘U05PX2X2AKD’, ‘CREATED’: ‘2024-05-15 17:43:07+00:00’, ‘PERMALINK’: ‘Slack’, ‘EXTERNAL_TYPE’: ‘’, ‘TIMESTAMPS’: ‘2024-05-15 17:43:07+00:00’, ‘MODE’: ‘hosted’, ‘DISPLAY_AS_BOT’: False, ‘PRETTY_TYPE’: ‘PNG’, ‘NAME’: ‘Screenshot 2024-05-15 at 11.10.51\u202fPM.png’, ‘IS_PUBLIC’: True, ‘PREVIEW_HIGHLIGHT’: None, ‘MIMETYPE’: ‘image/png’, ‘PERMALINK_PUBLIC’: ‘https://slack-files.com/TUMKD5EGJ-F073LUP3NRY-f6c9a34f29’, ‘FILETYPE’: ‘png’, ‘EDIT_LINK’: None, ‘URL_PRIVATE’: ‘Slack’, ‘HAS_RICH_PREVIEW’: False, ‘TITLE’: ‘Screenshot 2024-05-15 at 11.10.51\u202fPM.png’, ‘IS_STARRED’: False, ‘PREVIEW_IS_TRUNCATED’: None, ‘URL_PRIVATE_DOWNLOAD’: ‘Slack’, ‘PREVIEW’: None, ‘PUBLIC_URL_SHARED’: False, ‘MESSAGE_TS’: ‘1715795000.815479’, ‘PARENT_MESSAGE_TS’: ‘1715671951.610469’, ‘MESSAGE_CHANNEL_ID’: ‘CUMUWQU66’, ‘_FIVETRAN_DELETED’: False, ‘LINES_MORE’: None, ‘LINES’: None, ‘SIZE’: 553956, ‘_FIVETRAN_SYNCED’: ‘2024-05-19 08:22:15.732000+00:00’})

What should the schema name be? I thought TC05OP is the name

If there’s some other name, I don’t think we ingest that information right now

Hi <@U01GZEETMEZ> that is record name, schema name is different from this.

Got it. Then we don’t ingest the schema name, only the topic name and raw schema

ic. Can we have this feature to ingest schema along with topic?
The reason is Schema is more descriptive as compared to Topic name, also if subject name strategy is not TopicNameStrategy then we would not be able to identify what schema is associated with topic. We can leverage the Dataset subcategory “Schema” (which can be seen in filters in datahub UI) and just displays schemas.

Sure - feel free to open a PR :slightly_smiling_face:

> also if subject name strategy is not TopicNameStrategy then we would not be able to identify what schema is associated with topic
AFAIK kafka crawler only fetches schemas when strategy is TopicNameStrategy, otherwise crawler cannot determine which subject corresponds to a topic

Hi <@U027ZS25RFS> As per confluent_schema_registry.py i see it supports TopicNameStrategy / TopicRecordNameStrategy / TopicNameStrategy-with-environment as well as user-provided subjects.

I remember missing schemas for topics following TopicRecordNameStrategy
This was also reported here https://github.com/datahub-project/datahub/issues/6999
Of course, I may be wrong or this may have been fixed