Troubleshooting Metadata Ingestion from MS Dataverse with SQLAlchemy and DataHub

Original Slack Thread

Hi team,
Could you help me out. I’m trying to ingest metadata from MS Dataverse, using:

  1. Cdata Python connector for MS Dataverse as SQLAlchemy dialect
  2. SQLAlchemy (generic) recipe from datahub
  3. datahub ingest cli , version 0.13.0 (test both from 0.11.0 to 0.13.0 and GMS version: 0.12.0)
  4. worked well with TABLETYPE = “VIEW” for “Entities” schema but failed with TABLETYPE =“TABLE” for “System” (Note: “Entities” doesn’t have TABLE, just VIEW)

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)
  1. CLI
  2. Datahub version: 0.13.0
  3. Datasource: Microsoft Dataverse

Log ingest for “System” as below:

[2024-04-02 07:49:48,543] INFO     {datahub.cli.ingest_cli:147} - DataHub CLI version: 0.13.0
[2024-04-02 07:49:48,620] INFO     {datahub.ingestion.run.pipeline:238} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://datahub-gms:8080>
[2024-04-02 07:49:49,524] INFO     {datahub.ingestion.run.pipeline:255} - Source configured successfully.
[2024-04-02 07:49:49,525] INFO     {datahub.cli.ingest_cli:128} - Starting metadata ingestion
2024-04-02 07:49:50,074 INFO sqlalchemy.engine.Engine SELECT SchemaName FROM sys_schemas
[2024-04-02 07:49:50,074] INFO     {sqlalchemy.engine.base:1853} - SELECT SchemaName FROM sys_schemas
/2024-04-02 07:49:50,075 INFO sqlalchemy.engine.Engine [raw sql] ()
[2024-04-02 07:49:50,075] INFO     {sqlalchemy.engine.base:1858} - [raw sql] ()
2024-04-02 07:49:50,189 INFO sqlalchemy.engine.Engine COMMIT
[2024-04-02 07:49:50,189] INFO     {sqlalchemy.engine.base:1087} - COMMIT
2024-04-02 07:49:50,211 INFO sqlalchemy.engine.Engine SELECT TableName FROM SYS_TABLES WHERE TableType='TABLE' AND SchemaName=?
[2024-04-02 07:49:50,211] INFO     {sqlalchemy.engine.base:1853} - SELECT TableName FROM SYS_TABLES WHERE TableType='TABLE' AND SchemaName=?
2024-04-02 07:49:50,211 INFO sqlalchemy.engine.Engine [raw sql] ('System',)
[2024-04-02 07:49:50,211] INFO     {sqlalchemy.engine.base:1858} - [raw sql] ('System',)
|2024-04-02 07:49:58,376 INFO sqlalchemy.engine.Engine COMMIT
[2024-04-02 07:49:58,376] INFO     {sqlalchemy.engine.base:1087} - COMMIT
2024-04-02 07:49:58,377 INFO sqlalchemy.engine.Engine SELECT TableName FROM SYS_TABLES WHERE TableType='VIEW' AND SchemaName=?
[2024-04-02 07:49:58,377] INFO     {sqlalchemy.engine.base:1853} - SELECT TableName FROM SYS_TABLES WHERE TableType='VIEW' AND SchemaName=?
2024-04-02 07:49:58,377 INFO sqlalchemy.engine.Engine [raw sql] ('System',)
[2024-04-02 07:49:58,377] INFO     {sqlalchemy.engine.base:1858} - [raw sql] ('System',)
\2024-04-02 07:50:02,592 INFO sqlalchemy.engine.Engine COMMIT
[2024-04-02 07:50:02,592] INFO     {sqlalchemy.engine.base:1087} - COMMIT

Cli report:
{'cli_version': '0.13.0',
 'cli_entry_location': '/usr/local/lib/python3.10/site-packages/datahub/__init__.py',
 'py_version': '3.10.11 (main, May 23 2023, 13:58:30) [GCC 10.2.1 20210110]',
 'py_exec_path': '/usr/local/bin/python',
 'os_details': 'Linux-4.18.0-348.7.1.el8_5.x86_64-x86_64-with-glibc2.31',
 'mem_info': '213.16 MB',
 'peak_memory_usage': '213.16 MB',
 'disk_info': {'total': '527.37 GB', 'used': '88.36 GB', 'used_initally': '88.36 GB', 'free': '412.15 GB'},
 'peak_disk_usage': '88.36 GB',
 'thread_count': 2,
 'peak_thread_count': 2}
Source (sqlalchemy) report:
{'events_produced': 6,
 'events_produced_per_sec': 0,
 'entities': {'container': ['urn:li:container:7dff0b80aea6abeca2e6ae6ffbccf54c', 'urn:li:container:ba36059006c1e45fe451971c5c381040']},
 'aspects': {'container': {'containerProperties': 2, 'status': 1, 'dataPlatformInstance': 1, 'subTypes': 1, 'browsePathsV2': 1}},
 'warnings': {},
 'failures': {},
 'soft_deleted_stale_entities': [],
 'tables_scanned': 0,
 'views_scanned': 0,
 'entities_profiled': 0,
 'filtered': [],
 'num_view_definitions_parsed': 0,
 'num_view_definitions_failed_parsing': 0,
 'num_view_definitions_failed_column_parsing': 0,
 'view_definitions_parsing_failures': [],
 'start_time': '2024-04-02 07:49:49.290276 (13.35 seconds ago)',
 'running_time': '13.35 seconds'}
Sink (datahub-rest) report:
{'total_records_written': 0,
 'records_written_per_second': 0,
 'warnings': [],
 'failures': [],
 'start_time': '2024-04-02 07:49:48.617808 (14.03 seconds ago)',
 'current_time': '2024-04-02 07:50:02.645401 (now)',
 'total_duration_in_seconds': 14.03,
 'max_threads': 15,
 'gms_version': 'v0.12.0',
 'pending_requests': 0}

⏳ Pipeline running successfully so far; produced 6 events in 13.35 seconds.
[2024-04-02 07:50:02,740] INFO     {datahub.cli.ingest_cli:141} - Finished metadata ingestion
-
Cli report:
{'cli_version': '0.13.0',
 'cli_entry_location': '/usr/local/lib/python3.10/site-packages/datahub/__init__.py',
 'py_version': '3.10.11 (main, May 23 2023, 13:58:30) [GCC 10.2.1 20210110]',
 'py_exec_path': '/usr/local/bin/python',
 'os_details': 'Linux-4.18.0-348.7.1.el8_5.x86_64-x86_64-with-glibc2.31',
 'mem_info': '216.87 MB',
 'peak_memory_usage': '216.87 MB',
 'disk_info': {'total': '527.37 GB', 'used': '88.36 GB', 'used_initally': '88.36 GB', 'free': '412.15 GB'},
 'peak_disk_usage': '88.36 GB',
 'thread_count': 2,
 'peak_thread_count': 2}
Source (sqlalchemy) report:
{'events_produced': 13,
 'events_produced_per_sec': 0,
 'entities': {'container': ['urn:li:container:7dff0b80aea6abeca2e6ae6ffbccf54c', 'urn:li:container:ba36059006c1e45fe451971c5c381040']},
 'aspects': {'container': {'containerProperties': 2, 'status': 2, 'dataPlatformInstance': 2, 'subTypes': 2, 'browsePathsV2': 4, 'container': 1}},
 'warnings': {},
 'failures': {},
 'soft_deleted_stale_entities': [],
 'tables_scanned': 0,
 'views_scanned': 0,
 'entities_profiled': 0,
 'filtered': [],
 'num_view_definitions_parsed': 0,
 'num_view_definitions_failed_parsing': 0,
 'num_view_definitions_failed_column_parsing': 0,
 'view_definitions_parsing_failures': [],
 'start_time': '2024-04-02 07:49:49.290276 (13.69 seconds ago)',
 'running_time': '13.69 seconds'}
Sink (datahub-rest) report:
{'total_records_written': 0,
 'records_written_per_second': 0,
 'warnings': [],
 'failures': [],
 'start_time': '2024-04-02 07:49:48.617808 (14.37 seconds ago)',
 'current_time': '2024-04-02 07:50:02.985571 (now)',
 'total_duration_in_seconds': 14.37,
 'max_threads': 15,
 'gms_version': 'v0.12.0',
 'pending_requests': 0}

 Pipeline finished successfully; produced 13 events in 13.69 seconds.```

• Log ingest for “Entities” schema as below:

[2024-04-02 07:52:38,066] INFO     {datahub.cli.ingest_cli:147} - DataHub CLI version: 0.13.0
[2024-04-02 07:52:38,147] INFO     {datahub.ingestion.run.pipeline:238} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://datahub-gms:8080>
[2024-04-02 07:52:38,995] INFO     {datahub.ingestion.run.pipeline:255} - Source configured successfully.
[2024-04-02 07:52:38,996] INFO     {datahub.cli.ingest_cli:128} - Starting metadata ingestion
2024-04-02 07:52:39,553 INFO sqlalchemy.engine.Engine SELECT SchemaName FROM sys_schemas
[2024-04-02 07:52:39,553] INFO     {sqlalchemy.engine.base:1853} - SELECT SchemaName FROM sys_schemas
/2024-04-02 07:52:39,553 INFO sqlalchemy.engine.Engine [raw sql] ()
[2024-04-02 07:52:39,553] INFO     {sqlalchemy.engine.base:1858} - [raw sql] ()
2024-04-02 07:52:39,673 INFO sqlalchemy.engine.Engine COMMIT
[2024-04-02 07:52:39,673] INFO     {sqlalchemy.engine.base:1087} - COMMIT
2024-04-02 07:52:39,695 INFO sqlalchemy.engine.Engine SELECT TableName FROM SYS_TABLES WHERE TableType='TABLE' AND SchemaName=?
[2024-04-02 07:52:39,695] INFO     {sqlalchemy.engine.base:1853} - SELECT TableName FROM SYS_TABLES WHERE TableType='TABLE' AND SchemaName=?
2024-04-02 07:52:39,695 INFO sqlalchemy.engine.Engine [raw sql] ('Entities',)
[2024-04-02 07:52:39,695] INFO     {sqlalchemy.engine.base:1858} - [raw sql] ('Entities',)
2024-04-02 07:52:44,653 INFO sqlalchemy.engine.Engine COMMIT
|[2024-04-02 07:52:44,653] INFO     {sqlalchemy.engine.base:1087} - COMMIT
2024-04-02 07:52:44,662 INFO sqlalchemy.engine.Engine SELECT TableName FROM SYS_TABLES WHERE TableType='VIEW' AND SchemaName=?
[2024-04-02 07:52:44,662] INFO     {sqlalchemy.engine.base:1853} - SELECT TableName FROM SYS_TABLES WHERE TableType='VIEW' AND SchemaName=?
2024-04-02 07:52:44,662 INFO sqlalchemy.engine.Engine [raw sql] ('Entities',)
[2024-04-02 07:52:44,662] INFO     {sqlalchemy.engine.base:1858} - [raw sql] ('Entities',)
2024-04-02 07:52:44,669 INFO sqlalchemy.engine.Engine COMMIT
[2024-04-02 07:52:44,669] INFO     {sqlalchemy.engine.base:1087} - COMMIT
2024-04-02 07:52:44,671 INFO sqlalchemy.engine.Engine SELECT CatalogName,SchemaName,TableName,ColumnName,DataType,NumericScale,IsNullable,Ordinal,IsAutoIncrement,IsKey,NumericPrecision FROM sys_tablecolumns WHERE TableName=? AND SchemaName=?
[2024-04-02 07:52:44,671] INFO     {sqlalchemy.engine.base:1853} - SELECT CatalogName,SchemaName,TableName,ColumnName,DataType,NumericScale,IsNullable,Ordinal,IsAutoIncrement,IsKey,NumericPrecision FROM sys_tablecolumns WHERE TableName=? AND SchemaName=?
2024-04-02 07:52:44,671 INFO sqlalchemy.engine.Engine [raw sql] ('MultiSelectPickListAttributeMetaData', 'Entities')
[2024-04-02 07:52:44,671] INFO     {sqlalchemy.engine.base:1858} - [raw sql] ('MultiSelectPickListAttributeMetaData', 'Entities')
2024-04-02 07:52:44,741 INFO sqlalchemy.engine.Engine COMMIT
[2024-04-02 07:52:44,741] INFO     {sqlalchemy.engine.base:1087} - COMMIT
2024-04-02 07:52:44,749 INFO sqlalchemy.engine.Engine SELECT CatalogName,SchemaName,TableName,ColumnName,DataType,NumericScale,IsNullable,Ordinal,IsAutoIncrement,IsKey,NumericPrecision FROM sys_tablecolumns WHERE TableName=? AND SchemaName=?
[2024-04-02 07:52:44,749] INFO     {sqlalchemy.engine.base:1853} - SELECT CatalogName,SchemaName,TableName,ColumnName,DataType,NumericScale,IsNullable,Ordinal,IsAutoIncrement,IsKey,NumericPrecision FROM sys_tablecolumns WHERE TableName=? AND SchemaName=?
2024-04-02 07:52:44,749 INFO sqlalchemy.engine.Engine [raw sql] ('MultiSelectPickListOptions', 'Entities')
[2024-04-02 07:52:44,749] INFO     {sqlalchemy.engine.base:1858} - [raw sql] ('MultiSelectPickListOptions', 'Entities')
2024-04-02 07:52:44,763 INFO sqlalchemy.engine.Engine COMMIT
[2024-04-02 07:52:44,763] INFO     {sqlalchemy.engine.base:1087} - COMMIT
2024-04-02 07:52:44,771 INFO sqlalchemy.engine.Engine SELECT CatalogName,SchemaName,TableName,ColumnName,DataType,NumericScale,IsNullable,Ordinal,IsAutoIncrement,IsKey,NumericPrecision FROM sys_tablecolumns WHERE TableName=? AND SchemaName=?
[2024-04-02 07:52:44,771] INFO     {sqlalchemy.engine.base:1853} - SELECT CatalogName,SchemaName,TableName,ColumnName,DataType,NumericScale,IsNullable,Ordinal,IsAutoIncrement,IsKey,NumericPrecision FROM sys_tablecolumns WHERE TableName=? AND SchemaName=?
2024-04-02 07:52:44,771 INFO sqlalchemy.engine.Engine [raw sql] ('PickListAttributeMetaData', 'Entities')
[2024-04-02 07:52:44,771] INFO     {sqlalchemy.engine.base:1858} - [raw sql] ('PickListAttributeMetaData', 'Entities')
2024-04-02 07:52:44,786 INFO sqlalchemy.engine.Engine COMMIT
[2024-04-02 07:52:44,786] INFO     {sqlalchemy.engine.base:1087} - COMMIT
2024-04-02 07:52:44,793 INFO sqlalchemy.engine.Engine SELECT CatalogName,SchemaName,TableName,ColumnName,DataType,NumericScale,IsNullable,Ordinal,IsAutoIncrement,IsKey,NumericPrecision FROM sys_tablecolumns WHERE TableName=? AND SchemaName=?
[2024-04-02 07:52:44,793] INFO     {sqlalchemy.engine.base:1853} - SELECT CatalogName,SchemaName,TableName,ColumnName,DataType,NumericScale,IsNullable,Ordinal,IsAutoIncrement,IsKey,NumericPrecision FROM sys_tablecolumns WHERE TableName=? AND SchemaName=?
2024-04-02 07:52:44,793 INFO sqlalchemy.engine.Engine [raw sql] ('PickListOptions', 'Entities')
[2024-04-02 07:52:44,793] INFO     {sqlalchemy.engine.base:1858} - [raw sql] ('PickListOptions', 'Entities')
2024-04-02 07:52:44,805 INFO sqlalchemy.engine.Engine COMMIT
[2024-04-02 07:52:44,805] INFO     {sqlalchemy.engine.base:1087} - COMMIT
2024-04-02 07:52:44,811 INFO sqlalchemy.engine.Engine SELECT CatalogName,SchemaName,TableName,ColumnName,DataType,NumericScale,IsNullable,Ordinal,IsAutoIncrement,IsKey,NumericPrecision FROM sys_tablecolumns WHERE TableName=? AND SchemaName=?
[2024-04-02 07:52:44,811] INFO     {sqlalchemy.engine.base:1853} - SELECT CatalogName,SchemaName,TableName,ColumnName,DataType,NumericScale,IsNullable,Ordinal,IsAutoIncrement,IsKey,NumericPrecision FROM sys_tablecolumns WHERE TableName=? AND SchemaName=?
2024-04-02 07:52:44,811 INFO sqlalchemy.engine.Engine [raw sql] ('StateAttributeMetadata', 'Entities')
[2024-04-02 07:52:44,811] INFO     {sqlalchemy.engine.base:1858} - [raw sql] ('StateAttributeMetadata', 'Entities')
2024-04-02 07:52:44,824 INFO sqlalchemy.engine.Engine COMMIT
[2024-04-02 07:52:44,824] INFO     {sqlalchemy.engine.base:1087} - COMMIT
2024-04-02 07:52:44,831 INFO sqlalchemy.engine.Engine SELECT CatalogName,SchemaName,TableName,ColumnName,DataType,NumericScale,IsNullable,Ordinal,IsAutoIncrement,IsKey,NumericPrecision FROM sys_tablecolumns WHERE TableName=? AND SchemaName=?
[2024-04-02 07:52:44,831] INFO     {sqlalchemy.engine.base:1853} - SELECT CatalogName,SchemaName,TableName,ColumnName,DataType,NumericScale,IsNullable,Ordinal,IsAutoIncrement,IsKey,NumericPrecision FROM sys_tablecolumns WHERE TableName=? AND SchemaName=?
2024-04-02 07:52:44,831 INFO sqlalchemy.engine.Engine [raw sql] ('StateAttributeOptions', 'Entities')
[2024-04-02 07:52:44,831] INFO     {sqlalchemy.engine.base:1858} - [raw sql] ('StateAttributeOptions', 'Entities')
2024-04-02 07:52:44,842 INFO sqlalchemy.engine.Engine COMMIT
[2024-04-02 07:52:44,842] INFO     {sqlalchemy.engine.base:1087} - COMMIT
2024-04-02 07:52:44,850 INFO sqlalchemy.engine.Engine SELECT CatalogName,SchemaName,TableName,ColumnName,DataType,NumericScale,IsNullable,Ordinal,IsAutoIncrement,IsKey,NumericPrecision FROM sys_tablecolumns WHERE TableName=? AND SchemaName=?
[2024-04-02 07:52:44,850] INFO     {sqlalchemy.engine.base:1853} - SELECT CatalogName,SchemaName,TableName,ColumnName,DataType,NumericScale,IsNullable,Ordinal,IsAutoIncrement,IsKey,NumericPrecision FROM sys_tablecolumns WHERE TableName=? AND SchemaName=?
2024-04-02 07:52:44,850 INFO sqlalchemy.engine.Engine [raw sql] ('StatusAttributeMetadata', 'Entities')
[2024-04-02 07:52:44,850] INFO     {sqlalchemy.engine.base:1858} - [raw sql] ('StatusAttributeMetadata', 'Entities')
2024-04-02 07:52:44,864 INFO sqlalchemy.engine.Engine COMMIT
[2024-04-02 07:52:44,864] INFO     {sqlalchemy.engine.base:1087} - COMMIT
2024-04-02 07:52:44,870 INFO sqlalchemy.engine.Engine SELECT CatalogName,SchemaName,TableName,ColumnName,DataType,NumericScale,IsNullable,Ordinal,IsAutoIncrement,IsKey,NumericPrecision FROM sys_tablecolumns WHERE TableName=? AND SchemaName=?
[2024-04-02 07:52:44,870] INFO     {sqlalchemy.engine.base:1853} - SELECT CatalogName,SchemaName,TableName,ColumnName,DataType,NumericScale,IsNullable,Ordinal,IsAutoIncrement,IsKey,NumericPrecision FROM sys_tablecolumns WHERE TableName=? AND SchemaName=?
2024-04-02 07:52:44,870 INFO sqlalchemy.engine.Engine [raw sql] ('StatusAttributeOptions', 'Entities')
[2024-04-02 07:52:44,870] INFO     {sqlalchemy.engine.base:1858} - [raw sql] ('StatusAttributeOptions', 'Entities')
2024-04-02 07:52:44,881 INFO sqlalchemy.engine.Engine COMMIT
[2024-04-02 07:52:44,881] INFO     {sqlalchemy.engine.base:1087} - COMMIT
\[2024-04-02 07:52:44,933] INFO     {datahub.cli.ingest_cli:141} - Finished metadata ingestion

Cli report:
{'cli_version': '0.13.0',
 'cli_entry_location': '/usr/local/lib/python3.10/site-packages/datahub/__init__.py',
 'py_version': '3.10.11 (main, May 23 2023, 13:58:30) [GCC 10.2.1 20210110]',
 'py_exec_path': '/usr/local/bin/python',
 'os_details': 'Linux-4.18.0-348.7.1.el8_5.x86_64-x86_64-with-glibc2.31',
 'mem_info': '272.49 MB',
 'peak_memory_usage': '272.49 MB',
 'disk_info': {'total': '527.37 GB', 'used': '88.36 GB', 'used_initally': '88.36 GB', 'free': '412.15 GB'},
 'peak_disk_usage': '88.36 GB',
 'thread_count': 2,
 'peak_thread_count': 2}
Source (sqlalchemy) report:
{'events_produced': 61,
 'events_produced_per_sec': 9,
 'entities': {'container': ['urn:li:container:7dff0b80aea6abeca2e6ae6ffbccf54c', 'urn:li:container:a2cbc87138bc4d5824c9bb7693e2b238'],
              'dataset': ['urn:li:dataset:(urn:li:dataPlatform:dataverse,Entities.MultiSelectPickListAttributeMetaData,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:dataverse,Entities.MultiSelectPickListOptions,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:dataverse,Entities.PickListAttributeMetaData,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:dataverse,Entities.PickListOptions,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:dataverse,Entities.StateAttributeMetadata,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:dataverse,Entities.StateAttributeOptions,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:dataverse,Entities.StatusAttributeMetadata,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:dataverse,Entities.StatusAttributeOptions,PROD)']},
 'aspects': {'container': {'containerProperties': 2, 'status': 2, 'dataPlatformInstance': 2, 'subTypes': 2, 'browsePathsV2': 4, 'container': 1},
             'dataset': {'container': 8,
                         'status': 8,
                         'datasetProperties': 8,
                         'schemaMetadata': 8,
                         'subTypes': 8,
                         'viewProperties': 8,
                         'browsePathsV2': 16}},
{--- omitted --- }

 Pipeline finished successfully; produced 61 events in 6.39 seconds.```

And recipe:

source:
type: sqlalchemy
config:
# Coordinates
connect_uri: “cds:///?OrganizationUrl=https://maskedserver.dynamics.com/InitiateOAuth=REFRESH&amp;OAuthSettingsLocation=‘/OAuthSettings.txt’&amp;SSLServerCert=*&amp;schema=System;”
platform: dataverse
#database: CData
#options:
# connect_args:
# connect_timeout: 20

sink:
type: “datahub-rest”
config:
server: “http://datahub-gms:8080