Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!
- Are you using UI or CLI for ingestion?
- Which DataHub version are you using? (e.g. 0.12.0)
- What data source(s) are you integrating with DataHub? (e.g. BigQuery)
<@U06TM7M40RK> Can you help with the questions in this thread?
Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses! We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66|> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a or so other folks in the Community know if you’re still looking for help on the topic and to help us improve the assistant!
I’m here to help with questions specifically about DataHub. Could you please provide more details or rephrase your question to include relevant information about your DataHub setup or issue? For example:
- Are you using UI or CLI for ingestion?
- Which DataHub version are you using? (e.g., 0.12.0)
- What data source(s) are you integrating with DataHub? (e.g., BigQuery)
This information will help me assist you more effectively. 0 button 0 button Hint: Mention @runllm in the thread for followups.
- UI
- v0.13.3
- Redshift
<@U06TM7M40RK>
Great, thanks for providing the details! Since you are using the UI for ingestion, DataHub version 0.13.3, and integrating with Redshift, here are the steps to set up Redshift ingestion in DataHub:
Step-by-Step Guide to Set Up Redshift Ingestion
-
Create a DataHub User in Redshift:
- Connect to your Amazon Redshift cluster using an SQL client such as SQL Workbench/J or Amazon Redshift Query Editor with your Admin user.
- Create a Redshift user that will be used to perform the metadata extraction if you don’t have one already. For example:
CREATE USER datahub WITH PASSWORD 'Datahub1234';
-
Grant Necessary Permissions:
- Grant the following permissions to your
datahub
user:ALTER USER datahub WITH SYSLOG ACCESS UNRESTRICTED; GRANT SELECT ON pg_catalog.svv_table_info to datahub; GRANT SELECT ON pg_catalog.svl_user_info to datahub;
- Grant the following permissions to your
-
Configure Secrets in DataHub:
- Within DataHub, navigate to the Ingestion tab in the top right corner of your screen.
- Navigate to the Secrets tab and click Create new secret.
- Create a Redshift User’s Password secret to securely store your Redshift User’s password within DataHub.
- Click Create new secret again, enter a name like
REDSHIFT_PASSWORD
, and enter your DataHub Redshift user’s password. Optionally, add a description and click Create.
-
Configure the Ingestion Source:
- Navigate to the Sources:
- <https://datahubproject.io/docs/0.13.1/quick-ingestion-guides/redshift/configuration|Redshift Configuration Guide>
- <https://datahubproject.io/docs/0.13.1/quick-ingestion-guides/redshift/setup|Redshift Setup Guide> 0 button 0 button Hint: Mention @runllm in the thread for followups.
[2024-08-08 14:42:54,138] INFO {datahub.ingestion.reporting.file_reporter:52} - Wrote UNKNOWN report successfully to <_io.TextIOWrapper name='/tmp/datahub/ingest/6eec71e1-edc4-4b23-88f7-0b21f290325d/ingestion_report.json' mode='w' encoding='UTF-8'>
[2024-08-08 14:42:54,139] INFO {datahub.cli.ingest_cli:133} - Source (redshift) report:
{'num_tables_fetch_sample_values_failed': 0,
'num_tables_classification_attempted': 0,
'num_tables_classification_failed': 0,
'num_tables_classification_found': 0,
'info_types_detected': {},
'window_end_time': '2024-08-08 14:41:14.043432+00:00 (1 minute and 40.1 seconds ago)',
'window_start_time': '2024-08-07 00:00:00+00:00 (1 day, 14 hours and 42 minutes ago)',
'ingestion_stage': 'Lineage Extraction at 2024-08-08 14:41:16.891120+00:00',
'ingestion_stage_durations': {'Metadata Extraction at 2024-08-08 14:41:14.224686+00:00': 2.67},
'events_produced': 11,
'events_produced_per_sec': 0,
'entities': {'container': ['urn:li:container:a88d7a48ed3cf00c20c68324d65466ff', 'urn:li:container:a214233bcaa8908aabeb84ee8d750cb0']},
'aspects': {'container': {'containerProperties': 2, 'status': 2, 'dataPlatformInstance': 2, 'subTypes': 2, 'browsePathsV2': 2, 'container': 1}},
'warnings': {},
'failures': {},
'soft_deleted_stale_entities': [],
'tables_scanned': 0,
'views_scanned': 0,
'entities_profiled': 0,
'filtered': ['dataplatform._airbyte_raw_vault',
'dataplatform.dev_eoin_travers',
'dataplatform.dev_jamie_coe',
'dataplatform.dev_kevin_longe',
'dataplatform.dev_tableau',
'dataplatform.raw_vault',
'dataplatform.staging',
'dataplatform.staging_vault',
'dataplatform.tmp',
'dataplatform.unmind_datalake',
'... sampled of 16 total elements'],
'num_view_definitions_parsed': 0,
'num_view_definitions_failed_parsing': 0,
'num_view_definitions_failed_column_parsing': 0,
'view_definitions_parsing_failures': [],
'profiling_skipped_not_updated': {},
'profiling_skipped_size_limit': {},
'profiling_skipped_row_limit': {},
'profiling_skipped_other': {},
'num_tables_not_eligible_profiling': {},
'upstream_lineage': {},
'usage_extraction_sec': {},
'lineage_extraction_sec': {},
'table_processed': {},
'table_filtered': {},
'view_filtered': {},
'view_processed': {},
'table_cached': {'dataplatform.prod': 281, 'dataplatform.prod_vault': 1},
'view_cached': {'dataplatform.prod': 5},
'metadata_extraction_sec': {'dataplatform.prod': 0.2, 'dataplatform.prod_vault': 0.1},
'operational_metadata_extraction_sec': {},
'lineage_mem_size': {},
'tables_in_mem_size': {'dataplatform': '16.79 KB'},
'views_in_mem_size': {'dataplatform': '946 bytes'},
'num_operational_stats_filtered': 0,
'num_repeated_operations_dropped': 0,
'num_usage_stat_skipped': 0,
'num_lineage_tables_dropped': 0,
'num_lineage_dropped_query_parser': 0,
'num_lineage_dropped_not_support_copy_path': 0,
'lineage_start_time': '2024-08-07 00:00:00+00:00 (1 day, 14 hours and 42 minutes ago)',
'lineage_end_time': '2024-08-08 14:41:14.043432+00:00 (1 minute and 40.1 seconds ago)',
'stateful_lineage_ingestion_enabled': True,
'stateful_usage_ingestion_enabled': True,
'num_unresolved_temp_columns': 0,
'start_time': '2024-08-08 14:41:14.053329 (1 minute and 40.09 seconds ago)',
'running_time': '1 minute and 40.09 seconds'}
[2024-08-08 14:42:54,140] INFO {datahub.cli.ingest_cli:136} - Sink (datahub-rest) report:
{'total_records_written': 11,
'records_written_per_second': 0,
'warnings': [],
'failures': [],
'start_time': '2024-08-08 14:41:12.111555 (1 minute and 42.03 seconds ago)',
'current_time': '2024-08-08 14:42:54.139777 (now)',
'total_duration_in_seconds': 102.03,
'max_threads': 15,
'gms_version': '',
'pending_requests': 0}
[2024-08-08 14:42:54,475] ERROR {datahub.entrypoints:201} - Command failed: Invalid expression / Unexpected token. Line 1, Col: 163.
shift_prod", "target_name": "prod", "node_id": "model.unmind_data_platform.stg_slack_app_opened"} */[4m\[0mnalter table "dataplatform"."staging"."stg_slack_app_opened" rename to "stg_slack_app_opened__dbt_ba
Traceback (most recent call last):
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/entrypoints.py", line 188, in main
sys.exit(datahub(standalone_mode=False, **kwargs))
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 454, in wrapper
raise e
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 403, in wrapper
res = func(*args, **kwargs)
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 201, in run
ret = loop.run_until_complete(run_ingestion_and_check_upgrade())
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 185, in run_ingestion_and_check_upgrade
ret = await ingestion_future
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 139, in run_pipeline_to_completion
raise e
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 131, in run_pipeline_to_completion
pipeline.run()
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 405, in run
for wu in itertools.islice(
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 147, in auto_stale_entity_removal
for wu in stream:
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/incremental_lineage_helper.py", line 116, in auto_incremental_lineage
yield from stream
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 171, in auto_workunit_reporter
for wu in stream:
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 253, in auto_browse_path_v2
for urn, batch in _batch_workunits_by_urn(stream):
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 391, in _batch_workunits_by_urn
for wu in stream:
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 184, in auto_materialize_referenced_tags
for wu in stream:
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 91, in auto_status_aspect
for wu in stream:
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/redshift.py", line 468, in get_workunits_internal
yield from self.extract_lineage(
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/redshift.py", line 987, in extract_lineage
lineage_extractor.populate_lineage(
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/lineage.py", line 659, in populate_lineage
table_renames, all_tables_set = self._process_table_renames(
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/lineage.py", line 851, in _process_table_renames
schema, prev_name, new_name = parse_alter_table_rename(
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/lineage.py", line 131, in parse_alter_table_rename
parsed_query = sqlglot.parse_one(query, dialect="redshift")
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/sqlglot/__init__.py", line 124, in parse_one
result = dialect.parse(sql, **opts)
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/sqlglot/dialects/dialect.py", line 478, in parse
return self.parser(**opts).parse(self.tokenize(sql), sql)
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/sqlglot/parser.py", line 1145, in parse
return self._parse(
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/sqlglot/parser.py", line 1214, in _parse
self.raise_error("Invalid expression / Unexpected token")
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/sqlglot/parser.py", line 1255, in raise_error
raise error
sqlglot.errors.ParseError: Invalid expression / Unexpected token. Line 1, Col: 163.
shift_prod", "target_name": "prod", "node_id": "model.unmind_data_platform.stg_slack_app_opened"} */[4m\[0mnalter table "dataplatform"."staging"."stg_slack_app_opened" rename to "stg_slack_app_opened__dbt_ba```
<@U06TM7M40RK>
The error you’re encountering, Invalid expression / Unexpected token
, during the Redshift ingestion process is related to the parsing of SQL queries, specifically the ALTER TABLE ... RENAME TO ...
statements. This issue has been reported by other users as well.
Steps to Address the Issue
-
Check for Known Issues and Fixes:
- According to <Issues · datahub-project/datahub · GitHub GitHub issue>, a similar problem was reported and a potential fix was suggested. The solution involves modifying the
redshift_schema.py
file to handle newline characters in the query text:query_text=row[field_names.index("query_text")].replace(r"\n", "\n")
- According to <Issues · datahub-project/datahub · GitHub GitHub issue>, a similar problem was reported and a potential fix was suggested. The solution involves modifying the
-
Update DataHub to the Latest Version:
- Ensure you are using the latest version of DataHub, as there might be fixes or improvements related to this issue in newer releases.
-
Modify the Ingestion Recipe:
- If you are using a custom ingestion recipe, ensure that it is correctly configured to handle table renames and other SQL operations. You might need to adjust the
schema_pattern
or other configurations to avoid parsing issues.
- If you are using a custom ingestion recipe, ensure that it is correctly configured to handle table renames and other SQL operations. You might need to adjust the
-
Review the Redshift Lineage Code:
- The error occurs in the
parse_alter_table_rename
function within thelineage.py
file. You can review the code to ensure it correctly handles the SQL syntax used in your Redshift instance. The relevant code can be found https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/redshift/lineage.py|here.
- The error occurs in the
Example Code Snippet for Handling Newline Characters
Here is an example of how you can modify the redshift_schema.py
file to handle newline characters in the query text:
# Modify the redshift_schema.py file
query_text = row[field_names.index("query_text")].replace(r"\n", "\n")
Additional Resources
- <https://datahubproject.io/docs/metadata-ingestion/source_docs/redshift|Redshift Ingestion Configuration Guide>
- <https://github.com/datahub-project/datahub/issues|DataHub GitHub Issues>
Would you like more detailed steps or further assistance with any specific part of the process?
Sources:
- <Issues · datahub-project/datahub · GitHub Issue: Redshift serverless ingestion fails with unexpected token in v0.13.3rc1>
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/redshift/lineage.py|GitHub File: Redshift Lineage Code>
- <https://datahubproject.io/docs/metadata-ingestion/source_docs/redshift|Redshift Ingestion Configuration Guide>
- <https://github.com/datahub-project/datahub/issues|DataHub GitHub Issues> 1 button 0 button Hint: Mention <@U06TM7M40RK> in the thread for followups.