Troubleshooting Redshift Ingestion Error: Invalid Expression / Unexpected Token

Original Slack Thread

attachment

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses! We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66|> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :+1: or :-1: so other folks in the Community know if you’re still looking for help on the topic and to help us improve the assistant!

I’m here to help with questions specifically about DataHub. Could you please provide more details or rephrase your question to include relevant information about your DataHub setup or issue? For example:

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g., BigQuery)

This information will help me assist you more effectively. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

  1. UI
  2. v0.13.3
  3. Redshift

<@U06TM7M40RK>

Great, thanks for providing the details! Since you are using the UI for ingestion, DataHub version 0.13.3, and integrating with Redshift, here are the steps to set up Redshift ingestion in DataHub:

Step-by-Step Guide to Set Up Redshift Ingestion

  1. Create a DataHub User in Redshift:

    • Connect to your Amazon Redshift cluster using an SQL client such as SQL Workbench/J or Amazon Redshift Query Editor with your Admin user.
    • Create a Redshift user that will be used to perform the metadata extraction if you don’t have one already. For example:
      CREATE USER datahub WITH PASSWORD 'Datahub1234';
      
  2. Grant Necessary Permissions:

    • Grant the following permissions to your datahub user:
      ALTER USER datahub WITH SYSLOG ACCESS UNRESTRICTED;
      GRANT SELECT ON pg_catalog.svv_table_info to datahub;
      GRANT SELECT ON pg_catalog.svl_user_info to datahub;
      
  3. Configure Secrets in DataHub:

    • Within DataHub, navigate to the Ingestion tab in the top right corner of your screen.
    • Navigate to the Secrets tab and click Create new secret.
    • Create a Redshift User’s Password secret to securely store your Redshift User’s password within DataHub.
    • Click Create new secret again, enter a name like REDSHIFT_PASSWORD, and enter your DataHub Redshift user’s password. Optionally, add a description and click Create.
  4. Configure the Ingestion Source:

    • Navigate to the Sources:
[2024-08-08 14:42:54,138] INFO     {datahub.ingestion.reporting.file_reporter:52} - Wrote UNKNOWN report successfully to &lt;_io.TextIOWrapper name='/tmp/datahub/ingest/6eec71e1-edc4-4b23-88f7-0b21f290325d/ingestion_report.json' mode='w' encoding='UTF-8'&gt;
[2024-08-08 14:42:54,139] INFO     {datahub.cli.ingest_cli:133} - Source (redshift) report:
{'num_tables_fetch_sample_values_failed': 0,
 'num_tables_classification_attempted': 0,
 'num_tables_classification_failed': 0,
 'num_tables_classification_found': 0,
 'info_types_detected': {},
 'window_end_time': '2024-08-08 14:41:14.043432+00:00 (1 minute and 40.1 seconds ago)',
 'window_start_time': '2024-08-07 00:00:00+00:00 (1 day, 14 hours and 42 minutes ago)',
 'ingestion_stage': 'Lineage Extraction at 2024-08-08 14:41:16.891120+00:00',
 'ingestion_stage_durations': {'Metadata Extraction at 2024-08-08 14:41:14.224686+00:00': 2.67},
 'events_produced': 11,
 'events_produced_per_sec': 0,
 'entities': {'container': ['urn:li:container:a88d7a48ed3cf00c20c68324d65466ff', 'urn:li:container:a214233bcaa8908aabeb84ee8d750cb0']},
 'aspects': {'container': {'containerProperties': 2, 'status': 2, 'dataPlatformInstance': 2, 'subTypes': 2, 'browsePathsV2': 2, 'container': 1}},
 'warnings': {},
 'failures': {},
 'soft_deleted_stale_entities': [],
 'tables_scanned': 0,
 'views_scanned': 0,
 'entities_profiled': 0,
 'filtered': ['dataplatform._airbyte_raw_vault',
              'dataplatform.dev_eoin_travers',
              'dataplatform.dev_jamie_coe',
              'dataplatform.dev_kevin_longe',
              'dataplatform.dev_tableau',
              'dataplatform.raw_vault',
              'dataplatform.staging',
              'dataplatform.staging_vault',
              'dataplatform.tmp',
              'dataplatform.unmind_datalake',
              '... sampled of 16 total elements'],
 'num_view_definitions_parsed': 0,
 'num_view_definitions_failed_parsing': 0,
 'num_view_definitions_failed_column_parsing': 0,
 'view_definitions_parsing_failures': [],
 'profiling_skipped_not_updated': {},
 'profiling_skipped_size_limit': {},
 'profiling_skipped_row_limit': {},
 'profiling_skipped_other': {},
 'num_tables_not_eligible_profiling': {},
 'upstream_lineage': {},
 'usage_extraction_sec': {},
 'lineage_extraction_sec': {},
 'table_processed': {},
 'table_filtered': {},
 'view_filtered': {},
 'view_processed': {},
 'table_cached': {'dataplatform.prod': 281, 'dataplatform.prod_vault': 1},
 'view_cached': {'dataplatform.prod': 5},
 'metadata_extraction_sec': {'dataplatform.prod': 0.2, 'dataplatform.prod_vault': 0.1},
 'operational_metadata_extraction_sec': {},
 'lineage_mem_size': {},
 'tables_in_mem_size': {'dataplatform': '16.79 KB'},
 'views_in_mem_size': {'dataplatform': '946 bytes'},
 'num_operational_stats_filtered': 0,
 'num_repeated_operations_dropped': 0,
 'num_usage_stat_skipped': 0,
 'num_lineage_tables_dropped': 0,
 'num_lineage_dropped_query_parser': 0,
 'num_lineage_dropped_not_support_copy_path': 0,
 'lineage_start_time': '2024-08-07 00:00:00+00:00 (1 day, 14 hours and 42 minutes ago)',
 'lineage_end_time': '2024-08-08 14:41:14.043432+00:00 (1 minute and 40.1 seconds ago)',
 'stateful_lineage_ingestion_enabled': True,
 'stateful_usage_ingestion_enabled': True,
 'num_unresolved_temp_columns': 0,
 'start_time': '2024-08-08 14:41:14.053329 (1 minute and 40.09 seconds ago)',
 'running_time': '1 minute and 40.09 seconds'}
[2024-08-08 14:42:54,140] INFO     {datahub.cli.ingest_cli:136} - Sink (datahub-rest) report:
{'total_records_written': 11,
 'records_written_per_second': 0,
 'warnings': [],
 'failures': [],
 'start_time': '2024-08-08 14:41:12.111555 (1 minute and 42.03 seconds ago)',
 'current_time': '2024-08-08 14:42:54.139777 (now)',
 'total_duration_in_seconds': 102.03,
 'max_threads': 15,
 'gms_version': '',
 'pending_requests': 0}
[2024-08-08 14:42:54,475] ERROR    {datahub.entrypoints:201} - Command failed: Invalid expression / Unexpected token. Line 1, Col: 163.
  shift_prod", "target_name": "prod", "node_id": "model.unmind_data_platform.stg_slack_app_opened"} */[4m\[0mnalter table "dataplatform"."staging"."stg_slack_app_opened" rename to "stg_slack_app_opened__dbt_ba
Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/entrypoints.py", line 188, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 454, in wrapper
    raise e
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 403, in wrapper
    res = func(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 201, in run
    ret = loop.run_until_complete(run_ingestion_and_check_upgrade())
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 185, in run_ingestion_and_check_upgrade
    ret = await ingestion_future
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 139, in run_pipeline_to_completion
    raise e
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 131, in run_pipeline_to_completion
    pipeline.run()
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 405, in run
    for wu in itertools.islice(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 147, in auto_stale_entity_removal
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/incremental_lineage_helper.py", line 116, in auto_incremental_lineage
    yield from stream
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 171, in auto_workunit_reporter
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 253, in auto_browse_path_v2
    for urn, batch in _batch_workunits_by_urn(stream):
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 391, in _batch_workunits_by_urn
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 184, in auto_materialize_referenced_tags
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 91, in auto_status_aspect
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/redshift.py", line 468, in get_workunits_internal
    yield from self.extract_lineage(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/redshift.py", line 987, in extract_lineage
    lineage_extractor.populate_lineage(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/lineage.py", line 659, in populate_lineage
    table_renames, all_tables_set = self._process_table_renames(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/lineage.py", line 851, in _process_table_renames
    schema, prev_name, new_name = parse_alter_table_rename(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/lineage.py", line 131, in parse_alter_table_rename
    parsed_query = sqlglot.parse_one(query, dialect="redshift")
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/sqlglot/__init__.py", line 124, in parse_one
    result = dialect.parse(sql, **opts)
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/sqlglot/dialects/dialect.py", line 478, in parse
    return self.parser(**opts).parse(self.tokenize(sql), sql)
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/sqlglot/parser.py", line 1145, in parse
    return self._parse(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/sqlglot/parser.py", line 1214, in _parse
    self.raise_error("Invalid expression / Unexpected token")
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/sqlglot/parser.py", line 1255, in raise_error
    raise error
sqlglot.errors.ParseError: Invalid expression / Unexpected token. Line 1, Col: 163.
  shift_prod", "target_name": "prod", "node_id": "model.unmind_data_platform.stg_slack_app_opened"} */[4m\[0mnalter table "dataplatform"."staging"."stg_slack_app_opened" rename to "stg_slack_app_opened__dbt_ba```
<@U06TM7M40RK>

The error you’re encountering, Invalid expression / Unexpected token, during the Redshift ingestion process is related to the parsing of SQL queries, specifically the ALTER TABLE ... RENAME TO ... statements. This issue has been reported by other users as well.

Steps to Address the Issue

  1. Check for Known Issues and Fixes:

    • According to <Issues · datahub-project/datahub · GitHub GitHub issue>, a similar problem was reported and a potential fix was suggested. The solution involves modifying the redshift_schema.py file to handle newline characters in the query text:
      query_text=row[field_names.index("query_text")].replace(r"\n", "\n")
      
  2. Update DataHub to the Latest Version:

    • Ensure you are using the latest version of DataHub, as there might be fixes or improvements related to this issue in newer releases.
  3. Modify the Ingestion Recipe:

    • If you are using a custom ingestion recipe, ensure that it is correctly configured to handle table renames and other SQL operations. You might need to adjust the schema_pattern or other configurations to avoid parsing issues.
  4. Review the Redshift Lineage Code:

Example Code Snippet for Handling Newline Characters

Here is an example of how you can modify the redshift_schema.py file to handle newline characters in the query text:

# Modify the redshift_schema.py file
query_text = row[field_names.index("query_text")].replace(r"\n", "\n")

Additional Resources

Would you like more detailed steps or further assistance with any specific part of the process?

Sources: