Troubleshooting Detailed Logging in DataHub CLI Upgrade to v13.3

Original Slack Thread

Hi Team
After upgrading CLI to v13.3 I don’t see below logs anymore. All I see is the job report

2024-06-20 07:11:57,326 INFO sqlalchemy.engine.Engine DESCRIBE FORMATTED acctlnkds.tcic_acct_pgm_sta_hist
[2024-06-20 07:11:57,326] INFO     {sqlalchemy.engine.Engine:1863} - DESCRIBE FORMATTED acctlnkds.tcic_acct_pgm_sta_hist```
Are these logs disabled completely? Without these logs it is not possible to know what is the progress and the last Dataset ingested.

In my case there are millions of Datasets and if  the ingestion fails due any reason( network issue, permission error, HiveException, SerDe error) I need to know till what schema it got completed and then re-run with the remaining schemas.

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

Hi! To better assist you, could you please provide the following information?

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g., BigQuery)

Based on your initial message, it seems you are using the CLI for ingestion and have upgraded to DataHub CLI version 0.13.3. You mentioned that you are not seeing detailed logs anymore and only see the job report. This change could be related to recent updates in the logging configuration or ingestion framework.

To address your concerns:

  1. Logging Configuration: There might have been changes in the logging configuration with the new version. You can check the <https://github.com/datahub-project/datahub/blob/master/docs/cli.md|DataHub CLI documentation> to see if there are any updates or configurations you need to adjust to re-enable detailed logging.
  2. Ingestion Progress: If detailed logs are crucial for your ingestion process, especially for tracking progress and handling errors, you might need to adjust the logging level or configuration in your ingestion scripts.

Would you like more detailed steps on how to check and adjust the logging configuration for DataHub CLI?

Sources:

Yes

<@U06TM7M40RK>
Send more detailed steps

<@U06TM7M40RK>
Yes

To re-enable detailed logging in DataHub CLI version 0.13.3 and ensure you can track the ingestion progress and handle errors effectively, follow these detailed steps:

Step 1: Verify Logging Configuration
First, ensure that the logging configuration is set to capture detailed logs. The DataHub CLI uses the logback configuration for logging. You can adjust the logging level by modifying the logback.xml file.

Example logback.xml Configuration:

<configuration>
    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{yyyy-MM-dd HH:mm:ss} %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>

    <root level="INFO">
        <appender-ref ref="STDOUT" />
    </root>

    <logger name="org.apache" level="DEBUG" />
    <logger name="org.apache.hive" level="DEBUG" />
    <logger name="sqlalchemy.engine" level="DEBUG" />
</configuration>

Step 2: Apply Custom Logging Configuration
To apply a custom logging configuration in DataHub CLI, you can set the JAVA_OPTS environment variable to use your custom logback.xml file.

Example Command:

export JAVA_OPTS="-Dlogback.configurationFile=/path/to/your/logback.xml"
datahub ingest -c your_recipe.yaml

Step 3: Verify Ingestion Logs
After running the ingestion command, verify that the detailed logs are being captured as expected. The logs should include detailed information about the ingestion process, including the progress and any errors encountered.

Step 4: Handle Ingestion Failures
If ingestion fails due to network issues, permission errors, or other exceptions, review the detailed logs to identify the last successfully ingested dataset. You can then modify your ingestion recipe to resume from the last successful point.

Example Ingestion Recipe:

source:
  type: hive
  config:
    username: your_username
    password: your_password
    host_port: your_hive_host:10000
    database: your_database
    # Add other necessary configurations

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"
    token: your_token

Additional Resources:

Would you like more specific examples or further assistance with any of these steps?

Sources:

<@U06TM7M40RK>
How to print the query being executed during ingestion?

To print the SQL queries being executed during ingestion, you can enable detailed logging for the SQLAlchemy engine, which is commonly used in DataHub for database interactions. This can be done by configuring the logging settings in your environment or ingestion script.

Step-by-Step Guide:

  1. Modify Logging Configuration:
    Ensure that the logging configuration captures detailed SQLAlchemy logs. You can do this by modifying the logback.xml file or by setting up logging in your Python script.

  2. Set Up Logging in Python Script:
    If you are running the ingestion via a Python script, you can configure logging directly in the script.

Example Python Script with Logging Configuration:

import logging
import datahub.emitter.mce_builder as builder
from datahub.ingestion.run.pipeline import Pipeline

# Configure logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger('sqlalchemy.engine')
logger.setLevel(logging.INFO)

# Load your ingestion recipe
pipeline = Pipeline.create({
    "source": {
        "type": "hive",
        "config": {
            "username": "your_username",
            "password": "your_password",
            "host_port": "your_hive_host:10000",
            "database": "your_database"
        }
    },
    "sink": {
        "type": "datahub-rest",
        "config": {
            "server": "http://localhost:8080",
            "token": "your_token"
        }
    }
})

# Run the pipeline
pipeline.run()
pipeline.raise_from_status()
  1. Set Up Logging in Shell Environment:
    If you prefer to configure logging via the shell environment, you can set the SQLALCHEMY_ECHO environment variable to True.

Example Shell Command:

export SQLALCHEMY_ECHO=True
datahub ingest -c your_recipe.yaml

Verify Logs:
After setting up the logging configuration, run your ingestion command. The SQL queries being executed should now be printed in the console output or log files.

Additional Resources:

Would you like more specific examples or further assistance with any of these steps?

Sources:

<@U06TM7M40RK>
SQLALCHEMY_ECHO is not working

I’m not able to find any data sources to answer your question.

Could you provide more details on your ingestion setup, such as the specific DataHub version, the database type you’re integrating with, and any custom configurations you might have?

If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK>
Logging in Shell Environment is not working

I’m not able to find any data sources to answer your question.

Could you specify which DataHub version you are using and the exact command you are running for ingestion?

If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 1 button :bulb: Hint: Mention <@U06TM7M40RK> in the thread for followups.

<@UV14447EU> <@U01GZEETMEZ>
Can you please help me with the detailed logging during ingestion?
I tried DATAHUB_DEBUG=true this is printing the query but also the cURL request to GMS. So can’t use this as the will print unnecessary data.

Is it possible to get the old way printing logs (query + cli report)?