Troubleshooting Metadata Ingestion Stuck at Cloning Git Repository

Original Slack Thread

<@U06TM7M40RK>, I am running metadata ingestion for source lookml. But The stage is stuck at cloning git repository.
Logs for the job:
[2024-12-31 15:06:01,777] INFO {datahub.ingestion.source.git.git_import:56} - ⏳ Cloning repo '<mailto:git@gitlab.com|git@gitlab.com>:group/looker-dev-cluster.git', this can take some time... |

It seems likely that the issue you’re encountering with the lookml metadata ingestion being stuck at the “cloning git repository” stage could be related to the configuration of the repo_ssh_locator or the SSH key used for accessing the repository. Based on the information from the data sources, here are a few things you might want to check:

  1. SSH Key Configuration: Ensure that the SSH key provided in your configuration is correctly set up as a deploy key for the GitLab repository. According to the <Looker | DataHub documentation>, you need to generate a private-public SSH key pair and add the public key to your GitLab repository as a deploy key with read access.

  2. Repo SSH Locator: The repo_ssh_locator should be correctly set to the URL you want to use for cloning. If you’re using a non-standard Git host, you might need to manually set this parameter. As per the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/configuration/git.py|Git configuration file>, the repo_ssh_locator is inferred for GitHub and GitLab, but for other hosts, it needs to be set manually.

  3. Network and Connectivity: Ensure that there are no network issues preventing the connection to the GitLab server. This includes checking firewall settings, VPNs, or any other network configurations that might block the connection.

  4. GitLab Repository Access: Double-check that the repository URL is correct and that the credentials provided have the necessary permissions to access the repository.

Would you like more detailed guidance on any of these steps?

Sources:

Hi <@U07MGCL5KC6>, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><

<@U06TM7M40RK>, How can I get metrics for metadata ingestion in datahub?

To obtain metrics for metadata ingestion in DataHub, you can utilize the DatahubIngestionRunSummaryClass and DatahubIngestionCheckpointClass which provide detailed information about the ingestion process. Here are some key points to consider:

  1. DatahubIngestionRunSummaryClass: This class provides a summary of a DataHub ingestion run. It includes various metrics such as the number of work units committed, number of events, number of entities, number of aspects, and more. You can access properties like numWorkUnitsCommitted, numEvents, numEntities, numAspects, numWarnings, and numErrors to get detailed metrics about the ingestion run. For more details, refer to the <Models | DataHub documentation>.

  2. DatahubIngestionCheckpointClass: This class provides a checkpoint of a DataHub ingestion run for a given job. It includes properties such as timestampMillis, pipelineName, platformInstanceId, config, state, and more. This can be useful for tracking the state and configuration of an ingestion run. More information can be found in the <Models | DataHub documentation>.

  3. Logging and Monitoring: Ensure that your ingestion pipelines are configured to log detailed information. This can help you monitor the ingestion process and identify any issues or bottlenecks.

  4. Ingestion Reports: Review the ingestion reports generated by DataHub. These reports provide insights into the ingestion process, including the number of records processed, warnings, and errors.

Would you like more detailed guidance on how to implement or access these metrics?

Sources:

<@U06TM7M40RK>, Are there any metrics which are pre-exposed by datahub?

It seems likely that DataHub provides several pre-exposed metrics that can be used to monitor various aspects of the system. Here are some key points based on the available data sources:

  1. GraphQL Metrics: According to a <Slack conversation>, there are metrics related to GraphQL queries such as metrics_com_datahub_graphql_GraphQLController_searchAcrossEntities_Count, which tracks the number of times the searchAcrossEntities query is called. This can be useful for monitoring search activity within DataHub.

  2. Ingestion Metrics: The metrics_com_linkedin_metadata_resources_entity_AspectResource_ingestProposal_Count metric is mentioned as a way to track the validation of incoming metadata changes. This can be useful for monitoring the write path and ingestion activities.

  3. Monitoring and Tracing: DataHub supports monitoring through tools like Prometheus and Grafana. According to the <https://datahubproject.io/docs/0.14.0/advanced/monitoring|Monitoring DataHub documentation>, you can set up tracing and metrics measurements using OpenTelemetry and Dropwizard Metrics. This setup allows you to track various system components and performance metrics.

  4. Analytics Tab: There is an <DataHub tab> in DataHub that provides metrics like Weekly Active Users (WAU), Monthly Active Users (MAU), and usage statistics. This can help you understand user engagement and system usage.

If you need more detailed information on setting up or accessing these metrics, feel free to ask!

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Just checking in a final time, Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look at this thread :slight_smile:

Hope this helps ! ><

<@U06TM7M40RK>, What is ingestProposal?

The ingestProposal is a REST API endpoint in DataHub used for ingesting individual metadata aspects into the DataHub system. This endpoint allows you to propose changes to the metadata graph by submitting a MetadataChangeProposal (MCP). Here’s a brief overview based on the available data sources:

  1. Purpose: The ingestProposal endpoint is designed to handle metadata changes at a granular level by ingesting individual aspects of an entity. This is part of DataHub’s flexible ingestion architecture, which supports both synchronous and asynchronous models.

  2. Usage: To use the ingestProposal endpoint, you need to provide a JSON payload that includes the entity type, entity URN, change type (such as UPSERT or DELETE), aspect name, and a JSON-serialized aspect. This payload corresponds to the PDL schema defined for the aspect.

  3. Example: Here is an example of how to use the ingestProposal endpoint via a curl command:

    curl --location --request POST 'http://localhost:8080/aspects?action=ingestProposal' \
    --header 'X-RestLi-Protocol-Version: 2.0.0' \
    --header 'Content-Type: application/json' \
    --data-raw '{
      "proposal" : {
        "entityType": "dataset",
        "entityUrn" : "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)",
        "changeType" : "UPSERT",
        "aspectName" : "datasetUsageStatistics",
        "aspect" : {
          "value" : "{ \"timestampMillis\":1629840771000,\"uniqueUserCount\" : 10, \"totalSqlQueries\": 20, \"fieldCounts\": [ {\"fieldPath\": \"col1\", \"count\": 20}, {\"fieldPath\" : \"col2\", \"count\": 5} ]}",
          "contentType": "application/json"
        }
      }
    }'
    
  4. Deprecation of Legacy Methods: The ingestProposal method is part of a move away from legacy snapshot models, focusing instead on ingesting aspects directly for more flexibility and efficiency.

For more detailed information, you can refer to the <https://datahubproject.io/docs/0.14.0/api/restli/restli-overview|Rest.li API documentation>.

Would you like more detailed guidance on how to implement or use this endpoint?

Sources:

<@U06TM7M40RK>, How is the ingest enpoint different from ingestProposal ?

WHen we do a metadata ingestion, does it call ingestProposal?

It seems likely that the ingest and ingestProposal endpoints in DataHub serve different purposes and are used in different contexts for metadata ingestion. Here’s a breakdown based on the available data sources:

  1. Ingest Endpoint:

    • The ingest endpoint is used for ingesting multiple aspects about a particular entity at the same time. This is part of the legacy method of ingesting entities via snapshots.
    • It allows you to create or update an entity by providing a snapshot that includes multiple aspects. This method is being deprecated in favor of more granular aspect-based ingestion.
    • An example of using the ingest endpoint is to create a dataset with multiple aspects like ownership, institutional memory, and schema metadata.
  2. IngestProposal Endpoint:

    • The ingestProposal endpoint is designed for ingesting individual aspects of an entity. It focuses on making changes at a more granular level by proposing changes to specific aspects.
    • This endpoint is part of the newer approach that emphasizes the use of MetadataChangeProposal (MCP) for more flexible and efficient metadata management.
    • The ingestProposal method is used to submit a single aspect change, which aligns with the atomicity contract of the metadata model.
  3. Metadata Ingestion Process:

    • When performing metadata ingestion, the process typically involves calling the ingestProposal endpoint to handle individual aspect changes. This is part of the shift towards using MCPs for metadata changes.
    • The ingestProposal method is preferred for its flexibility and ability to handle changes asynchronously or synchronously, depending on the configuration.

In summary, the ingestProposal endpoint is more aligned with the current best practices for metadata ingestion in DataHub, focusing on aspect-level changes, while the ingest endpoint is part of the legacy approach that is being phased out.

Would you like more detailed guidance on how to use these endpoints?

Sources: