Ingesting a CSV file into Datahub using version 0.13 and configuring the ingestion from the Datahub interface

Original Slack Thread

Hello everyone. I want to ingest a CSV file located into my computer inside Datahub. What are the steps to follow and how to properly configure the ingestion from Datahub interface? I’m using version 0.13, and the final goal would be to build a Data Catalog over BigQuery, but for the sake of this message I just want to ingest a CSV file from my PC into Datahub

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

It will probably be easier to directly ingest from bigquery using our bigquery https://datahubproject.io/docs/next/generated/ingestion/sources/bigquery/|source than to ingest from a csv. CSV files don’t contain as rich metadata as databases / data warehouses, so we’d have to do type inference on the data to be able to ingest them properly. We currently don’t have an ingestion source that does this on csv files.

Actually, you can use https://datahubproject.io/docs/next/generated/ingestion/sources/s3/ to ingest a csv file. Despite the source name, you can use this one to ingest local files. Specify as follows:

  type: s3
  config:
    path_specs:
      - include: "./relative/directory"```
to ingest all files in a directory. Or I believe you can specify a single CSV file directly.

Thank you Andrew! Yes, I know that ingesting from BigQuery is easier and more powerful, I just wanted to try every possibility :slightly_smiling_face: