Searching Tables with Non-English Descriptions in Datahub UI

Original Slack Thread

hi, could datahub UI search find tables whose document or field description contain the query

Yes, these are searchable fields that are by default queried when doing a search

<@UV5UEC3LN> I tried, but it won’t give me the table whose field description contains my query. I wonder if it needs to be restricted in english language, because I wrote my description in Chinese. thanks

What description did you write? Was it more than 3 characters and was your search query also more than 3 characters?

<@UV5UEC3LN> a field description is 日频k线的开盘价格 , and my query is 开盘. besides, I tried to use a query with more than 3 characters e.g. 开盘价格. it still returns nothing.

and I test the query in English. it works, it could find the table whose column descriptions contain the query.

or is there any option to set in datahub/elastic search that make them support utf-8 characters

Ah, so this is a limitation on how we’re dividing “words.” 日频k线的开盘价格 is treated as a single “word” and within a word we are only doing prefix based fuzzy matching. So for example a query for 日频k线 should your dataset, but 开盘价格 would not because it’s a partial match at the end of the “word”

I can definitely understand why this wouldn’t be an ideal tokenization strategy for Chinese where each character can really be a full word or even more on its own, but would probably require custom tokenization strategies configured in ElasticSearch by language

You may want to look into:

This will require in-depth knowledge of ElasticSearch tokenization & analyzers though to get working in a desirable way

What you’ll probably want here is full ngram matching instead of just prefix, we don’t do this by default because it is very expensive to both performance and ElasticSearch index size.

thanks for your help. I am trying on elastic tokenization these days but I am not working out on it till now.