Estimator choice for mapping string independent variable to string categorical dependent variableNLP software for classification of large datasetsMethods for automated synonym detectionEstimating dependent variable as sum of functions of independent variablesHow to apply machine learning to fuzzy matchingPython - How to intuit word from abbreviated text using NLP?What is the difference between “feature” and “independent variable”?What's the ideal way to include dictionaries (gazetteer) in spaCy to improve NER?Does linear regression work with a categorical independent variable & continuous dependent variable?relation in between a categorical dependent variable and combination of independent variablesHow to compare the continuous dependent variable with categorical independent variable?

Why would Basel III prevent price discovery at credit markets?

Meaning of " brothers in arms"

Why is destructor not called in operator delete?

numpy 1D array: mask elements that repeat more than n times

Right way to say I disagree with the design but ok I will do

Did Bercow say he would have sent the EU extension-request letter himself, had Johnson not done so?

Why buy a first class ticket on Southern trains?

Uncountably many functions coinciding only finitely many values

Continents with simplex noise

C function to check the validity of a date in DD.MM.YYYY format

What is the difference between Anführer and Führer?

Has Turkey released the Greek soldiers they apprehended in 2018?

Is Dom based XSS still a valid security concern in modern browsers?

Echo bracket symbol to terminal

Double feature: Weightier

Replacing triangulated categories with something better

Why is the past tense of vomit generally spelled 'vomited' rather than 'vomitted'?

How offensive is Fachidiot?

Is velocity a valid measure of team and process improvement?

Run "cd" command as superuser in Linux

Precious Stone, as Clear as Diamond

Uncooked peppers and garlic in olive oil fizzled when opened

How can I determine if two vertices on a polygon are consecutive?

AniPop - The anime downloader

Estimator choice for mapping string independent variable to string categorical dependent variable

NLP software for classification of large datasetsMethods for automated synonym detectionEstimating dependent variable as sum of functions of independent variablesHow to apply machine learning to fuzzy matchingPython - How to intuit word from abbreviated text using NLP?What is the difference between “feature” and “independent variable”?What's the ideal way to include dictionaries (gazetteer) in spaCy to improve NER?Does linear regression work with a categorical independent variable & continuous dependent variable?relation in between a categorical dependent variable and combination of independent variablesHow to compare the continuous dependent variable with categorical independent variable?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;

I'm attempting to build a predictive model that can map text-based vendor-provided descriptions of a service to around 800 standardized service codes, based on a training set of about 13,000 correctly mapped services.

Each standardized service code also has a standardized description, which is usually similar to the vendor-provided description (i.e, some of the words used are the same), but not identical. Descriptions are typically 3-10 word in length

My main issue is that I'm not sure what type of estimator will be appropriate for this problem.

I have tried using simple fuzzy matching approaches, including:

Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches

Trying to find the standardized service description with the minimum Levenshtein distance

These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.

I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.

Which type of estimator can I use to solve this problem?

asked Mar 28 at 21:35

J-Sharp

You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

– Ken4scholars
Mar 29 at 8:37

Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

– J-Sharp
Mar 29 at 14:20

You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

– Ken4scholars
Mar 29 at 16:50

Got it, thank you very much @Ken4scholars

– J-Sharp
Apr 1 at 18:07

add a comment
|

My main issue is that I'm not sure what type of estimator will be appropriate for this problem.

I have tried using simple fuzzy matching approaches, including:

Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches

Trying to find the standardized service description with the minimum Levenshtein distance

These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.

I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.

Which type of estimator can I use to solve this problem?

asked Mar 28 at 21:35

J-Sharp

You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

– Ken4scholars
Mar 29 at 8:37

Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

– J-Sharp
Mar 29 at 14:20

You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

– Ken4scholars
Mar 29 at 16:50

Got it, thank you very much @Ken4scholars

– J-Sharp
Apr 1 at 18:07

add a comment
|

My main issue is that I'm not sure what type of estimator will be appropriate for this problem.

I have tried using simple fuzzy matching approaches, including:

Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches

Trying to find the standardized service description with the minimum Levenshtein distance

These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.

I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.

Which type of estimator can I use to solve this problem?

asked Mar 28 at 21:35

J-Sharp

My main issue is that I'm not sure what type of estimator will be appropriate for this problem.

I have tried using simple fuzzy matching approaches, including:

Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches

Trying to find the standardized service description with the minimum Levenshtein distance

These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.

I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.

Which type of estimator can I use to solve this problem?

machine-learning nlp data-science

asked Mar 28 at 21:35

J-Sharp

asked Mar 28 at 21:35

J-Sharp

asked Mar 28 at 21:35

J-Sharp

asked Mar 28 at 21:35

J-Sharp

asked Mar 28 at 21:35

J-Sharp

You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

– Ken4scholars
Mar 29 at 8:37

Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

– J-Sharp
Mar 29 at 14:20

You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

– Ken4scholars
Mar 29 at 16:50

Got it, thank you very much @Ken4scholars

– J-Sharp
Apr 1 at 18:07

add a comment
|

You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

– Ken4scholars
Mar 29 at 8:37

Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

– J-Sharp
Mar 29 at 14:20

You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

– Ken4scholars
Mar 29 at 16:50

Got it, thank you very much @Ken4scholars

– J-Sharp
Apr 1 at 18:07

You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

– Ken4scholars
Mar 29 at 8:37

Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

– J-Sharp
Mar 29 at 14:20

You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

– Ken4scholars
Mar 29 at 16:50

Got it, thank you very much @Ken4scholars

– J-Sharp
Apr 1 at 18:07

add a comment
|

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407207%2festimator-choice-for-mapping-string-independent-variable-to-string-categorical-d%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

0

Your Answer

Post as a guest

0

0

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

0

Your Answer

Sign up or log in

Post as a guest

Post as a guest

0

0

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴