Estimator choice for mapping string independent variable to string categorical dependent variableNLP software for classification of large datasetsMethods for automated synonym detectionEstimating dependent variable as sum of functions of independent variablesHow to apply machine learning to fuzzy matchingPython - How to intuit word from abbreviated text using NLP?What is the difference between “feature” and “independent variable”?What's the ideal way to include dictionaries (gazetteer) in spaCy to improve NER?Does linear regression work with a categorical independent variable & continuous dependent variable?relation in between a categorical dependent variable and combination of independent variablesHow to compare the continuous dependent variable with categorical independent variable?
Why would Basel III prevent price discovery at credit markets?
Meaning of " brothers in arms"
Why is destructor not called in operator delete?
numpy 1D array: mask elements that repeat more than n times
Right way to say I disagree with the design but ok I will do
Did Bercow say he would have sent the EU extension-request letter himself, had Johnson not done so?
Why buy a first class ticket on Southern trains?
Uncountably many functions coinciding only finitely many values
Continents with simplex noise
C function to check the validity of a date in DD.MM.YYYY format
What is the difference between Anführer and Führer?
Has Turkey released the Greek soldiers they apprehended in 2018?
Is Dom based XSS still a valid security concern in modern browsers?
Echo bracket symbol to terminal
Double feature: Weightier
Replacing triangulated categories with something better
Why is the past tense of vomit generally spelled 'vomited' rather than 'vomitted'?
How offensive is Fachidiot?
Is velocity a valid measure of team and process improvement?
Run "cd" command as superuser in Linux
Precious Stone, as Clear as Diamond
Uncooked peppers and garlic in olive oil fizzled when opened
How can I determine if two vertices on a polygon are consecutive?
AniPop - The anime downloader
Estimator choice for mapping string independent variable to string categorical dependent variable
NLP software for classification of large datasetsMethods for automated synonym detectionEstimating dependent variable as sum of functions of independent variablesHow to apply machine learning to fuzzy matchingPython - How to intuit word from abbreviated text using NLP?What is the difference between “feature” and “independent variable”?What's the ideal way to include dictionaries (gazetteer) in spaCy to improve NER?Does linear regression work with a categorical independent variable & continuous dependent variable?relation in between a categorical dependent variable and combination of independent variablesHow to compare the continuous dependent variable with categorical independent variable?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;
I'm attempting to build a predictive model that can map text-based vendor-provided descriptions of a service to around 800 standardized service codes, based on a training set of about 13,000 correctly mapped services.
Each standardized service code also has a standardized description, which is usually similar to the vendor-provided description (i.e, some of the words used are the same), but not identical. Descriptions are typically 3-10 word in length
My main issue is that I'm not sure what type of estimator will be appropriate for this problem.
I have tried using simple fuzzy matching approaches, including:
- Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches
- Trying to find the standardized service description with the minimum Levenshtein distance
These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.
I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.
Which type of estimator can I use to solve this problem?
machine-learning nlp data-science
add a comment
|
I'm attempting to build a predictive model that can map text-based vendor-provided descriptions of a service to around 800 standardized service codes, based on a training set of about 13,000 correctly mapped services.
Each standardized service code also has a standardized description, which is usually similar to the vendor-provided description (i.e, some of the words used are the same), but not identical. Descriptions are typically 3-10 word in length
My main issue is that I'm not sure what type of estimator will be appropriate for this problem.
I have tried using simple fuzzy matching approaches, including:
- Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches
- Trying to find the standardized service description with the minimum Levenshtein distance
These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.
I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.
Which type of estimator can I use to solve this problem?
machine-learning nlp data-science
You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms
– Ken4scholars
Mar 29 at 8:37
Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?
– J-Sharp
Mar 29 at 14:20
You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word
– Ken4scholars
Mar 29 at 16:50
Got it, thank you very much @Ken4scholars
– J-Sharp
Apr 1 at 18:07
add a comment
|
I'm attempting to build a predictive model that can map text-based vendor-provided descriptions of a service to around 800 standardized service codes, based on a training set of about 13,000 correctly mapped services.
Each standardized service code also has a standardized description, which is usually similar to the vendor-provided description (i.e, some of the words used are the same), but not identical. Descriptions are typically 3-10 word in length
My main issue is that I'm not sure what type of estimator will be appropriate for this problem.
I have tried using simple fuzzy matching approaches, including:
- Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches
- Trying to find the standardized service description with the minimum Levenshtein distance
These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.
I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.
Which type of estimator can I use to solve this problem?
machine-learning nlp data-science
I'm attempting to build a predictive model that can map text-based vendor-provided descriptions of a service to around 800 standardized service codes, based on a training set of about 13,000 correctly mapped services.
Each standardized service code also has a standardized description, which is usually similar to the vendor-provided description (i.e, some of the words used are the same), but not identical. Descriptions are typically 3-10 word in length
My main issue is that I'm not sure what type of estimator will be appropriate for this problem.
I have tried using simple fuzzy matching approaches, including:
- Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches
- Trying to find the standardized service description with the minimum Levenshtein distance
These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.
I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.
Which type of estimator can I use to solve this problem?
machine-learning nlp data-science
machine-learning nlp data-science
asked Mar 28 at 21:35
J-SharpJ-Sharp
1
1
You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms
– Ken4scholars
Mar 29 at 8:37
Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?
– J-Sharp
Mar 29 at 14:20
You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word
– Ken4scholars
Mar 29 at 16:50
Got it, thank you very much @Ken4scholars
– J-Sharp
Apr 1 at 18:07
add a comment
|
You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms
– Ken4scholars
Mar 29 at 8:37
Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?
– J-Sharp
Mar 29 at 14:20
You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word
– Ken4scholars
Mar 29 at 16:50
Got it, thank you very much @Ken4scholars
– J-Sharp
Apr 1 at 18:07
You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms
– Ken4scholars
Mar 29 at 8:37
You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms
– Ken4scholars
Mar 29 at 8:37
Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?
– J-Sharp
Mar 29 at 14:20
Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?
– J-Sharp
Mar 29 at 14:20
You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word
– Ken4scholars
Mar 29 at 16:50
You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word
– Ken4scholars
Mar 29 at 16:50
Got it, thank you very much @Ken4scholars
– J-Sharp
Apr 1 at 18:07
Got it, thank you very much @Ken4scholars
– J-Sharp
Apr 1 at 18:07
add a comment
|
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407207%2festimator-choice-for-mapping-string-independent-variable-to-string-categorical-d%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407207%2festimator-choice-for-mapping-string-independent-variable-to-string-categorical-d%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms
– Ken4scholars
Mar 29 at 8:37
Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?
– J-Sharp
Mar 29 at 14:20
You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word
– Ken4scholars
Mar 29 at 16:50
Got it, thank you very much @Ken4scholars
– J-Sharp
Apr 1 at 18:07