Estimator choice for mapping string independent variable to string categorical dependent variableNLP software for classification of large datasetsMethods for automated synonym detectionEstimating dependent variable as sum of functions of independent variablesHow to apply machine learning to fuzzy matchingPython - How to intuit word from abbreviated text using NLP?What is the difference between “feature” and “independent variable”?What's the ideal way to include dictionaries (gazetteer) in spaCy to improve NER?Does linear regression work with a categorical independent variable & continuous dependent variable?relation in between a categorical dependent variable and combination of independent variablesHow to compare the continuous dependent variable with categorical independent variable?

Why would Basel III prevent price discovery at credit markets?

Meaning of " brothers in arms"

Why is destructor not called in operator delete?

numpy 1D array: mask elements that repeat more than n times

Right way to say I disagree with the design but ok I will do

Did Bercow say he would have sent the EU extension-request letter himself, had Johnson not done so?

Why buy a first class ticket on Southern trains?

Uncountably many functions coinciding only finitely many values

Continents with simplex noise

C function to check the validity of a date in DD.MM.YYYY format

What is the difference between Anführer and Führer?

Has Turkey released the Greek soldiers they apprehended in 2018?

Is Dom based XSS still a valid security concern in modern browsers?

Echo bracket symbol to terminal

Double feature: Weightier

Replacing triangulated categories with something better

Why is the past tense of vomit generally spelled 'vomited' rather than 'vomitted'?

How offensive is Fachidiot?

Is velocity a valid measure of team and process improvement?

Run "cd" command as superuser in Linux

Precious Stone, as Clear as Diamond

Uncooked peppers and garlic in olive oil fizzled when opened

How can I determine if two vertices on a polygon are consecutive?

AniPop - The anime downloader



Estimator choice for mapping string independent variable to string categorical dependent variable


NLP software for classification of large datasetsMethods for automated synonym detectionEstimating dependent variable as sum of functions of independent variablesHow to apply machine learning to fuzzy matchingPython - How to intuit word from abbreviated text using NLP?What is the difference between “feature” and “independent variable”?What's the ideal way to include dictionaries (gazetteer) in spaCy to improve NER?Does linear regression work with a categorical independent variable & continuous dependent variable?relation in between a categorical dependent variable and combination of independent variablesHow to compare the continuous dependent variable with categorical independent variable?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;









0

















I'm attempting to build a predictive model that can map text-based vendor-provided descriptions of a service to around 800 standardized service codes, based on a training set of about 13,000 correctly mapped services.



Each standardized service code also has a standardized description, which is usually similar to the vendor-provided description (i.e, some of the words used are the same), but not identical. Descriptions are typically 3-10 word in length



My main issue is that I'm not sure what type of estimator will be appropriate for this problem.



I have tried using simple fuzzy matching approaches, including:



  • Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches

  • Trying to find the standardized service description with the minimum Levenshtein distance

These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.



I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.



Which type of estimator can I use to solve this problem?










share|improve this question


























  • You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

    – Ken4scholars
    Mar 29 at 8:37











  • Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

    – J-Sharp
    Mar 29 at 14:20











  • You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

    – Ken4scholars
    Mar 29 at 16:50











  • Got it, thank you very much @Ken4scholars

    – J-Sharp
    Apr 1 at 18:07

















0

















I'm attempting to build a predictive model that can map text-based vendor-provided descriptions of a service to around 800 standardized service codes, based on a training set of about 13,000 correctly mapped services.



Each standardized service code also has a standardized description, which is usually similar to the vendor-provided description (i.e, some of the words used are the same), but not identical. Descriptions are typically 3-10 word in length



My main issue is that I'm not sure what type of estimator will be appropriate for this problem.



I have tried using simple fuzzy matching approaches, including:



  • Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches

  • Trying to find the standardized service description with the minimum Levenshtein distance

These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.



I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.



Which type of estimator can I use to solve this problem?










share|improve this question


























  • You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

    – Ken4scholars
    Mar 29 at 8:37











  • Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

    – J-Sharp
    Mar 29 at 14:20











  • You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

    – Ken4scholars
    Mar 29 at 16:50











  • Got it, thank you very much @Ken4scholars

    – J-Sharp
    Apr 1 at 18:07













0












0








0








I'm attempting to build a predictive model that can map text-based vendor-provided descriptions of a service to around 800 standardized service codes, based on a training set of about 13,000 correctly mapped services.



Each standardized service code also has a standardized description, which is usually similar to the vendor-provided description (i.e, some of the words used are the same), but not identical. Descriptions are typically 3-10 word in length



My main issue is that I'm not sure what type of estimator will be appropriate for this problem.



I have tried using simple fuzzy matching approaches, including:



  • Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches

  • Trying to find the standardized service description with the minimum Levenshtein distance

These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.



I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.



Which type of estimator can I use to solve this problem?










share|improve this question















I'm attempting to build a predictive model that can map text-based vendor-provided descriptions of a service to around 800 standardized service codes, based on a training set of about 13,000 correctly mapped services.



Each standardized service code also has a standardized description, which is usually similar to the vendor-provided description (i.e, some of the words used are the same), but not identical. Descriptions are typically 3-10 word in length



My main issue is that I'm not sure what type of estimator will be appropriate for this problem.



I have tried using simple fuzzy matching approaches, including:



  • Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches

  • Trying to find the standardized service description with the minimum Levenshtein distance

These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.



I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.



Which type of estimator can I use to solve this problem?







machine-learning nlp data-science






share|improve this question














share|improve this question











share|improve this question




share|improve this question










asked Mar 28 at 21:35









J-SharpJ-Sharp

1




1















  • You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

    – Ken4scholars
    Mar 29 at 8:37











  • Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

    – J-Sharp
    Mar 29 at 14:20











  • You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

    – Ken4scholars
    Mar 29 at 16:50











  • Got it, thank you very much @Ken4scholars

    – J-Sharp
    Apr 1 at 18:07

















  • You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

    – Ken4scholars
    Mar 29 at 8:37











  • Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

    – J-Sharp
    Mar 29 at 14:20











  • You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

    – Ken4scholars
    Mar 29 at 16:50











  • Got it, thank you very much @Ken4scholars

    – J-Sharp
    Apr 1 at 18:07
















You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

– Ken4scholars
Mar 29 at 8:37





You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

– Ken4scholars
Mar 29 at 8:37













Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

– J-Sharp
Mar 29 at 14:20





Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

– J-Sharp
Mar 29 at 14:20













You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

– Ken4scholars
Mar 29 at 16:50





You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

– Ken4scholars
Mar 29 at 16:50













Got it, thank you very much @Ken4scholars

– J-Sharp
Apr 1 at 18:07





Got it, thank you very much @Ken4scholars

– J-Sharp
Apr 1 at 18:07












0






active

oldest

votes













Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);














draft saved

draft discarded
















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407207%2festimator-choice-for-mapping-string-independent-variable-to-string-categorical-d%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown


























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407207%2festimator-choice-for-mapping-string-independent-variable-to-string-categorical-d%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown









Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript