Estimator choice for mapping string independent variable to string categorical dependent variableNLP software for classification of large datasetsMethods for automated synonym detectionEstimating dependent variable as sum of functions of independent variablesHow to apply machine learning to fuzzy matchingPython - How to intuit word from abbreviated text using NLP?What is the difference between “feature” and “independent variable”?What's the ideal way to include dictionaries (gazetteer) in spaCy to improve NER?Does linear regression work with a categorical independent variable & continuous dependent variable?relation in between a categorical dependent variable and combination of independent variablesHow to compare the continuous dependent variable with categorical independent variable?

Why would Basel III prevent price discovery at credit markets?

Meaning of " brothers in arms"

Why is destructor not called in operator delete?

numpy 1D array: mask elements that repeat more than n times

Right way to say I disagree with the design but ok I will do

Did Bercow say he would have sent the EU extension-request letter himself, had Johnson not done so?

Why buy a first class ticket on Southern trains?

Uncountably many functions coinciding only finitely many values

Continents with simplex noise

C function to check the validity of a date in DD.MM.YYYY format

What is the difference between Anführer and Führer?

Has Turkey released the Greek soldiers they apprehended in 2018?

Is Dom based XSS still a valid security concern in modern browsers?

Echo bracket symbol to terminal

Double feature: Weightier

Replacing triangulated categories with something better

Why is the past tense of vomit generally spelled 'vomited' rather than 'vomitted'?

How offensive is Fachidiot?

Is velocity a valid measure of team and process improvement?

Run "cd" command as superuser in Linux

Precious Stone, as Clear as Diamond

Uncooked peppers and garlic in olive oil fizzled when opened

How can I determine if two vertices on a polygon are consecutive?

AniPop - The anime downloader



Estimator choice for mapping string independent variable to string categorical dependent variable


NLP software for classification of large datasetsMethods for automated synonym detectionEstimating dependent variable as sum of functions of independent variablesHow to apply machine learning to fuzzy matchingPython - How to intuit word from abbreviated text using NLP?What is the difference between “feature” and “independent variable”?What's the ideal way to include dictionaries (gazetteer) in spaCy to improve NER?Does linear regression work with a categorical independent variable & continuous dependent variable?relation in between a categorical dependent variable and combination of independent variablesHow to compare the continuous dependent variable with categorical independent variable?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;









0

















I'm attempting to build a predictive model that can map text-based vendor-provided descriptions of a service to around 800 standardized service codes, based on a training set of about 13,000 correctly mapped services.



Each standardized service code also has a standardized description, which is usually similar to the vendor-provided description (i.e, some of the words used are the same), but not identical. Descriptions are typically 3-10 word in length



My main issue is that I'm not sure what type of estimator will be appropriate for this problem.



I have tried using simple fuzzy matching approaches, including:



  • Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches

  • Trying to find the standardized service description with the minimum Levenshtein distance

These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.



I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.



Which type of estimator can I use to solve this problem?










share|improve this question


























  • You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

    – Ken4scholars
    Mar 29 at 8:37











  • Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

    – J-Sharp
    Mar 29 at 14:20











  • You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

    – Ken4scholars
    Mar 29 at 16:50











  • Got it, thank you very much @Ken4scholars

    – J-Sharp
    Apr 1 at 18:07

















0

















I'm attempting to build a predictive model that can map text-based vendor-provided descriptions of a service to around 800 standardized service codes, based on a training set of about 13,000 correctly mapped services.



Each standardized service code also has a standardized description, which is usually similar to the vendor-provided description (i.e, some of the words used are the same), but not identical. Descriptions are typically 3-10 word in length



My main issue is that I'm not sure what type of estimator will be appropriate for this problem.



I have tried using simple fuzzy matching approaches, including:



  • Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches

  • Trying to find the standardized service description with the minimum Levenshtein distance

These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.



I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.



Which type of estimator can I use to solve this problem?










share|improve this question


























  • You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

    – Ken4scholars
    Mar 29 at 8:37











  • Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

    – J-Sharp
    Mar 29 at 14:20











  • You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

    – Ken4scholars
    Mar 29 at 16:50











  • Got it, thank you very much @Ken4scholars

    – J-Sharp
    Apr 1 at 18:07













0












0








0








I'm attempting to build a predictive model that can map text-based vendor-provided descriptions of a service to around 800 standardized service codes, based on a training set of about 13,000 correctly mapped services.



Each standardized service code also has a standardized description, which is usually similar to the vendor-provided description (i.e, some of the words used are the same), but not identical. Descriptions are typically 3-10 word in length



My main issue is that I'm not sure what type of estimator will be appropriate for this problem.



I have tried using simple fuzzy matching approaches, including:



  • Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches

  • Trying to find the standardized service description with the minimum Levenshtein distance

These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.



I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.



Which type of estimator can I use to solve this problem?










share|improve this question















I'm attempting to build a predictive model that can map text-based vendor-provided descriptions of a service to around 800 standardized service codes, based on a training set of about 13,000 correctly mapped services.



Each standardized service code also has a standardized description, which is usually similar to the vendor-provided description (i.e, some of the words used are the same), but not identical. Descriptions are typically 3-10 word in length



My main issue is that I'm not sure what type of estimator will be appropriate for this problem.



I have tried using simple fuzzy matching approaches, including:



  • Counting matching words/characters between the vendor-provided and standardized service descriptions and selecting the one with the most matches

  • Trying to find the standardized service description with the minimum Levenshtein distance

These have not worked particularly well due to the use of synonymous but different word choices within the vendor-provided and standardized descriptions.



I have also considered using a decision tree, but it seems infeasible given 800+ possible outcomes.



Which type of estimator can I use to solve this problem?







machine-learning nlp data-science






share|improve this question














share|improve this question











share|improve this question




share|improve this question










asked Mar 28 at 21:35









J-SharpJ-Sharp

1




1















  • You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

    – Ken4scholars
    Mar 29 at 8:37











  • Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

    – J-Sharp
    Mar 29 at 14:20











  • You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

    – Ken4scholars
    Mar 29 at 16:50











  • Got it, thank you very much @Ken4scholars

    – J-Sharp
    Apr 1 at 18:07

















  • You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

    – Ken4scholars
    Mar 29 at 8:37











  • Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

    – J-Sharp
    Mar 29 at 14:20











  • You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

    – Ken4scholars
    Mar 29 at 16:50











  • Got it, thank you very much @Ken4scholars

    – J-Sharp
    Apr 1 at 18:07
















You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

– Ken4scholars
Mar 29 at 8:37





You can first use a summarization technique to extract the most useful words from the description and then use a word embedding technique like word2vec to calculate the similarity between the summary words and the service codes choosing the service code which is most similar. Simple word distance techniques like Levenshtein distance won't work well here because of synonyms

– Ken4scholars
Mar 29 at 8:37













Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

– J-Sharp
Mar 29 at 14:20





Thanks for that @Ken4scholars. Are there any specific summarization techniques that you can recommend?

– J-Sharp
Mar 29 at 14:20













You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

– Ken4scholars
Mar 29 at 16:50





You can do an extractive summarization using TF-IDF to calculate the relevance of each word in the description and rank them, then select the highest ranked word

– Ken4scholars
Mar 29 at 16:50













Got it, thank you very much @Ken4scholars

– J-Sharp
Apr 1 at 18:07





Got it, thank you very much @Ken4scholars

– J-Sharp
Apr 1 at 18:07












0






active

oldest

votes













Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);














draft saved

draft discarded
















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407207%2festimator-choice-for-mapping-string-independent-variable-to-string-categorical-d%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown


























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407207%2festimator-choice-for-mapping-string-independent-variable-to-string-categorical-d%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown









Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현