How to get Vocabulary with weights for tf-idf word bags in ml.net?How do you get the index of the current iteration of a foreach loop?How do I get the path of the assembly the code is in?How do I get a consistent byte representation of strings in C# without manually specifying an encoding?How can I get the application's path in a .NET console application?String similarity TF-IDF Bag of words or Word2vecWhy are TF-IDF vocabulary words represented as axes/dimensions?Remove single occurrences of words in vocabulary TF-IDFHow to act on a class with ml.net?How to remove column with ML.NET ColumnDropperIsn’t it bad to not idf-weight the document?

How to get the address of a C++ lambda function within itself?

How to avoid answering "what were you sick with"?

Pi to the power y, for small y's

What on earth is this small wall-mounted computer?

"Cобака на сене" - is this expression still in use or is it dated?

Bash to check if directory exist. If not create with an array

Does SQL Server Only Perform Calculations In A SELECT List Once?

How to interpret Residuals vs. Fitted Plot

Which fallacy: "If white privilege exists, why did Elizabeth Warren pretend to be an Indian?"

What do you call someone whose unmarried partner has died?

What information could a Time Traveller give to the Germans to make them win the war?

Well-known American figure with Roman numerals

What's an "add" chord?

What is a Thanos Word™?

A question about the に in this sentence

Can we rotate symbols in LaTeX? How should we make this diagram?

Fermat's Last Theorem, mod n

Spanning tree of a rectangular grid

How to identify a (personal) Canon Sue?

Does the Green-Flame Blade cantrip work if I've cast the Shillelagh cantrip on my staff?

Google just EOLed the original Pixel. How long until it's a brick?

CO₂ level is high enough that it reduces cognitive ability. Isn't that a reason to worry?

Why use [FormalN]?

How much money is needed to prove you can support yourself with ESTA



How to get Vocabulary with weights for tf-idf word bags in ml.net?


How do you get the index of the current iteration of a foreach loop?How do I get the path of the assembly the code is in?How do I get a consistent byte representation of strings in C# without manually specifying an encoding?How can I get the application's path in a .NET console application?String similarity TF-IDF Bag of words or Word2vecWhy are TF-IDF vocabulary words represented as axes/dimensions?Remove single occurrences of words in vocabulary TF-IDFHow to act on a class with ml.net?How to remove column with ML.NET ColumnDropperIsn’t it bad to not idf-weight the document?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;









2

















The documentation of ML.NET shows how to use context.Transforms.Text.ProduceWordBags to get word bags. The method takes Transforms.Text.NgramExtractingEstimator.WeightingCriteria as one of the parameters, so it's possible to request TfIdf weights to be used. The simplest example would be:



// Get a small dataset as an IEnumerable and then read it as a ML.NET data set.
IEnumerable<SamplesUtils.DatasetUtils.SampleTopicsData> data = SamplesUtils.DatasetUtils.GetTopicsData();
var trainData = ml.Data.LoadFromEnumerable(data);

var pipeline = ml.Transforms.Text.ProduceWordBags("bags", review, ngramLength: 1, weighting: Transforms.Text.NgramExtractingEstimator.WeightingCriteria.TfIdf);

var transformer = pipeline.Fit(trainData);
var transformed_data = transformer.Transform(trainData);


That's all fine, but how do I get the actual results out of transformed_data?



I did some digging in a debugger, but I'm still quite confused on what's actually happening here.



First of all, running the pipeline adds three extra columns to transformed_data:



enter image description here



After getting a preview of the data I can see what's in these columns. To make things clearer here's what GetTopicsData returns, which is what we're running our transform on:



animals birds cats dogs fish horse
horse birds house fish duck cats
car truck driver bus pickup
car truck driver bus pickup horse


That's exactly what I'm seeing in the very first bags column, typed as Vector<string>:



enter image description here



Moving on to the second bags column, typed as Vector<Key<UInt32, 0-12>> (no idea what 0-12 is here btw.).



This one has KeyValues annotation on it and it looks like for each row it maps the words into indexes in global Vocabulary array.



enter image description here



The Vocabulary array is part of Annotations:



enter image description here



So that's promissing. You'd think the last bags column, typed as Vector<Single, 13> would have the weights for each of the words! Unfortunately, that's not what I'm seeing. First of all, the same Vocabulary array is present in Annotations:



enter image description here



And the values in rows are 1/0, which is not what TfIdf should return:



enter image description here



So to me that looks more like "Is word i from the Vocabulary present in current row" and not the TfIdf frequency of it, which is what I'm trying to get.










share|improve this question


























  • OK. I think I got it. There is a bug in ML.NET which basically ignores weighting parameter in this scenarios and always uses Tf.

    – MarcinJuraszek
    Mar 28 at 22:32

















2

















The documentation of ML.NET shows how to use context.Transforms.Text.ProduceWordBags to get word bags. The method takes Transforms.Text.NgramExtractingEstimator.WeightingCriteria as one of the parameters, so it's possible to request TfIdf weights to be used. The simplest example would be:



// Get a small dataset as an IEnumerable and then read it as a ML.NET data set.
IEnumerable<SamplesUtils.DatasetUtils.SampleTopicsData> data = SamplesUtils.DatasetUtils.GetTopicsData();
var trainData = ml.Data.LoadFromEnumerable(data);

var pipeline = ml.Transforms.Text.ProduceWordBags("bags", review, ngramLength: 1, weighting: Transforms.Text.NgramExtractingEstimator.WeightingCriteria.TfIdf);

var transformer = pipeline.Fit(trainData);
var transformed_data = transformer.Transform(trainData);


That's all fine, but how do I get the actual results out of transformed_data?



I did some digging in a debugger, but I'm still quite confused on what's actually happening here.



First of all, running the pipeline adds three extra columns to transformed_data:



enter image description here



After getting a preview of the data I can see what's in these columns. To make things clearer here's what GetTopicsData returns, which is what we're running our transform on:



animals birds cats dogs fish horse
horse birds house fish duck cats
car truck driver bus pickup
car truck driver bus pickup horse


That's exactly what I'm seeing in the very first bags column, typed as Vector<string>:



enter image description here



Moving on to the second bags column, typed as Vector<Key<UInt32, 0-12>> (no idea what 0-12 is here btw.).



This one has KeyValues annotation on it and it looks like for each row it maps the words into indexes in global Vocabulary array.



enter image description here



The Vocabulary array is part of Annotations:



enter image description here



So that's promissing. You'd think the last bags column, typed as Vector<Single, 13> would have the weights for each of the words! Unfortunately, that's not what I'm seeing. First of all, the same Vocabulary array is present in Annotations:



enter image description here



And the values in rows are 1/0, which is not what TfIdf should return:



enter image description here



So to me that looks more like "Is word i from the Vocabulary present in current row" and not the TfIdf frequency of it, which is what I'm trying to get.










share|improve this question


























  • OK. I think I got it. There is a bug in ML.NET which basically ignores weighting parameter in this scenarios and always uses Tf.

    – MarcinJuraszek
    Mar 28 at 22:32













2












2








2








The documentation of ML.NET shows how to use context.Transforms.Text.ProduceWordBags to get word bags. The method takes Transforms.Text.NgramExtractingEstimator.WeightingCriteria as one of the parameters, so it's possible to request TfIdf weights to be used. The simplest example would be:



// Get a small dataset as an IEnumerable and then read it as a ML.NET data set.
IEnumerable<SamplesUtils.DatasetUtils.SampleTopicsData> data = SamplesUtils.DatasetUtils.GetTopicsData();
var trainData = ml.Data.LoadFromEnumerable(data);

var pipeline = ml.Transforms.Text.ProduceWordBags("bags", review, ngramLength: 1, weighting: Transforms.Text.NgramExtractingEstimator.WeightingCriteria.TfIdf);

var transformer = pipeline.Fit(trainData);
var transformed_data = transformer.Transform(trainData);


That's all fine, but how do I get the actual results out of transformed_data?



I did some digging in a debugger, but I'm still quite confused on what's actually happening here.



First of all, running the pipeline adds three extra columns to transformed_data:



enter image description here



After getting a preview of the data I can see what's in these columns. To make things clearer here's what GetTopicsData returns, which is what we're running our transform on:



animals birds cats dogs fish horse
horse birds house fish duck cats
car truck driver bus pickup
car truck driver bus pickup horse


That's exactly what I'm seeing in the very first bags column, typed as Vector<string>:



enter image description here



Moving on to the second bags column, typed as Vector<Key<UInt32, 0-12>> (no idea what 0-12 is here btw.).



This one has KeyValues annotation on it and it looks like for each row it maps the words into indexes in global Vocabulary array.



enter image description here



The Vocabulary array is part of Annotations:



enter image description here



So that's promissing. You'd think the last bags column, typed as Vector<Single, 13> would have the weights for each of the words! Unfortunately, that's not what I'm seeing. First of all, the same Vocabulary array is present in Annotations:



enter image description here



And the values in rows are 1/0, which is not what TfIdf should return:



enter image description here



So to me that looks more like "Is word i from the Vocabulary present in current row" and not the TfIdf frequency of it, which is what I'm trying to get.










share|improve this question















The documentation of ML.NET shows how to use context.Transforms.Text.ProduceWordBags to get word bags. The method takes Transforms.Text.NgramExtractingEstimator.WeightingCriteria as one of the parameters, so it's possible to request TfIdf weights to be used. The simplest example would be:



// Get a small dataset as an IEnumerable and then read it as a ML.NET data set.
IEnumerable<SamplesUtils.DatasetUtils.SampleTopicsData> data = SamplesUtils.DatasetUtils.GetTopicsData();
var trainData = ml.Data.LoadFromEnumerable(data);

var pipeline = ml.Transforms.Text.ProduceWordBags("bags", review, ngramLength: 1, weighting: Transforms.Text.NgramExtractingEstimator.WeightingCriteria.TfIdf);

var transformer = pipeline.Fit(trainData);
var transformed_data = transformer.Transform(trainData);


That's all fine, but how do I get the actual results out of transformed_data?



I did some digging in a debugger, but I'm still quite confused on what's actually happening here.



First of all, running the pipeline adds three extra columns to transformed_data:



enter image description here



After getting a preview of the data I can see what's in these columns. To make things clearer here's what GetTopicsData returns, which is what we're running our transform on:



animals birds cats dogs fish horse
horse birds house fish duck cats
car truck driver bus pickup
car truck driver bus pickup horse


That's exactly what I'm seeing in the very first bags column, typed as Vector<string>:



enter image description here



Moving on to the second bags column, typed as Vector<Key<UInt32, 0-12>> (no idea what 0-12 is here btw.).



This one has KeyValues annotation on it and it looks like for each row it maps the words into indexes in global Vocabulary array.



enter image description here



The Vocabulary array is part of Annotations:



enter image description here



So that's promissing. You'd think the last bags column, typed as Vector<Single, 13> would have the weights for each of the words! Unfortunately, that's not what I'm seeing. First of all, the same Vocabulary array is present in Annotations:



enter image description here



And the values in rows are 1/0, which is not what TfIdf should return:



enter image description here



So to me that looks more like "Is word i from the Vocabulary present in current row" and not the TfIdf frequency of it, which is what I'm trying to get.







c# tf-idf ml.net






share|improve this question














share|improve this question











share|improve this question




share|improve this question










asked Mar 28 at 21:52









MarcinJuraszekMarcinJuraszek

112k12 gold badges152 silver badges226 bronze badges




112k12 gold badges152 silver badges226 bronze badges















  • OK. I think I got it. There is a bug in ML.NET which basically ignores weighting parameter in this scenarios and always uses Tf.

    – MarcinJuraszek
    Mar 28 at 22:32

















  • OK. I think I got it. There is a bug in ML.NET which basically ignores weighting parameter in this scenarios and always uses Tf.

    – MarcinJuraszek
    Mar 28 at 22:32
















OK. I think I got it. There is a bug in ML.NET which basically ignores weighting parameter in this scenarios and always uses Tf.

– MarcinJuraszek
Mar 28 at 22:32





OK. I think I got it. There is a bug in ML.NET which basically ignores weighting parameter in this scenarios and always uses Tf.

– MarcinJuraszek
Mar 28 at 22:32












0






active

oldest

votes













Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);














draft saved

draft discarded
















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407409%2fhow-to-get-vocabulary-with-weights-for-tf-idf-word-bags-in-ml-net%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown


























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407409%2fhow-to-get-vocabulary-with-weights-for-tf-idf-word-bags-in-ml-net%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown









Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript