How to get Vocabulary with weights for tf-idf word bags in ml.net?How do you get the index of the current iteration of a foreach loop?How do I get the path of the assembly the code is in?How do I get a consistent byte representation of strings in C# without manually specifying an encoding?How can I get the application's path in a .NET console application?String similarity TF-IDF Bag of words or Word2vecWhy are TF-IDF vocabulary words represented as axes/dimensions?Remove single occurrences of words in vocabulary TF-IDFHow to act on a class with ml.net?How to remove column with ML.NET ColumnDropperIsn’t it bad to not idf-weight the document?
How to get the address of a C++ lambda function within itself?
How to avoid answering "what were you sick with"?
Pi to the power y, for small y's
What on earth is this small wall-mounted computer?
"Cобака на сене" - is this expression still in use or is it dated?
Bash to check if directory exist. If not create with an array
Does SQL Server Only Perform Calculations In A SELECT List Once?
How to interpret Residuals vs. Fitted Plot
Which fallacy: "If white privilege exists, why did Elizabeth Warren pretend to be an Indian?"
What do you call someone whose unmarried partner has died?
What information could a Time Traveller give to the Germans to make them win the war?
Well-known American figure with Roman numerals
What's an "add" chord?
What is a Thanos Word™?
A question about the に in this sentence
Can we rotate symbols in LaTeX? How should we make this diagram?
Fermat's Last Theorem, mod n
Spanning tree of a rectangular grid
How to identify a (personal) Canon Sue?
Does the Green-Flame Blade cantrip work if I've cast the Shillelagh cantrip on my staff?
Google just EOLed the original Pixel. How long until it's a brick?
CO₂ level is high enough that it reduces cognitive ability. Isn't that a reason to worry?
Why use [FormalN]?
How much money is needed to prove you can support yourself with ESTA
How to get Vocabulary with weights for tf-idf word bags in ml.net?
How do you get the index of the current iteration of a foreach loop?How do I get the path of the assembly the code is in?How do I get a consistent byte representation of strings in C# without manually specifying an encoding?How can I get the application's path in a .NET console application?String similarity TF-IDF Bag of words or Word2vecWhy are TF-IDF vocabulary words represented as axes/dimensions?Remove single occurrences of words in vocabulary TF-IDFHow to act on a class with ml.net?How to remove column with ML.NET ColumnDropperIsn’t it bad to not idf-weight the document?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;
The documentation of ML.NET shows how to use context.Transforms.Text.ProduceWordBags
to get word bags. The method takes Transforms.Text.NgramExtractingEstimator.WeightingCriteria
as one of the parameters, so it's possible to request TfIdf
weights to be used. The simplest example would be:
// Get a small dataset as an IEnumerable and then read it as a ML.NET data set.
IEnumerable<SamplesUtils.DatasetUtils.SampleTopicsData> data = SamplesUtils.DatasetUtils.GetTopicsData();
var trainData = ml.Data.LoadFromEnumerable(data);
var pipeline = ml.Transforms.Text.ProduceWordBags("bags", review, ngramLength: 1, weighting: Transforms.Text.NgramExtractingEstimator.WeightingCriteria.TfIdf);
var transformer = pipeline.Fit(trainData);
var transformed_data = transformer.Transform(trainData);
That's all fine, but how do I get the actual results out of transformed_data
?
I did some digging in a debugger, but I'm still quite confused on what's actually happening here.
First of all, running the pipeline adds three extra columns to transformed_data
:
After getting a preview of the data I can see what's in these columns. To make things clearer here's what GetTopicsData
returns, which is what we're running our transform on:
animals birds cats dogs fish horse
horse birds house fish duck cats
car truck driver bus pickup
car truck driver bus pickup horse
That's exactly what I'm seeing in the very first bags
column, typed as Vector<string>
:
Moving on to the second bags
column, typed as Vector<Key<UInt32, 0-12>>
(no idea what 0-12
is here btw.).
This one has KeyValues
annotation on it and it looks like for each row it maps the words into indexes in global Vocabulary array.
The Vocabulary array is part of Annotations
:
So that's promissing. You'd think the last bags
column, typed as Vector<Single, 13>
would have the weights for each of the words! Unfortunately, that's not what I'm seeing. First of all, the same Vocabulary array is present in Annotations
:
And the values in rows are 1
/0
, which is not what TfIdf should return:
So to me that looks more like "Is word i
from the Vocabulary present in current row" and not the TfIdf frequency of it, which is what I'm trying to get.
c# tf-idf ml.net
add a comment
|
The documentation of ML.NET shows how to use context.Transforms.Text.ProduceWordBags
to get word bags. The method takes Transforms.Text.NgramExtractingEstimator.WeightingCriteria
as one of the parameters, so it's possible to request TfIdf
weights to be used. The simplest example would be:
// Get a small dataset as an IEnumerable and then read it as a ML.NET data set.
IEnumerable<SamplesUtils.DatasetUtils.SampleTopicsData> data = SamplesUtils.DatasetUtils.GetTopicsData();
var trainData = ml.Data.LoadFromEnumerable(data);
var pipeline = ml.Transforms.Text.ProduceWordBags("bags", review, ngramLength: 1, weighting: Transforms.Text.NgramExtractingEstimator.WeightingCriteria.TfIdf);
var transformer = pipeline.Fit(trainData);
var transformed_data = transformer.Transform(trainData);
That's all fine, but how do I get the actual results out of transformed_data
?
I did some digging in a debugger, but I'm still quite confused on what's actually happening here.
First of all, running the pipeline adds three extra columns to transformed_data
:
After getting a preview of the data I can see what's in these columns. To make things clearer here's what GetTopicsData
returns, which is what we're running our transform on:
animals birds cats dogs fish horse
horse birds house fish duck cats
car truck driver bus pickup
car truck driver bus pickup horse
That's exactly what I'm seeing in the very first bags
column, typed as Vector<string>
:
Moving on to the second bags
column, typed as Vector<Key<UInt32, 0-12>>
(no idea what 0-12
is here btw.).
This one has KeyValues
annotation on it and it looks like for each row it maps the words into indexes in global Vocabulary array.
The Vocabulary array is part of Annotations
:
So that's promissing. You'd think the last bags
column, typed as Vector<Single, 13>
would have the weights for each of the words! Unfortunately, that's not what I'm seeing. First of all, the same Vocabulary array is present in Annotations
:
And the values in rows are 1
/0
, which is not what TfIdf should return:
So to me that looks more like "Is word i
from the Vocabulary present in current row" and not the TfIdf frequency of it, which is what I'm trying to get.
c# tf-idf ml.net
OK. I think I got it. There is a bug in ML.NET which basically ignoresweighting
parameter in this scenarios and always usesTf
.
– MarcinJuraszek
Mar 28 at 22:32
add a comment
|
The documentation of ML.NET shows how to use context.Transforms.Text.ProduceWordBags
to get word bags. The method takes Transforms.Text.NgramExtractingEstimator.WeightingCriteria
as one of the parameters, so it's possible to request TfIdf
weights to be used. The simplest example would be:
// Get a small dataset as an IEnumerable and then read it as a ML.NET data set.
IEnumerable<SamplesUtils.DatasetUtils.SampleTopicsData> data = SamplesUtils.DatasetUtils.GetTopicsData();
var trainData = ml.Data.LoadFromEnumerable(data);
var pipeline = ml.Transforms.Text.ProduceWordBags("bags", review, ngramLength: 1, weighting: Transforms.Text.NgramExtractingEstimator.WeightingCriteria.TfIdf);
var transformer = pipeline.Fit(trainData);
var transformed_data = transformer.Transform(trainData);
That's all fine, but how do I get the actual results out of transformed_data
?
I did some digging in a debugger, but I'm still quite confused on what's actually happening here.
First of all, running the pipeline adds three extra columns to transformed_data
:
After getting a preview of the data I can see what's in these columns. To make things clearer here's what GetTopicsData
returns, which is what we're running our transform on:
animals birds cats dogs fish horse
horse birds house fish duck cats
car truck driver bus pickup
car truck driver bus pickup horse
That's exactly what I'm seeing in the very first bags
column, typed as Vector<string>
:
Moving on to the second bags
column, typed as Vector<Key<UInt32, 0-12>>
(no idea what 0-12
is here btw.).
This one has KeyValues
annotation on it and it looks like for each row it maps the words into indexes in global Vocabulary array.
The Vocabulary array is part of Annotations
:
So that's promissing. You'd think the last bags
column, typed as Vector<Single, 13>
would have the weights for each of the words! Unfortunately, that's not what I'm seeing. First of all, the same Vocabulary array is present in Annotations
:
And the values in rows are 1
/0
, which is not what TfIdf should return:
So to me that looks more like "Is word i
from the Vocabulary present in current row" and not the TfIdf frequency of it, which is what I'm trying to get.
c# tf-idf ml.net
The documentation of ML.NET shows how to use context.Transforms.Text.ProduceWordBags
to get word bags. The method takes Transforms.Text.NgramExtractingEstimator.WeightingCriteria
as one of the parameters, so it's possible to request TfIdf
weights to be used. The simplest example would be:
// Get a small dataset as an IEnumerable and then read it as a ML.NET data set.
IEnumerable<SamplesUtils.DatasetUtils.SampleTopicsData> data = SamplesUtils.DatasetUtils.GetTopicsData();
var trainData = ml.Data.LoadFromEnumerable(data);
var pipeline = ml.Transforms.Text.ProduceWordBags("bags", review, ngramLength: 1, weighting: Transforms.Text.NgramExtractingEstimator.WeightingCriteria.TfIdf);
var transformer = pipeline.Fit(trainData);
var transformed_data = transformer.Transform(trainData);
That's all fine, but how do I get the actual results out of transformed_data
?
I did some digging in a debugger, but I'm still quite confused on what's actually happening here.
First of all, running the pipeline adds three extra columns to transformed_data
:
After getting a preview of the data I can see what's in these columns. To make things clearer here's what GetTopicsData
returns, which is what we're running our transform on:
animals birds cats dogs fish horse
horse birds house fish duck cats
car truck driver bus pickup
car truck driver bus pickup horse
That's exactly what I'm seeing in the very first bags
column, typed as Vector<string>
:
Moving on to the second bags
column, typed as Vector<Key<UInt32, 0-12>>
(no idea what 0-12
is here btw.).
This one has KeyValues
annotation on it and it looks like for each row it maps the words into indexes in global Vocabulary array.
The Vocabulary array is part of Annotations
:
So that's promissing. You'd think the last bags
column, typed as Vector<Single, 13>
would have the weights for each of the words! Unfortunately, that's not what I'm seeing. First of all, the same Vocabulary array is present in Annotations
:
And the values in rows are 1
/0
, which is not what TfIdf should return:
So to me that looks more like "Is word i
from the Vocabulary present in current row" and not the TfIdf frequency of it, which is what I'm trying to get.
c# tf-idf ml.net
c# tf-idf ml.net
asked Mar 28 at 21:52
MarcinJuraszekMarcinJuraszek
112k12 gold badges152 silver badges226 bronze badges
112k12 gold badges152 silver badges226 bronze badges
OK. I think I got it. There is a bug in ML.NET which basically ignoresweighting
parameter in this scenarios and always usesTf
.
– MarcinJuraszek
Mar 28 at 22:32
add a comment
|
OK. I think I got it. There is a bug in ML.NET which basically ignoresweighting
parameter in this scenarios and always usesTf
.
– MarcinJuraszek
Mar 28 at 22:32
OK. I think I got it. There is a bug in ML.NET which basically ignores
weighting
parameter in this scenarios and always uses Tf
.– MarcinJuraszek
Mar 28 at 22:32
OK. I think I got it. There is a bug in ML.NET which basically ignores
weighting
parameter in this scenarios and always uses Tf
.– MarcinJuraszek
Mar 28 at 22:32
add a comment
|
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407409%2fhow-to-get-vocabulary-with-weights-for-tf-idf-word-bags-in-ml-net%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407409%2fhow-to-get-vocabulary-with-weights-for-tf-idf-word-bags-in-ml-net%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
OK. I think I got it. There is a bug in ML.NET which basically ignores
weighting
parameter in this scenarios and always usesTf
.– MarcinJuraszek
Mar 28 at 22:32