How to compare millions of minhashed documents on elasticsearch?Locality-sensitive hashing - ElasticsearchSolr vs. ElasticSearchSimilar image search by pHash distance in Elasticsearchhow to find the nearest / closest number using Query DSL in elasticsearchCompare Elasticsearch query score across multiple queriesk-means using signature matrix generated from minhashImplement minhash LSH using Spark (Java)Elasticsearch minhash plugin usage and copy of minhash value to text field?Node.js / javascript minhash module that outputs a similar hashstring for similar textJavascript minhash function to generate a characteristic hash key for a string text
Is it damaging to turn off a small fridge for two days every week?
How to colour a table with opaque colour such that no text and no lines are visible?
Unusual mail headers, evidence of an attempted attack. Have I been pwned?
What does it mean to "control target player"?
How to remove this component from PCB
Why aren't non-isolated DC-DC converters made for high wattage applications?
Greeting with "Ho"
What's currently blocking the construction of the wall between Mexico and the US?
If plants "alternate generations" between sporophytes and gametophytes, why don't we say the same of humans?
Is there a term for the belief that "if it's legal, it's moral"?
How many people are necessary to maintain modern civilisation?
How does a blind passenger not die, if driver becomes unconscious
Find the C-factor of a vote
Parameterize chained calls to a utility program in Bash
What exactly is the 'online' in OLAP and OLTP?
Should developer taking test phones home or put in office?
How does DC work with natural 20?
How can I politely work my way around not liking coffee or beer when it comes to professional networking?
Can White Castle?
How is hair tissue mineral analysis performed?
Why did pressing the joystick button spit out keypresses?
Why do all the teams that I have worked with always finish a sprint without completion of all the stories?
Do I have any obligations to my PhD supervisor's requests after I have graduated?
Can humans ever directly see a few photons at a time? Can a human see a single photon?
How to compare millions of minhashed documents on elasticsearch?
Locality-sensitive hashing - ElasticsearchSolr vs. ElasticSearchSimilar image search by pHash distance in Elasticsearchhow to find the nearest / closest number using Query DSL in elasticsearchCompare Elasticsearch query score across multiple queriesk-means using signature matrix generated from minhashImplement minhash LSH using Spark (Java)Elasticsearch minhash plugin usage and copy of minhash value to text field?Node.js / javascript minhash module that outputs a similar hashstring for similar textJavascript minhash function to generate a characteristic hash key for a string text
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I have lots of documents with a minhashed field (based on content similarity) stored in elasticsearch. Now, I would either compare all of them with eachother to get similar (hash) documents, with the Elasticsearch API, but I can't do a fuzzy query because it allows only edit distance of 2 and is therefore useless.
I am also looking for a possible Node.js implementation if it cannot be done in Elasticsearch. My first approach was to retrive all id's and minhash-values (=hex-strings) for every document in Elasticsearch, then store them in an array and sort them by lexicographical order. Then, I would only have to compare the nearest neighbour k-documents based on edit-distance, instead of
n*(n-1)/2
comparisons, so I would get n*k
comparisons only. What do you think of this approach?
elasticsearch string-comparison fuzzy-search minhash
|
show 4 more comments
I have lots of documents with a minhashed field (based on content similarity) stored in elasticsearch. Now, I would either compare all of them with eachother to get similar (hash) documents, with the Elasticsearch API, but I can't do a fuzzy query because it allows only edit distance of 2 and is therefore useless.
I am also looking for a possible Node.js implementation if it cannot be done in Elasticsearch. My first approach was to retrive all id's and minhash-values (=hex-strings) for every document in Elasticsearch, then store them in an array and sort them by lexicographical order. Then, I would only have to compare the nearest neighbour k-documents based on edit-distance, instead of
n*(n-1)/2
comparisons, so I would get n*k
comparisons only. What do you think of this approach?
elasticsearch string-comparison fuzzy-search minhash
This node.js module might help: github.com/duhaime/minhash
– Val
Mar 25 at 8:42
I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..
– user2774480
Mar 25 at 8:44
ok but you can still compute the similarity between each pair by calling thejaccard()
method for the KNN, right?
– Val
Mar 25 at 8:50
yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..
– user2774480
Mar 25 at 8:52
Also see this answer: stackoverflow.com/a/41254259/4604579
– Val
Mar 25 at 8:53
|
show 4 more comments
I have lots of documents with a minhashed field (based on content similarity) stored in elasticsearch. Now, I would either compare all of them with eachother to get similar (hash) documents, with the Elasticsearch API, but I can't do a fuzzy query because it allows only edit distance of 2 and is therefore useless.
I am also looking for a possible Node.js implementation if it cannot be done in Elasticsearch. My first approach was to retrive all id's and minhash-values (=hex-strings) for every document in Elasticsearch, then store them in an array and sort them by lexicographical order. Then, I would only have to compare the nearest neighbour k-documents based on edit-distance, instead of
n*(n-1)/2
comparisons, so I would get n*k
comparisons only. What do you think of this approach?
elasticsearch string-comparison fuzzy-search minhash
I have lots of documents with a minhashed field (based on content similarity) stored in elasticsearch. Now, I would either compare all of them with eachother to get similar (hash) documents, with the Elasticsearch API, but I can't do a fuzzy query because it allows only edit distance of 2 and is therefore useless.
I am also looking for a possible Node.js implementation if it cannot be done in Elasticsearch. My first approach was to retrive all id's and minhash-values (=hex-strings) for every document in Elasticsearch, then store them in an array and sort them by lexicographical order. Then, I would only have to compare the nearest neighbour k-documents based on edit-distance, instead of
n*(n-1)/2
comparisons, so I would get n*k
comparisons only. What do you think of this approach?
elasticsearch string-comparison fuzzy-search minhash
elasticsearch string-comparison fuzzy-search minhash
asked Mar 25 at 8:34
user2774480user2774480
3202619
3202619
This node.js module might help: github.com/duhaime/minhash
– Val
Mar 25 at 8:42
I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..
– user2774480
Mar 25 at 8:44
ok but you can still compute the similarity between each pair by calling thejaccard()
method for the KNN, right?
– Val
Mar 25 at 8:50
yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..
– user2774480
Mar 25 at 8:52
Also see this answer: stackoverflow.com/a/41254259/4604579
– Val
Mar 25 at 8:53
|
show 4 more comments
This node.js module might help: github.com/duhaime/minhash
– Val
Mar 25 at 8:42
I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..
– user2774480
Mar 25 at 8:44
ok but you can still compute the similarity between each pair by calling thejaccard()
method for the KNN, right?
– Val
Mar 25 at 8:50
yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..
– user2774480
Mar 25 at 8:52
Also see this answer: stackoverflow.com/a/41254259/4604579
– Val
Mar 25 at 8:53
This node.js module might help: github.com/duhaime/minhash
– Val
Mar 25 at 8:42
This node.js module might help: github.com/duhaime/minhash
– Val
Mar 25 at 8:42
I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..
– user2774480
Mar 25 at 8:44
I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..
– user2774480
Mar 25 at 8:44
ok but you can still compute the similarity between each pair by calling the
jaccard()
method for the KNN, right?– Val
Mar 25 at 8:50
ok but you can still compute the similarity between each pair by calling the
jaccard()
method for the KNN, right?– Val
Mar 25 at 8:50
yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..
– user2774480
Mar 25 at 8:52
yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..
– user2774480
Mar 25 at 8:52
Also see this answer: stackoverflow.com/a/41254259/4604579
– Val
Mar 25 at 8:53
Also see this answer: stackoverflow.com/a/41254259/4604579
– Val
Mar 25 at 8:53
|
show 4 more comments
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55333857%2fhow-to-compare-millions-of-minhashed-documents-on-elasticsearch%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55333857%2fhow-to-compare-millions-of-minhashed-documents-on-elasticsearch%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
This node.js module might help: github.com/duhaime/minhash
– Val
Mar 25 at 8:42
I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..
– user2774480
Mar 25 at 8:44
ok but you can still compute the similarity between each pair by calling the
jaccard()
method for the KNN, right?– Val
Mar 25 at 8:50
yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..
– user2774480
Mar 25 at 8:52
Also see this answer: stackoverflow.com/a/41254259/4604579
– Val
Mar 25 at 8:53