How to compare millions of minhashed documents on elasticsearch?Locality-sensitive hashing - ElasticsearchSolr vs. ElasticSearchSimilar image search by pHash distance in Elasticsearchhow to find the nearest / closest number using Query DSL in elasticsearchCompare Elasticsearch query score across multiple queriesk-means using signature matrix generated from minhashImplement minhash LSH using Spark (Java)Elasticsearch minhash plugin usage and copy of minhash value to text field?Node.js / javascript minhash module that outputs a similar hashstring for similar textJavascript minhash function to generate a characteristic hash key for a string text

Is it damaging to turn off a small fridge for two days every week?

How to colour a table with opaque colour such that no text and no lines are visible?

Unusual mail headers, evidence of an attempted attack. Have I been pwned?

What does it mean to "control target player"?

How to remove this component from PCB

Why aren't non-isolated DC-DC converters made for high wattage applications?

Greeting with "Ho"

What's currently blocking the construction of the wall between Mexico and the US?

If plants "alternate generations" between sporophytes and gametophytes, why don't we say the same of humans?

Is there a term for the belief that "if it's legal, it's moral"?

How many people are necessary to maintain modern civilisation?

How does a blind passenger not die, if driver becomes unconscious

Find the C-factor of a vote

Parameterize chained calls to a utility program in Bash

What exactly is the 'online' in OLAP and OLTP?

Should developer taking test phones home or put in office?

How does DC work with natural 20?

How can I politely work my way around not liking coffee or beer when it comes to professional networking?

Can White Castle?

How is hair tissue mineral analysis performed?

Why did pressing the joystick button spit out keypresses?

Why do all the teams that I have worked with always finish a sprint without completion of all the stories?

Do I have any obligations to my PhD supervisor's requests after I have graduated?

Can humans ever directly see a few photons at a time? Can a human see a single photon?



How to compare millions of minhashed documents on elasticsearch?


Locality-sensitive hashing - ElasticsearchSolr vs. ElasticSearchSimilar image search by pHash distance in Elasticsearchhow to find the nearest / closest number using Query DSL in elasticsearchCompare Elasticsearch query score across multiple queriesk-means using signature matrix generated from minhashImplement minhash LSH using Spark (Java)Elasticsearch minhash plugin usage and copy of minhash value to text field?Node.js / javascript minhash module that outputs a similar hashstring for similar textJavascript minhash function to generate a characteristic hash key for a string text






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















I have lots of documents with a minhashed field (based on content similarity) stored in elasticsearch. Now, I would either compare all of them with eachother to get similar (hash) documents, with the Elasticsearch API, but I can't do a fuzzy query because it allows only edit distance of 2 and is therefore useless.



I am also looking for a possible Node.js implementation if it cannot be done in Elasticsearch. My first approach was to retrive all id's and minhash-values (=hex-strings) for every document in Elasticsearch, then store them in an array and sort them by lexicographical order. Then, I would only have to compare the nearest neighbour k-documents based on edit-distance, instead of



n*(n-1)/2 comparisons, so I would get n*k comparisons only. What do you think of this approach?










share|improve this question






















  • This node.js module might help: github.com/duhaime/minhash

    – Val
    Mar 25 at 8:42











  • I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..

    – user2774480
    Mar 25 at 8:44












  • ok but you can still compute the similarity between each pair by calling the jaccard() method for the KNN, right?

    – Val
    Mar 25 at 8:50











  • yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..

    – user2774480
    Mar 25 at 8:52












  • Also see this answer: stackoverflow.com/a/41254259/4604579

    – Val
    Mar 25 at 8:53

















0















I have lots of documents with a minhashed field (based on content similarity) stored in elasticsearch. Now, I would either compare all of them with eachother to get similar (hash) documents, with the Elasticsearch API, but I can't do a fuzzy query because it allows only edit distance of 2 and is therefore useless.



I am also looking for a possible Node.js implementation if it cannot be done in Elasticsearch. My first approach was to retrive all id's and minhash-values (=hex-strings) for every document in Elasticsearch, then store them in an array and sort them by lexicographical order. Then, I would only have to compare the nearest neighbour k-documents based on edit-distance, instead of



n*(n-1)/2 comparisons, so I would get n*k comparisons only. What do you think of this approach?










share|improve this question






















  • This node.js module might help: github.com/duhaime/minhash

    – Val
    Mar 25 at 8:42











  • I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..

    – user2774480
    Mar 25 at 8:44












  • ok but you can still compute the similarity between each pair by calling the jaccard() method for the KNN, right?

    – Val
    Mar 25 at 8:50











  • yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..

    – user2774480
    Mar 25 at 8:52












  • Also see this answer: stackoverflow.com/a/41254259/4604579

    – Val
    Mar 25 at 8:53













0












0








0








I have lots of documents with a minhashed field (based on content similarity) stored in elasticsearch. Now, I would either compare all of them with eachother to get similar (hash) documents, with the Elasticsearch API, but I can't do a fuzzy query because it allows only edit distance of 2 and is therefore useless.



I am also looking for a possible Node.js implementation if it cannot be done in Elasticsearch. My first approach was to retrive all id's and minhash-values (=hex-strings) for every document in Elasticsearch, then store them in an array and sort them by lexicographical order. Then, I would only have to compare the nearest neighbour k-documents based on edit-distance, instead of



n*(n-1)/2 comparisons, so I would get n*k comparisons only. What do you think of this approach?










share|improve this question














I have lots of documents with a minhashed field (based on content similarity) stored in elasticsearch. Now, I would either compare all of them with eachother to get similar (hash) documents, with the Elasticsearch API, but I can't do a fuzzy query because it allows only edit distance of 2 and is therefore useless.



I am also looking for a possible Node.js implementation if it cannot be done in Elasticsearch. My first approach was to retrive all id's and minhash-values (=hex-strings) for every document in Elasticsearch, then store them in an array and sort them by lexicographical order. Then, I would only have to compare the nearest neighbour k-documents based on edit-distance, instead of



n*(n-1)/2 comparisons, so I would get n*k comparisons only. What do you think of this approach?







elasticsearch string-comparison fuzzy-search minhash






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 25 at 8:34









user2774480user2774480

3202619




3202619












  • This node.js module might help: github.com/duhaime/minhash

    – Val
    Mar 25 at 8:42











  • I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..

    – user2774480
    Mar 25 at 8:44












  • ok but you can still compute the similarity between each pair by calling the jaccard() method for the KNN, right?

    – Val
    Mar 25 at 8:50











  • yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..

    – user2774480
    Mar 25 at 8:52












  • Also see this answer: stackoverflow.com/a/41254259/4604579

    – Val
    Mar 25 at 8:53

















  • This node.js module might help: github.com/duhaime/minhash

    – Val
    Mar 25 at 8:42











  • I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..

    – user2774480
    Mar 25 at 8:44












  • ok but you can still compute the similarity between each pair by calling the jaccard() method for the KNN, right?

    – Val
    Mar 25 at 8:50











  • yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..

    – user2774480
    Mar 25 at 8:52












  • Also see this answer: stackoverflow.com/a/41254259/4604579

    – Val
    Mar 25 at 8:53
















This node.js module might help: github.com/duhaime/minhash

– Val
Mar 25 at 8:42





This node.js module might help: github.com/duhaime/minhash

– Val
Mar 25 at 8:42













I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..

– user2774480
Mar 25 at 8:44






I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..

– user2774480
Mar 25 at 8:44














ok but you can still compute the similarity between each pair by calling the jaccard() method for the KNN, right?

– Val
Mar 25 at 8:50





ok but you can still compute the similarity between each pair by calling the jaccard() method for the KNN, right?

– Val
Mar 25 at 8:50













yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..

– user2774480
Mar 25 at 8:52






yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..

– user2774480
Mar 25 at 8:52














Also see this answer: stackoverflow.com/a/41254259/4604579

– Val
Mar 25 at 8:53





Also see this answer: stackoverflow.com/a/41254259/4604579

– Val
Mar 25 at 8:53












0






active

oldest

votes














Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55333857%2fhow-to-compare-millions-of-minhashed-documents-on-elasticsearch%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55333857%2fhow-to-compare-millions-of-minhashed-documents-on-elasticsearch%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript