How to compare millions of minhashed documents on elasticsearch?Locality-sensitive hashing - ElasticsearchSolr vs. ElasticSearchSimilar image search by pHash distance in Elasticsearchhow to find the nearest / closest number using Query DSL in elasticsearchCompare Elasticsearch query score across multiple queriesk-means using signature matrix generated from minhashImplement minhash LSH using Spark (Java)Elasticsearch minhash plugin usage and copy of minhash value to text field?Node.js / javascript minhash module that outputs a similar hashstring for similar textJavascript minhash function to generate a characteristic hash key for a string text

Is it damaging to turn off a small fridge for two days every week?

How to colour a table with opaque colour such that no text and no lines are visible?

Unusual mail headers, evidence of an attempted attack. Have I been pwned?

What does it mean to "control target player"?

How to remove this component from PCB

Why aren't non-isolated DC-DC converters made for high wattage applications?

Greeting with "Ho"

What's currently blocking the construction of the wall between Mexico and the US?

If plants "alternate generations" between sporophytes and gametophytes, why don't we say the same of humans?

Is there a term for the belief that "if it's legal, it's moral"?

How many people are necessary to maintain modern civilisation?

How does a blind passenger not die, if driver becomes unconscious

Find the C-factor of a vote

Parameterize chained calls to a utility program in Bash

What exactly is the 'online' in OLAP and OLTP?

Should developer taking test phones home or put in office?

How does DC work with natural 20?

How can I politely work my way around not liking coffee or beer when it comes to professional networking?

Can White Castle?

How is hair tissue mineral analysis performed?

Why did pressing the joystick button spit out keypresses?

Why do all the teams that I have worked with always finish a sprint without completion of all the stories?

Do I have any obligations to my PhD supervisor's requests after I have graduated?

Can humans ever directly see a few photons at a time? Can a human see a single photon?



How to compare millions of minhashed documents on elasticsearch?


Locality-sensitive hashing - ElasticsearchSolr vs. ElasticSearchSimilar image search by pHash distance in Elasticsearchhow to find the nearest / closest number using Query DSL in elasticsearchCompare Elasticsearch query score across multiple queriesk-means using signature matrix generated from minhashImplement minhash LSH using Spark (Java)Elasticsearch minhash plugin usage and copy of minhash value to text field?Node.js / javascript minhash module that outputs a similar hashstring for similar textJavascript minhash function to generate a characteristic hash key for a string text






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















I have lots of documents with a minhashed field (based on content similarity) stored in elasticsearch. Now, I would either compare all of them with eachother to get similar (hash) documents, with the Elasticsearch API, but I can't do a fuzzy query because it allows only edit distance of 2 and is therefore useless.



I am also looking for a possible Node.js implementation if it cannot be done in Elasticsearch. My first approach was to retrive all id's and minhash-values (=hex-strings) for every document in Elasticsearch, then store them in an array and sort them by lexicographical order. Then, I would only have to compare the nearest neighbour k-documents based on edit-distance, instead of



n*(n-1)/2 comparisons, so I would get n*k comparisons only. What do you think of this approach?










share|improve this question






















  • This node.js module might help: github.com/duhaime/minhash

    – Val
    Mar 25 at 8:42











  • I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..

    – user2774480
    Mar 25 at 8:44












  • ok but you can still compute the similarity between each pair by calling the jaccard() method for the KNN, right?

    – Val
    Mar 25 at 8:50











  • yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..

    – user2774480
    Mar 25 at 8:52












  • Also see this answer: stackoverflow.com/a/41254259/4604579

    – Val
    Mar 25 at 8:53

















0















I have lots of documents with a minhashed field (based on content similarity) stored in elasticsearch. Now, I would either compare all of them with eachother to get similar (hash) documents, with the Elasticsearch API, but I can't do a fuzzy query because it allows only edit distance of 2 and is therefore useless.



I am also looking for a possible Node.js implementation if it cannot be done in Elasticsearch. My first approach was to retrive all id's and minhash-values (=hex-strings) for every document in Elasticsearch, then store them in an array and sort them by lexicographical order. Then, I would only have to compare the nearest neighbour k-documents based on edit-distance, instead of



n*(n-1)/2 comparisons, so I would get n*k comparisons only. What do you think of this approach?










share|improve this question






















  • This node.js module might help: github.com/duhaime/minhash

    – Val
    Mar 25 at 8:42











  • I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..

    – user2774480
    Mar 25 at 8:44












  • ok but you can still compute the similarity between each pair by calling the jaccard() method for the KNN, right?

    – Val
    Mar 25 at 8:50











  • yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..

    – user2774480
    Mar 25 at 8:52












  • Also see this answer: stackoverflow.com/a/41254259/4604579

    – Val
    Mar 25 at 8:53













0












0








0








I have lots of documents with a minhashed field (based on content similarity) stored in elasticsearch. Now, I would either compare all of them with eachother to get similar (hash) documents, with the Elasticsearch API, but I can't do a fuzzy query because it allows only edit distance of 2 and is therefore useless.



I am also looking for a possible Node.js implementation if it cannot be done in Elasticsearch. My first approach was to retrive all id's and minhash-values (=hex-strings) for every document in Elasticsearch, then store them in an array and sort them by lexicographical order. Then, I would only have to compare the nearest neighbour k-documents based on edit-distance, instead of



n*(n-1)/2 comparisons, so I would get n*k comparisons only. What do you think of this approach?










share|improve this question














I have lots of documents with a minhashed field (based on content similarity) stored in elasticsearch. Now, I would either compare all of them with eachother to get similar (hash) documents, with the Elasticsearch API, but I can't do a fuzzy query because it allows only edit distance of 2 and is therefore useless.



I am also looking for a possible Node.js implementation if it cannot be done in Elasticsearch. My first approach was to retrive all id's and minhash-values (=hex-strings) for every document in Elasticsearch, then store them in an array and sort them by lexicographical order. Then, I would only have to compare the nearest neighbour k-documents based on edit-distance, instead of



n*(n-1)/2 comparisons, so I would get n*k comparisons only. What do you think of this approach?







elasticsearch string-comparison fuzzy-search minhash






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 25 at 8:34









user2774480user2774480

3202619




3202619












  • This node.js module might help: github.com/duhaime/minhash

    – Val
    Mar 25 at 8:42











  • I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..

    – user2774480
    Mar 25 at 8:44












  • ok but you can still compute the similarity between each pair by calling the jaccard() method for the KNN, right?

    – Val
    Mar 25 at 8:50











  • yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..

    – user2774480
    Mar 25 at 8:52












  • Also see this answer: stackoverflow.com/a/41254259/4604579

    – Val
    Mar 25 at 8:53

















  • This node.js module might help: github.com/duhaime/minhash

    – Val
    Mar 25 at 8:42











  • I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..

    – user2774480
    Mar 25 at 8:44












  • ok but you can still compute the similarity between each pair by calling the jaccard() method for the KNN, right?

    – Val
    Mar 25 at 8:50











  • yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..

    – user2774480
    Mar 25 at 8:52












  • Also see this answer: stackoverflow.com/a/41254259/4604579

    – Val
    Mar 25 at 8:53
















This node.js module might help: github.com/duhaime/minhash

– Val
Mar 25 at 8:42





This node.js module might help: github.com/duhaime/minhash

– Val
Mar 25 at 8:42













I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..

– user2774480
Mar 25 at 8:44






I was already using exactly this module ;). But now I have the problem how to compare those hashed documents efficiently. Because I stored the hashes as terms in Elasticsearch, but don't know how to compare them because they are not stored in similar "buckets" for similar hash values.. I only have the plain minhash values.... thats my dilemma. And Elasticsearch allows a fuzzy search for comparison only up to 2 edit distances, which is useless in my case..

– user2774480
Mar 25 at 8:44














ok but you can still compute the similarity between each pair by calling the jaccard() method for the KNN, right?

– Val
Mar 25 at 8:50





ok but you can still compute the similarity between each pair by calling the jaccard() method for the KNN, right?

– Val
Mar 25 at 8:50













yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..

– user2774480
Mar 25 at 8:52






yeah sure, but as the minhash value is stored inside each document, this would exactly make me about n^2 comparisons when comparing each pair which is too inefficient. I thought there was some kind of "trick" with LSH and with some sort of bucketing approach, but not sure how to interpret that from the papers... didn't understand it so well I guess..

– user2774480
Mar 25 at 8:52














Also see this answer: stackoverflow.com/a/41254259/4604579

– Val
Mar 25 at 8:53





Also see this answer: stackoverflow.com/a/41254259/4604579

– Val
Mar 25 at 8:53












0






active

oldest

votes














Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55333857%2fhow-to-compare-millions-of-minhashed-documents-on-elasticsearch%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55333857%2fhow-to-compare-millions-of-minhashed-documents-on-elasticsearch%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현