How Does the Hashing Trick in Machine Learning Work?Generate short hash string based using VBAHow does the Google “Did you mean?” Algorithm work?How can I generate an MD5 hash?How does a hash table work?Which machine learning classifier to choose, in general?Feature selection and unsupervised learning for multilingual data + machine learning algorithm selectionRepresenting arbitrarily long categorical array data in machine learningMixed parameter types for machine learningClassifying URLs into categories - Machine LearningHow exactly does feature hashing work?One hot encoding in Python

What is the need of methods like GET and POST in the HTTP protocol?

Painting a 4x6 grid with 2 colours

How to manage expenditure when billing cycles and paycheck cycles are not aligned?

On the meaning of 'anyways' in "What Exactly Is a Quartz Crystal, Anyways?"

Meaning of 'ran' in German?

Is it a good idea to leave minor world details to the reader's imagination?

Hilbert's hotel: why can't I repeat it infinitely many times?

Social leper versus social leopard

Does the Orange League not count as an official Pokemon League, making the Alolan League Ash's first-ever win?

Magneto 2 How to call Helper function in observer file

What exactly did this mechanic sabotage on the American Airlines 737, and how dangerous was it?

If the EU does not offer an extension to UK's Article 50 invocation, is the Benn Bill irrelevant?

How can an attacker use robots.txt?

Why is there not a feasible solution for a MIP?

I reverse the source code, you negate the output!

Do we know the situation in Britain before Sealion (summer 1940)?

Replace HP Smart Array RAID Controller with newer generation controller (e.g. 410 -> 420)

Is there any reason nowadays to use a neon indicator lamp instead of an LED?

To what extent is it worthwhile to report check fraud / refund scams?

How to deal with a Homophobic PC

In a folk jam session, when asked which key my non-transposing chromatic instrument (like a violin) is in, what do I answer?

Why does this image of Jupiter look so strange?

Where Does VDD+0.3V Input Limit Come From on IC chips?

The 100 soldier problem



How Does the Hashing Trick in Machine Learning Work?


Generate short hash string based using VBAHow does the Google “Did you mean?” Algorithm work?How can I generate an MD5 hash?How does a hash table work?Which machine learning classifier to choose, in general?Feature selection and unsupervised learning for multilingual data + machine learning algorithm selectionRepresenting arbitrarily long categorical array data in machine learningMixed parameter types for machine learningClassifying URLs into categories - Machine LearningHow exactly does feature hashing work?One hot encoding in Python






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








-1















I have a large categorical dataset and a feedforward ANN that I am using for classification purposes. I programmed the machine learning model using Excel VBA (the only programming language I have access too currently).



I have 150 categories in my dataset that I need to process. I have tried using Binary Encoding and One-Hot Encoding, however because of the number of categories I need to process, these vectors are often too large for VBA to handle and I end up with a memory error.



I’d like to give the Hashing trick a go, and see if it works any better. I don't understand how to do this with Excel however.



I have reviewed the following links to try and understand it:



https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/feature-hashing



https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f



https://en.wikipedia.org/wiki/Vowpal_Wabbit



I still don’t completely understand it. Here is what I have done so far. I used the following code example to create a hash sequence for my categorical date:
Generate short hash string based using VBA



Using the code above, I have been able to produce collision free numerical hash sequences. However, what do I do now? Does the hash sequence need to be converted to a binary vector now? This is where I get lost.



I provided a small example of my data thus far. Would somebody be able to show me step by step how the hashing trick works (preferably for Excel)?



'CATEGORY 'HASH SEQUENCE
STEEL 37152
PLASTIC 31081
ALUMINUM 2310
BRONZE 9364









share|improve this question






























    -1















    I have a large categorical dataset and a feedforward ANN that I am using for classification purposes. I programmed the machine learning model using Excel VBA (the only programming language I have access too currently).



    I have 150 categories in my dataset that I need to process. I have tried using Binary Encoding and One-Hot Encoding, however because of the number of categories I need to process, these vectors are often too large for VBA to handle and I end up with a memory error.



    I’d like to give the Hashing trick a go, and see if it works any better. I don't understand how to do this with Excel however.



    I have reviewed the following links to try and understand it:



    https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/feature-hashing



    https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f



    https://en.wikipedia.org/wiki/Vowpal_Wabbit



    I still don’t completely understand it. Here is what I have done so far. I used the following code example to create a hash sequence for my categorical date:
    Generate short hash string based using VBA



    Using the code above, I have been able to produce collision free numerical hash sequences. However, what do I do now? Does the hash sequence need to be converted to a binary vector now? This is where I get lost.



    I provided a small example of my data thus far. Would somebody be able to show me step by step how the hashing trick works (preferably for Excel)?



    'CATEGORY 'HASH SEQUENCE
    STEEL 37152
    PLASTIC 31081
    ALUMINUM 2310
    BRONZE 9364









    share|improve this question


























      -1












      -1








      -1








      I have a large categorical dataset and a feedforward ANN that I am using for classification purposes. I programmed the machine learning model using Excel VBA (the only programming language I have access too currently).



      I have 150 categories in my dataset that I need to process. I have tried using Binary Encoding and One-Hot Encoding, however because of the number of categories I need to process, these vectors are often too large for VBA to handle and I end up with a memory error.



      I’d like to give the Hashing trick a go, and see if it works any better. I don't understand how to do this with Excel however.



      I have reviewed the following links to try and understand it:



      https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/feature-hashing



      https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f



      https://en.wikipedia.org/wiki/Vowpal_Wabbit



      I still don’t completely understand it. Here is what I have done so far. I used the following code example to create a hash sequence for my categorical date:
      Generate short hash string based using VBA



      Using the code above, I have been able to produce collision free numerical hash sequences. However, what do I do now? Does the hash sequence need to be converted to a binary vector now? This is where I get lost.



      I provided a small example of my data thus far. Would somebody be able to show me step by step how the hashing trick works (preferably for Excel)?



      'CATEGORY 'HASH SEQUENCE
      STEEL 37152
      PLASTIC 31081
      ALUMINUM 2310
      BRONZE 9364









      share|improve this question














      I have a large categorical dataset and a feedforward ANN that I am using for classification purposes. I programmed the machine learning model using Excel VBA (the only programming language I have access too currently).



      I have 150 categories in my dataset that I need to process. I have tried using Binary Encoding and One-Hot Encoding, however because of the number of categories I need to process, these vectors are often too large for VBA to handle and I end up with a memory error.



      I’d like to give the Hashing trick a go, and see if it works any better. I don't understand how to do this with Excel however.



      I have reviewed the following links to try and understand it:



      https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/feature-hashing



      https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f



      https://en.wikipedia.org/wiki/Vowpal_Wabbit



      I still don’t completely understand it. Here is what I have done so far. I used the following code example to create a hash sequence for my categorical date:
      Generate short hash string based using VBA



      Using the code above, I have been able to produce collision free numerical hash sequences. However, what do I do now? Does the hash sequence need to be converted to a binary vector now? This is where I get lost.



      I provided a small example of my data thus far. Would somebody be able to show me step by step how the hashing trick works (preferably for Excel)?



      'CATEGORY 'HASH SEQUENCE
      STEEL 37152
      PLASTIC 31081
      ALUMINUM 2310
      BRONZE 9364






      excel machine-learning hash hashcode






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 28 at 16:24









      junfanbljunfanbl

      216 bronze badges




      216 bronze badges

























          1 Answer
          1






          active

          oldest

          votes


















          0
















          So what the hashing trick does is it prevents ~fake words from taking up extra memory. In a regular Bag-Of-Words (BOW) model, you have 1 dimension per word in the vocabulary. This means that a misspelled word and the regular word can both take up separate dimensions - if you have the misspelled word in the model at all. If the misspelled word is not in the model, (depending on your model) you might ignore it completly. This adds up over time. And by misspelled word, I'm just using an example of any word not in the vocabulary you use to create the vectors to train your model with. Meaning any model trained this way cannot adapt to new vocab without being trained all over again.



          The hashing method allows you to incorporate out-of-vocab words, with some potential accuracy loss. It also ensures that you can bound your memory. Essentially the hashing method starts by defining a hash function that takes some input (typically a word) and mapping it to an output value Within an Already Determined Range. You would choose your hash function to output somewhere between say 0-2^16. Thus you know your output vectors will always be capped at size 2^16 (arbitrary value really), so you can prevent memory issues. Further, hash functions have "collisions" - what this means is that hash(a) might equal hash(b) - very rarely with an appropriate output range, but its possible. This means that you lose some accuracy - but since the hash function is theoretically able to take any input string, it can work with out of vocabulary words to get a new vector Of the Same Size as the original vectors used to train the model. Since your new data vector is the Same Size as those used to train the model previously, you can use it to refine your model instead of being forced to train a new model.






          share|improve this answer



























            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );














            draft saved

            draft discarded
















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55402537%2fhow-does-the-hashing-trick-in-machine-learning-work%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0
















            So what the hashing trick does is it prevents ~fake words from taking up extra memory. In a regular Bag-Of-Words (BOW) model, you have 1 dimension per word in the vocabulary. This means that a misspelled word and the regular word can both take up separate dimensions - if you have the misspelled word in the model at all. If the misspelled word is not in the model, (depending on your model) you might ignore it completly. This adds up over time. And by misspelled word, I'm just using an example of any word not in the vocabulary you use to create the vectors to train your model with. Meaning any model trained this way cannot adapt to new vocab without being trained all over again.



            The hashing method allows you to incorporate out-of-vocab words, with some potential accuracy loss. It also ensures that you can bound your memory. Essentially the hashing method starts by defining a hash function that takes some input (typically a word) and mapping it to an output value Within an Already Determined Range. You would choose your hash function to output somewhere between say 0-2^16. Thus you know your output vectors will always be capped at size 2^16 (arbitrary value really), so you can prevent memory issues. Further, hash functions have "collisions" - what this means is that hash(a) might equal hash(b) - very rarely with an appropriate output range, but its possible. This means that you lose some accuracy - but since the hash function is theoretically able to take any input string, it can work with out of vocabulary words to get a new vector Of the Same Size as the original vectors used to train the model. Since your new data vector is the Same Size as those used to train the model previously, you can use it to refine your model instead of being forced to train a new model.






            share|improve this answer





























              0
















              So what the hashing trick does is it prevents ~fake words from taking up extra memory. In a regular Bag-Of-Words (BOW) model, you have 1 dimension per word in the vocabulary. This means that a misspelled word and the regular word can both take up separate dimensions - if you have the misspelled word in the model at all. If the misspelled word is not in the model, (depending on your model) you might ignore it completly. This adds up over time. And by misspelled word, I'm just using an example of any word not in the vocabulary you use to create the vectors to train your model with. Meaning any model trained this way cannot adapt to new vocab without being trained all over again.



              The hashing method allows you to incorporate out-of-vocab words, with some potential accuracy loss. It also ensures that you can bound your memory. Essentially the hashing method starts by defining a hash function that takes some input (typically a word) and mapping it to an output value Within an Already Determined Range. You would choose your hash function to output somewhere between say 0-2^16. Thus you know your output vectors will always be capped at size 2^16 (arbitrary value really), so you can prevent memory issues. Further, hash functions have "collisions" - what this means is that hash(a) might equal hash(b) - very rarely with an appropriate output range, but its possible. This means that you lose some accuracy - but since the hash function is theoretically able to take any input string, it can work with out of vocabulary words to get a new vector Of the Same Size as the original vectors used to train the model. Since your new data vector is the Same Size as those used to train the model previously, you can use it to refine your model instead of being forced to train a new model.






              share|improve this answer



























                0














                0










                0









                So what the hashing trick does is it prevents ~fake words from taking up extra memory. In a regular Bag-Of-Words (BOW) model, you have 1 dimension per word in the vocabulary. This means that a misspelled word and the regular word can both take up separate dimensions - if you have the misspelled word in the model at all. If the misspelled word is not in the model, (depending on your model) you might ignore it completly. This adds up over time. And by misspelled word, I'm just using an example of any word not in the vocabulary you use to create the vectors to train your model with. Meaning any model trained this way cannot adapt to new vocab without being trained all over again.



                The hashing method allows you to incorporate out-of-vocab words, with some potential accuracy loss. It also ensures that you can bound your memory. Essentially the hashing method starts by defining a hash function that takes some input (typically a word) and mapping it to an output value Within an Already Determined Range. You would choose your hash function to output somewhere between say 0-2^16. Thus you know your output vectors will always be capped at size 2^16 (arbitrary value really), so you can prevent memory issues. Further, hash functions have "collisions" - what this means is that hash(a) might equal hash(b) - very rarely with an appropriate output range, but its possible. This means that you lose some accuracy - but since the hash function is theoretically able to take any input string, it can work with out of vocabulary words to get a new vector Of the Same Size as the original vectors used to train the model. Since your new data vector is the Same Size as those used to train the model previously, you can use it to refine your model instead of being forced to train a new model.






                share|improve this answer













                So what the hashing trick does is it prevents ~fake words from taking up extra memory. In a regular Bag-Of-Words (BOW) model, you have 1 dimension per word in the vocabulary. This means that a misspelled word and the regular word can both take up separate dimensions - if you have the misspelled word in the model at all. If the misspelled word is not in the model, (depending on your model) you might ignore it completly. This adds up over time. And by misspelled word, I'm just using an example of any word not in the vocabulary you use to create the vectors to train your model with. Meaning any model trained this way cannot adapt to new vocab without being trained all over again.



                The hashing method allows you to incorporate out-of-vocab words, with some potential accuracy loss. It also ensures that you can bound your memory. Essentially the hashing method starts by defining a hash function that takes some input (typically a word) and mapping it to an output value Within an Already Determined Range. You would choose your hash function to output somewhere between say 0-2^16. Thus you know your output vectors will always be capped at size 2^16 (arbitrary value really), so you can prevent memory issues. Further, hash functions have "collisions" - what this means is that hash(a) might equal hash(b) - very rarely with an appropriate output range, but its possible. This means that you lose some accuracy - but since the hash function is theoretically able to take any input string, it can work with out of vocabulary words to get a new vector Of the Same Size as the original vectors used to train the model. Since your new data vector is the Same Size as those used to train the model previously, you can use it to refine your model instead of being forced to train a new model.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Mar 28 at 16:43









                Evan MataEvan Mata

                16713 bronze badges




                16713 bronze badges

































                    draft saved

                    draft discarded















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55402537%2fhow-does-the-hashing-trick-in-machine-learning-work%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

                    용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

                    155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해