What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 secondsFactors that limit speed Presto?Partitioning method that can help to avoid having to specify the same information or column in Hive Partitioned Query?How to convert Java timestamp stored as bigint to timestamp in Presto?How to discard partition column from hive view while selecting?presto looking for partitions on localhost instead of remote host that has hive metastoreHow to handle hive locking across hive and prestoCan't read data in Presto - can in HiveEnforce hive sql-standard security for Presto according to the user submitting the queryHive View query is not using partitionHow do you add partitions to a partitioned table in Presto running in Amazon EMR?

Can you feel passing through the sound barrier in an F-16?

Give function defaults arguments from a dictionary in Python

Three Singles in Three Clubs

How do I make distance between concentric circles equal?

Avoiding racist tropes in fantasy

Why did MS-DOS applications built using Turbo Pascal fail to start with a division by zero error on faster systems?

Solve a logarithmic equation by NSolve

Why aren't RCS openings an issue for spacecraft heat shields?

Why is observed clock rate < 3MHz on Arduino Uno?

Were there 486SX revisions without an FPU on the die?

Why would the US President need briefings on UFOs?

What is this symbol: semicircles facing eachother

Church Booleans

Why is Boris Johnson visiting only Paris & Berlin if every member of the EU needs to agree on a withdrawal deal?

Was Switzerland really impossible to invade during WW2?

Defense against attacks using dictionaries

A list of proofs of "Coherent topoi have enough points"

How is "sein" conjugated in this sub-sentence?

In what ways can a Non-paladin access Paladin spells?

Factoring the square of this polynomial?

Is there a limit on how long the casting (speaking aloud part of the spell) of Wish can be?

How to persuade recruiters to send me the Job Description?

Is refusing to concede in the face of an unstoppable Nexus combo punishable?

Do ability scores have any effect on casting Wish spell



What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds


Factors that limit speed Presto?Partitioning method that can help to avoid having to specify the same information or column in Hive Partitioned Query?How to convert Java timestamp stored as bigint to timestamp in Presto?How to discard partition column from hive view while selecting?presto looking for partitions on localhost instead of remote host that has hive metastoreHow to handle hive locking across hive and prestoCan't read data in Presto - can in HiveEnforce hive sql-standard security for Presto according to the user submitting the queryHive View query is not using partitionHow do you add partitions to a partitioned table in Presto running in Amazon EMR?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








1















My use case is i have two data sources:
1. Source1 (as speed layer)
2. Hive external table on top of S3(as batch layer)



I am using Presto for querying data from both the data sources by using view.
I want to create view that will union data from both the sources like : "create view test as select * from Source1.table union all select * from hive.table"



We are keeping 24 hours data in Source1 and after 24 hours that data will be migrated to s3 via hive.



Columns for Source1 tables are:timestamp,logtype,company,category



User will query data using timestamp range(can query data of last 15/30 minutes, last x hours, last x days, last x months, etc)
example: "select * from test where timestamp > (now() - interval '15' minute)","select * from test where timestamp > (now() - interval '12' hour)", "select * from test where timestamp > (now() - interval '1' day)"



To satisfy the user query I need to partition the hive table as well as the user should not be aware of the underlying stategy i.e if user is querying last x minutes data, he/she should not bother that if presto is reading the data from Source1 or hive.



What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds?










share|improve this question
































    1















    My use case is i have two data sources:
    1. Source1 (as speed layer)
    2. Hive external table on top of S3(as batch layer)



    I am using Presto for querying data from both the data sources by using view.
    I want to create view that will union data from both the sources like : "create view test as select * from Source1.table union all select * from hive.table"



    We are keeping 24 hours data in Source1 and after 24 hours that data will be migrated to s3 via hive.



    Columns for Source1 tables are:timestamp,logtype,company,category



    User will query data using timestamp range(can query data of last 15/30 minutes, last x hours, last x days, last x months, etc)
    example: "select * from test where timestamp > (now() - interval '15' minute)","select * from test where timestamp > (now() - interval '12' hour)", "select * from test where timestamp > (now() - interval '1' day)"



    To satisfy the user query I need to partition the hive table as well as the user should not be aware of the underlying stategy i.e if user is querying last x minutes data, he/she should not bother that if presto is reading the data from Source1 or hive.



    What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds?










    share|improve this question




























      1












      1








      1








      My use case is i have two data sources:
      1. Source1 (as speed layer)
      2. Hive external table on top of S3(as batch layer)



      I am using Presto for querying data from both the data sources by using view.
      I want to create view that will union data from both the sources like : "create view test as select * from Source1.table union all select * from hive.table"



      We are keeping 24 hours data in Source1 and after 24 hours that data will be migrated to s3 via hive.



      Columns for Source1 tables are:timestamp,logtype,company,category



      User will query data using timestamp range(can query data of last 15/30 minutes, last x hours, last x days, last x months, etc)
      example: "select * from test where timestamp > (now() - interval '15' minute)","select * from test where timestamp > (now() - interval '12' hour)", "select * from test where timestamp > (now() - interval '1' day)"



      To satisfy the user query I need to partition the hive table as well as the user should not be aware of the underlying stategy i.e if user is querying last x minutes data, he/she should not bother that if presto is reading the data from Source1 or hive.



      What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds?










      share|improve this question
















      My use case is i have two data sources:
      1. Source1 (as speed layer)
      2. Hive external table on top of S3(as batch layer)



      I am using Presto for querying data from both the data sources by using view.
      I want to create view that will union data from both the sources like : "create view test as select * from Source1.table union all select * from hive.table"



      We are keeping 24 hours data in Source1 and after 24 hours that data will be migrated to s3 via hive.



      Columns for Source1 tables are:timestamp,logtype,company,category



      User will query data using timestamp range(can query data of last 15/30 minutes, last x hours, last x days, last x months, etc)
      example: "select * from test where timestamp > (now() - interval '15' minute)","select * from test where timestamp > (now() - interval '12' hour)", "select * from test where timestamp > (now() - interval '1' day)"



      To satisfy the user query I need to partition the hive table as well as the user should not be aware of the underlying stategy i.e if user is querying last x minutes data, he/she should not bother that if presto is reading the data from Source1 or hive.



      What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds?







      hive bigdata data-warehouse presto data-partitioning






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Apr 2 at 14:32







      unknown_k

















      asked Mar 27 at 16:06









      unknown_kunknown_k

      62 bronze badges




      62 bronze badges

























          1 Answer
          1






          active

          oldest

          votes


















          0













          For hive a partition column should be used which will queried in filter.

          In your case this is timestamp. However if you use timestamp it would create a partition for every second (or millisecond) depending in the data in the column.

          A better solution would be to create columns like year, month, day, hour (from timestamp) and to use these as partition columns.



          The same strategy will work for Kudu however be advised it could create hot-spotting since all the newly arriving records will go to same (most-recent) partition this will limit insert (and may be query) performance.

          To overcome use one additional column as hash partition along with timestamp derived columns.

          e.g year, month, day, hour, logtype






          share|improve this answer

























          • Since you must use timestamp as the only filter criteria, you are limited to using timestamp as the partition. But creating partition for every timestamp (second) will create too many partitions in hive and will be bad for performance.

            – shanmuga
            Apr 1 at 8:59










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55381744%2fwhat-should-be-my-hive-partitioning-strategy-and-view-strategy-so-that-query-can%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0













          For hive a partition column should be used which will queried in filter.

          In your case this is timestamp. However if you use timestamp it would create a partition for every second (or millisecond) depending in the data in the column.

          A better solution would be to create columns like year, month, day, hour (from timestamp) and to use these as partition columns.



          The same strategy will work for Kudu however be advised it could create hot-spotting since all the newly arriving records will go to same (most-recent) partition this will limit insert (and may be query) performance.

          To overcome use one additional column as hash partition along with timestamp derived columns.

          e.g year, month, day, hour, logtype






          share|improve this answer

























          • Since you must use timestamp as the only filter criteria, you are limited to using timestamp as the partition. But creating partition for every timestamp (second) will create too many partitions in hive and will be bad for performance.

            – shanmuga
            Apr 1 at 8:59















          0













          For hive a partition column should be used which will queried in filter.

          In your case this is timestamp. However if you use timestamp it would create a partition for every second (or millisecond) depending in the data in the column.

          A better solution would be to create columns like year, month, day, hour (from timestamp) and to use these as partition columns.



          The same strategy will work for Kudu however be advised it could create hot-spotting since all the newly arriving records will go to same (most-recent) partition this will limit insert (and may be query) performance.

          To overcome use one additional column as hash partition along with timestamp derived columns.

          e.g year, month, day, hour, logtype






          share|improve this answer

























          • Since you must use timestamp as the only filter criteria, you are limited to using timestamp as the partition. But creating partition for every timestamp (second) will create too many partitions in hive and will be bad for performance.

            – shanmuga
            Apr 1 at 8:59













          0












          0








          0







          For hive a partition column should be used which will queried in filter.

          In your case this is timestamp. However if you use timestamp it would create a partition for every second (or millisecond) depending in the data in the column.

          A better solution would be to create columns like year, month, day, hour (from timestamp) and to use these as partition columns.



          The same strategy will work for Kudu however be advised it could create hot-spotting since all the newly arriving records will go to same (most-recent) partition this will limit insert (and may be query) performance.

          To overcome use one additional column as hash partition along with timestamp derived columns.

          e.g year, month, day, hour, logtype






          share|improve this answer













          For hive a partition column should be used which will queried in filter.

          In your case this is timestamp. However if you use timestamp it would create a partition for every second (or millisecond) depending in the data in the column.

          A better solution would be to create columns like year, month, day, hour (from timestamp) and to use these as partition columns.



          The same strategy will work for Kudu however be advised it could create hot-spotting since all the newly arriving records will go to same (most-recent) partition this will limit insert (and may be query) performance.

          To overcome use one additional column as hash partition along with timestamp derived columns.

          e.g year, month, day, hour, logtype







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 29 at 16:22









          shanmugashanmuga

          3,1381 gold badge10 silver badges28 bronze badges




          3,1381 gold badge10 silver badges28 bronze badges















          • Since you must use timestamp as the only filter criteria, you are limited to using timestamp as the partition. But creating partition for every timestamp (second) will create too many partitions in hive and will be bad for performance.

            – shanmuga
            Apr 1 at 8:59

















          • Since you must use timestamp as the only filter criteria, you are limited to using timestamp as the partition. But creating partition for every timestamp (second) will create too many partitions in hive and will be bad for performance.

            – shanmuga
            Apr 1 at 8:59
















          Since you must use timestamp as the only filter criteria, you are limited to using timestamp as the partition. But creating partition for every timestamp (second) will create too many partitions in hive and will be bad for performance.

          – shanmuga
          Apr 1 at 8:59





          Since you must use timestamp as the only filter criteria, you are limited to using timestamp as the partition. But creating partition for every timestamp (second) will create too many partitions in hive and will be bad for performance.

          – shanmuga
          Apr 1 at 8:59








          Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.







          Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.



















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55381744%2fwhat-should-be-my-hive-partitioning-strategy-and-view-strategy-so-that-query-can%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

          Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

          Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript