What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 secondsFactors that limit speed Presto?Partitioning method that can help to avoid having to specify the same information or column in Hive Partitioned Query?How to convert Java timestamp stored as bigint to timestamp in Presto?How to discard partition column from hive view while selecting?presto looking for partitions on localhost instead of remote host that has hive metastoreHow to handle hive locking across hive and prestoCan't read data in Presto - can in HiveEnforce hive sql-standard security for Presto according to the user submitting the queryHive View query is not using partitionHow do you add partitions to a partitioned table in Presto running in Amazon EMR?
Can you feel passing through the sound barrier in an F-16?
Give function defaults arguments from a dictionary in Python
Three Singles in Three Clubs
How do I make distance between concentric circles equal?
Avoiding racist tropes in fantasy
Why did MS-DOS applications built using Turbo Pascal fail to start with a division by zero error on faster systems?
Solve a logarithmic equation by NSolve
Why aren't RCS openings an issue for spacecraft heat shields?
Why is observed clock rate < 3MHz on Arduino Uno?
Were there 486SX revisions without an FPU on the die?
Why would the US President need briefings on UFOs?
What is this symbol: semicircles facing eachother
Church Booleans
Why is Boris Johnson visiting only Paris & Berlin if every member of the EU needs to agree on a withdrawal deal?
Was Switzerland really impossible to invade during WW2?
Defense against attacks using dictionaries
A list of proofs of "Coherent topoi have enough points"
How is "sein" conjugated in this sub-sentence?
In what ways can a Non-paladin access Paladin spells?
Factoring the square of this polynomial?
Is there a limit on how long the casting (speaking aloud part of the spell) of Wish can be?
How to persuade recruiters to send me the Job Description?
Is refusing to concede in the face of an unstoppable Nexus combo punishable?
Do ability scores have any effect on casting Wish spell
What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds
Factors that limit speed Presto?Partitioning method that can help to avoid having to specify the same information or column in Hive Partitioned Query?How to convert Java timestamp stored as bigint to timestamp in Presto?How to discard partition column from hive view while selecting?presto looking for partitions on localhost instead of remote host that has hive metastoreHow to handle hive locking across hive and prestoCan't read data in Presto - can in HiveEnforce hive sql-standard security for Presto according to the user submitting the queryHive View query is not using partitionHow do you add partitions to a partitioned table in Presto running in Amazon EMR?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
My use case is i have two data sources:
1. Source1 (as speed layer)
2. Hive external table on top of S3(as batch layer)
I am using Presto for querying data from both the data sources by using view.
I want to create view that will union data from both the sources like : "create view test as select * from Source1.table union all select * from hive.table"
We are keeping 24 hours data in Source1 and after 24 hours that data will be migrated to s3 via hive.
Columns for Source1 tables are:timestamp,logtype,company,category
User will query data using timestamp range(can query data of last 15/30 minutes, last x hours, last x days, last x months, etc)
example: "select * from test where timestamp > (now() - interval '15' minute)","select * from test where timestamp > (now() - interval '12' hour)", "select * from test where timestamp > (now() - interval '1' day)"
To satisfy the user query I need to partition the hive table as well as the user should not be aware of the underlying stategy i.e if user is querying last x minutes data, he/she should not bother that if presto is reading the data from Source1 or hive.
What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds?
hive bigdata data-warehouse presto data-partitioning
add a comment |
My use case is i have two data sources:
1. Source1 (as speed layer)
2. Hive external table on top of S3(as batch layer)
I am using Presto for querying data from both the data sources by using view.
I want to create view that will union data from both the sources like : "create view test as select * from Source1.table union all select * from hive.table"
We are keeping 24 hours data in Source1 and after 24 hours that data will be migrated to s3 via hive.
Columns for Source1 tables are:timestamp,logtype,company,category
User will query data using timestamp range(can query data of last 15/30 minutes, last x hours, last x days, last x months, etc)
example: "select * from test where timestamp > (now() - interval '15' minute)","select * from test where timestamp > (now() - interval '12' hour)", "select * from test where timestamp > (now() - interval '1' day)"
To satisfy the user query I need to partition the hive table as well as the user should not be aware of the underlying stategy i.e if user is querying last x minutes data, he/she should not bother that if presto is reading the data from Source1 or hive.
What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds?
hive bigdata data-warehouse presto data-partitioning
add a comment |
My use case is i have two data sources:
1. Source1 (as speed layer)
2. Hive external table on top of S3(as batch layer)
I am using Presto for querying data from both the data sources by using view.
I want to create view that will union data from both the sources like : "create view test as select * from Source1.table union all select * from hive.table"
We are keeping 24 hours data in Source1 and after 24 hours that data will be migrated to s3 via hive.
Columns for Source1 tables are:timestamp,logtype,company,category
User will query data using timestamp range(can query data of last 15/30 minutes, last x hours, last x days, last x months, etc)
example: "select * from test where timestamp > (now() - interval '15' minute)","select * from test where timestamp > (now() - interval '12' hour)", "select * from test where timestamp > (now() - interval '1' day)"
To satisfy the user query I need to partition the hive table as well as the user should not be aware of the underlying stategy i.e if user is querying last x minutes data, he/she should not bother that if presto is reading the data from Source1 or hive.
What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds?
hive bigdata data-warehouse presto data-partitioning
My use case is i have two data sources:
1. Source1 (as speed layer)
2. Hive external table on top of S3(as batch layer)
I am using Presto for querying data from both the data sources by using view.
I want to create view that will union data from both the sources like : "create view test as select * from Source1.table union all select * from hive.table"
We are keeping 24 hours data in Source1 and after 24 hours that data will be migrated to s3 via hive.
Columns for Source1 tables are:timestamp,logtype,company,category
User will query data using timestamp range(can query data of last 15/30 minutes, last x hours, last x days, last x months, etc)
example: "select * from test where timestamp > (now() - interval '15' minute)","select * from test where timestamp > (now() - interval '12' hour)", "select * from test where timestamp > (now() - interval '1' day)"
To satisfy the user query I need to partition the hive table as well as the user should not be aware of the underlying stategy i.e if user is querying last x minutes data, he/she should not bother that if presto is reading the data from Source1 or hive.
What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds?
hive bigdata data-warehouse presto data-partitioning
hive bigdata data-warehouse presto data-partitioning
edited Apr 2 at 14:32
unknown_k
asked Mar 27 at 16:06
unknown_kunknown_k
62 bronze badges
62 bronze badges
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
For hive a partition column should be used which will queried in filter.
In your case this is timestamp. However if you use timestamp it would create a partition for every second (or millisecond) depending in the data in the column.
A better solution would be to create columns like year
, month
, day
, hour
(from timestamp) and to use these as partition columns.
The same strategy will work for Kudu however be advised it could create hot-spotting since all the newly arriving records will go to same (most-recent) partition this will limit insert (and may be query) performance.
To overcome use one additional column as hash partition along with timestamp derived columns.
e.g year
, month
, day
, hour
, logtype
Since you must use timestamp as the only filter criteria, you are limited to using timestamp as the partition. But creating partition for every timestamp (second) will create too many partitions in hive and will be bad for performance.
– shanmuga
Apr 1 at 8:59
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55381744%2fwhat-should-be-my-hive-partitioning-strategy-and-view-strategy-so-that-query-can%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
For hive a partition column should be used which will queried in filter.
In your case this is timestamp. However if you use timestamp it would create a partition for every second (or millisecond) depending in the data in the column.
A better solution would be to create columns like year
, month
, day
, hour
(from timestamp) and to use these as partition columns.
The same strategy will work for Kudu however be advised it could create hot-spotting since all the newly arriving records will go to same (most-recent) partition this will limit insert (and may be query) performance.
To overcome use one additional column as hash partition along with timestamp derived columns.
e.g year
, month
, day
, hour
, logtype
Since you must use timestamp as the only filter criteria, you are limited to using timestamp as the partition. But creating partition for every timestamp (second) will create too many partitions in hive and will be bad for performance.
– shanmuga
Apr 1 at 8:59
add a comment |
For hive a partition column should be used which will queried in filter.
In your case this is timestamp. However if you use timestamp it would create a partition for every second (or millisecond) depending in the data in the column.
A better solution would be to create columns like year
, month
, day
, hour
(from timestamp) and to use these as partition columns.
The same strategy will work for Kudu however be advised it could create hot-spotting since all the newly arriving records will go to same (most-recent) partition this will limit insert (and may be query) performance.
To overcome use one additional column as hash partition along with timestamp derived columns.
e.g year
, month
, day
, hour
, logtype
Since you must use timestamp as the only filter criteria, you are limited to using timestamp as the partition. But creating partition for every timestamp (second) will create too many partitions in hive and will be bad for performance.
– shanmuga
Apr 1 at 8:59
add a comment |
For hive a partition column should be used which will queried in filter.
In your case this is timestamp. However if you use timestamp it would create a partition for every second (or millisecond) depending in the data in the column.
A better solution would be to create columns like year
, month
, day
, hour
(from timestamp) and to use these as partition columns.
The same strategy will work for Kudu however be advised it could create hot-spotting since all the newly arriving records will go to same (most-recent) partition this will limit insert (and may be query) performance.
To overcome use one additional column as hash partition along with timestamp derived columns.
e.g year
, month
, day
, hour
, logtype
For hive a partition column should be used which will queried in filter.
In your case this is timestamp. However if you use timestamp it would create a partition for every second (or millisecond) depending in the data in the column.
A better solution would be to create columns like year
, month
, day
, hour
(from timestamp) and to use these as partition columns.
The same strategy will work for Kudu however be advised it could create hot-spotting since all the newly arriving records will go to same (most-recent) partition this will limit insert (and may be query) performance.
To overcome use one additional column as hash partition along with timestamp derived columns.
e.g year
, month
, day
, hour
, logtype
answered Mar 29 at 16:22
shanmugashanmuga
3,1381 gold badge10 silver badges28 bronze badges
3,1381 gold badge10 silver badges28 bronze badges
Since you must use timestamp as the only filter criteria, you are limited to using timestamp as the partition. But creating partition for every timestamp (second) will create too many partitions in hive and will be bad for performance.
– shanmuga
Apr 1 at 8:59
add a comment |
Since you must use timestamp as the only filter criteria, you are limited to using timestamp as the partition. But creating partition for every timestamp (second) will create too many partitions in hive and will be bad for performance.
– shanmuga
Apr 1 at 8:59
Since you must use timestamp as the only filter criteria, you are limited to using timestamp as the partition. But creating partition for every timestamp (second) will create too many partitions in hive and will be bad for performance.
– shanmuga
Apr 1 at 8:59
Since you must use timestamp as the only filter criteria, you are limited to using timestamp as the partition. But creating partition for every timestamp (second) will create too many partitions in hive and will be bad for performance.
– shanmuga
Apr 1 at 8:59
add a comment |
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55381744%2fwhat-should-be-my-hive-partitioning-strategy-and-view-strategy-so-that-query-can%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown