Pre filter solution for aws Glue / Redshift (before the loading into S3) Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) Data science time! April 2019 and salary with experience Should we burninate the [wrap] tag? The Ask Question Wizard is Live!Data import speeds from Amazon S3 in to Redshift / Redshift solutionSimple way to load new files only into Redshift from S3?Copy error Amazon Redshift loading from S3Redshift doesn't recognize newly loaded data as pre-sortedAWS Redshift: Load data from many buckets on S3AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'Scheduling data extraction from AWS Redshift to S3AWS Glue: How to handle nested JSON with varying schemasIs there a way to simply truncate columns while inserting into Redshift via Glue?aws glue incremental load relational database
3 doors, three guards, one stone
What causes the vertical darker bands in my photo?
What does F' and F" mean?
English words in a non-english sci-fi novel
Should I discuss the type of campaign with my players?
Understanding Ceva's Theorem
Do I really need recursive chmod to restrict access to a folder?
ListPlot join points by nearest neighbor rather than order
Why was the term "discrete" used in discrete logarithm?
Output the ŋarâþ crîþ alphabet song without using (m)any letters
List *all* the tuples!
What is known about the Ubaid lizard-people figurines?
How to bypass password on Windows XP account?
How do I stop a creek from eroding my steep embankment?
Why aren't air breathing engines used as small first stages
Why am I getting the error "non-boolean type specified in a context where a condition is expected" for this request?
How to tell that you are a giant?
Why is my conclusion inconsistent with the van't Hoff equation?
How can I make names more distinctive without making them longer?
Book where humans were engineered with genes from animal species to survive hostile planets
Sci-Fi book where patients in a coma ward all live in a subconscious world linked together
Dating a Former Employee
How do I keep my slimes from escaping their pens?
How widely used is the term Treppenwitz? Is it something that most Germans know?
Pre filter solution for aws Glue / Redshift (before the loading into S3)
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
Data science time! April 2019 and salary with experience
Should we burninate the [wrap] tag?
The Ask Question Wizard is Live!Data import speeds from Amazon S3 in to Redshift / Redshift solutionSimple way to load new files only into Redshift from S3?Copy error Amazon Redshift loading from S3Redshift doesn't recognize newly loaded data as pre-sortedAWS Redshift: Load data from many buckets on S3AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'Scheduling data extraction from AWS Redshift to S3AWS Glue: How to handle nested JSON with varying schemasIs there a way to simply truncate columns while inserting into Redshift via Glue?aws glue incremental load relational database
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I am working on aws Glue scheduled transformation scripts from Redshift tables (or views) to Redshift tables.
Some of my tables are huge and I do not want to load the full table into S3 files or for some of them, I would like to have an incremental solution.
So I am looking for a pre filter solution for aws Glue / Redshift (before the loading into S3).
I know that if you work directly on a S3 bucket you can use predicate push-down to avoid loading the data you don't care about. It seems that it is not supported on Redshift tables.
Do you know a similary solution for Redshift? Or a workaround?
Thank you
amazon-redshift aws-glue
add a comment |
I am working on aws Glue scheduled transformation scripts from Redshift tables (or views) to Redshift tables.
Some of my tables are huge and I do not want to load the full table into S3 files or for some of them, I would like to have an incremental solution.
So I am looking for a pre filter solution for aws Glue / Redshift (before the loading into S3).
I know that if you work directly on a S3 bucket you can use predicate push-down to avoid loading the data you don't care about. It seems that it is not supported on Redshift tables.
Do you know a similary solution for Redshift? Or a workaround?
Thank you
amazon-redshift aws-glue
add a comment |
I am working on aws Glue scheduled transformation scripts from Redshift tables (or views) to Redshift tables.
Some of my tables are huge and I do not want to load the full table into S3 files or for some of them, I would like to have an incremental solution.
So I am looking for a pre filter solution for aws Glue / Redshift (before the loading into S3).
I know that if you work directly on a S3 bucket you can use predicate push-down to avoid loading the data you don't care about. It seems that it is not supported on Redshift tables.
Do you know a similary solution for Redshift? Or a workaround?
Thank you
amazon-redshift aws-glue
I am working on aws Glue scheduled transformation scripts from Redshift tables (or views) to Redshift tables.
Some of my tables are huge and I do not want to load the full table into S3 files or for some of them, I would like to have an incremental solution.
So I am looking for a pre filter solution for aws Glue / Redshift (before the loading into S3).
I know that if you work directly on a S3 bucket you can use predicate push-down to avoid loading the data you don't care about. It seems that it is not supported on Redshift tables.
Do you know a similary solution for Redshift? Or a workaround?
Thank you
amazon-redshift aws-glue
amazon-redshift aws-glue
asked Mar 22 at 9:02
QTHQTH
1
1
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
qry_for_df = """
(SELECT *
FROM table1
WHERE field1 = ‘something’ ) as subq
"""
df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()
Some considerations:
I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.
I've not included way to store the password and url, but here is a link for Amazon best practices (as suggestion):
https://aws.amazon.com/blogs/big-data/use-aws-glue-to-run-etl-jobs-against-non-native-jdbc-data-sources/
You can take the parameters from your metadata, but the database name is missing and the connection name should be hard coded or via job parameters
cnx_parameters = glueContext.extract_jdbc_conf("Your_Cnx_Name", catalog_id=None)
Hello and thank you for your help. It desserve a try and to be honest I did not found another solution
– QTH
Mar 25 at 14:21
Would be nice if you share the results. did it work?
– fernolimits
Mar 27 at 17:32
I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/
– QTH
Mar 28 at 9:34
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55296100%2fpre-filter-solution-for-aws-glue-redshift-before-the-loading-into-s3%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
qry_for_df = """
(SELECT *
FROM table1
WHERE field1 = ‘something’ ) as subq
"""
df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()
Some considerations:
I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.
I've not included way to store the password and url, but here is a link for Amazon best practices (as suggestion):
https://aws.amazon.com/blogs/big-data/use-aws-glue-to-run-etl-jobs-against-non-native-jdbc-data-sources/
You can take the parameters from your metadata, but the database name is missing and the connection name should be hard coded or via job parameters
cnx_parameters = glueContext.extract_jdbc_conf("Your_Cnx_Name", catalog_id=None)
Hello and thank you for your help. It desserve a try and to be honest I did not found another solution
– QTH
Mar 25 at 14:21
Would be nice if you share the results. did it work?
– fernolimits
Mar 27 at 17:32
I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/
– QTH
Mar 28 at 9:34
add a comment |
I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
qry_for_df = """
(SELECT *
FROM table1
WHERE field1 = ‘something’ ) as subq
"""
df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()
Some considerations:
I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.
I've not included way to store the password and url, but here is a link for Amazon best practices (as suggestion):
https://aws.amazon.com/blogs/big-data/use-aws-glue-to-run-etl-jobs-against-non-native-jdbc-data-sources/
You can take the parameters from your metadata, but the database name is missing and the connection name should be hard coded or via job parameters
cnx_parameters = glueContext.extract_jdbc_conf("Your_Cnx_Name", catalog_id=None)
Hello and thank you for your help. It desserve a try and to be honest I did not found another solution
– QTH
Mar 25 at 14:21
Would be nice if you share the results. did it work?
– fernolimits
Mar 27 at 17:32
I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/
– QTH
Mar 28 at 9:34
add a comment |
I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
qry_for_df = """
(SELECT *
FROM table1
WHERE field1 = ‘something’ ) as subq
"""
df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()
Some considerations:
I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.
I've not included way to store the password and url, but here is a link for Amazon best practices (as suggestion):
https://aws.amazon.com/blogs/big-data/use-aws-glue-to-run-etl-jobs-against-non-native-jdbc-data-sources/
You can take the parameters from your metadata, but the database name is missing and the connection name should be hard coded or via job parameters
cnx_parameters = glueContext.extract_jdbc_conf("Your_Cnx_Name", catalog_id=None)
I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
qry_for_df = """
(SELECT *
FROM table1
WHERE field1 = ‘something’ ) as subq
"""
df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()
Some considerations:
I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.
I've not included way to store the password and url, but here is a link for Amazon best practices (as suggestion):
https://aws.amazon.com/blogs/big-data/use-aws-glue-to-run-etl-jobs-against-non-native-jdbc-data-sources/
You can take the parameters from your metadata, but the database name is missing and the connection name should be hard coded or via job parameters
cnx_parameters = glueContext.extract_jdbc_conf("Your_Cnx_Name", catalog_id=None)
answered Mar 24 at 11:52
fernolimitsfernolimits
12
12
Hello and thank you for your help. It desserve a try and to be honest I did not found another solution
– QTH
Mar 25 at 14:21
Would be nice if you share the results. did it work?
– fernolimits
Mar 27 at 17:32
I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/
– QTH
Mar 28 at 9:34
add a comment |
Hello and thank you for your help. It desserve a try and to be honest I did not found another solution
– QTH
Mar 25 at 14:21
Would be nice if you share the results. did it work?
– fernolimits
Mar 27 at 17:32
I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/
– QTH
Mar 28 at 9:34
Hello and thank you for your help. It desserve a try and to be honest I did not found another solution
– QTH
Mar 25 at 14:21
Hello and thank you for your help. It desserve a try and to be honest I did not found another solution
– QTH
Mar 25 at 14:21
Would be nice if you share the results. did it work?
– fernolimits
Mar 27 at 17:32
Would be nice if you share the results. did it work?
– fernolimits
Mar 27 at 17:32
I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/
– QTH
Mar 28 at 9:34
I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/
– QTH
Mar 28 at 9:34
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55296100%2fpre-filter-solution-for-aws-glue-redshift-before-the-loading-into-s3%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown