Pre filter solution for aws Glue / Redshift (before the loading into S3) Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) Data science time! April 2019 and salary with experience Should we burninate the [wrap] tag? The Ask Question Wizard is Live!Data import speeds from Amazon S3 in to Redshift / Redshift solutionSimple way to load new files only into Redshift from S3?Copy error Amazon Redshift loading from S3Redshift doesn't recognize newly loaded data as pre-sortedAWS Redshift: Load data from many buckets on S3AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'Scheduling data extraction from AWS Redshift to S3AWS Glue: How to handle nested JSON with varying schemasIs there a way to simply truncate columns while inserting into Redshift via Glue?aws glue incremental load relational database

3 doors, three guards, one stone

What causes the vertical darker bands in my photo?

What does F' and F" mean?

English words in a non-english sci-fi novel

Should I discuss the type of campaign with my players?

Understanding Ceva's Theorem

Do I really need recursive chmod to restrict access to a folder?

ListPlot join points by nearest neighbor rather than order

Why was the term "discrete" used in discrete logarithm?

Output the ŋarâþ crîþ alphabet song without using (m)any letters

List *all* the tuples!

What is known about the Ubaid lizard-people figurines?

How to bypass password on Windows XP account?

How do I stop a creek from eroding my steep embankment?

Why aren't air breathing engines used as small first stages

Why am I getting the error "non-boolean type specified in a context where a condition is expected" for this request?

How to tell that you are a giant?

Why is my conclusion inconsistent with the van't Hoff equation?

How can I make names more distinctive without making them longer?

Book where humans were engineered with genes from animal species to survive hostile planets

Sci-Fi book where patients in a coma ward all live in a subconscious world linked together

Dating a Former Employee

How do I keep my slimes from escaping their pens?

How widely used is the term Treppenwitz? Is it something that most Germans know?

Pre filter solution for aws Glue / Redshift (before the loading into S3)

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)

Data science time! April 2019 and salary with experience

Should we burninate the [wrap] tag?

The Ask Question Wizard is Live!Data import speeds from Amazon S3 in to Redshift / Redshift solutionSimple way to load new files only into Redshift from S3?Copy error Amazon Redshift loading from S3Redshift doesn't recognize newly loaded data as pre-sortedAWS Redshift: Load data from many buckets on S3AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'Scheduling data extraction from AWS Redshift to S3AWS Glue: How to handle nested JSON with varying schemasIs there a way to simply truncate columns while inserting into Redshift via Glue?aws glue incremental load relational database

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I am working on aws Glue scheduled transformation scripts from Redshift tables (or views) to Redshift tables.

Some of my tables are huge and I do not want to load the full table into S3 files or for some of them, I would like to have an incremental solution.
So I am looking for a pre filter solution for aws Glue / Redshift (before the loading into S3).

I know that if you work directly on a S3 bucket you can use predicate push-down to avoid loading the data you don't care about. It seems that it is not supported on Redshift tables.

Do you know a similary solution for Redshift? Or a workaround?

Thank you

asked Mar 22 at 9:02

QTH

add a comment |

I am working on aws Glue scheduled transformation scripts from Redshift tables (or views) to Redshift tables.

I know that if you work directly on a S3 bucket you can use predicate push-down to avoid loading the data you don't care about. It seems that it is not supported on Redshift tables.

Do you know a similary solution for Redshift? Or a workaround?

Thank you

asked Mar 22 at 9:02

QTH

add a comment |

I am working on aws Glue scheduled transformation scripts from Redshift tables (or views) to Redshift tables.

I know that if you work directly on a S3 bucket you can use predicate push-down to avoid loading the data you don't care about. It seems that it is not supported on Redshift tables.

Do you know a similary solution for Redshift? Or a workaround?

Thank you

asked Mar 22 at 9:02

QTH

I am working on aws Glue scheduled transformation scripts from Redshift tables (or views) to Redshift tables.

I know that if you work directly on a S3 bucket you can use predicate push-down to avoid loading the data you don't care about. It seems that it is not supported on Redshift tables.

Do you know a similary solution for Redshift? Or a workaround?

Thank you

amazon-redshift aws-glue

asked Mar 22 at 9:02

QTH

asked Mar 22 at 9:02

QTH

asked Mar 22 at 9:02

QTH

asked Mar 22 at 9:02

QTH

asked Mar 22 at 9:02

QTH

add a comment |

1 Answer
1

active

oldest

votes

I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

qry_for_df = """
 (SELECT *
 FROM table1
 WHERE field1 = ‘something’ ) as subq
"""

df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()

Some considerations:
I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.

I've not included way to store the password and url, but here is a link for Amazon best practices (as suggestion):
https://aws.amazon.com/blogs/big-data/use-aws-glue-to-run-etl-jobs-against-non-native-jdbc-data-sources/

You can take the parameters from your metadata, but the database name is missing and the connection name should be hard coded or via job parameters
cnx_parameters = glueContext.extract_jdbc_conf("Your_Cnx_Name", catalog_id=None)

answered Mar 24 at 11:52

fernolimits

Hello and thank you for your help. It desserve a try and to be honest I did not found another solution

– QTH
Mar 25 at 14:21

Would be nice if you share the results. did it work?

– fernolimits
Mar 27 at 17:32

I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/

– QTH
Mar 28 at 9:34

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55296100%2fpre-filter-solution-for-aws-glue-redshift-before-the-loading-into-s3%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

qry_for_df = """
 (SELECT *
 FROM table1
 WHERE field1 = ‘something’ ) as subq
"""

df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()

Some considerations:
I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.

answered Mar 24 at 11:52

fernolimits

Hello and thank you for your help. It desserve a try and to be honest I did not found another solution

– QTH
Mar 25 at 14:21

Would be nice if you share the results. did it work?

– fernolimits
Mar 27 at 17:32

I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/

– QTH
Mar 28 at 9:34

add a comment |

I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

qry_for_df = """
 (SELECT *
 FROM table1
 WHERE field1 = ‘something’ ) as subq
"""

df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()

Some considerations:
I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.

answered Mar 24 at 11:52

fernolimits

Hello and thank you for your help. It desserve a try and to be honest I did not found another solution

– QTH
Mar 25 at 14:21

Would be nice if you share the results. did it work?

– fernolimits
Mar 27 at 17:32

I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/

– QTH
Mar 28 at 9:34

add a comment |

I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

qry_for_df = """
 (SELECT *
 FROM table1
 WHERE field1 = ‘something’ ) as subq
"""

df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()

Some considerations:
I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.

answered Mar 24 at 11:52

fernolimits

I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

qry_for_df = """
 (SELECT *
 FROM table1
 WHERE field1 = ‘something’ ) as subq
"""

df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()

Some considerations:
I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.

answered Mar 24 at 11:52

fernolimits

answered Mar 24 at 11:52

fernolimits

answered Mar 24 at 11:52

fernolimits

answered Mar 24 at 11:52

fernolimits

Hello and thank you for your help. It desserve a try and to be honest I did not found another solution

– QTH
Mar 25 at 14:21

Would be nice if you share the results. did it work?

– fernolimits
Mar 27 at 17:32

I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/

– QTH
Mar 28 at 9:34

add a comment |

Hello and thank you for your help. It desserve a try and to be honest I did not found another solution

– QTH
Mar 25 at 14:21

Would be nice if you share the results. did it work?

– fernolimits
Mar 27 at 17:32

I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/

– QTH
Mar 28 at 9:34

Hello and thank you for your help. It desserve a try and to be honest I did not found another solution

– QTH
Mar 25 at 14:21

Would be nice if you share the results. did it work?

– fernolimits
Mar 27 at 17:32

I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/

– QTH
Mar 28 at 9:34

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer
1

1 Answer
1

1 Answer
1