Pre filter solution for aws Glue / Redshift (before the loading into S3) Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) Data science time! April 2019 and salary with experience Should we burninate the [wrap] tag? The Ask Question Wizard is Live!Data import speeds from Amazon S3 in to Redshift / Redshift solutionSimple way to load new files only into Redshift from S3?Copy error Amazon Redshift loading from S3Redshift doesn't recognize newly loaded data as pre-sortedAWS Redshift: Load data from many buckets on S3AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'Scheduling data extraction from AWS Redshift to S3AWS Glue: How to handle nested JSON with varying schemasIs there a way to simply truncate columns while inserting into Redshift via Glue?aws glue incremental load relational database

3 doors, three guards, one stone

What causes the vertical darker bands in my photo?

What does F' and F" mean?

English words in a non-english sci-fi novel

Should I discuss the type of campaign with my players?

Understanding Ceva's Theorem

Do I really need recursive chmod to restrict access to a folder?

ListPlot join points by nearest neighbor rather than order

Why was the term "discrete" used in discrete logarithm?

Output the ŋarâþ crîþ alphabet song without using (m)any letters

List *all* the tuples!

What is known about the Ubaid lizard-people figurines?

How to bypass password on Windows XP account?

How do I stop a creek from eroding my steep embankment?

Why aren't air breathing engines used as small first stages

Why am I getting the error "non-boolean type specified in a context where a condition is expected" for this request?

How to tell that you are a giant?

Why is my conclusion inconsistent with the van't Hoff equation?

How can I make names more distinctive without making them longer?

Book where humans were engineered with genes from animal species to survive hostile planets

Sci-Fi book where patients in a coma ward all live in a subconscious world linked together

Dating a Former Employee

How do I keep my slimes from escaping their pens?

How widely used is the term Treppenwitz? Is it something that most Germans know?



Pre filter solution for aws Glue / Redshift (before the loading into S3)



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
Data science time! April 2019 and salary with experience
Should we burninate the [wrap] tag?
The Ask Question Wizard is Live!Data import speeds from Amazon S3 in to Redshift / Redshift solutionSimple way to load new files only into Redshift from S3?Copy error Amazon Redshift loading from S3Redshift doesn't recognize newly loaded data as pre-sortedAWS Redshift: Load data from many buckets on S3AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'Scheduling data extraction from AWS Redshift to S3AWS Glue: How to handle nested JSON with varying schemasIs there a way to simply truncate columns while inserting into Redshift via Glue?aws glue incremental load relational database



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















I am working on aws Glue scheduled transformation scripts from Redshift tables (or views) to Redshift tables.



Some of my tables are huge and I do not want to load the full table into S3 files or for some of them, I would like to have an incremental solution.
So I am looking for a pre filter solution for aws Glue / Redshift (before the loading into S3).



I know that if you work directly on a S3 bucket you can use predicate push-down to avoid loading the data you don't care about. It seems that it is not supported on Redshift tables.



Do you know a similary solution for Redshift? Or a workaround?



Thank you










share|improve this question




























    0















    I am working on aws Glue scheduled transformation scripts from Redshift tables (or views) to Redshift tables.



    Some of my tables are huge and I do not want to load the full table into S3 files or for some of them, I would like to have an incremental solution.
    So I am looking for a pre filter solution for aws Glue / Redshift (before the loading into S3).



    I know that if you work directly on a S3 bucket you can use predicate push-down to avoid loading the data you don't care about. It seems that it is not supported on Redshift tables.



    Do you know a similary solution for Redshift? Or a workaround?



    Thank you










    share|improve this question
























      0












      0








      0








      I am working on aws Glue scheduled transformation scripts from Redshift tables (or views) to Redshift tables.



      Some of my tables are huge and I do not want to load the full table into S3 files or for some of them, I would like to have an incremental solution.
      So I am looking for a pre filter solution for aws Glue / Redshift (before the loading into S3).



      I know that if you work directly on a S3 bucket you can use predicate push-down to avoid loading the data you don't care about. It seems that it is not supported on Redshift tables.



      Do you know a similary solution for Redshift? Or a workaround?



      Thank you










      share|improve this question














      I am working on aws Glue scheduled transformation scripts from Redshift tables (or views) to Redshift tables.



      Some of my tables are huge and I do not want to load the full table into S3 files or for some of them, I would like to have an incremental solution.
      So I am looking for a pre filter solution for aws Glue / Redshift (before the loading into S3).



      I know that if you work directly on a S3 bucket you can use predicate push-down to avoid loading the data you don't care about. It seems that it is not supported on Redshift tables.



      Do you know a similary solution for Redshift? Or a workaround?



      Thank you







      amazon-redshift aws-glue






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 22 at 9:02









      QTHQTH

      1




      1






















          1 Answer
          1






          active

          oldest

          votes


















          0














          I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:



          import sys
          from awsglue.transforms import *
          from awsglue.utils import getResolvedOptions
          from pyspark.context import SparkContext
          from awsglue.context import GlueContext
          from awsglue.job import Job

          ## @params: [TempDir, JOB_NAME]
          args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])

          sc = SparkContext()
          glueContext = GlueContext(sc)
          spark = glueContext.spark_session
          job = Job(glueContext)
          job.init(args['JOB_NAME'], args)

          qry_for_df = """
          (SELECT *
          FROM table1
          WHERE field1 = ‘something’ ) as subq
          """

          df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()


          Some considerations:
          I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.



          I've not included way to store the password and url, but here is a link for Amazon best practices (as suggestion):
          https://aws.amazon.com/blogs/big-data/use-aws-glue-to-run-etl-jobs-against-non-native-jdbc-data-sources/



          You can take the parameters from your metadata, but the database name is missing and the connection name should be hard coded or via job parameters
          cnx_parameters = glueContext.extract_jdbc_conf("Your_Cnx_Name", catalog_id=None)






          share|improve this answer























          • Hello and thank you for your help. It desserve a try and to be honest I did not found another solution

            – QTH
            Mar 25 at 14:21











          • Would be nice if you share the results. did it work?

            – fernolimits
            Mar 27 at 17:32











          • I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/

            – QTH
            Mar 28 at 9:34












          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55296100%2fpre-filter-solution-for-aws-glue-redshift-before-the-loading-into-s3%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:



          import sys
          from awsglue.transforms import *
          from awsglue.utils import getResolvedOptions
          from pyspark.context import SparkContext
          from awsglue.context import GlueContext
          from awsglue.job import Job

          ## @params: [TempDir, JOB_NAME]
          args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])

          sc = SparkContext()
          glueContext = GlueContext(sc)
          spark = glueContext.spark_session
          job = Job(glueContext)
          job.init(args['JOB_NAME'], args)

          qry_for_df = """
          (SELECT *
          FROM table1
          WHERE field1 = ‘something’ ) as subq
          """

          df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()


          Some considerations:
          I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.



          I've not included way to store the password and url, but here is a link for Amazon best practices (as suggestion):
          https://aws.amazon.com/blogs/big-data/use-aws-glue-to-run-etl-jobs-against-non-native-jdbc-data-sources/



          You can take the parameters from your metadata, but the database name is missing and the connection name should be hard coded or via job parameters
          cnx_parameters = glueContext.extract_jdbc_conf("Your_Cnx_Name", catalog_id=None)






          share|improve this answer























          • Hello and thank you for your help. It desserve a try and to be honest I did not found another solution

            – QTH
            Mar 25 at 14:21











          • Would be nice if you share the results. did it work?

            – fernolimits
            Mar 27 at 17:32











          • I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/

            – QTH
            Mar 28 at 9:34
















          0














          I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:



          import sys
          from awsglue.transforms import *
          from awsglue.utils import getResolvedOptions
          from pyspark.context import SparkContext
          from awsglue.context import GlueContext
          from awsglue.job import Job

          ## @params: [TempDir, JOB_NAME]
          args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])

          sc = SparkContext()
          glueContext = GlueContext(sc)
          spark = glueContext.spark_session
          job = Job(glueContext)
          job.init(args['JOB_NAME'], args)

          qry_for_df = """
          (SELECT *
          FROM table1
          WHERE field1 = ‘something’ ) as subq
          """

          df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()


          Some considerations:
          I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.



          I've not included way to store the password and url, but here is a link for Amazon best practices (as suggestion):
          https://aws.amazon.com/blogs/big-data/use-aws-glue-to-run-etl-jobs-against-non-native-jdbc-data-sources/



          You can take the parameters from your metadata, but the database name is missing and the connection name should be hard coded or via job parameters
          cnx_parameters = glueContext.extract_jdbc_conf("Your_Cnx_Name", catalog_id=None)






          share|improve this answer























          • Hello and thank you for your help. It desserve a try and to be honest I did not found another solution

            – QTH
            Mar 25 at 14:21











          • Would be nice if you share the results. did it work?

            – fernolimits
            Mar 27 at 17:32











          • I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/

            – QTH
            Mar 28 at 9:34














          0












          0








          0







          I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:



          import sys
          from awsglue.transforms import *
          from awsglue.utils import getResolvedOptions
          from pyspark.context import SparkContext
          from awsglue.context import GlueContext
          from awsglue.job import Job

          ## @params: [TempDir, JOB_NAME]
          args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])

          sc = SparkContext()
          glueContext = GlueContext(sc)
          spark = glueContext.spark_session
          job = Job(glueContext)
          job.init(args['JOB_NAME'], args)

          qry_for_df = """
          (SELECT *
          FROM table1
          WHERE field1 = ‘something’ ) as subq
          """

          df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()


          Some considerations:
          I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.



          I've not included way to store the password and url, but here is a link for Amazon best practices (as suggestion):
          https://aws.amazon.com/blogs/big-data/use-aws-glue-to-run-etl-jobs-against-non-native-jdbc-data-sources/



          You can take the parameters from your metadata, but the database name is missing and the connection name should be hard coded or via job parameters
          cnx_parameters = glueContext.extract_jdbc_conf("Your_Cnx_Name", catalog_id=None)






          share|improve this answer













          I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:



          import sys
          from awsglue.transforms import *
          from awsglue.utils import getResolvedOptions
          from pyspark.context import SparkContext
          from awsglue.context import GlueContext
          from awsglue.job import Job

          ## @params: [TempDir, JOB_NAME]
          args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])

          sc = SparkContext()
          glueContext = GlueContext(sc)
          spark = glueContext.spark_session
          job = Job(glueContext)
          job.init(args['JOB_NAME'], args)

          qry_for_df = """
          (SELECT *
          FROM table1
          WHERE field1 = ‘something’ ) as subq
          """

          df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()


          Some considerations:
          I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.



          I've not included way to store the password and url, but here is a link for Amazon best practices (as suggestion):
          https://aws.amazon.com/blogs/big-data/use-aws-glue-to-run-etl-jobs-against-non-native-jdbc-data-sources/



          You can take the parameters from your metadata, but the database name is missing and the connection name should be hard coded or via job parameters
          cnx_parameters = glueContext.extract_jdbc_conf("Your_Cnx_Name", catalog_id=None)







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 24 at 11:52









          fernolimitsfernolimits

          12




          12












          • Hello and thank you for your help. It desserve a try and to be honest I did not found another solution

            – QTH
            Mar 25 at 14:21











          • Would be nice if you share the results. did it work?

            – fernolimits
            Mar 27 at 17:32











          • I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/

            – QTH
            Mar 28 at 9:34


















          • Hello and thank you for your help. It desserve a try and to be honest I did not found another solution

            – QTH
            Mar 25 at 14:21











          • Would be nice if you share the results. did it work?

            – fernolimits
            Mar 27 at 17:32











          • I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/

            – QTH
            Mar 28 at 9:34

















          Hello and thank you for your help. It desserve a try and to be honest I did not found another solution

          – QTH
          Mar 25 at 14:21





          Hello and thank you for your help. It desserve a try and to be honest I did not found another solution

          – QTH
          Mar 25 at 14:21













          Would be nice if you share the results. did it work?

          – fernolimits
          Mar 27 at 17:32





          Would be nice if you share the results. did it work?

          – fernolimits
          Mar 27 at 17:32













          I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/

          – QTH
          Mar 28 at 9:34






          I did not succeed with that so I decided to go on another solution: I have created a View. It is a little bit ugly but it works well with no effort :-/

          – QTH
          Mar 28 at 9:34




















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55296100%2fpre-filter-solution-for-aws-glue-redshift-before-the-loading-into-s3%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

          Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

          Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript