Terraform AWS Athena to use Glue catalog as db Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!AWS Accessing Redshift in a VPCCan crawlers update imported tables in AWS Glue?AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'HIVE_INVALID_METADATA in Amazon AthenaHow does AWS Glue ETL job retrieve data?Creating a Glue Data Catalog Table within a Glue JobAWS Glue multi-tenant files normalization into the common schemaWhy is my AWS Glue crawler not creating any tables?Adding multiple S3 paths to glue crawler with terraformAWS Glue - avro to parquet - Glue job getting an empty frame from catalog

Unix AIX passing variable and arguments to expect and spawn

Why do C and C++ allow the expression (int) + 4*5?

Is Bran literally the world's memory?

Trying to enter the Fox's den

What is the ongoing value of the Kanban board to the developers as opposed to management

Lights are flickering on and off after accidentally bumping into light switch

Pointing to problems without suggesting solutions

How to ask rejected full-time candidates to apply to teach individual courses?

A journey... into the MIND

2 sample t test for sample sizes - 30,000 and 150,000

What helicopter has the most rotor blades?

How is an IPA symbol that lacks a name (e.g. ɲ) called?

How to create a command for the "strange m" symbol in latex?

What is the difference between 准时 and 按时?

Who can become a wight?

Why does my GNOME settings mention "Moto C Plus"?

What documents does someone with a long-term visa need to travel to another Schengen country?

A German immigrant ancestor has a "Registration Affidavit of Alien Enemy" on file. What does that mean exactly?

How to charge percentage of transaction cost?

Who's this lady in the war room?

Checking IFI enabled on SQL server below 2016

Protagonist's race is hidden - should I reveal it?

Weaponising the Grasp-at-a-Distance spell

Marquee sign letters



Terraform AWS Athena to use Glue catalog as db



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
Data science time! April 2019 and salary with experience
The Ask Question Wizard is Live!AWS Accessing Redshift in a VPCCan crawlers update imported tables in AWS Glue?AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'HIVE_INVALID_METADATA in Amazon AthenaHow does AWS Glue ETL job retrieve data?Creating a Glue Data Catalog Table within a Glue JobAWS Glue multi-tenant files normalization into the common schemaWhy is my AWS Glue crawler not creating any tables?Adding multiple S3 paths to glue crawler with terraformAWS Glue - avro to parquet - Glue job getting an empty frame from catalog



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








6















I'm confused as to how I should use terraform to connect Athena to my Glue Catalog database.



I use



resource "aws_glue_catalog_database" "catalog_database" 
name = "$var.glue_db_name"


resource "aws_glue_crawler" "datalake_crawler"
database_name = "$var.glue_db_name"
name = "$var.crawler_name"
role = "$aws_iam_role.crawler_iam_role.name"
description = "$var.crawler_description"
table_prefix = "$var.table_prefix"
schedule = "$var.schedule"

s3_target
path = "s3://$var.data_bucket_name[0]"

s3_target
path = "s3://$var.data_bucket_name[1]"




to create a Glue DB and the crawler to crawl an s3 bucket (here only two), but I don't know how I link the Athena query service to the Glue DB. In the terraform documentation for Athena, there doesn't appear to be a way to connect Athena to a Glue cataglog but only to an S3 Bucket. Clearly, however, Athena can be integrated with Glue.



How can I terraform an Athena database to use my Glue catalog as it's data source rather than an S3 bucket?










share|improve this question



















  • 1





    Did you run the crawler? Did it create AWS Glue tables? If you do not define aws_glue_catalog_table resources with terraform that point to their respective S3 locations, the crawler will need to run at least once to create the tables. Once they are created your Glue DB and the tables should become visible in Athena, even without defining a terraform aws_athena_database resource.

    – Martin
    Mar 28 at 15:29











  • I've run the crawler. There are no tables created in the Glue db. To be honest, I don't even know if I've given the crawler the correct role/policies: I've created a gist that includes the code I'm using: gist.github.com/stevenranney/3f5545e1e736266807b1f337e0be58e0

    – Steven
    Mar 28 at 18:52











  • I suggest to also add tag "terraform" (or "terraform-aws") to the question.

    – Martin
    Apr 4 at 6:38

















6















I'm confused as to how I should use terraform to connect Athena to my Glue Catalog database.



I use



resource "aws_glue_catalog_database" "catalog_database" 
name = "$var.glue_db_name"


resource "aws_glue_crawler" "datalake_crawler"
database_name = "$var.glue_db_name"
name = "$var.crawler_name"
role = "$aws_iam_role.crawler_iam_role.name"
description = "$var.crawler_description"
table_prefix = "$var.table_prefix"
schedule = "$var.schedule"

s3_target
path = "s3://$var.data_bucket_name[0]"

s3_target
path = "s3://$var.data_bucket_name[1]"




to create a Glue DB and the crawler to crawl an s3 bucket (here only two), but I don't know how I link the Athena query service to the Glue DB. In the terraform documentation for Athena, there doesn't appear to be a way to connect Athena to a Glue cataglog but only to an S3 Bucket. Clearly, however, Athena can be integrated with Glue.



How can I terraform an Athena database to use my Glue catalog as it's data source rather than an S3 bucket?










share|improve this question



















  • 1





    Did you run the crawler? Did it create AWS Glue tables? If you do not define aws_glue_catalog_table resources with terraform that point to their respective S3 locations, the crawler will need to run at least once to create the tables. Once they are created your Glue DB and the tables should become visible in Athena, even without defining a terraform aws_athena_database resource.

    – Martin
    Mar 28 at 15:29











  • I've run the crawler. There are no tables created in the Glue db. To be honest, I don't even know if I've given the crawler the correct role/policies: I've created a gist that includes the code I'm using: gist.github.com/stevenranney/3f5545e1e736266807b1f337e0be58e0

    – Steven
    Mar 28 at 18:52











  • I suggest to also add tag "terraform" (or "terraform-aws") to the question.

    – Martin
    Apr 4 at 6:38













6












6








6


1






I'm confused as to how I should use terraform to connect Athena to my Glue Catalog database.



I use



resource "aws_glue_catalog_database" "catalog_database" 
name = "$var.glue_db_name"


resource "aws_glue_crawler" "datalake_crawler"
database_name = "$var.glue_db_name"
name = "$var.crawler_name"
role = "$aws_iam_role.crawler_iam_role.name"
description = "$var.crawler_description"
table_prefix = "$var.table_prefix"
schedule = "$var.schedule"

s3_target
path = "s3://$var.data_bucket_name[0]"

s3_target
path = "s3://$var.data_bucket_name[1]"




to create a Glue DB and the crawler to crawl an s3 bucket (here only two), but I don't know how I link the Athena query service to the Glue DB. In the terraform documentation for Athena, there doesn't appear to be a way to connect Athena to a Glue cataglog but only to an S3 Bucket. Clearly, however, Athena can be integrated with Glue.



How can I terraform an Athena database to use my Glue catalog as it's data source rather than an S3 bucket?










share|improve this question
















I'm confused as to how I should use terraform to connect Athena to my Glue Catalog database.



I use



resource "aws_glue_catalog_database" "catalog_database" 
name = "$var.glue_db_name"


resource "aws_glue_crawler" "datalake_crawler"
database_name = "$var.glue_db_name"
name = "$var.crawler_name"
role = "$aws_iam_role.crawler_iam_role.name"
description = "$var.crawler_description"
table_prefix = "$var.table_prefix"
schedule = "$var.schedule"

s3_target
path = "s3://$var.data_bucket_name[0]"

s3_target
path = "s3://$var.data_bucket_name[1]"




to create a Glue DB and the crawler to crawl an s3 bucket (here only two), but I don't know how I link the Athena query service to the Glue DB. In the terraform documentation for Athena, there doesn't appear to be a way to connect Athena to a Glue cataglog but only to an S3 Bucket. Clearly, however, Athena can be integrated with Glue.



How can I terraform an Athena database to use my Glue catalog as it's data source rather than an S3 bucket?







amazon-web-services terraform aws-glue terraform-provider-aws aws-glue-data-catalog






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 4 at 12:42







Steven

















asked Mar 12 at 19:08









StevenSteven

1,6601132




1,6601132







  • 1





    Did you run the crawler? Did it create AWS Glue tables? If you do not define aws_glue_catalog_table resources with terraform that point to their respective S3 locations, the crawler will need to run at least once to create the tables. Once they are created your Glue DB and the tables should become visible in Athena, even without defining a terraform aws_athena_database resource.

    – Martin
    Mar 28 at 15:29











  • I've run the crawler. There are no tables created in the Glue db. To be honest, I don't even know if I've given the crawler the correct role/policies: I've created a gist that includes the code I'm using: gist.github.com/stevenranney/3f5545e1e736266807b1f337e0be58e0

    – Steven
    Mar 28 at 18:52











  • I suggest to also add tag "terraform" (or "terraform-aws") to the question.

    – Martin
    Apr 4 at 6:38












  • 1





    Did you run the crawler? Did it create AWS Glue tables? If you do not define aws_glue_catalog_table resources with terraform that point to their respective S3 locations, the crawler will need to run at least once to create the tables. Once they are created your Glue DB and the tables should become visible in Athena, even without defining a terraform aws_athena_database resource.

    – Martin
    Mar 28 at 15:29











  • I've run the crawler. There are no tables created in the Glue db. To be honest, I don't even know if I've given the crawler the correct role/policies: I've created a gist that includes the code I'm using: gist.github.com/stevenranney/3f5545e1e736266807b1f337e0be58e0

    – Steven
    Mar 28 at 18:52











  • I suggest to also add tag "terraform" (or "terraform-aws") to the question.

    – Martin
    Apr 4 at 6:38







1




1





Did you run the crawler? Did it create AWS Glue tables? If you do not define aws_glue_catalog_table resources with terraform that point to their respective S3 locations, the crawler will need to run at least once to create the tables. Once they are created your Glue DB and the tables should become visible in Athena, even without defining a terraform aws_athena_database resource.

– Martin
Mar 28 at 15:29





Did you run the crawler? Did it create AWS Glue tables? If you do not define aws_glue_catalog_table resources with terraform that point to their respective S3 locations, the crawler will need to run at least once to create the tables. Once they are created your Glue DB and the tables should become visible in Athena, even without defining a terraform aws_athena_database resource.

– Martin
Mar 28 at 15:29













I've run the crawler. There are no tables created in the Glue db. To be honest, I don't even know if I've given the crawler the correct role/policies: I've created a gist that includes the code I'm using: gist.github.com/stevenranney/3f5545e1e736266807b1f337e0be58e0

– Steven
Mar 28 at 18:52





I've run the crawler. There are no tables created in the Glue db. To be honest, I don't even know if I've given the crawler the correct role/policies: I've created a gist that includes the code I'm using: gist.github.com/stevenranney/3f5545e1e736266807b1f337e0be58e0

– Steven
Mar 28 at 18:52













I suggest to also add tag "terraform" (or "terraform-aws") to the question.

– Martin
Apr 4 at 6:38





I suggest to also add tag "terraform" (or "terraform-aws") to the question.

– Martin
Apr 4 at 6:38












2 Answers
2






active

oldest

votes


















1














Our current basic setup for having Glue crawl one S3 bucket and create/update a table in a Glue DB, which can then be queried in Athena, looks like this:



Crawler role and role policy:



  • The assume_role_policy of the IAM role needs only Glue as principal

  • The IAM role policy allows actions for Glue, S3, and logs

  • The Glue actions and resources can probably be narrowed down to the ones really needed (this is work in progress)

  • The S3 actions are limited to those needed by the crawler

resource "aws_iam_role" "glue_crawler_role" 
name = "analytics_glue_crawler_role"

assume_role_policy = <<EOF

"Version": "2012-10-17",
"Statement": [

"Action": "sts:AssumeRole",
"Principal":
"Service": "glue.amazonaws.com"
,
"Effect": "Allow",
"Sid": ""

]

EOF


resource "aws_iam_role_policy" "glue_crawler_role_policy"
name = "analytics_glue_crawler_role_policy"
role = "$aws_iam_role.glue_crawler_role.id"
policy = <<EOF

"Version": "2012-10-17",
"Statement": [

"Effect": "Allow",
"Action": [
"glue:*",
],
"Resource": [
"*"
]
,

"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:GetBucketAcl",
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::analytics-product-data"
]
,

"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:*:*:/aws-glue/*"
]

]

EOF



S3 Bucket, Glue Database and Crawler:



resource "aws_s3_bucket" "product_bucket" 
bucket = "analytics-product-data"
acl = "private"


resource "aws_glue_catalog_database" "analytics_db"
name = "inventory-analytics-db"


resource "aws_glue_crawler" "product_crawler"
database_name = "$aws_glue_catalog_database.analytics_db.name"
name = "analytics-product-crawler"
role = "$aws_iam_role.glue_crawler_role.arn"

schedule = "cron(0 0 * * ? *)"

configuration = ""Version": 1.0, "CrawlerOutput": "Partitions": "AddOrUpdateBehavior": "InheritFromTable" , "Tables": "AddOrUpdateBehavior": "MergeNewColumns" "

schema_change_policy
delete_behavior = "DELETE_FROM_DATABASE"


s3_target
path = "s3://$aws_s3_bucket.product_bucket.bucket/products"







share|improve this answer
































    0














    I had many things wrong in my Terraform code. To start with:



    1. The S3 bucket argument in the aws_athena_database code refers to the bucket for query output not the data the table should be built from.

    2. I had set up my aws_glue_crawler to write to a Glue database rather than an Athena db. Indeed, as Martin suggested above, once correctly set up, Athena was able to see the tables in the Glue db.


    3. I did not have the correct policies attached to my crawler. Initially, the only policy attached to the crawler role was



      resource "aws_iam_role_policy_attachment" "crawler_attach" 
      policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
      role = "$aws_iam_role.crawler_iam_role.name"



      after setting a second policy that explicitly allowed all S3 access to all of the buckets I wanted to crawl and attaching that policy to the same crawler role, the crawler ran and updated tables successfully.



    The second policy:



    resource "aws_iam_policy" "crawler_bucket_policy" 
    name = "crawler_bucket_policy"
    path = "/"
    description = "Gives crawler access to buckets"
    policy = <<EOF

    "Version": "2012-10-17",
    "Statement": [

    "Sid": "Stmt1553807998309",
    "Action": "*",
    "Effect": "Allow",
    "Resource": "*"
    ,

    "Sid": "Stmt1553808056033",
    "Action": "s3:*",
    "Effect": "Allow",
    "Resource": "arn:aws:s3:::bucket0"
    ,

    "Sid": "Stmt1553808078743",
    "Action": "s3:*",
    "Effect": "Allow",
    "Resource": "arn:aws:s3:::bucket1"
    ,

    "Sid": "Stmt1553808099644",
    "Action": "s3:*",
    "Effect": "Allow",
    "Resource": "arn:aws:s3:::bucket2"
    ,

    "Sid": "Stmt1553808114975",
    "Action": "s3:*",
    "Effect": "Allow",
    "Resource": "arn:aws:s3:::bucket3"
    ,

    "Sid": "Stmt1553808128211",
    "Action": "s3:*",
    "Effect": "Allow",
    "Resource": "arn:aws:s3:::bucket4"

    ]

    EOF



    I'm confident that I can get away from hardcoding the bucket names in this policy but I don't yet know how to do that.






    share|improve this answer























    • crawler_bucket_policy has one statement allowing action "*" on resource "*" - this seems very open, is it really necessary? Our current setup, which I'll list as an answer below, allows action "glue:*" on resource "*" in addition to the s3 actions.

      – Martin
      Mar 29 at 23:42












    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55129035%2fterraform-aws-athena-to-use-glue-catalog-as-db%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Our current basic setup for having Glue crawl one S3 bucket and create/update a table in a Glue DB, which can then be queried in Athena, looks like this:



    Crawler role and role policy:



    • The assume_role_policy of the IAM role needs only Glue as principal

    • The IAM role policy allows actions for Glue, S3, and logs

    • The Glue actions and resources can probably be narrowed down to the ones really needed (this is work in progress)

    • The S3 actions are limited to those needed by the crawler

    resource "aws_iam_role" "glue_crawler_role" 
    name = "analytics_glue_crawler_role"

    assume_role_policy = <<EOF

    "Version": "2012-10-17",
    "Statement": [

    "Action": "sts:AssumeRole",
    "Principal":
    "Service": "glue.amazonaws.com"
    ,
    "Effect": "Allow",
    "Sid": ""

    ]

    EOF


    resource "aws_iam_role_policy" "glue_crawler_role_policy"
    name = "analytics_glue_crawler_role_policy"
    role = "$aws_iam_role.glue_crawler_role.id"
    policy = <<EOF

    "Version": "2012-10-17",
    "Statement": [

    "Effect": "Allow",
    "Action": [
    "glue:*",
    ],
    "Resource": [
    "*"
    ]
    ,

    "Effect": "Allow",
    "Action": [
    "s3:GetBucketLocation",
    "s3:ListBucket",
    "s3:GetBucketAcl",
    "s3:GetObject",
    "s3:PutObject",
    "s3:DeleteObject"
    ],
    "Resource": [
    "arn:aws:s3:::analytics-product-data"
    ]
    ,

    "Effect": "Allow",
    "Action": [
    "logs:CreateLogGroup",
    "logs:CreateLogStream",
    "logs:PutLogEvents"
    ],
    "Resource": [
    "arn:aws:logs:*:*:/aws-glue/*"
    ]

    ]

    EOF



    S3 Bucket, Glue Database and Crawler:



    resource "aws_s3_bucket" "product_bucket" 
    bucket = "analytics-product-data"
    acl = "private"


    resource "aws_glue_catalog_database" "analytics_db"
    name = "inventory-analytics-db"


    resource "aws_glue_crawler" "product_crawler"
    database_name = "$aws_glue_catalog_database.analytics_db.name"
    name = "analytics-product-crawler"
    role = "$aws_iam_role.glue_crawler_role.arn"

    schedule = "cron(0 0 * * ? *)"

    configuration = ""Version": 1.0, "CrawlerOutput": "Partitions": "AddOrUpdateBehavior": "InheritFromTable" , "Tables": "AddOrUpdateBehavior": "MergeNewColumns" "

    schema_change_policy
    delete_behavior = "DELETE_FROM_DATABASE"


    s3_target
    path = "s3://$aws_s3_bucket.product_bucket.bucket/products"







    share|improve this answer





























      1














      Our current basic setup for having Glue crawl one S3 bucket and create/update a table in a Glue DB, which can then be queried in Athena, looks like this:



      Crawler role and role policy:



      • The assume_role_policy of the IAM role needs only Glue as principal

      • The IAM role policy allows actions for Glue, S3, and logs

      • The Glue actions and resources can probably be narrowed down to the ones really needed (this is work in progress)

      • The S3 actions are limited to those needed by the crawler

      resource "aws_iam_role" "glue_crawler_role" 
      name = "analytics_glue_crawler_role"

      assume_role_policy = <<EOF

      "Version": "2012-10-17",
      "Statement": [

      "Action": "sts:AssumeRole",
      "Principal":
      "Service": "glue.amazonaws.com"
      ,
      "Effect": "Allow",
      "Sid": ""

      ]

      EOF


      resource "aws_iam_role_policy" "glue_crawler_role_policy"
      name = "analytics_glue_crawler_role_policy"
      role = "$aws_iam_role.glue_crawler_role.id"
      policy = <<EOF

      "Version": "2012-10-17",
      "Statement": [

      "Effect": "Allow",
      "Action": [
      "glue:*",
      ],
      "Resource": [
      "*"
      ]
      ,

      "Effect": "Allow",
      "Action": [
      "s3:GetBucketLocation",
      "s3:ListBucket",
      "s3:GetBucketAcl",
      "s3:GetObject",
      "s3:PutObject",
      "s3:DeleteObject"
      ],
      "Resource": [
      "arn:aws:s3:::analytics-product-data"
      ]
      ,

      "Effect": "Allow",
      "Action": [
      "logs:CreateLogGroup",
      "logs:CreateLogStream",
      "logs:PutLogEvents"
      ],
      "Resource": [
      "arn:aws:logs:*:*:/aws-glue/*"
      ]

      ]

      EOF



      S3 Bucket, Glue Database and Crawler:



      resource "aws_s3_bucket" "product_bucket" 
      bucket = "analytics-product-data"
      acl = "private"


      resource "aws_glue_catalog_database" "analytics_db"
      name = "inventory-analytics-db"


      resource "aws_glue_crawler" "product_crawler"
      database_name = "$aws_glue_catalog_database.analytics_db.name"
      name = "analytics-product-crawler"
      role = "$aws_iam_role.glue_crawler_role.arn"

      schedule = "cron(0 0 * * ? *)"

      configuration = ""Version": 1.0, "CrawlerOutput": "Partitions": "AddOrUpdateBehavior": "InheritFromTable" , "Tables": "AddOrUpdateBehavior": "MergeNewColumns" "

      schema_change_policy
      delete_behavior = "DELETE_FROM_DATABASE"


      s3_target
      path = "s3://$aws_s3_bucket.product_bucket.bucket/products"







      share|improve this answer



























        1












        1








        1







        Our current basic setup for having Glue crawl one S3 bucket and create/update a table in a Glue DB, which can then be queried in Athena, looks like this:



        Crawler role and role policy:



        • The assume_role_policy of the IAM role needs only Glue as principal

        • The IAM role policy allows actions for Glue, S3, and logs

        • The Glue actions and resources can probably be narrowed down to the ones really needed (this is work in progress)

        • The S3 actions are limited to those needed by the crawler

        resource "aws_iam_role" "glue_crawler_role" 
        name = "analytics_glue_crawler_role"

        assume_role_policy = <<EOF

        "Version": "2012-10-17",
        "Statement": [

        "Action": "sts:AssumeRole",
        "Principal":
        "Service": "glue.amazonaws.com"
        ,
        "Effect": "Allow",
        "Sid": ""

        ]

        EOF


        resource "aws_iam_role_policy" "glue_crawler_role_policy"
        name = "analytics_glue_crawler_role_policy"
        role = "$aws_iam_role.glue_crawler_role.id"
        policy = <<EOF

        "Version": "2012-10-17",
        "Statement": [

        "Effect": "Allow",
        "Action": [
        "glue:*",
        ],
        "Resource": [
        "*"
        ]
        ,

        "Effect": "Allow",
        "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket",
        "s3:GetBucketAcl",
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
        ],
        "Resource": [
        "arn:aws:s3:::analytics-product-data"
        ]
        ,

        "Effect": "Allow",
        "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
        ],
        "Resource": [
        "arn:aws:logs:*:*:/aws-glue/*"
        ]

        ]

        EOF



        S3 Bucket, Glue Database and Crawler:



        resource "aws_s3_bucket" "product_bucket" 
        bucket = "analytics-product-data"
        acl = "private"


        resource "aws_glue_catalog_database" "analytics_db"
        name = "inventory-analytics-db"


        resource "aws_glue_crawler" "product_crawler"
        database_name = "$aws_glue_catalog_database.analytics_db.name"
        name = "analytics-product-crawler"
        role = "$aws_iam_role.glue_crawler_role.arn"

        schedule = "cron(0 0 * * ? *)"

        configuration = ""Version": 1.0, "CrawlerOutput": "Partitions": "AddOrUpdateBehavior": "InheritFromTable" , "Tables": "AddOrUpdateBehavior": "MergeNewColumns" "

        schema_change_policy
        delete_behavior = "DELETE_FROM_DATABASE"


        s3_target
        path = "s3://$aws_s3_bucket.product_bucket.bucket/products"







        share|improve this answer















        Our current basic setup for having Glue crawl one S3 bucket and create/update a table in a Glue DB, which can then be queried in Athena, looks like this:



        Crawler role and role policy:



        • The assume_role_policy of the IAM role needs only Glue as principal

        • The IAM role policy allows actions for Glue, S3, and logs

        • The Glue actions and resources can probably be narrowed down to the ones really needed (this is work in progress)

        • The S3 actions are limited to those needed by the crawler

        resource "aws_iam_role" "glue_crawler_role" 
        name = "analytics_glue_crawler_role"

        assume_role_policy = <<EOF

        "Version": "2012-10-17",
        "Statement": [

        "Action": "sts:AssumeRole",
        "Principal":
        "Service": "glue.amazonaws.com"
        ,
        "Effect": "Allow",
        "Sid": ""

        ]

        EOF


        resource "aws_iam_role_policy" "glue_crawler_role_policy"
        name = "analytics_glue_crawler_role_policy"
        role = "$aws_iam_role.glue_crawler_role.id"
        policy = <<EOF

        "Version": "2012-10-17",
        "Statement": [

        "Effect": "Allow",
        "Action": [
        "glue:*",
        ],
        "Resource": [
        "*"
        ]
        ,

        "Effect": "Allow",
        "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket",
        "s3:GetBucketAcl",
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
        ],
        "Resource": [
        "arn:aws:s3:::analytics-product-data"
        ]
        ,

        "Effect": "Allow",
        "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
        ],
        "Resource": [
        "arn:aws:logs:*:*:/aws-glue/*"
        ]

        ]

        EOF



        S3 Bucket, Glue Database and Crawler:



        resource "aws_s3_bucket" "product_bucket" 
        bucket = "analytics-product-data"
        acl = "private"


        resource "aws_glue_catalog_database" "analytics_db"
        name = "inventory-analytics-db"


        resource "aws_glue_crawler" "product_crawler"
        database_name = "$aws_glue_catalog_database.analytics_db.name"
        name = "analytics-product-crawler"
        role = "$aws_iam_role.glue_crawler_role.arn"

        schedule = "cron(0 0 * * ? *)"

        configuration = ""Version": 1.0, "CrawlerOutput": "Partitions": "AddOrUpdateBehavior": "InheritFromTable" , "Tables": "AddOrUpdateBehavior": "MergeNewColumns" "

        schema_change_policy
        delete_behavior = "DELETE_FROM_DATABASE"


        s3_target
        path = "s3://$aws_s3_bucket.product_bucket.bucket/products"








        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Mar 30 at 0:34

























        answered Mar 30 at 0:06









        MartinMartin

        7118




        7118























            0














            I had many things wrong in my Terraform code. To start with:



            1. The S3 bucket argument in the aws_athena_database code refers to the bucket for query output not the data the table should be built from.

            2. I had set up my aws_glue_crawler to write to a Glue database rather than an Athena db. Indeed, as Martin suggested above, once correctly set up, Athena was able to see the tables in the Glue db.


            3. I did not have the correct policies attached to my crawler. Initially, the only policy attached to the crawler role was



              resource "aws_iam_role_policy_attachment" "crawler_attach" 
              policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
              role = "$aws_iam_role.crawler_iam_role.name"



              after setting a second policy that explicitly allowed all S3 access to all of the buckets I wanted to crawl and attaching that policy to the same crawler role, the crawler ran and updated tables successfully.



            The second policy:



            resource "aws_iam_policy" "crawler_bucket_policy" 
            name = "crawler_bucket_policy"
            path = "/"
            description = "Gives crawler access to buckets"
            policy = <<EOF

            "Version": "2012-10-17",
            "Statement": [

            "Sid": "Stmt1553807998309",
            "Action": "*",
            "Effect": "Allow",
            "Resource": "*"
            ,

            "Sid": "Stmt1553808056033",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket0"
            ,

            "Sid": "Stmt1553808078743",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket1"
            ,

            "Sid": "Stmt1553808099644",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket2"
            ,

            "Sid": "Stmt1553808114975",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket3"
            ,

            "Sid": "Stmt1553808128211",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket4"

            ]

            EOF



            I'm confident that I can get away from hardcoding the bucket names in this policy but I don't yet know how to do that.






            share|improve this answer























            • crawler_bucket_policy has one statement allowing action "*" on resource "*" - this seems very open, is it really necessary? Our current setup, which I'll list as an answer below, allows action "glue:*" on resource "*" in addition to the s3 actions.

              – Martin
              Mar 29 at 23:42
















            0














            I had many things wrong in my Terraform code. To start with:



            1. The S3 bucket argument in the aws_athena_database code refers to the bucket for query output not the data the table should be built from.

            2. I had set up my aws_glue_crawler to write to a Glue database rather than an Athena db. Indeed, as Martin suggested above, once correctly set up, Athena was able to see the tables in the Glue db.


            3. I did not have the correct policies attached to my crawler. Initially, the only policy attached to the crawler role was



              resource "aws_iam_role_policy_attachment" "crawler_attach" 
              policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
              role = "$aws_iam_role.crawler_iam_role.name"



              after setting a second policy that explicitly allowed all S3 access to all of the buckets I wanted to crawl and attaching that policy to the same crawler role, the crawler ran and updated tables successfully.



            The second policy:



            resource "aws_iam_policy" "crawler_bucket_policy" 
            name = "crawler_bucket_policy"
            path = "/"
            description = "Gives crawler access to buckets"
            policy = <<EOF

            "Version": "2012-10-17",
            "Statement": [

            "Sid": "Stmt1553807998309",
            "Action": "*",
            "Effect": "Allow",
            "Resource": "*"
            ,

            "Sid": "Stmt1553808056033",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket0"
            ,

            "Sid": "Stmt1553808078743",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket1"
            ,

            "Sid": "Stmt1553808099644",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket2"
            ,

            "Sid": "Stmt1553808114975",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket3"
            ,

            "Sid": "Stmt1553808128211",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket4"

            ]

            EOF



            I'm confident that I can get away from hardcoding the bucket names in this policy but I don't yet know how to do that.






            share|improve this answer























            • crawler_bucket_policy has one statement allowing action "*" on resource "*" - this seems very open, is it really necessary? Our current setup, which I'll list as an answer below, allows action "glue:*" on resource "*" in addition to the s3 actions.

              – Martin
              Mar 29 at 23:42














            0












            0








            0







            I had many things wrong in my Terraform code. To start with:



            1. The S3 bucket argument in the aws_athena_database code refers to the bucket for query output not the data the table should be built from.

            2. I had set up my aws_glue_crawler to write to a Glue database rather than an Athena db. Indeed, as Martin suggested above, once correctly set up, Athena was able to see the tables in the Glue db.


            3. I did not have the correct policies attached to my crawler. Initially, the only policy attached to the crawler role was



              resource "aws_iam_role_policy_attachment" "crawler_attach" 
              policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
              role = "$aws_iam_role.crawler_iam_role.name"



              after setting a second policy that explicitly allowed all S3 access to all of the buckets I wanted to crawl and attaching that policy to the same crawler role, the crawler ran and updated tables successfully.



            The second policy:



            resource "aws_iam_policy" "crawler_bucket_policy" 
            name = "crawler_bucket_policy"
            path = "/"
            description = "Gives crawler access to buckets"
            policy = <<EOF

            "Version": "2012-10-17",
            "Statement": [

            "Sid": "Stmt1553807998309",
            "Action": "*",
            "Effect": "Allow",
            "Resource": "*"
            ,

            "Sid": "Stmt1553808056033",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket0"
            ,

            "Sid": "Stmt1553808078743",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket1"
            ,

            "Sid": "Stmt1553808099644",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket2"
            ,

            "Sid": "Stmt1553808114975",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket3"
            ,

            "Sid": "Stmt1553808128211",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket4"

            ]

            EOF



            I'm confident that I can get away from hardcoding the bucket names in this policy but I don't yet know how to do that.






            share|improve this answer













            I had many things wrong in my Terraform code. To start with:



            1. The S3 bucket argument in the aws_athena_database code refers to the bucket for query output not the data the table should be built from.

            2. I had set up my aws_glue_crawler to write to a Glue database rather than an Athena db. Indeed, as Martin suggested above, once correctly set up, Athena was able to see the tables in the Glue db.


            3. I did not have the correct policies attached to my crawler. Initially, the only policy attached to the crawler role was



              resource "aws_iam_role_policy_attachment" "crawler_attach" 
              policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
              role = "$aws_iam_role.crawler_iam_role.name"



              after setting a second policy that explicitly allowed all S3 access to all of the buckets I wanted to crawl and attaching that policy to the same crawler role, the crawler ran and updated tables successfully.



            The second policy:



            resource "aws_iam_policy" "crawler_bucket_policy" 
            name = "crawler_bucket_policy"
            path = "/"
            description = "Gives crawler access to buckets"
            policy = <<EOF

            "Version": "2012-10-17",
            "Statement": [

            "Sid": "Stmt1553807998309",
            "Action": "*",
            "Effect": "Allow",
            "Resource": "*"
            ,

            "Sid": "Stmt1553808056033",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket0"
            ,

            "Sid": "Stmt1553808078743",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket1"
            ,

            "Sid": "Stmt1553808099644",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket2"
            ,

            "Sid": "Stmt1553808114975",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket3"
            ,

            "Sid": "Stmt1553808128211",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::bucket4"

            ]

            EOF



            I'm confident that I can get away from hardcoding the bucket names in this policy but I don't yet know how to do that.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Mar 28 at 22:20









            StevenSteven

            1,6601132




            1,6601132












            • crawler_bucket_policy has one statement allowing action "*" on resource "*" - this seems very open, is it really necessary? Our current setup, which I'll list as an answer below, allows action "glue:*" on resource "*" in addition to the s3 actions.

              – Martin
              Mar 29 at 23:42


















            • crawler_bucket_policy has one statement allowing action "*" on resource "*" - this seems very open, is it really necessary? Our current setup, which I'll list as an answer below, allows action "glue:*" on resource "*" in addition to the s3 actions.

              – Martin
              Mar 29 at 23:42

















            crawler_bucket_policy has one statement allowing action "*" on resource "*" - this seems very open, is it really necessary? Our current setup, which I'll list as an answer below, allows action "glue:*" on resource "*" in addition to the s3 actions.

            – Martin
            Mar 29 at 23:42






            crawler_bucket_policy has one statement allowing action "*" on resource "*" - this seems very open, is it really necessary? Our current setup, which I'll list as an answer below, allows action "glue:*" on resource "*" in addition to the s3 actions.

            – Martin
            Mar 29 at 23:42


















            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55129035%2fterraform-aws-athena-to-use-glue-catalog-as-db%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

            Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

            Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript