Terraform AWS Athena to use Glue catalog as db Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!AWS Accessing Redshift in a VPCCan crawlers update imported tables in AWS Glue?AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'HIVE_INVALID_METADATA in Amazon AthenaHow does AWS Glue ETL job retrieve data?Creating a Glue Data Catalog Table within a Glue JobAWS Glue multi-tenant files normalization into the common schemaWhy is my AWS Glue crawler not creating any tables?Adding multiple S3 paths to glue crawler with terraformAWS Glue - avro to parquet - Glue job getting an empty frame from catalog
Unix AIX passing variable and arguments to expect and spawn
Why do C and C++ allow the expression (int) + 4*5?
Is Bran literally the world's memory?
Trying to enter the Fox's den
What is the ongoing value of the Kanban board to the developers as opposed to management
Lights are flickering on and off after accidentally bumping into light switch
Pointing to problems without suggesting solutions
How to ask rejected full-time candidates to apply to teach individual courses?
A journey... into the MIND
2 sample t test for sample sizes - 30,000 and 150,000
What helicopter has the most rotor blades?
How is an IPA symbol that lacks a name (e.g. ɲ) called?
How to create a command for the "strange m" symbol in latex?
What is the difference between 准时 and 按时?
Who can become a wight?
Why does my GNOME settings mention "Moto C Plus"?
What documents does someone with a long-term visa need to travel to another Schengen country?
A German immigrant ancestor has a "Registration Affidavit of Alien Enemy" on file. What does that mean exactly?
How to charge percentage of transaction cost?
Who's this lady in the war room?
Checking IFI enabled on SQL server below 2016
Protagonist's race is hidden - should I reveal it?
Weaponising the Grasp-at-a-Distance spell
Marquee sign letters
Terraform AWS Athena to use Glue catalog as db
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
Data science time! April 2019 and salary with experience
The Ask Question Wizard is Live!AWS Accessing Redshift in a VPCCan crawlers update imported tables in AWS Glue?AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'HIVE_INVALID_METADATA in Amazon AthenaHow does AWS Glue ETL job retrieve data?Creating a Glue Data Catalog Table within a Glue JobAWS Glue multi-tenant files normalization into the common schemaWhy is my AWS Glue crawler not creating any tables?Adding multiple S3 paths to glue crawler with terraformAWS Glue - avro to parquet - Glue job getting an empty frame from catalog
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I'm confused as to how I should use terraform to connect Athena to my Glue Catalog database.
I use
resource "aws_glue_catalog_database" "catalog_database"
name = "$var.glue_db_name"
resource "aws_glue_crawler" "datalake_crawler"
database_name = "$var.glue_db_name"
name = "$var.crawler_name"
role = "$aws_iam_role.crawler_iam_role.name"
description = "$var.crawler_description"
table_prefix = "$var.table_prefix"
schedule = "$var.schedule"
s3_target
path = "s3://$var.data_bucket_name[0]"
s3_target
path = "s3://$var.data_bucket_name[1]"
to create a Glue DB and the crawler to crawl an s3 bucket (here only two), but I don't know how I link the Athena query service to the Glue DB. In the terraform documentation for Athena
, there doesn't appear to be a way to connect Athena to a Glue cataglog but only to an S3 Bucket. Clearly, however, Athena can be integrated with Glue.
How can I terraform an Athena database to use my Glue catalog as it's data source rather than an S3 bucket?
amazon-web-services terraform aws-glue terraform-provider-aws aws-glue-data-catalog
add a comment |
I'm confused as to how I should use terraform to connect Athena to my Glue Catalog database.
I use
resource "aws_glue_catalog_database" "catalog_database"
name = "$var.glue_db_name"
resource "aws_glue_crawler" "datalake_crawler"
database_name = "$var.glue_db_name"
name = "$var.crawler_name"
role = "$aws_iam_role.crawler_iam_role.name"
description = "$var.crawler_description"
table_prefix = "$var.table_prefix"
schedule = "$var.schedule"
s3_target
path = "s3://$var.data_bucket_name[0]"
s3_target
path = "s3://$var.data_bucket_name[1]"
to create a Glue DB and the crawler to crawl an s3 bucket (here only two), but I don't know how I link the Athena query service to the Glue DB. In the terraform documentation for Athena
, there doesn't appear to be a way to connect Athena to a Glue cataglog but only to an S3 Bucket. Clearly, however, Athena can be integrated with Glue.
How can I terraform an Athena database to use my Glue catalog as it's data source rather than an S3 bucket?
amazon-web-services terraform aws-glue terraform-provider-aws aws-glue-data-catalog
1
Did you run the crawler? Did it create AWS Glue tables? If you do not defineaws_glue_catalog_table
resources with terraform that point to their respective S3 locations, the crawler will need to run at least once to create the tables. Once they are created your Glue DB and the tables should become visible in Athena, even without defining a terraformaws_athena_database
resource.
– Martin
Mar 28 at 15:29
I've run the crawler. There are no tables created in the Glue db. To be honest, I don't even know if I've given the crawler the correct role/policies: I've created a gist that includes the code I'm using: gist.github.com/stevenranney/3f5545e1e736266807b1f337e0be58e0
– Steven
Mar 28 at 18:52
I suggest to also add tag "terraform" (or "terraform-aws") to the question.
– Martin
Apr 4 at 6:38
add a comment |
I'm confused as to how I should use terraform to connect Athena to my Glue Catalog database.
I use
resource "aws_glue_catalog_database" "catalog_database"
name = "$var.glue_db_name"
resource "aws_glue_crawler" "datalake_crawler"
database_name = "$var.glue_db_name"
name = "$var.crawler_name"
role = "$aws_iam_role.crawler_iam_role.name"
description = "$var.crawler_description"
table_prefix = "$var.table_prefix"
schedule = "$var.schedule"
s3_target
path = "s3://$var.data_bucket_name[0]"
s3_target
path = "s3://$var.data_bucket_name[1]"
to create a Glue DB and the crawler to crawl an s3 bucket (here only two), but I don't know how I link the Athena query service to the Glue DB. In the terraform documentation for Athena
, there doesn't appear to be a way to connect Athena to a Glue cataglog but only to an S3 Bucket. Clearly, however, Athena can be integrated with Glue.
How can I terraform an Athena database to use my Glue catalog as it's data source rather than an S3 bucket?
amazon-web-services terraform aws-glue terraform-provider-aws aws-glue-data-catalog
I'm confused as to how I should use terraform to connect Athena to my Glue Catalog database.
I use
resource "aws_glue_catalog_database" "catalog_database"
name = "$var.glue_db_name"
resource "aws_glue_crawler" "datalake_crawler"
database_name = "$var.glue_db_name"
name = "$var.crawler_name"
role = "$aws_iam_role.crawler_iam_role.name"
description = "$var.crawler_description"
table_prefix = "$var.table_prefix"
schedule = "$var.schedule"
s3_target
path = "s3://$var.data_bucket_name[0]"
s3_target
path = "s3://$var.data_bucket_name[1]"
to create a Glue DB and the crawler to crawl an s3 bucket (here only two), but I don't know how I link the Athena query service to the Glue DB. In the terraform documentation for Athena
, there doesn't appear to be a way to connect Athena to a Glue cataglog but only to an S3 Bucket. Clearly, however, Athena can be integrated with Glue.
How can I terraform an Athena database to use my Glue catalog as it's data source rather than an S3 bucket?
amazon-web-services terraform aws-glue terraform-provider-aws aws-glue-data-catalog
amazon-web-services terraform aws-glue terraform-provider-aws aws-glue-data-catalog
edited Apr 4 at 12:42
Steven
asked Mar 12 at 19:08
StevenSteven
1,6601132
1,6601132
1
Did you run the crawler? Did it create AWS Glue tables? If you do not defineaws_glue_catalog_table
resources with terraform that point to their respective S3 locations, the crawler will need to run at least once to create the tables. Once they are created your Glue DB and the tables should become visible in Athena, even without defining a terraformaws_athena_database
resource.
– Martin
Mar 28 at 15:29
I've run the crawler. There are no tables created in the Glue db. To be honest, I don't even know if I've given the crawler the correct role/policies: I've created a gist that includes the code I'm using: gist.github.com/stevenranney/3f5545e1e736266807b1f337e0be58e0
– Steven
Mar 28 at 18:52
I suggest to also add tag "terraform" (or "terraform-aws") to the question.
– Martin
Apr 4 at 6:38
add a comment |
1
Did you run the crawler? Did it create AWS Glue tables? If you do not defineaws_glue_catalog_table
resources with terraform that point to their respective S3 locations, the crawler will need to run at least once to create the tables. Once they are created your Glue DB and the tables should become visible in Athena, even without defining a terraformaws_athena_database
resource.
– Martin
Mar 28 at 15:29
I've run the crawler. There are no tables created in the Glue db. To be honest, I don't even know if I've given the crawler the correct role/policies: I've created a gist that includes the code I'm using: gist.github.com/stevenranney/3f5545e1e736266807b1f337e0be58e0
– Steven
Mar 28 at 18:52
I suggest to also add tag "terraform" (or "terraform-aws") to the question.
– Martin
Apr 4 at 6:38
1
1
Did you run the crawler? Did it create AWS Glue tables? If you do not define
aws_glue_catalog_table
resources with terraform that point to their respective S3 locations, the crawler will need to run at least once to create the tables. Once they are created your Glue DB and the tables should become visible in Athena, even without defining a terraform aws_athena_database
resource.– Martin
Mar 28 at 15:29
Did you run the crawler? Did it create AWS Glue tables? If you do not define
aws_glue_catalog_table
resources with terraform that point to their respective S3 locations, the crawler will need to run at least once to create the tables. Once they are created your Glue DB and the tables should become visible in Athena, even without defining a terraform aws_athena_database
resource.– Martin
Mar 28 at 15:29
I've run the crawler. There are no tables created in the Glue db. To be honest, I don't even know if I've given the crawler the correct role/policies: I've created a gist that includes the code I'm using: gist.github.com/stevenranney/3f5545e1e736266807b1f337e0be58e0
– Steven
Mar 28 at 18:52
I've run the crawler. There are no tables created in the Glue db. To be honest, I don't even know if I've given the crawler the correct role/policies: I've created a gist that includes the code I'm using: gist.github.com/stevenranney/3f5545e1e736266807b1f337e0be58e0
– Steven
Mar 28 at 18:52
I suggest to also add tag "terraform" (or "terraform-aws") to the question.
– Martin
Apr 4 at 6:38
I suggest to also add tag "terraform" (or "terraform-aws") to the question.
– Martin
Apr 4 at 6:38
add a comment |
2 Answers
2
active
oldest
votes
Our current basic setup for having Glue crawl one S3 bucket and create/update a table in a Glue DB, which can then be queried in Athena, looks like this:
Crawler role and role policy:
- The assume_role_policy of the IAM role needs only Glue as principal
- The IAM role policy allows actions for Glue, S3, and logs
- The Glue actions and resources can probably be narrowed down to the ones really needed (this is work in progress)
- The S3 actions are limited to those needed by the crawler
resource "aws_iam_role" "glue_crawler_role"
name = "analytics_glue_crawler_role"
assume_role_policy = <<EOF
"Version": "2012-10-17",
"Statement": [
"Action": "sts:AssumeRole",
"Principal":
"Service": "glue.amazonaws.com"
,
"Effect": "Allow",
"Sid": ""
]
EOF
resource "aws_iam_role_policy" "glue_crawler_role_policy"
name = "analytics_glue_crawler_role_policy"
role = "$aws_iam_role.glue_crawler_role.id"
policy = <<EOF
"Version": "2012-10-17",
"Statement": [
"Effect": "Allow",
"Action": [
"glue:*",
],
"Resource": [
"*"
]
,
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:GetBucketAcl",
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::analytics-product-data"
]
,
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:*:*:/aws-glue/*"
]
]
EOF
S3 Bucket, Glue Database and Crawler:
resource "aws_s3_bucket" "product_bucket"
bucket = "analytics-product-data"
acl = "private"
resource "aws_glue_catalog_database" "analytics_db"
name = "inventory-analytics-db"
resource "aws_glue_crawler" "product_crawler"
database_name = "$aws_glue_catalog_database.analytics_db.name"
name = "analytics-product-crawler"
role = "$aws_iam_role.glue_crawler_role.arn"
schedule = "cron(0 0 * * ? *)"
configuration = ""Version": 1.0, "CrawlerOutput": "Partitions": "AddOrUpdateBehavior": "InheritFromTable" , "Tables": "AddOrUpdateBehavior": "MergeNewColumns" "
schema_change_policy
delete_behavior = "DELETE_FROM_DATABASE"
s3_target
path = "s3://$aws_s3_bucket.product_bucket.bucket/products"
add a comment |
I had many things wrong in my Terraform code. To start with:
- The
S3
bucket argument in theaws_athena_database
code refers to the bucket for query output not the data the table should be built from. - I had set up my
aws_glue_crawler
to write to a Glue database rather than an Athena db. Indeed, as Martin suggested above, once correctly set up, Athena was able to see the tables in the Glue db. I did not have the correct policies attached to my crawler. Initially, the only policy attached to the crawler role was
resource "aws_iam_role_policy_attachment" "crawler_attach"
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
role = "$aws_iam_role.crawler_iam_role.name"
after setting a second policy that explicitly allowed all
S3
access to all of the buckets I wanted to crawl and attaching that policy to the same crawler role, the crawler ran and updated tables successfully.
The second policy:
resource "aws_iam_policy" "crawler_bucket_policy"
name = "crawler_bucket_policy"
path = "/"
description = "Gives crawler access to buckets"
policy = <<EOF
"Version": "2012-10-17",
"Statement": [
"Sid": "Stmt1553807998309",
"Action": "*",
"Effect": "Allow",
"Resource": "*"
,
"Sid": "Stmt1553808056033",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket0"
,
"Sid": "Stmt1553808078743",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket1"
,
"Sid": "Stmt1553808099644",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket2"
,
"Sid": "Stmt1553808114975",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket3"
,
"Sid": "Stmt1553808128211",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket4"
]
EOF
I'm confident that I can get away from hardcoding the bucket names in this policy but I don't yet know how to do that.
crawler_bucket_policy has one statement allowing action"*"
on resource"*"
- this seems very open, is it really necessary? Our current setup, which I'll list as an answer below, allows action"glue:*"
on resource"*"
in addition to the s3 actions.
– Martin
Mar 29 at 23:42
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55129035%2fterraform-aws-athena-to-use-glue-catalog-as-db%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Our current basic setup for having Glue crawl one S3 bucket and create/update a table in a Glue DB, which can then be queried in Athena, looks like this:
Crawler role and role policy:
- The assume_role_policy of the IAM role needs only Glue as principal
- The IAM role policy allows actions for Glue, S3, and logs
- The Glue actions and resources can probably be narrowed down to the ones really needed (this is work in progress)
- The S3 actions are limited to those needed by the crawler
resource "aws_iam_role" "glue_crawler_role"
name = "analytics_glue_crawler_role"
assume_role_policy = <<EOF
"Version": "2012-10-17",
"Statement": [
"Action": "sts:AssumeRole",
"Principal":
"Service": "glue.amazonaws.com"
,
"Effect": "Allow",
"Sid": ""
]
EOF
resource "aws_iam_role_policy" "glue_crawler_role_policy"
name = "analytics_glue_crawler_role_policy"
role = "$aws_iam_role.glue_crawler_role.id"
policy = <<EOF
"Version": "2012-10-17",
"Statement": [
"Effect": "Allow",
"Action": [
"glue:*",
],
"Resource": [
"*"
]
,
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:GetBucketAcl",
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::analytics-product-data"
]
,
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:*:*:/aws-glue/*"
]
]
EOF
S3 Bucket, Glue Database and Crawler:
resource "aws_s3_bucket" "product_bucket"
bucket = "analytics-product-data"
acl = "private"
resource "aws_glue_catalog_database" "analytics_db"
name = "inventory-analytics-db"
resource "aws_glue_crawler" "product_crawler"
database_name = "$aws_glue_catalog_database.analytics_db.name"
name = "analytics-product-crawler"
role = "$aws_iam_role.glue_crawler_role.arn"
schedule = "cron(0 0 * * ? *)"
configuration = ""Version": 1.0, "CrawlerOutput": "Partitions": "AddOrUpdateBehavior": "InheritFromTable" , "Tables": "AddOrUpdateBehavior": "MergeNewColumns" "
schema_change_policy
delete_behavior = "DELETE_FROM_DATABASE"
s3_target
path = "s3://$aws_s3_bucket.product_bucket.bucket/products"
add a comment |
Our current basic setup for having Glue crawl one S3 bucket and create/update a table in a Glue DB, which can then be queried in Athena, looks like this:
Crawler role and role policy:
- The assume_role_policy of the IAM role needs only Glue as principal
- The IAM role policy allows actions for Glue, S3, and logs
- The Glue actions and resources can probably be narrowed down to the ones really needed (this is work in progress)
- The S3 actions are limited to those needed by the crawler
resource "aws_iam_role" "glue_crawler_role"
name = "analytics_glue_crawler_role"
assume_role_policy = <<EOF
"Version": "2012-10-17",
"Statement": [
"Action": "sts:AssumeRole",
"Principal":
"Service": "glue.amazonaws.com"
,
"Effect": "Allow",
"Sid": ""
]
EOF
resource "aws_iam_role_policy" "glue_crawler_role_policy"
name = "analytics_glue_crawler_role_policy"
role = "$aws_iam_role.glue_crawler_role.id"
policy = <<EOF
"Version": "2012-10-17",
"Statement": [
"Effect": "Allow",
"Action": [
"glue:*",
],
"Resource": [
"*"
]
,
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:GetBucketAcl",
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::analytics-product-data"
]
,
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:*:*:/aws-glue/*"
]
]
EOF
S3 Bucket, Glue Database and Crawler:
resource "aws_s3_bucket" "product_bucket"
bucket = "analytics-product-data"
acl = "private"
resource "aws_glue_catalog_database" "analytics_db"
name = "inventory-analytics-db"
resource "aws_glue_crawler" "product_crawler"
database_name = "$aws_glue_catalog_database.analytics_db.name"
name = "analytics-product-crawler"
role = "$aws_iam_role.glue_crawler_role.arn"
schedule = "cron(0 0 * * ? *)"
configuration = ""Version": 1.0, "CrawlerOutput": "Partitions": "AddOrUpdateBehavior": "InheritFromTable" , "Tables": "AddOrUpdateBehavior": "MergeNewColumns" "
schema_change_policy
delete_behavior = "DELETE_FROM_DATABASE"
s3_target
path = "s3://$aws_s3_bucket.product_bucket.bucket/products"
add a comment |
Our current basic setup for having Glue crawl one S3 bucket and create/update a table in a Glue DB, which can then be queried in Athena, looks like this:
Crawler role and role policy:
- The assume_role_policy of the IAM role needs only Glue as principal
- The IAM role policy allows actions for Glue, S3, and logs
- The Glue actions and resources can probably be narrowed down to the ones really needed (this is work in progress)
- The S3 actions are limited to those needed by the crawler
resource "aws_iam_role" "glue_crawler_role"
name = "analytics_glue_crawler_role"
assume_role_policy = <<EOF
"Version": "2012-10-17",
"Statement": [
"Action": "sts:AssumeRole",
"Principal":
"Service": "glue.amazonaws.com"
,
"Effect": "Allow",
"Sid": ""
]
EOF
resource "aws_iam_role_policy" "glue_crawler_role_policy"
name = "analytics_glue_crawler_role_policy"
role = "$aws_iam_role.glue_crawler_role.id"
policy = <<EOF
"Version": "2012-10-17",
"Statement": [
"Effect": "Allow",
"Action": [
"glue:*",
],
"Resource": [
"*"
]
,
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:GetBucketAcl",
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::analytics-product-data"
]
,
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:*:*:/aws-glue/*"
]
]
EOF
S3 Bucket, Glue Database and Crawler:
resource "aws_s3_bucket" "product_bucket"
bucket = "analytics-product-data"
acl = "private"
resource "aws_glue_catalog_database" "analytics_db"
name = "inventory-analytics-db"
resource "aws_glue_crawler" "product_crawler"
database_name = "$aws_glue_catalog_database.analytics_db.name"
name = "analytics-product-crawler"
role = "$aws_iam_role.glue_crawler_role.arn"
schedule = "cron(0 0 * * ? *)"
configuration = ""Version": 1.0, "CrawlerOutput": "Partitions": "AddOrUpdateBehavior": "InheritFromTable" , "Tables": "AddOrUpdateBehavior": "MergeNewColumns" "
schema_change_policy
delete_behavior = "DELETE_FROM_DATABASE"
s3_target
path = "s3://$aws_s3_bucket.product_bucket.bucket/products"
Our current basic setup for having Glue crawl one S3 bucket and create/update a table in a Glue DB, which can then be queried in Athena, looks like this:
Crawler role and role policy:
- The assume_role_policy of the IAM role needs only Glue as principal
- The IAM role policy allows actions for Glue, S3, and logs
- The Glue actions and resources can probably be narrowed down to the ones really needed (this is work in progress)
- The S3 actions are limited to those needed by the crawler
resource "aws_iam_role" "glue_crawler_role"
name = "analytics_glue_crawler_role"
assume_role_policy = <<EOF
"Version": "2012-10-17",
"Statement": [
"Action": "sts:AssumeRole",
"Principal":
"Service": "glue.amazonaws.com"
,
"Effect": "Allow",
"Sid": ""
]
EOF
resource "aws_iam_role_policy" "glue_crawler_role_policy"
name = "analytics_glue_crawler_role_policy"
role = "$aws_iam_role.glue_crawler_role.id"
policy = <<EOF
"Version": "2012-10-17",
"Statement": [
"Effect": "Allow",
"Action": [
"glue:*",
],
"Resource": [
"*"
]
,
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:GetBucketAcl",
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::analytics-product-data"
]
,
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:*:*:/aws-glue/*"
]
]
EOF
S3 Bucket, Glue Database and Crawler:
resource "aws_s3_bucket" "product_bucket"
bucket = "analytics-product-data"
acl = "private"
resource "aws_glue_catalog_database" "analytics_db"
name = "inventory-analytics-db"
resource "aws_glue_crawler" "product_crawler"
database_name = "$aws_glue_catalog_database.analytics_db.name"
name = "analytics-product-crawler"
role = "$aws_iam_role.glue_crawler_role.arn"
schedule = "cron(0 0 * * ? *)"
configuration = ""Version": 1.0, "CrawlerOutput": "Partitions": "AddOrUpdateBehavior": "InheritFromTable" , "Tables": "AddOrUpdateBehavior": "MergeNewColumns" "
schema_change_policy
delete_behavior = "DELETE_FROM_DATABASE"
s3_target
path = "s3://$aws_s3_bucket.product_bucket.bucket/products"
edited Mar 30 at 0:34
answered Mar 30 at 0:06
MartinMartin
7118
7118
add a comment |
add a comment |
I had many things wrong in my Terraform code. To start with:
- The
S3
bucket argument in theaws_athena_database
code refers to the bucket for query output not the data the table should be built from. - I had set up my
aws_glue_crawler
to write to a Glue database rather than an Athena db. Indeed, as Martin suggested above, once correctly set up, Athena was able to see the tables in the Glue db. I did not have the correct policies attached to my crawler. Initially, the only policy attached to the crawler role was
resource "aws_iam_role_policy_attachment" "crawler_attach"
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
role = "$aws_iam_role.crawler_iam_role.name"
after setting a second policy that explicitly allowed all
S3
access to all of the buckets I wanted to crawl and attaching that policy to the same crawler role, the crawler ran and updated tables successfully.
The second policy:
resource "aws_iam_policy" "crawler_bucket_policy"
name = "crawler_bucket_policy"
path = "/"
description = "Gives crawler access to buckets"
policy = <<EOF
"Version": "2012-10-17",
"Statement": [
"Sid": "Stmt1553807998309",
"Action": "*",
"Effect": "Allow",
"Resource": "*"
,
"Sid": "Stmt1553808056033",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket0"
,
"Sid": "Stmt1553808078743",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket1"
,
"Sid": "Stmt1553808099644",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket2"
,
"Sid": "Stmt1553808114975",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket3"
,
"Sid": "Stmt1553808128211",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket4"
]
EOF
I'm confident that I can get away from hardcoding the bucket names in this policy but I don't yet know how to do that.
crawler_bucket_policy has one statement allowing action"*"
on resource"*"
- this seems very open, is it really necessary? Our current setup, which I'll list as an answer below, allows action"glue:*"
on resource"*"
in addition to the s3 actions.
– Martin
Mar 29 at 23:42
add a comment |
I had many things wrong in my Terraform code. To start with:
- The
S3
bucket argument in theaws_athena_database
code refers to the bucket for query output not the data the table should be built from. - I had set up my
aws_glue_crawler
to write to a Glue database rather than an Athena db. Indeed, as Martin suggested above, once correctly set up, Athena was able to see the tables in the Glue db. I did not have the correct policies attached to my crawler. Initially, the only policy attached to the crawler role was
resource "aws_iam_role_policy_attachment" "crawler_attach"
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
role = "$aws_iam_role.crawler_iam_role.name"
after setting a second policy that explicitly allowed all
S3
access to all of the buckets I wanted to crawl and attaching that policy to the same crawler role, the crawler ran and updated tables successfully.
The second policy:
resource "aws_iam_policy" "crawler_bucket_policy"
name = "crawler_bucket_policy"
path = "/"
description = "Gives crawler access to buckets"
policy = <<EOF
"Version": "2012-10-17",
"Statement": [
"Sid": "Stmt1553807998309",
"Action": "*",
"Effect": "Allow",
"Resource": "*"
,
"Sid": "Stmt1553808056033",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket0"
,
"Sid": "Stmt1553808078743",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket1"
,
"Sid": "Stmt1553808099644",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket2"
,
"Sid": "Stmt1553808114975",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket3"
,
"Sid": "Stmt1553808128211",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket4"
]
EOF
I'm confident that I can get away from hardcoding the bucket names in this policy but I don't yet know how to do that.
crawler_bucket_policy has one statement allowing action"*"
on resource"*"
- this seems very open, is it really necessary? Our current setup, which I'll list as an answer below, allows action"glue:*"
on resource"*"
in addition to the s3 actions.
– Martin
Mar 29 at 23:42
add a comment |
I had many things wrong in my Terraform code. To start with:
- The
S3
bucket argument in theaws_athena_database
code refers to the bucket for query output not the data the table should be built from. - I had set up my
aws_glue_crawler
to write to a Glue database rather than an Athena db. Indeed, as Martin suggested above, once correctly set up, Athena was able to see the tables in the Glue db. I did not have the correct policies attached to my crawler. Initially, the only policy attached to the crawler role was
resource "aws_iam_role_policy_attachment" "crawler_attach"
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
role = "$aws_iam_role.crawler_iam_role.name"
after setting a second policy that explicitly allowed all
S3
access to all of the buckets I wanted to crawl and attaching that policy to the same crawler role, the crawler ran and updated tables successfully.
The second policy:
resource "aws_iam_policy" "crawler_bucket_policy"
name = "crawler_bucket_policy"
path = "/"
description = "Gives crawler access to buckets"
policy = <<EOF
"Version": "2012-10-17",
"Statement": [
"Sid": "Stmt1553807998309",
"Action": "*",
"Effect": "Allow",
"Resource": "*"
,
"Sid": "Stmt1553808056033",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket0"
,
"Sid": "Stmt1553808078743",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket1"
,
"Sid": "Stmt1553808099644",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket2"
,
"Sid": "Stmt1553808114975",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket3"
,
"Sid": "Stmt1553808128211",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket4"
]
EOF
I'm confident that I can get away from hardcoding the bucket names in this policy but I don't yet know how to do that.
I had many things wrong in my Terraform code. To start with:
- The
S3
bucket argument in theaws_athena_database
code refers to the bucket for query output not the data the table should be built from. - I had set up my
aws_glue_crawler
to write to a Glue database rather than an Athena db. Indeed, as Martin suggested above, once correctly set up, Athena was able to see the tables in the Glue db. I did not have the correct policies attached to my crawler. Initially, the only policy attached to the crawler role was
resource "aws_iam_role_policy_attachment" "crawler_attach"
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
role = "$aws_iam_role.crawler_iam_role.name"
after setting a second policy that explicitly allowed all
S3
access to all of the buckets I wanted to crawl and attaching that policy to the same crawler role, the crawler ran and updated tables successfully.
The second policy:
resource "aws_iam_policy" "crawler_bucket_policy"
name = "crawler_bucket_policy"
path = "/"
description = "Gives crawler access to buckets"
policy = <<EOF
"Version": "2012-10-17",
"Statement": [
"Sid": "Stmt1553807998309",
"Action": "*",
"Effect": "Allow",
"Resource": "*"
,
"Sid": "Stmt1553808056033",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket0"
,
"Sid": "Stmt1553808078743",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket1"
,
"Sid": "Stmt1553808099644",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket2"
,
"Sid": "Stmt1553808114975",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket3"
,
"Sid": "Stmt1553808128211",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket4"
]
EOF
I'm confident that I can get away from hardcoding the bucket names in this policy but I don't yet know how to do that.
answered Mar 28 at 22:20
StevenSteven
1,6601132
1,6601132
crawler_bucket_policy has one statement allowing action"*"
on resource"*"
- this seems very open, is it really necessary? Our current setup, which I'll list as an answer below, allows action"glue:*"
on resource"*"
in addition to the s3 actions.
– Martin
Mar 29 at 23:42
add a comment |
crawler_bucket_policy has one statement allowing action"*"
on resource"*"
- this seems very open, is it really necessary? Our current setup, which I'll list as an answer below, allows action"glue:*"
on resource"*"
in addition to the s3 actions.
– Martin
Mar 29 at 23:42
crawler_bucket_policy has one statement allowing action
"*"
on resource "*"
- this seems very open, is it really necessary? Our current setup, which I'll list as an answer below, allows action "glue:*"
on resource "*"
in addition to the s3 actions.– Martin
Mar 29 at 23:42
crawler_bucket_policy has one statement allowing action
"*"
on resource "*"
- this seems very open, is it really necessary? Our current setup, which I'll list as an answer below, allows action "glue:*"
on resource "*"
in addition to the s3 actions.– Martin
Mar 29 at 23:42
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55129035%2fterraform-aws-athena-to-use-glue-catalog-as-db%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Did you run the crawler? Did it create AWS Glue tables? If you do not define
aws_glue_catalog_table
resources with terraform that point to their respective S3 locations, the crawler will need to run at least once to create the tables. Once they are created your Glue DB and the tables should become visible in Athena, even without defining a terraformaws_athena_database
resource.– Martin
Mar 28 at 15:29
I've run the crawler. There are no tables created in the Glue db. To be honest, I don't even know if I've given the crawler the correct role/policies: I've created a gist that includes the code I'm using: gist.github.com/stevenranney/3f5545e1e736266807b1f337e0be58e0
– Steven
Mar 28 at 18:52
I suggest to also add tag "terraform" (or "terraform-aws") to the question.
– Martin
Apr 4 at 6:38