Skip to main content

Glue

Experimental
Creates:
Assets

Configure in the UI

This plugin can be configured directly in the Marmot UI with a step-by-step wizard.

View Guide

The Glue plugin discovers and catalogs AWS Glue resources including jobs, databases, tables and crawlers. It captures metadata such as job configurations, table schemas, crawler schedules and database properties. Iceberg-managed tables are automatically skipped (use the dedicated Iceberg plugin instead).

Required Permissions

Example Configuration


credentials:
region: "us-east-1"
profile: "production"
role: "<role>"
tags:
- "aws"
discover_jobs: true
discover_databases: true
discover_tables: true
discover_crawlers: true

Configuration

The following configuration options are available:

PropertyTypeRequiredDescription
credentialsAWSCredentialsfalseAWS credentials configuration
discover_crawlersboolfalseWhether to discover Glue crawlers
discover_databasesboolfalseWhether to discover Glue databases
discover_jobsboolfalseWhether to discover Glue jobs
discover_tablesboolfalseWhether to discover Glue tables
external_links[]ExternalLinkfalseExternal links to show on all assets
filterFilterfalseFilter discovered assets by name (regex)
include_tags[]stringfalseList of AWS tags to include as metadata. By default, all tags are included.
tagsTagsConfigfalseTags to apply to discovered assets
tags_to_metadataboolfalseConvert AWS tags to Marmot metadata

Available Metadata

The following metadata fields are available:

FieldTypeDescription
catalog_idstringID of the Data Catalog
classificationstringClassification of the table data (csv, parquet, json, etc.)
classifiersstringCustom classifiers used by the crawler
connectionsstringConnections used by the job
create_timestringDate and time the database was created
create_timestringDate and time the table was created
created_onstringDate and time the job was created
creation_timestringDate and time the crawler was created
database_namestringTarget database for the crawler
database_namestringName of the database containing the table
descriptionstringDescription of the database
glue_versionstringGlue version used by the job
input_formatstringHadoop input format class
last_crawl_errorstringError message from the last crawl
last_crawl_statusstringStatus of the last crawl
last_crawl_timestringStart time of the last crawl
last_modified_onstringDate and time the job was last modified
last_updatedstringDate and time the crawler was last updated
locationstringS3 location of the table data
location_uristringLocation of the database
max_capacityfloat64Maximum number of DPU that can be allocated
max_retriesintMaximum number of retries
number_of_workersint32Number of workers allocated to the job
output_formatstringHadoop output format class
ownerstringOwner of the table
parametersstringDatabase parameters
partition_keysstringPartition key columns
recrawl_behaviorstringRecrawl behavior policy
retentionint32Retention period in days
rolestringIAM role ARN assigned to the job
rolestringIAM role ARN assigned to the crawler
schedulestringCron schedule expression
schema_delete_behaviorstringBehavior when schema objects are deleted
schema_update_behaviorstringBehavior when schema changes are detected
script_locationstringS3 location of the job script
security_configurationstringSecurity configuration applied to the job
serdestringSerialization/deserialization library
statestringCurrent state of the crawler (READY, RUNNING, STOPPING)
table_typestringType of table (EXTERNAL_TABLE, VIRTUAL_VIEW, etc.)
targetsstringSummary of crawler targets
timeoutint32Job timeout in minutes
typestringJob command type (glueetl, pythonshell, gluestreaming)
update_timestringDate and time the table was last updated
worker_typestringWorker type (Standard, G.1X, G.2X, etc.)