Introduction to Amazon AppFlow
Amazon AppFlow is a Fully Managed Cloud ETL Service that enables businesses to securely transfer data bidirectionally between a limited set of Software-as-a-Service (SaaS) applications like Salesforce, Marketo, Slack, etc, and AWS services such as Amazon S3 and Amazon Redshift, with minimal code, in no-time.
Each ETL task set up to move data is referred to as a “Flow”. Amazon AppFlow allows you to conduct data flows at almost any scale and frequency – on A Schedule, for any Business Event, or On-demand. You can enable strong data transformation functions like filtering and validation to generate rich, ready-to-use data as part of the flow, avoiding the need for additional steps. AppFlow automatically encrypts data in motion and enables users to restrict data from flowing over to the public Internet for SaaS applications that are integrated with AWS PrivateLink, lowering security risks.
Introduction to AWS Glue
AWS Glue is a Fully Managed ETL (Extract, Transform, and Load) Service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably across multiple data stores and data streams. AWS Glue minimizes the cost, complexity, and time spent in creating ETL jobs. AWS Glue is made up of three components: a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that generates Python or Scala code automatically, and a flexible scheduler that manages dependency resolution, job monitoring, and retries. Since AWS Glue is serverless, there is no infrastructure to install or manage.
Understanding the Differences between Amazon AppFlow and AWS Glue
AWS Glue and Amazon AppFlow belong to the “Big Data Tools” category of the tech stack.
Here are a few primary differences between these Cloud Integration platforms:
1) Technical Knowledge
Amazon AppFlow comes with a low-code, easy-to-use interface to configure the data flow. It is based on a Workflow-based ETL tool model. A Flow in Amazon AppFlow may be configured by anybody, including those with minimal technical skills. This reduces reliance on the engineering team.
AWS Glue is a Code-based ETL tool. It runs jobs in Apache Spark. This implies that engineers who need to customize the produced ETL job should be familiar with Spark. The code will be written in Scala or Python, therefore engineers should be familiar with those languages in addition to Spark. This indicates that not all data professionals will be able to customize generated ETL jobs to meet their unique requirements.
2) Transformations
Amazon AppFlow’s data transformation capabilities are restricted to masking data or filtering out undesirable data. It is not feasible to perform complex transformations such as currency conversion or date format standardization.
AWS Glue provides a collection of built-in transformations that you can use to process your data. These are ApplyMapping, DropFields, DropNullFields, Filter, Join, Map, MapToCollection, Relationalize, RenameField, ResolveChoice, SelectFields, SelectFromCollection, Spigot, SplitFields, SplitRows, and Unbox. These transformations may be accessed using your ETL script.
3) Incremental Update of Data
Each flow that runs on Amazon AppFlow dumps the whole dataset from source to destination. There is no method to restrict the data transfer to only the data that has changed since the last data transfer. If only a few modifications were made to your SaaS applications, you will wind up wasting a large portion of your AppFlow limit (and money).
AWS Glue jobs utilize job bookmarks to process incremental data since the last job execution. A job bookmark is made up of the state of various work elements including sources, transformations, and targets. Incremental crawl only crawls folders that have been added since the last crawler run for an Amazon S3 data source. Without this choice, the crawler crawls the whole dataset, which saves significant time and money.
4) Data Load Limitations
Each flow defined in Amazon AppFlow allows data to be transferred from a single object or entity in your SaaS application. Consider the following Salesforce example: Salesforce allows you to import data from over 800+ different objects that include opportunities, deals, customer contacts, accounts, and so on. To transfer data from these 800+ Salesforce objects, you’d have to build up and set up 800+ flows one at a time.
With AWS Glue grouping enabled, the benchmark AWS Glue ETL job could process more than 1 million files using the standard AWS Glue worker type. The groupSize parameter is an optional feature that allows you to specify how much data each Spark task reads and processes as a single AWS Glue DynamicFrame partition.
5) Integrations Support
Amazon AppFlow allows you to move data between SaaS applications you use on a daily basis, and AWS services such as Amazon S3 and Amazon Redshift. AppFlow currently supports around 22 different SaaS Applications. Amazon AppFlow does not allow importing advertising data from platforms such as Google Ads, Facebook Ads, Twitter Ads, or other business applications such as Intercom, Shopify, HubSpot, and many others.
AWS Glue natively supports data stored in MySQL, Amazon RDS for Oracle, Amazon RDS for PostgreSQL, Amazon RDS for SQL Server, Amazon Redshift, DynamoDB, and Amazon S3, as well as MySQL, Oracle, Microsoft SQL Server, and PostgreSQL databases on your Amazon EC2-hosted Virtual Private Cloud (Amazon VPC). AWS Glue also supports Amazon MSK, Amazon Kinesis Data Streams, and Apache Kafka data streams.
To access data from non-native sources of AWS Glue, you can also write custom Scala or Python code and import custom libraries and Jar files into your AWS Glue ETL jobs. The AWS Glue Elastic Views preview now supports Amazon DynamoDB as a source, with Amazon Aurora and Amazon RDS support on the way. Currently supported targets are Amazon Redshift, Amazon S3, and Amazon OpenSearch Service (successor to Amazon Elasticsearch Service), with support for Amazon Aurora, Amazon RDS, and Amazon DynamoDB to follow.
6) Data Transfer Scheduling
In Amazon AppFlow, flows can be configured to run on a predefined schedule or frequency or to be triggered when an event occurs. It also provides the option of doing a one-time data transfer. This feature is useful if your team relies on the data being refreshed from source to destination on a periodic basis.
In AWS Glue, you can define a time-based timetable for your crawlers and jobs. These schedules are defined using the Unix-like cron syntax. The time is specified in Coordinated Universal Time (UTC), and the schedule’s minimum precision is 5 minutes.
This blog talks about a few of the primary differences between Amazon AppFlow and AWS Glue in detail. It also gives a brief overview of the two Cloud Integration platforms by Amazon.