As AI and analytics becomes table stakes for businesses, there is a need to manage and utilize many different sources of data. These data sources can be of many different kinds – structured, semi-structured and unstructured. As I outlined in my blog earlier this year, one of the top challenges faced by data scientists is getting timely, high-quality data at the right time so that they can allocate their time to analysis and insights, versus data management.
AWS Glue comes as a savior. At its core, it is a fully managed cloud ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams. But I feel that it can be a powerful tool that can truly help accelerate your strategic data initiatives.
In this blog we’ll look at when to use it, and what to think about before implementing it.
What is AWS Glue
Enterprises often sit on data mines, not fully extracting the monetary benefits of leveraging the insights for operational improvements, customer experience, and revenue growth.
On average, almost 70% of a data scientist’s time is spent cleaning incoming data, aggregating the sources, and making them ready to be used for analysis.
This is where AWS Glue comes in. It helps you monetize your data debt, and sets up the roadmap for higher overall organizational maturity.
AWS Glue positions three core components at the highest level – a data catalog, ETL processing, and scheduling. By considering the architecture in this manner, we logically begin to focus on the outcomes rather than the procedural aspects of data management. We think this is a very nice soft benefit.
The AWS Glue Data Catalog is used to track all data, data locations, indexes, run metrics, and schemas. It makes the data accessible, converting unstructured data into preferable schema-oriented table formats. It uses “crawlers” and “classifiers” to assist in the identification of data schema and its creation in the Data Catalog. AWS Glue crawlers can crawl both file-based (e.g., unstructured) and table-based (structured) data store via Java Database Connectivity (JDBC) or native interfaces. It can securely connect to multiple data sources and saves the time by doing away with the need of reconnecting every time.
Then together, the three components significantly reduce manual effort by automating the entire data discovery and aggregation process. Consequently, we also reduce the probability of errors.
The second major benefit is that AWS Glue is serverless. This means that there is no infrastructure to set up or manage. In turn this implies that the agility factor to go from data to insights is high.
Illustrative Case Study
On a recent project, the operations team was facing challenges in getting their key customer analytics reports out on time. There was a significant level of manual effort needed for data pipelining (integration) from multiple sources. The data had to undergo several levels of adjustment to be ready for analysis. Inconsistencies between data sources had to be resolved very frequently. This overhead was affecting reporting and business intelligence latencies negatively. As a result, the client team was encountering challenges in their ability to engage their customers in data-driven performance conversations to create the right strategies for the next customer marketing cycle.
In order to meet this challenge of streamline business reporting and analytics process, we used Glue to act as a supporting tool to the regular process flow. Multiple files containing millions of records every cycle were processed. Some of the data was structured (e.g., customer data) while some was semi-structured (e.g., advertising information) and the rest of it was unstructured (e.g., user generated content).
An important point to note is that we masked the data (using PySpark) that was processed by Glue to maintain compliance and reduce risk. Processing was in memory, data was access restricted, and only masked data was allowed as output. We customized the sorting and filtering in the business reports and other output in a way that no sensitive information was compromised.
The data sources were segregated into specific folders. Then the AWS Glue crawlers helped us crawl the files and structure the data properly. Glue split the jobs and ran them in parallel so we could truly leverage the power of the cloud. With the latest Glue updates, we also now get much better UI and faster processing.
Since AWS Glue is serverless, we were able to start quicky with a low cost, pay-as-you-go option. One of the key benefits was the trade-off between time and cost. We went from a largely manual process to an automated process which was a huge time saving.
Finally, there are inherent benefits of a cloud-based data infrastructure over the traditional approach. We were able to store and process very high volume of data which otherwise would have required a significant scalability management overhead in terms of on-premise infrastructure.
What to watch out for?
One consideration is that as you adopt any cloud-based solution, costs can quickly rise. One of the ways we were able to contain that is by being conscious of what we needed. For example, for structured input we used limited AWS Glue features such as crawlers that look at the incoming file formats and validated them on an automated schedule. Since we had a lot of files, we also found it very useful to use crawlers to help redesign the files.
We also ran a lot of processing outside the Glue environment using Python because the processing within a managed cloud service can quickly get expensive.
Finally, like any technology tool, sound knowledge and real-life experience is required in order to maximize the benefits you can get from AWS Glue. Creating a robust data management strategy using a top-down approach using design thinking principles is very helpful. It allows you link the enterprise rollout of AWS Glue in a manner that is substantiated by associated business benefits at every stage.
We recommend that you give AWS Glue a try. It can help you get started quickly with your data management efforts and ROI can be realized within as little as 6-8 weeks.
We can help you run a pilot, and conduct a discovery workshop to create a roadmap that matches your business goals.
Take our data strategy assessment to see where you are on your data management maturity.