A Journey Toward Real-time Insight Generation and Decision Automation
- by Amar Patel
- March 15, 2018
- cloud migration• sql server• data warehouse• etl
- 9 minutes to read.
A Journey Toward Real-time Insight Generation and Decision Automation
Over the last 12-months, I’ve spoken with numerous analytics leads across large and small enterprises, across numerous sectors; many of these leads are responsible for creating reports, analytical insights and management information across their products and services at national and international levels.
Despite their differences, there are common issues and themes that surface time and time again. Obviously, making the best use of the growing data at hand is a key challenge; this includes determining the right tools and processes needed to remain competitive within noisy and increasingly disrupted sectors. One retail organization I spoke to was consolidating its organizational transformation program, as well as using best-of-breed digital partnerships for online commerce and automated logistics for home delivery; on the surface, this particular organization certainly seemed like it was actively preparing to remain competitive.
However competition is intense, and that intensity is likely to increase by at least an order of magnitude as the likes of Amazon’s, for example, penetration increases (Go, Fresh, and Whole Foods). April 2017 - Amazon Strategy Teardown: Building New Business Pillars In AI, Next-Gen Logistics, And Enterprise Cloud Apps
Retail is not the only industry facing this level of competition, fragmentation, and disruption; we’re seeing it across finance, manufacturing, transport, construction, and others.
At the moment, many of these organizations current technologies allow them to generate insights and reporting on a weekly, monthly, and annual basis. They are very much working within a batch processing world, with traditional ETL features (extract, transform, load), and no apparent plans to introduce real-time or shorter-time horizon insights.
For us, the conversation always gets interesting when we start outlining what a real-time world might look like to them.
Why should real-time matter?
Simply put, it’s something that all enterprises, not just in retail, cannot afford to ignore. Let’s put the Amazon effect to one side, though just for a brief moment; judging by the sheer number of “Amazon” mentions in their earnings call alone, everyone should have Amazon at least in their peripheral vision.
It’s clear to us at 47 Degrees, that many of our clients are pursuing real-time data and insights agendas; for example, we’re helping them power B2C consumer rewards and notification programs across retail, finance, and gaming.
Consumers, and increasingly businesses, are demanding an experience that gives them each exactly what they need, when they need it, and nothing more or less. In our opinion, they are more likely to transact with brands and experiences that offer this kind of experience.
In essence, by not leveraging real-time data processing and insight generation, your agility, and ability to respond will impact your likelihood of any kind of outperformance - see the quote below from McKinsey. 4 Mar 2017 - Capturing value from your customer data
There are many organizations that continue their analytics work via spreadsheets, populated by a back-end SQL databases that themselves were populated by ETL processes (automated and manual) from various back-end systems.
Given the above metric from McKinsey’s survey, while many enterprises currently use data to drive effective decision making, they are leaving money on the table. Ultimately, we are talking in terms of operating-profit being left on the table, by simply not having a real-time data strategy in place, not to mention reducing the likelihood of other non-Pareto gains through machine learning, real-time targeting, fraud detection, and much more.
So, how do you begin on a journey to all this good stuff? Well, here are some pointers, most of which should be common sense:
- Move from batch systems and processes to micro-batch and real-time systems and processes, for shorter time to insight, or near real-time insight.
- Remove processing bottlenecks and introduce more automation, many of which probably live within the spreadsheet ecosystem, freeing up teams to spend more time experimenting with data.
- Formalize a data lifecycle at a national, group and department levels, to help with standardizing data operations and management
- Start a variety of POCs across the data lifecycle stages to identify, and confirm return on investment
- Leverage elastic compute and storage in the cloud, for infrastructure cost and time savings - there’s really no point in trying to compete with AWS, Microsoft or Google.
When thinking about the point above, it also helps to think of data as a product that is created, has a lifecycle, and is subject to a variety of workloads and transformations; much like any other internal product or service in your organization.
Many enterprises still have their own dedicated infrastructure or co-location; just by migrating their ETL, batch processing and data storage to the cloud gives them a flexible compute and storage capability that surpasses anything they could achieve themselves. Here’s an example of what a standard architecture could look like on the Google Cloud Platform (GCP):
This, in turn, enables them to start moving from large-scale batch workloads to micro-batch workloads as their back-end infrastructure modernizes over time. It also gives them the ability to carry out data processing and transformation upon ingest (Cloud Dataflow is a fully-managed stream and batch data processing service based on Apache Beam).
Over time, as back-end systems allow, the retailer would be able to take advantage of real-time ingest (via Cloud Pub/Sub), and continue to leverage the processing and storage capability in the same data warehouse as the ETL data.
The great thing about the GCP stack is that it is a highly modular fully-managed service, so any enterprise could initially migrate ETL, data processing, and visualization to the cloud for infrastructure cost savings. Over time, and as capability within the enterprise grows (technology and human), it could leverage real-time data streaming into the same stack. Here are some key features and products to highlight:
Pub/Sub can be seen as a message-oriented middleware in the cloud, providing a number of high utility use cases, specifically in our case asynchronous workflows and data streaming from various processes, devices, or systems. It enables the real-time ingest of information for processing and analysis.
Google Cloud Storage (GCS) and Storage Transfer Service
GCS is a unified object store and simply acts as a staging point for data being loaded by various back-end systems before processing. This staging point creates an opportunity to ensure data can be standardized before downstream services start processing.
The transfer service (not shown in architecture) provides a mechanism to drop data into cloud storage, for example for one-time transfer operations, recurring transfer operations, as well as periodic synchronization between data sources and data sinks. It removes the headache of manually managing loading of batches, etc., adding to the level of automation between internal and cloud-based processing.
While not critical to the operation of data processing and analysis, Stackdriver (not shown in architecture) allows everyone to sleep better at night, by providing very powerful monitoring, logging, and diagnostics, ensuring all the data processing workloads and any downstream applications are healthy and performing optimally. Given that monitoring can also be embedded into your own infrastructure, it provides a holistic view of the data supply chains within your business.
Cloud Dataflow is a service (based on Apache Beam) for transforming and enriching data in stream (real-time) and batch (historical) modes with equal reliability and expressiveness. It provides a unified programming model and a managed service for developing and executing a wide variety of data processing patterns, including ETL. Cloud Dataflow unlocks transformational use cases across industries, including:
- Clickstream, Point-of-Sale, and segmentation analysis in retail
- Fraud detection in financial services
- Personalized user experience in gaming
- IoT analytics in manufacturing, healthcare, and logistics
Google contributed the Cloud Dataflow programming model and SDKs to the Apache Software Foundation, thus giving birth to the Apache Beam project, which is fast becoming a de facto tool in the data processing space. Jan 2018 - Apache Beam lowers barriers to entry for big data processing technologies
Google BigQuery is a cloud-based big data analytics web service for processing very large read-only data sets. It is effectively a fully-managed data warehouse in the cloud. BigQuery was designed for analyzing data on the order of billions of rows, using a SQL-like syntax. It runs on the Google Cloud Storage infrastructure and can be accessed with a REST-oriented API.
Data from BigQuery can be ingested by various applications for regular or ad hoc workloads, e.g. end-of-day reporting using tools like Dataproc (GCP’s fully-managed Hadoop or Spark service), reporting in DataStudio (or other BI / visualisation tools), or for emerging data science initiatives, using DataLab (which uses the open-source Jupyter core).
Alongside a fully managed service, there are a couple of benefits with Google’s approach to infrastructure. Firstly, almost every GCP service has an open-source core, thus giving every customer the freedom to choose to migrate off GCP at any time, into any other infrastructure of their choice (though they would likely need to hire a small army of DevOps and CRE’s to manage it). Secondly, GCP includes a free tier; for example BigQuery has two free tiers: one for storage (10GB) and one for analysis 1TB/month, thus encouraging its use for prototyping or testing.
We’re excited by the sheer variety of data innovation options ahead of many of our clients and prospects, some of which could drive the next 5-10 years of landmark growth for them. Not just in terms of data processing and visualisation, but also in terms of creating real-time data supply chains at the heart of their businesses. Ultimately, that is our mission.
If you’re thinking you need help in any of this, we’d be happy to explore the options relevant to your business.