With Big Data projects, the challenge is to clean and filter the huge amounts of information involved. It takes a lot of work to get to the point where business value can be extracted.
Over the coming year, Google will focus on releasing cloud tools and services that ease development tasks, while helping companies monitor their Big Data operations. At their I/O developer conference in June, the company unveiled a number of new products to achieve this.
Google Cloud Platform lets developers build, test and deploy applications on Google’s infrastructure; from computing, storage and application services for Web, mobile, or backend solutions. The platform is a set of modular cloud-based services allowing you to create anything from simple websites to complex applications.
The tech giant has introduced a cloud computing service called Google Cloud Dataflow - billed as a way of more easily moving, processing, and analysing vast amounts of digital information. According to Urs Hölzle (who oversaw the creation of Google’s global network of data centres), it’s designed to help companies deal with petabytes of data - as in, millions of gigabytes.
Dataflow is based on Google’s FlumeJava data-pipeline tool and its MillWheel stream-processing system, and is seen as the company’s answer to Amazon’s Elastic MapReduce and Kinesis, all in one package.
Batch processing is a way of crunching data already collected, while stream processing involves analysing data in near real-time as it comes off the Net. Many organisations need both types of analysis, and Cloud Dataflow puts them under one umbrella.
Designed to be relatively simple, Dataflow handles very large datasets and complex workflows. All jobs use the same code, and Dataflow automatically optimises pipelines and manages the infrastructure.
A live demo at Google I/O involved streaming World Cup data against historical information, to spot anomalies. The system could be set to automatically take actions when something was detected.
Compute Engine and App Engine
Google sees cloud computing as a potentially enormous market, one to rival online advertising (its primary revenue source).
With Google Compute Engine (the company’s "infrastructure-as-a-service" cloud) and Google App Engine, it now offers cloud services allowing companies and independent developers to build and run large software applications. Google also revealed a number of support services.
Google Cloud Monitoring is designed to help find and fix unusual behaviour across an application stack. Based on technology from Google's recent acquisition of Stackdriver, Cloud Monitoring provides metrics, dashboards and alerts for Cloud Platform. It comes with over a dozen popular open source apps, including Apache, Nginx, MongoDB, MySQL, Tomcat, IIS, Redis, and Elasticsearch.
To help isolate the root cause of performance bottlenecks, Cloud Trace analyses the time spent by your application on request processing. You can also compare performance between various releases of your application using latency distributions.
Cloud Debugger can be used to identify problems in production applications, without affecting their performance. It gives a full stack trace, and snapshots of all local variables for any watchpoint you set in your code - while your application runs undisturbed.
Google Cloud Save provides a simple API for saving, retrieving, and synchronising user data to the cloud and across devices, without needing to code up the backend. Data is saved in Google Cloud Datastore, making it accessible from Google App Engine or Google Compute Engine via the existing Datastore API.
Cloud Save is currently in private beta, but will be available for general use "soon".
Tooling has been added to Android Studio, simplifying the process of adding an App Engine backend to mobile apps. There are now three built-in App Engine backend module templates, including Java Servlet, Java Endpoints and an App Engine backend with Google Cloud Messaging.
With Big Data analysis, timing is everything. As Greg DeMichillie, director of product management for Google's cloud team put it, "Knowing there was a trend isn't helpful if you find out a week later." What's required is data analysis in real time - or as close to real time as you can get.
BigQuery is a way of almost instantly asking questions of massive datasets. You can bulk load data by using a job, or stream records individually.
Queries can execute asynchronously in the background, and be polled for status. Using the Google Cloud Console, you can access a history of your jobs and queries with the rest of your Cloud Platform resources.
Queries are written in BigQuery's SQL dialect, which supports synchronous and asynchronous query methods. Both methods are handled by a job, but the "synchronous" option exposes a timeout value that waits until the job has finished before returning.
There are separate interfaces for administration and developers. Access at both project and dataset levels can be controlled via the Google APIs Console.
The first 100 GB of data processed each month is free. Monthly billing will vary, but the BigQuery website has a Pricing Calculator which provides a simple tool to help get a sense of what an application running on Google Cloud Platform could cost.
Google looks to position itself as the cloud provider most dedicated to making developers’ lives easy. As with Big Data, it's automating much of the process - and exposing some of its in-house technologies, on the way.