How to Move Big Data in the Cloud

Cloud Data

By Sarah Lahav

Due to the seemingly infinite capacity and elasticity of large cloud service providers, and their emerging focus beyond infrastructure into Big Data services, it’s now more appealing than ever for enterprises to look at swapping their on-premise Big Data capabilities for cloud-based data services.

But how do companies move their Big Data initiatives to the cloud? Firstly, there are two main considerations to take into account:

  • Working out how and where the processing will run
  • How to get your data near to the processing capacity so that ideally both application and data are in the cloud (bar any issues related to sensitivity)

There are three common options for processing:

1. Roll your own

If you have specialist Big Data needs and a team of expert data scientists and Big Data administrators, then you might choose to create your own Big Data implementation in the cloud. In AWS, this could be deploying your own virtual machines, installing the software, and connecting network and data stores. This takes time and a certain skillset to deploy and manage, and should be reserved for special cases – as it’s the most costly and slowest way to do Big Data in the cloud.

2. Use a specialist provider on top of the cloud

You can leverage a cloud Big Data specialist such as Cloudera and Hortonworks, who can help you build and run your cloud-based Big Data implementation. They provide everything you need across the Big Data lifecycle. These providers essentially speed up the “roll your own” approach such that you get there faster than you could by yourself.

3. Use the cloud service provider’s data services

If you are like most of the corporate world that wants all the upside of Big Data without the deployment downsides, then you can immediately “pass go” by leveraging the cloud service providers’ own Big Data solutions such as AWS Lambda and Azure Machine Learning. This is the cheapest and fastest way to do Big Data in the cloud.

Then there is the issue of the distance between the processing and the data.

Beware of Data Gravity

Dave McCrory, the founder of four virtualization/cloud startups, once described “data gravity” as a restrictive force that gets stronger as data gets larger. This force makes data heavy to move (the time and cost) and it also attracts applications that increase the gravity further. Big Data is ultimately applications working on very large datasets, so data gravity is a very pertinent issue.

If you have collected large data sets on-premise and you need them to be analyzed by a cloud-based Big Data system, then you have to somehow connect the cloud application to your local data. Thankfully this has been improved somewhat by the use of dedicated, fiber connections between the client network and the cloud service provider. For example, an Azure ExpressRoute connection is a good way to both increase security and to have sufficient bandwidth to transfer large local datasets to the cloud.

Trying to accomplish this over “normal” Internet cloud connections will most likely never work due to the latency and bandwidth constraints, and over a corporate WAN this can also be extremely costly. Other alternatives are to ship storage media to the cloud provider, using facilities such as AWS Import/Export.

Advice on Getting Started

If you are just getting started with Big Data in the cloud, make your first steps easier by using a cloud service provider’s data services. If you then outgrow that service and need to fulfill specialist needs, then consider seeking expert advice on how to do it.

Sarah LahavSysAid Technologies‘ first employee, Sarah is now CEO and a vital link between SysAid and its customers since 2003. As CEO, she takes a hands-on role evolving SysAid with the dynamic needs of service managers. Previously, Sarah was VP Customer Relations at SysAid and developed SysAid’s Certification Training program, advancing the teaching methods and training technology that is in place today. Sarah holds a B.Sc. in Industrial Engineering, specializing in Information Technology from The Open University in Israel, and spends her free time with her three beautiful children.

 

Timothy King
Follow Tim

Timothy King

Editor, Data and Analytics at Solutions Review
Timothy leads Solutions Review's Business Intelligence, Data Integration and Data Management areas of focus. He is recognized as one of the top authories in Big Data, and the number-one authority in enterprise middleware. Timothy has also been named one of the world's top-75 most influential business journalists by Richtopia.
Timothy King
Follow Tim