Evaluating Apache Cassandra as a Cloud Database [PDF]

The amount of information residing only in the cloud is relatively small today, especially compared to what it will be in less than two years, when a predicted 10 percent of all data will be maintained in a cloud. Moving to a cloud-based infrastructure requires choosing a database that can fully use all the benefits the cloud provides: transparent elasticity, transparent scalability, high availability, strong security, easy data distribution, data redundancy, support for all data formats, simple manageability, and low cost.

The amount of information that currently resides only in the cloud is small, but that’s about to change. A recent study by IT industry analyst group IDC estimates that cloud computing accounts
for less than 2 percent of IT spending today, but by 2015, nearly 20 percent of all information will be “touched” (stored or processed) in a cloud. Moreover, IDC predicts that by that same year, as
much as 10 percent of all data will be maintained in a cloud. Despite the growing movement toward cloud computing, some IT professionals remain standoffish toward the idea of porting a company’s data onto a public cloud computing platform such as Amazon, Rackspace, and others. This position is understandable, given the current confusion over whether running a database in a cloud environment actually delivers tangible benefits – technical and otherwise –over keeping that same data on-premise.

Whether deciding to move a small or significant amount of data to a cloud database, today’s IT decision-makers need to understand whether the solution they’re considering is designed and/or
implemented in a way that utilizes all the benefits and promises of cloud computing. This paper examines those key characteristics and discusses how Apache Cassandra™ stacks up from an evaluation perspective.

Why Move to a Cloud Database?

First, it should be understood that a cloud database is more than simply taking traditional relational database management system (RDBMS) software and running an instance of it on a cloud platform such as Amazon. Such a deployment in no way maximizes the capabilities of a cloud-computing environment.

But what constitutes a cloud-ready database? What features and functionalities must the database have to deliver on the potential that cloud computing offers? What follows is a discussion of some of the key promises of the cloud and the types of features a database should have to supply real benefits in a cloud environment.

  • The Cloud Promises Transparent Elasticity
  • The Cloud Promises Transparent Scalability
  • The Cloud Promises High Availability
  • The Cloud Promises Easy Data Distribution
  • The Cloud Promises Redundancy
  • The Cloud Promises Support for All Data Types
  • The Cloud Promises Easier Manageability
  • The Cloud Promises Lower Cost

What Is Apache Cassandra?

Apache Cassandra is a highly scalable and high-performance distributed database management system that excels at being a real-time datastore (i.e., the “system of record”) for online/transactional applications that need extremely fast read and write operations. Cassandra can manage the distribution of data across multiple data centers and offers incremental scalability with no single point of failure.

Cassandra was originally incubated at Facebook and is based upon Google’s BigTable and Amazon’s Dynamo software. The end result is an extremely scalable and fault-tolerant data
infrastructure that solves both small and big data problems, handles write-intensive user traffic, delivers sub-millisecond caching layer reads, and supports demanding workloads involving
petabytes of data.

Why Cassandra?

Cassandra is built with the assumption that failures can and will occur in a data center or cloud infrastructure. Therefore, data redundancy to protect against hardware failure and other data loss
scenarios is built into and managed transparently by Cassandra. Furthermore, this capability can be configured so that big data applications can use a single large database distributed across multiple, geographically dispersed data centers, between different physical racks in a data center, and between public cloud providers and on-premise managed data centers.

Download Evaluating Apache Cassandra as a Cloud Database [PDF]

[Free Cloud Service] Cloud Database and Open Source DBs

Have you ever found yourself spending precious time on installing, connecting and configuring a database, or other supporting systems used by your application? Here are a bunch of cloud database services, open source databases, CMSs and payment gateways that are free and ready to ride on the cloud.

Xeround

Xeround is a service that replaces your existing MySQL database and provides seamless MySQL scalability and high availability. Xeround runs in the cloud, allowing you to scale your database up or out automatically, and maintain availability even in the event of failure. You can run your database on Amazon Web Services and on Rackspace, on HP Cloud Services, as well as via the Heroku, Engine Yard, PHP Fog and AppHarbor platforms. Additional cloud service providers are being added on-going. This is the “Do-it-Yourself” approach – setting up a cloud instance on your chosen cloud service provider and then installing and configuring a database on it. The following open source databases are ready to deploy on a cloud virtual machine.

FREE Cloud Database

MySQL

MySQL is the world’s most popular open source database. Whether you are a fast growing web property, technology ISV or large enterprise, MySQL can cost-effectively help you deliver high performance, scalable database applications.

MySQL Downloads (Generally Available)

Running MySQL on Amazon EC2 with EBS (Elastic Block Store)

CouchDB

Apache CouchDB™ is a database that uses JSON for documents, JavaScript for MapReduce queries,and regular HTTP for an API. CouchDB is a database that completely embraces the web. Store your data with JSON documents. Access your documents with your web browser, via HTTP. Query, combine, and transform your documents with JavaScript. CouchDB works well with modern web and mobile apps. You can even serve web apps directly out of CouchDB. And you can distribute your data, or your apps, efficiently using CouchDB’s incremental replication. CouchDB supports master-master setups with automatic conflict detection.

MongoDB

MongoDB (from “humongous”) is a scalable, high-performance, open source NoSQL database. Written in C++.

MongoDB is a document database that provides high performance, high availability, and easy scalability.

  • Document Database
    • Documents (objects) map nicely to programming language data types.
    • Embedded documents and arrays reduce need for joins.
    • Dynamic schema makes polymorphism easier.
  • High Performance
    • Embedding makes reads and writes fast.
    • Indexes can include keys from embedded documents and arrays.
    • Optional streaming writes (no acknowledgments).
  • High Availability
    • Replicated servers with automatic master failover.
  • Easy Scalability
    • Automatic sharding distributes collection data across machines.
    • Eventually-consistent reads can be distributed over replicated servers.

Download MongoDB

Setting Up MongoDB on Amazon EC2

Cassandra

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Cassandra’s data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.

Download Cassandra

Setting Up Cassandra on Amazon EC2 

Neo4j

Neo4j is a robust (fully ACID) transactional property graph database. Due to its graph data model, Neo4j is highly agile and blazing fast. For connected data operations, Neo4j runs a thousand times faster than relational databases. Neo4j is a high-performance, NOSQL graph database with all the features of a mature and robust database. The programmer works with an object-oriented, flexible network structure rather than with strict and static tables — yet enjoys all the benefits of a fully transactional, enterprise-strength database. For many applications, Neo4j offers performance improvements on the order of 1000x or more compared to relational DBs.

Setting Up Neo4j on Amazon EC2

Hadoop

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.

Setting Up Hadoop on Amazon EC2

Why NoSQL? [PDF]

By all accounts, the consensus of IT professionals and industry database experts seems to be that NoSQL is here to stay. A recent study performed by a media firm on NoSQL market growth
forecasts a very strong compound annual growth rate (CAGR) of 21 percent for NoSQL technology from 2013-2018. Such growth and increased adoption has prompted one technology writer to go so far as to say: “NoSQL is the stuff of the Internet age.”

The term “NoSQL” is sometimes misused and abused by various software vendors and technology professionals. In general, NoSQL refers to progressive data management engines that go beyond legacy relational databases in satisfying the needs of today’s modern business applications. A very flexible data model, horizontal scalability, distributed architectures, and the use of languages and interfaces that are “not only” SQL typically characterize NoSQL technology.

While what defines NoSQL databases has been much more clearly articulated today than just a few years ago, what still puzzles some IT professionals is when and why NoSQL databases should be used. When will a traditional RDBMS suffice for an application and when is a NoSQL database more appropriate?

This paper discusses six of the most common reasons NoSQL databases are being deployed today, and highlights how Apache Cassandra™ and DataStax Enterprise fulfill those use cases.

Why NoSQL? [PDF]

FreeNAS: Open Source Storage Platform

FreeNAS is an Open Source Storage Platform based on FreeBSD and supports sharing across Windows, Apple, and UNIX-like systems. FreeNAS 8 includes ZFS, which supports high storage capacities and integrates file systems and volume management into a single piece of software.

FreeNAS can look like a Windows server or an iSCSI target, among other server types. It’s managed by a Web interface that’s more intuitive than some commercial storage appliances we’ve used. And FreeNAS offers the innovative ZFS file system, with built-in integrity checks, flexible and virtually unlimited scalability, and good performance.

Download: FreeNAS 8.3.1-RC1

Top 10 Cloud Job Titles

  1. Cloud Architect
  2. Cloud Software Engineer
  3. Cloud Sales Executive
  4. Cloud Engineer
  5. Cloud Developer
  6. Cloud Systems Administrator
  7. Cloud Consultant
  8. Cloud Systems Engineer
  9. Cloud Network Engineer
  10. Cloud Product Manager

via Dice.com

Cloud architect

Job description: Spearhead the development and implementation of cloud-based initiatives to ensure that systems are scalable, reliable, secure, supportable, and achieve business and IT performance and budgetary objectives.

Required credentials: B.S. in computer science or engineering; 10+ years experience in large-scale, multi-platform networks; expert in Shell, VBScript, Perl or Python; expert knowledge of Linux and Windows; significant experience designing, installing and administrating virtualized environments.

Requested credentials: Experience working with public cloud providers; expert understanding of firewall and load balancing concepts; prior work creating PCI-compliant solutions.

Cloud software engineer

Job description: Responsible for design and development of distributed software modules that integrate with cloud service providers.

Required credentials: B.S. in computer science or engineering; 2+ years professional experience in software development; work experience with ETL (Extract-Transform-Load) tools and techniques; work experience with system configuration and deployment automation technologies; hands-on programming experience on a Linux/Unix operating system; excellent understanding of at least one compiled-code language.

Requested credentials: Experience in deploying software to cloud computing infrastructure; experience in SOA technologies; ability to provide accurate ETA for software modules.

Cloud sales: cloud sales executive, cloud sales representative, cloud sales consultant, cloud sales manager

Job description: Develop and grow a book of outsourced cloud business with C-level professionals in midsize and enterprise-level customers.

Required credentials: Bachelor’s degree in business administration and 5-10 years business experience in client-facing roles, with some of that spent in outsourcing or systems integration; highly effective communication skills; strong understanding and successful experience in building strategic and/or developmental partnerships at the C-level within midsize and large corporations; demonstrated consistent quota attainment in selling infrastructure, IT, cloud and security services.

Requested credentials:  Ability to travel more than 50% of the time on the job.

Cloud engineer

Job description: Plan and conduct technical tasks associated with the implementation and maintenance of internal enterprise-shared virtualization infrastructure.

Required credentials: B.S. in computer science; 5+ years of implementation experience with highly virtualized shared infrastructure, platforms or applications architecture at a large enterprise or service provider.

Requested credentials: Vendor-specific virtualization certification such as VMware Certified Professional.

Cloud services developer

Job description: Design and build the multi-platform customer-facing tools — such as sales interfaces and management portals — that serve as the gateway into how end users consume the underlying cloud services.

Required credentials: B.S. in computer science or computer engineering; 5 years of experience with cloud architecture and design; 5 years of experience architecting and deploying Web services on SOA platforms (examples: Amazon EC2, Heroku, Azure, Rackspace); 5 years of experience with PHP Python, Java, or C++ with software development methodologies like Agile.

Cloud systems administrator

Job description: Configure and maintain the systems that comprise the underlying cloud platform. Troubleshoot when problems arise and plan for future cloud capacity requirements.

Required credentials: B.S. in computer science or computer engineering; 3 years of experience in operating system administration; 3 years of experience in supporting enterprise-level platform installations; strong Linux command-line skills; experience in performance monitoring and capacity planning for enterprise platforms.

Requested credentials: Knowledge of cloud-based development.

Cloud consultant

Job description: Conduct technical studies and evaluations of business area requirements and recommends to IT management appropriate cloud technology options.

Required credentials:  At least 8 years of related IT consulting experience; outstanding understanding of cloud technologies available and vendors providing cloud services; top-notch communication skills.

Cloud systems engineer

Job description: Build the virtual systems that support the cloud implementation.

Required credentials: B.S. in computer science, information technology or related technical degree; 5-10 years of systems engineering experience, holistic understanding of the Internet and hosting from the network layer up through the application layer; experience in a 24×7 hosting environment.

Requested credentials: Experience with monitoring tools, scripting, configuration management, clustering, Drupal and Internet security.

Cloud network engineer

Job description: Perform the implementation, operational support, maintenance and optimization of network hardware, software and communication links of the cloud infrastructure.

Required credentials: Related degree in computer science ; 4 years’ in-depth network engineering experience; proven deep understanding of TCP/IP, Subnetting, DNS, DHCP, NAT and routing; strong knowledge of Layer 2 network protocols; strong knowledge of Layer 3 IP routing; proven scripting abilities in one or more language — Perl, Shell or Python.

Requested credentials: Cisco Certified Network Associate (CCNA)/Cisco Certified Network Professional (CCNP) certification.

Cloud systems engineer

Job description: Build the virtual systems that support the cloud implementation.

Required credentials: B.S. in computer science, information technology or related technical degree; 5-10 years of systems engineering experience, holistic understanding of the Internet and hosting from the network layer up through the application layer; experience in a 24×7 hosting environment.

Requested credentials: Experience with monitoring tools, scripting, configuration management, clustering, Drupal and Internet security.

Cloud network engineer

Job description: Perform the implementation, operational support, maintenance and optimization of network hardware, software and communication links of the cloud infrastructure.

Required credentials: Related degree in computer science ; 4 years’ in-depth network engineering experience; proven deep understanding of TCP/IP, Subnetting, DNS, DHCP, NAT and routing; strong knowledge of Layer 2 network protocols; strong knowledge of Layer 3 IP routing; proven scripting abilities in one or more language — Perl, Shell or Python.

Requested credentials: Cisco Certified Network Associate (CCNA)/Cisco Certified Network Professional (CCNP) certification.

Cloud product manager

Job description: Perform product planning for cloud-based offerings including creating product concept and strategy documents, creating requirements specifications, identifying product positioning and enabling the sales processes (licensing, pricing, packaging, benefits, etc.).

Required credentials: Bachelor’s degree in business or computer sciences or equivalent work experience; minimum of 3 years of experience working with a software development company that deploys its offerings using a SaaS or cloud-based model; very strong communication skills.

Requested credentials: Advanced degree in business or computer sciences.

Top 10 Things to Know About Cloud Load Testing

  1. Set performance benchmarks
    Make sure your goals are realistic for your business and industry
  2. Review analytics data
  3. Identify what users are doing on your website (Critical Scenarios)
  4. Utilize real browsers
  5. Identify your testing team and collaboration opportunities
  6. Detail execution requirements
    Test within a live production environment as much as possible.
  7. Define your test iterations
  8. Set your testing timeline
  9. Monitor and log, log, log.
  10. Cross-correlate results

[Free Cloud Service]: Load Testing

These are cloud-based aplications that allow you to load-test your web application. All of these tools offer free plans, but they are not unlimited, so be sure to check usage within the free plan.

LoadStorm

LoadStorm™ is a web load test tool. It solves the problem of expensive traditional performance testing tools and the hassle of hosting open source tools.

CloudTest

Run mobile app and web performance tests faster, more rigorously and at scale—for
less cost. CloudTest lets you build, execute and analyze load tests on a single powerful
platform. Then run these tests on SOASTA’s Global Test Cloud to any traffic level, up to
millions of geographically dispersed mobile and web users. Whether you’re working with
a large or small app, CloudTest is the logical choice for testing. When you need to test
faster and even bigger, CloudTest makes it possible.

CloudTest Lite – Free Edition

CloudTest Lite is an enterprise-class, functional and performance testing platform that removes the barriers to performance testing web and mobile applications at any stage.

BlazeMeter

BlazeMeter was founded with the goal of making it easy for everyone to run sophisticated, large-scale performance and load tests quickly, easily and affordably.

BlazeMeter allows you to run massive load tests in the cloud. Cloud computing is an ideal solution for load testing. It allows creating massive loads within minutes, but only requires you to pay for what you use. And BlazeMeter’s BlazeCluster™ technology lets you create massive-scale load tests previously only possible with expensive testing environments. In addition to the advantages it has as a load testing cloud, BlazeMeter leverages the popular open-source performance testing framework, Apache JMeter. In fact, it’s the only 100%compatible JMeter testing service.

Neustar Website Load Testing

OpenShift Cloud Release Management

OpenShift allows developers to stand up their own release-management pipeline, including: local development, a single-gear staging environment, and a production-quality, cloud-scaling clustered environment – all in two simple steps!

PREP-WORK:

You’ll need to sign up a free OpenShift account, and have the latest version of git and rubygems command-line tools available in your local development environment.

Run sudo gem install rhc, and rhc setup to configure your OpenShift “Red Hat Cloud” rhc command-line tools, and link your local development environment with your OpenShift Online account:

sudo gem install rhc
rhc setup

Step1: Local Development and Staging

Start by provisioning a new cloud-hosted application. First, you’ll need to choose an APP_NAME and CARTRIDGE_TYPE to supply the initial starter-code for your app. Type rhc cartridge list to see a complete list of cartridge types and/or available application-hosting languages.

When ready, start up a new gear using the rhc command-line tools.

This command will output a local Git repo, so make sure to run it from within a folder where you would like to store your development source:

rhc app create APP_NAME CARTRIDGE_TYPE

For example, you can create a new application named onegear using node.js, by typing:

rhc app create onegear nodejs-0.6

This single-gear application hosting environment will be configured with a remote git source code repository (or repo) and several other services. It will also create a local clone of your application source, stored in a folder matching your chosenAPP_NAME.

The rhc app create command output should also include a live web URL where your application code has been staged.

For local development and testing work, enter your local application folder (cd APP_NAME) and make a few changes to your code. If you followed the above example and created a new node.js application, running npm start will allow you to test your changes locally.

Step 2: Clustered, Auto-Scaling Environments

Now that we have established our application’s development and staging environments, let’s add a multi-gear auto-scaling stage to the mix!

This time, when provisioning your gear:

  • Add a -s flag to the rhc app create command to take advantage of OpenShift’s auto-scaling capabilities.
  • Use the --no-git flag to avoid creating an extra copy of the default application source, since we already have locally-available application code from Step 1.
rhc app create --no-git -s multigear nodejs-0.6

The above command should return a new remote Git URL that your local project will need to be made aware of.

Run the following command from within your application source folder, inserting your own NEW_GIT_URL address. In these examples, I’ve chosen to name the Git remote for our clustered app, production:

git remote add production -m master NEW_GIT_URL

To see a list of each of your remote Git repositories, enter:

git remote -v

Now that you have a secondary Git remote configured, you’ll need to force push your project’s commit history over to your multi-gear application instance. This will establish a consistent commit history across each of your environments, and deploy the latest version of your code to your production cluster:

git push -f production master

You can optionally rename your staging environment’s Git remote label to help clarify the purpose of that machine. By default, OpenShift sets your primary application’s remote Git URL name to origin.

git remote rename origin staging

I’ll be using the name staging when pushing to my app’s single-gear environment in the remaining examples.

OPTIONAL STEP – DOMAIN ALIASING:

If you have registered a domain name that you would like to use with your application, our F.A.Q. contains some excellent notes on configuring domain name aliasing on OpenShift.

Release Management Workflows

Now that you have successfully established each of your application hosting environments, the following deployment workflows should be possible:

1. Local Development

With node.js, you can enter your local project (cd APP_NAME), and start up a development server with npm start.

Make a few changes and reload your local server to test your edits locally. When you’re happy with your changes, check them into your local repo before promoting them on to the next application-hosting stage.

git diff   # to review your changes
git add    # each file name that is ready to be deployed 
git commit # to record a batch of changes locally

2. Deployment to Staging

After reviewing and committing your changes locally, you can update your application staging environment with your latest work:

git push staging # or simply, git push

Then open a browser to this gear’s APP_URL to test your latest code. OpenShift does a lot of work behind the scenes to update your live app whenever this simple git push operation occurs.

3. Deployment to Auto-Scaled Environments

If the changes look good in staging, you can make them available on your cloud-scaled application environment as well:

git push production # deploy to a gear w/ the Git remote label of "production"

Easy, right?

Emergency Rollbacks

Occasionally, an application bug may manage to sneak past your development and staging-area tests and make through to your production hosting environment.

If needed, you can quickly revert the status of a gear to a previous “known good”-version of your application code:

git log # to review your list of changes

The output should look something like this:

commit b021aaa7effee77a1ed501813c2fa8f65ba100de 
Author: ryanj <ryanj@redht.com>
Date:   Sun Feb 12 13:44:42 2013 -0800

    changing the application header text

commit 28c5555352a902c549c965da30cf7559c80f328e
Author: Template builder <builder@example.com>
Date:   Thu Feb 7 15:36:23 2013 -0500

    Creating template

For example, if I knew that the previous release of my code was bug-free – the following command should allow me to reset my staging server to the most recently-listed commit hash (28c5555352a902c549c965da30cf7559c80f328e):

git push staging 28c5555352a902c549c965da30cf7559c80f328e:master  

Resetting your production environment to the same commit revision only requires a small change:

git push production 28c5555352a902c549c965da30cf7559c80f328e:master 

Once the bug has been identified and patched in your local development environment, you can repeat the previous deployment steps (without a commit hash) to update your application with your latest project source.

Machine Learning in Action

Machine Learning in Action

Machine Learning in Action is unique book that blends the foundational theories of machine learning with the practical realities of building tools for everyday data analysis. You’ll use the flexible Python programming language to build programs that implement algorithms for data classification, forecasting, recommendations, and higher-level features like summarization and simplification.

A machine is said to learn when its performance improves with experience. Learning requires algorithms and programs that capture data and ferret out the interesting or useful patterns. Once the specialized domain of analysts and mathematicians, machine learning is becoming a skill needed by many.

Machine Learning in Action is a clearly written tutorial for developers. It avoids academic language and takes you straight to the techniques you’ll use in your day-to-day work. Many (Python) examples present the core algorithms of statistical data processing, data analysis, and data visualization in code you can reuse. You’ll understand the concepts and how they fit in with tactical tasks like classification, forecasting, recommendations, and higher-level features like summarization and simplification.

Readers need no prior experience with machine learning or statistical processing. Familiarity with Python is helpful.

PART 1 CLASSIFICATION

  • Machine learning basics
  • Classifying with k-Nearest Neighbors
  • Splitting datasets one feature at a time: decision trees
  • Classifying with probability theory: naïve Bayes
  • Logistic regression
  • Support vector machines
  • Improving classification with the AdaBoost meta algorithm

PART 2 FORECASTING NUMERIC VALUES WITH REGRESSION

  • Predicting numeric values: regression
  • Tree-based regression

PART 3 UNSUPERVISED LEARNING

  • Grouping unlabeled items using k-means clustering
  • Association analysis with the Apriori algorithm
  • Efficiently finding frequent itemsets with FP-growth

PART 4 ADDITIONAL TOOLS

  • Using principal component analysis to simplify data
  • Simplifying data with the singular value decomposition
  • Big data and MapReduce

Machine Learning in Action

VMware vFabric Hyperic

VMware vFabric™ Hyperic® is the application monitoring component of the VMware vFabric Cloud Application Platform, enabling IT professionals to manage the performance and availability of custom Web applications in physical, virtual and cloud environments. Its unique ability to automatically discover, inventory, and monitor
servers, regardless of type or location, enables application operations teams to ensure that business-critical apps run without fail. Hyperic collects a vast range of performance data—50,000 metrics across 75 application technologies—and is easily extended to monitor any component in your application stack.

With fast deployment, a fully extensible framework and user interface, and support for both virtualized and physical infrastructure at scale, Hyperic is the standard for Web operations teams of all sizes.

VMware vFabric Hyperic provides the automation and built-in intelligence necessary for Web operations teams to manage applications at scale. Designed to power scalable, secure Web applications, Hyperic saves time and ensures comprehensive management of business-critical services.