Software Development

Cassandra: The Foundation Big Data Building Block

As Chief Technology Officer and co-founder at Instaclustr, Ben sets the technical direction for the company, identifying new features and capability.

Ben is located in our Redwood City office and he was recognized as an Apache Cassandra MVP at the Cassandra Summit in 2015. Ben is active in the community often speaking at local meetups and presenting at related conferences.

instaclustrInstaclustr has been providing a managed service for Apache Cassandra since 2013. Our solution is delivered in the cloud on AWS, Azure and Softlayer and also through Heroku as an add-on. We also provide an enterprise managed capability for clients in private datacenters and consulting and support services to a wide range of clients. We have seen the wide range of Cassandra use cases in action and this post aims to share some of our experiences.

Cassandra: The Foundation Big Data Building Block

To state the obvious, we here at Instaclustr are massive fans of Apache Cassandra. We have built our company and our managed service around this database technology and the awesomeness of its capability.

There are a lot of well documented use cases and amazing companies providing examples of how they are using Apache Cassandra, but as a managed service we get to see first hand the power of this amazing technology and what it can do for our diverse customer base.

Suffice to say that our experience over the last couple of years has left us even more convinced that Apache Cassandra is the foundation technology for the next wave of global-scale applications and solutions.

Our use cases

Of course we see the same use cases as those that are identified on Planet Cassandra, here is our take on each of these:

  • The fraud detection use case at Planet Cassandra is very active in our environment. We see the application in most cases is related to identifying anomalies through data mining and deep analytics to identify security-related events of interest.
  • Messaging. Several of our customers have social media and data sharing applications that are being used with messaging services at it’s core.
  • IoT. This is probably the most common of our customers use cases. We have many customers representing a wide range of industries, using Cassandra as an IoT solution. We also work with a number of customers who are providing IoT platforms to their own customer base.
  • Catalogues & Playlists. This particular use case we haven’t seen as much of the others but the data models and usage patterns typical seen within catalogues and playlists are usually a small part of a much larger application.
  • Recommendation & Personalisation. Many of our customers are using the power of personalisation. This is a very common within the AdTech industry, but also some of our customers are building unique learning platforms that are personalized to an individual student.

The most popular industries? We have a large customer base within the AdTech space, where the key metrics of performance and scalability are important. We also have core customers in the FinTech industry where personalization, high availability and security are all important.   We also have several customers in the EdTech space developing specialized and personalized learning platforms.

Another interesting insight is that we have an amazingly diverse client base that ranges from personal projects to early stage start-ups all the way through to 140-year-old, billion dollar companies looking to transform and enhance their business. We can see first hand that you don’t have to be a large company to be working with large datasets.

With several of our original customers we have been with them on the journey from an initial 3 node cluster, through to large production clusters with separate staging and testing environments.

Diverse use cases help us improve everyday

The beauty of having grown our customer base so rapidly and widely is that we have benefited from gaining insight and understanding into the wide application of this technology and the details of specific use cases. This provides us with a unique perspective of the deployment of Apache Cassandra. We see its adaptability, but we also see its complexity and its temperament when it is not handled well.

We see the specific nuances associated with operating an efficient production grade environment and cluster for all of the different use cases. Having such a wide range of different deployments under our care is giving us an ever increasing richness in our own data that we are now analysing through our Instametrics monitoring environment. This is helping us to continually improve our capability and to continue to automate and refine our service offering.

We have also been in the unique position of growing with our customers and helping them scale, in some cases rapidly. This also provides us with insight into how to build out a cluster or environment efficiently when an application goes viral, or the application has to ingest vast amounts of data.

With great power comes great responsibilities

There is no doubt that Apache Cassandra provides great power, but the trade-off is that this also comes with a certain level of complexity.   You can’t expect that a database technology like Apache Cassandra can simply scale rapidly, provide high throughput performance and be continuously on without there being some work to do.

Continued monitoring, maintenance and performance tuning are important activities that must go with any database and associated technology environment to keep it operating efficiently and effectively. But probably just as important is good design and planning up front.

We often see that the data layer follows on from the application. That is, the time effort and focus at first for many start-ups is the application and what the customer is building on the frontend. This is often necessary to demonstrate a concept to an investor or to simply get things up and running quickly while finding market fit. This approach means that often the data is an afterthought.

When the data is an afterthought we often see that the application and database will work okay at the beginning, but it is when they try to scale that things get ugly with Cassandra. If you don’t treat the data layer with a certain amount of mechanical sympathy, and you don’t plan effectively from the start, then there can be consequences down the track.

We see that effective planning and design of the data architecture and infrastructure from the start means that our customers tend to prosper and scaling and performance are not an issue. However, if you don’t do this and if you neglect to plan effectively, and you neglect to perform continued maintenance and tuning then the promises that Apache Cassandra brings can have you feeling an enormous amount of pain.

And yes we can speak from experience. Bringing your infrastructure and database back from the brink can be a difficult and painful experience.

You are much better off doing the work up front. Even if your environment works well initially, it is when you get to the point of having to scale is when you will start to see issues.

If you design the architecture and infrastructure yourself, get an independent expert with some experience to validate your work. Check and check again. Doing it right the first time will set you up for efficient scaling, high performance and a continuously on environment and save you weeks of pain when you have terabytes of data structured the wrong way. Again, your application might work okay at the start, but it is when it comes to the point of scaling that we see most of the issues arise for our customers.

Rebecca Mills

Rebecca Mills is a Junior Evangelist at DataStax, the company that delivers Apache Cassandra™ to the enterprise. She is interested in the new Cassandra user experience and is involved making it more approachable for everyone. She holds a B.Sc in biochemistry from Memorial University of Newfoundland and has been known to go on a bit about genome analysis.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button