Testing of Big Data & Advanced Analytics stack components is crucial to ensure success and confidence in your target implementation. Separating the application from the infrastructure/platform that enables it, is standard process, however, clearly separating the testing of the two may not be as standardized as one would think.
The importance of testing your data and analytics architecture without reliance on an existing pre-defined hardened application or attempting to perform testing of both your data-enabling architecture and the application implemented on-top, in tandem (and thereby relying on testing to harden both) has potential to provide neither confidence or consistency in results.
Here we discuss the core steps necessary to build separation between your data platform and application as-well-as appropriate testing methodologies for ensuring platform performance, reliability, and readiness are optimal without the overhead and uncertainty of relying on the client application to be available, tested, and hardened.
The end-goal is to certify the platform at each step of the build, from there move upward through a verified stack and compare/publish results based on established benchmarks for your reference hardware.
YOUR DATA AND ANALYTICS PLATFORM
Horizontally scalable Big Data platforms such as Hadoop for example, process very large volumes of data and are highly resource intensive. Platform testing is crucial to ensure the success of your critical Big Data or Advanced Analytics initiatives.
A poorly or improperly designed system may lead to performance degradation, and the system could fail to meet the requirement(s). Testing will ideally include both performance testing of job completion time, memory utilization, data throughput, and similar system metrics; let's also not forget failover testing to verify that data processing occurs seamlessly in case of failure of data nodes.
These testing procedures are most likely being conducted as part of your standardized SDLC, however, these activities, typically require the application itself to be present as part of the QA cycle.
This is where identification and implementation of testing methodologies, benchmark utilities, and best practice/process becomes the core focus.
THE DRIVERS FOR CHANGE
There are a number of factors that drive the need to add specific platform component testing into standard practices:
Experience-based drivers (those observed throughout delivery efforts)
Root cause analysis is consistently hindered by outstanding questions regarding the platform, network, server OS and application component configuration, logging, monitoring
No body of baseline metrics exist against which any operating state of a cluster can be compared, creating a glaring deficiency: standard workloads and benchmark numbers which define “good”
No existing evidence that platform architecture was ever/has ever been fully configured correctly
Deep-dive discovery of mis-matched or ill-advised component level settings/configuration could have been resolved prior to application level testing identifying “potential” issues and driving multi-level RCA
The number of programs that will use the platform is increasing rapidly and without a holistic view of the entire platform, each program has potential to impact the others
Your user base is expanding along with the programs and will mean that the platform must support (and be tested for) a multitude of different load/performance expectations
The increase in programs means increased lines of business on the platform and therefore potential for data collisions
Increase in data load job activities and impact to performance
SO WHAT SHOULD WE DO DIFFERENT?
With the above drivers in mind, you should incorporate the following components into your testing methodology:
Release driven platform regression which should include both performance and functional testing.
Data quality regression across all ETL and data integrity (as compared against the baseline). This allows for guards against data structure drift and data conformity drift
Monitoring (standard and synthetic) based on agreed benchmarks, alerting based on deviation from benchmarks
We also recommend expanding platform testing in the following ways:
Resiliency/failover/chaos testing, done component by component (both hardware and software components included)
Soak testing (partial load for significant duration 72+ hours)
Data load optimization testing
With a clear understanding of the background on this subject, the drivers for change, and a cohesive 'point a' established, we'll be prepared to move forward on resolving these systemic shortcomings.
In the next parts of this series, we'll discuss test methodology, configuration, synthetic data generation, and further ideas around focus on the right tests and metrics to support platform test certification.