top of page

Quality Assurance In Analytics (Part 2)

Below you will find a test methodology and configuration walkthrough to create baseline test results at each of the core steps in the buildout and development of a big data platform and application:

Baseline platform testing

Establishing baseline stats for the platform would involve building test scripts for each of the individual components once configured (prior to application code being deployed). From there, each of the components have a ‘point A’ to reference and reconfirm as new application versions are deployed and tested. The flow would be:

  1. Server build (racked, stacked, wired)

  2. OS configuration

  3. Network test and traffic/packet capture - test for expected patterns, throughput, routing efficiency

  4. Software component configuration - test for drift/compliance

  5. Component baseline test (this is done prior to data ingestion using existing suite(s) of tools/utils) the results of this are then compared with vendor-based benchmarks for similar/like hardware to ensure above configuration is in-line with cluster performance expectations

  6. ETL application and data ingestion

  7. Component test from above using full production data sizes and profiles: ‘our data’

  8. Application deployment and performance testing

Within the context of this flow of events, we can clearly illustrate (and eliminate) potential issues prior to creating multiple layers. That, in effect, gives us isolation of both components and application changes with deviations immediately illustrated. The result of this effort is a set of measures that can be referred back to and re-confirmed or back-tested each time something changes on the platform (new hardware, network configuration, security/OS/software component patches, etc).


Data Testing

Testing for both data quality and consistency is a necessary next step to producing a performant platform and ensuring stability. It’s important to break down the data related testing into several parts:

  1. Cardinality (ensure uniqueness)

  2. Regex pattern validation (does a given element conform as expected)

    1. This also ensures appropriate test creation for a given field during validation testing

  3. Range (validation that all data is within acceptable limits)

Performance testing

Performance testing for big data applications involves testing of huge volumes of structured and unstructured data, and it requires a specific testing approach to test such massive data.

Performance Testing is executed in this order:

  1. The process begins with the setting of the Big data cluster which is to be tested for performance

  2. Identify comparable reference architectures for the cluster build and harvest expected performance characteristics.

  3. Identify and design corresponding workloads.

  4. Prepare individual clients - compatible with monitoring platforms (Custom Scripts are created)

  5. Execute the test and analyze the result (If objectives are not met then tune the component and re-execute)

  6. Optimize Configuration

Regression testing

Full regression would be done using the above methods combined at every opportunity throughout the SDLC in preparation for release. Heavy reliance on DevOps tooling provided will be part and parcel to this effort being successful. Additionally, as new components come into play, Ops/Eng would be tasked with supporting, developing, and educating the various teams on necessary changes to existing tests and validations.

Test environment needs

Test environment needs depend on the type of architecture being tested. For Big data testing, the test environment should encompass:

  • It should have enough space for storage and process a production-equivalent amount of data

  • It should have a cluster with distributed nodes and data running on a known ratio/fraction of production horsepower

  • Dedicated instances should be used to keep test result trustworthiness high - a reasonable duty schedule should be implemented to maximize the investment in the hardware but still leave time when any environment is dedicated to performance workloads


Environment Isolation

Isolate operational components, for instance heavy lifting components from serving layers and then validate each one of those for configuration, performance, and usage expectations.

Heavy lifting component examples:

  • Spark

  • MapR

  • Hive LLIP

Real-time component examples:

  • HBase

  • Titan/Janus

  • Solr

  • Impala

  • Phoenix

Testing Approach

Because the data can be highly complex (dealing with large volumes of unstructured and structured data), one should be mindful of applying these principles to testing:

  1. Keeping in mind speeds of data consumption (which is data insertion rate)

  2. Ensuring the speeds of processing queries while that particular data is read (which is data retrieval rate)

  3. Synchronization requirements for multiple streams coming into one during curation

Since the system is made up of multiple components, it is important to conduct testing in isolation and starting out at component levels prior to testing them all at the same time. Thus, performance testers must be well-versed when it comes to their knowledge of framework and technology in big data. This will also entail the use and application of market tools for analytics and other ancillary business functions relying on data residing on the platform.

Continuum: Testing to Monitoring

With the right approach thought through up front, test tools (scripts, automation, etc) should be able to be repurposed to monitor the live running environment, long after testing completes, to guard against functionality or performance regression.



  • As data engineering and data analytics advances to a next level, Big data testing is inevitable.

  • Big data processing could be Batch, Real-Time, or Interactive

  • 3 stages of Testing Big Data applications are

    • Data staging validation - including data profiling, CDC, triggers, schedules.

    • Data transformation validation via Map/Reduce, Spark, COTS products.

    • Output validation phase - including data quality.

  • Architecture Testing is the important phase of Big data testing, as poorly designed system may lead to unprecedented errors and degradation of performance

  • Performance testing for Big data includes verifying

    • Data throughput

    • Data processing

    • Sub-component performance

  • Big data testing is very different from Traditional data testing in terms of Data, Infrastructure & Validation Tools

  • Big Data Testing challenges include virtualization, test automation and dealing with large dataset. Performance testing of Big Data applications is also an issue that requires focused attention and effort to gain insight and useful results from.

In all, the baseline testing of a big data platform is necessary for a multitude of reasons. To that end, we strongly recommend a metered approach to the testing efforts across all phases of the platform buildout and the SDLC related to the application itself. While the time and money spent may not be easily seen in the beginning as value enhancing to the overall effort, it is time and money well spent based on the delivery of the final product.

Join the conversation - leave your thoughts at the comments section below.


Abstract Shapes
bottom of page