Approaches to unit testing big data

Approaches to unit testing big data - unit-testing

Imagine you are designing a system and you want to start writing tests that determine functionality - but also performance and scalability. Are there any techniques you can share for handling large sets of data in different environments?

I would strongly recommend prioritizing functionality tests (using TDD as your development workflow) before working in performance and scalability tests. TDD will ensure your code is well designed and loosely coupled which will make it much, much, easier down the road to create automated performance and scalability. When your code is loosely coupled, you get control over your dependencies. When you have control over your dependencies, you can create any configuration you want for any high level test you want to write.

Do some functionality test. Consider few Risk management techniques refer this post How to handle risk Management in Big data This will help you.

Separate the different types of test.
Functional testing should come first, starting unit tests with small
amounts of mock data.
Next, integration tests, with small amounts of
data in a data store, though obviously not the same instance as the
store with the large data sets.
You maybe able to reduce your development effort by doing performance
and scalability tests together.
One important tip: Your test data set should be as realistic as possible. Use production data, anonymizing as necessary. Because big data performance depends on statistical distribtions in the data, you don't want to use synthetic data. For example, if you use fake user data that basically has the same user info a million times, you will get very different scalability results as opposed to real-life messy user data with a wide distribution of values.
For more specific advice, I'd need to know the technology you're using. In Hadoop, look at MRUnit. For RDBs, DBUnit. Apache Bigtop can provide inspiration, though it is aimed at core projects on Hadoop rather than specific application-level projects.

There are two situations that I encountered:
large HDFS datasets that serve as Data Warehouse or data sink for other applications
Apps with HBASE or other distributed databases
Tips for unit testing in both cases:-
a. Test the different functional components of the app first, there is no special rule for Big Data apps; just like any other app, unit testing should ascertain whether different components on an app are working as expected or not; then you could integrate functions/services/components, etc to do the SIT, if applicable
b. Specifically if HBASE or any other distributed data base is there, please test what is required of the DB. For e.g. distributed data bases often do not support ACID properties like traditional databases and instead are limited by something called CAP theorem (Consistency, Availability, Partition tolerance); usually 2 among 3 are guaranteed. For most RDBMS it is CA, usually for HBASE it is CP and Cassandra AP. As a designer or test planner, you should know, depending on your applications features, which is the CAP constraint for your distributed data base, and accordingly create test plan to check the real situation
Regarding performance - Again a lot depends on the infrastructure and app design. Also sometimes certain s/w implementations are more taxing than others. You might check the amount of partitioning for example, its all case based
Regarding Scalability - the very advantage of a Big data implementation is it is easily scalable compared to a traditional architecture. I have never thought of this as testable. For most big data apps, you can easily scale up, particularly horizontal scaling up is very easy, so not sure if anyone thinks about testing scalability for most apps.

for testing and measuring the performance, you could use static data sources and input (in could be huge dump file or sqlite DB).
you can create test and include it in your intergration build so that it particular function call takes more than X seconds, throw error.
as you build up more of your system, you will see that number increase and break your test.
you could spend 20% of your time to get 80% of the functionality, the rest of the 80% goes to performance and scalability :)
Scalability - think about Service oriented architechture so that you can have load balancer in between and you can increase your state/processing by simply adding new hardware/service to your system

Related

is it ok to write test cases for save/update/persist methods - whether it be mock or by calling real methods [duplicate]

I know what the advantages are and I use fake data when I am working with more complex systems.
What if I am developing something simple and I can easily set up my environment in a real database and the data being accessed is so small that the access time is not a factor, and I am only running a few tests.
Is it still important to create fake data or can I forget the extra coding and skip right to the real thing?
When I said real database I do not mean a production database, I mean a test database, but using a real live DBMS and the same schema as the real database.

The reasons to use fake data instead of a real DB are:
Speed. If your tests are slow you aren't going to run them. Mocking the DB can make your tests run much faster than they otherwise might.
Control. Your tests need to be the sole source of your test data. When you use fake data, your tests choose which fakes you will be using. So there is no chance that your tests are spoiled because someone left the DB in an unfamiliar state.
Order Independence. We want our tests to be runnable in any order at all. The input of one test should not depend on the output of another. When your tests control the test data, the tests can be independent of each other.
Environment Independence. Your tests should be runnable in any environment. You should be able to run them while on the train, or in a plane, or at home, or at work. They should not depend on external services. When you use fake data, you don't need an external DB.
Now, if you are building a small little application, and by using a real DB (like MySQL) you can achieve the above goals, then by all means use the DB. I do. But make no mistake, as your application grows you will eventually be faced with the need to mock out the DB. That's OK, do it when you need to. YAGNI. Just make sure you DO do it WHEN you need to. If you let it go, you'll pay.

It sort of depends what you want to test. Often you want to test the actual logic in your code not the data in the database, so setting up a complete database just to run your tests is a waste of time.
Also consider the amount of work that goes into maintaining your tests and testdatabase. Testing your code with a database often means your are testing your application as a whole instead of the different parts in isolation. This often result in a lot of work keeping both the database and tests in sync.
And the last problem is that the test should run in isolation so each test should either run on its own version of the database or leave it in exactly the same state as it was before the test ran. This includes the state after a failed test.
Having said that, if you really want to test on your database you can. There are tools that help setting up and tearing down a database, like dbunit.
I've seen people trying to create unit test like this, but almost always it turns out to be much more work then it is actually worth. Most abandoned it halfway during the project, most abandoning ttd completely during the project, thinking the experience transfer to unit testing in general.
So I would recommend keeping tests simple and isolated and encapsulate your code good enough it becomes possible to test your code in isolation.

As far as the Real DB does not get in your way, and you can go faster that way, I would be pragmatic and go for it.
In unit-test, the "test" is more important than the "unit".

I think it depends on whether your queries are fixed inside the repository (the better option, IMO), or whether the repository exposes composable queries; for example - if you have a repository method:
IQueryable<Customer> GetCustomers() {...}
Then your UI could request:
var foo = GetCustomers().Where(x=>SomeUnmappedFunction(x));
bool SomeUnmappedFunction(Customer customer) {
return customer.RegionId == 12345 && customer.Name.StartsWith("foo");
}
This will pass for an object-based fake repo, but will fail for actual db implementations. Of course, you can nullify this by having the repository handle all queries internally (no external composition); for example:
Customer[] GetCustomers(int? regionId, string nameStartsWith, ...) {...}
Because this can't be composed, you can check the DB and the UI independently. With composable queries, you are forced to use integration tests throughout if you want it to be useful.

It rather depends on whether the DB is automatically set up by the test, also whether the database is isolated from other developers.
At the moment it may not be a problem (e.g. only one developer). However (for manual database setup) setting up the database is an extra impediment for running tests, and this is a very bad thing.

If you're just writing a simple one-off application that you absolutely know will not grow, I think a lot of "best practices" just go right out the window.
You don't need to use DI/IOC or have unit tests or mock out your db access if all you're writing is a simple "Contact Us" form. However, where to draw the line between a "simple" app and a "complex" one is difficult.
In other words, use your best judgment as there is no hard-and-set answer to this.

It is ok to do that for the scenario, as long as you don't see them as "unit" tests. Those would be integration tests. You also want to consider if you will be manually testing through the UI again and again, as you might just automated your smoke tests instead. Given that, you might even consider not doing the integration tests at all, and just work at the functional/ui tests level (as they will already be covering the integration).
As others as pointed out, it is hard to draw the line on complex/non complex, and you would usually now when it is too late :(. If you are already used to doing them, I am sure you won't get much overhead. If that were not the case, you could learn from it :)

Assuming that you want to automate this, the most important thing is that you can programmatically generate your initial condition. It sounds like that's the case, and even better you're testing real world data.
However, there are a few drawbacks:
Your real database might not cover certain conditions in your code. If you have fake data, you cause that behavior to happen.
And as you point out, you have a simple application; when it becomes less simple, you'll want to have tests that you can categorize as unit tests and system tests. The unit tests should target a simple piece of functionality, which will be much easier to do with fake data.

One advantage of fake repositories is that your regression / unit testing is consistent since you can expect the same results for the same queries. This makes it easier to build certain unit tests.
There are several disadvantages if your code (if not read-query only) modifies data:
- If you have an error in your code (which is probably why you're testing), you could end up breaking the production database. Even if you didn't break it.
- if the production database changes over time and especially while your code is executing, you may lose track of the test materials that you added and have a hard time later cleaning it out of the database.
- Production queries from other systems accessing the database may treat your test data as real data and this can corrupt results of important business processes somewhere down the road. For example, even if you marked your data with a certain flag or prefix, can you assure that anyone accessing the database will adhere to this schema?
Also, some databases are regulated by privacy laws, so depending on your contract and who owns the main DB, you may or may not be legally allowed to access real data.
If you need to run on a production database, I would recommend running on a copy which you can easily create during of-peak hours.

It's a really simple application, and you can't see it growing, I see no problem running your tests on a real DB. If, however, you think this application will grow, it's important that you account for that in your tests.
Keep everything as simple as you can, and if you require more flexible testing later on, make it so. Plan ahead though, because you don't want to have a huge application in 3 years that relies on old and hacky (for a large application) tests.

The downsides to running tests against your database is lack of speed and the complexity for setting up your database state before running tests.
If you have control over this there is no problem in running the tests directly against the database; it's actually a good approach because it simulates your final product better than running against fake data. The key is to have a pragmatic approach and see best practice as guidelines and not rules.

Port monolithic server application to web services gotchas

Has anyone done a port from a monolithic server application to a service and what are the hidden ‘gotchas’ I need to be aware of to accurately estimate the cost of this?
We have an existing single threaded monolithic server application written in Java. It performs fine as it sits, but we want to start extending it. Extending it would mean many more people would use it and the server would not be able to handle the extra load. There was a significant development investment in this code base, and the code base is large. The cost of multithreading the server would be crazy.
I had a harebrained idea about breaking it up into logical service components, removing them from the application and hosting them on Axis2 or Tomcat, and pushing them into a SOA cloud.
I have written many services for Axis2, worked plenty with SOA clouds, and written multiple monolithic servers and it seems straight forward. Eliminate as much shared state as possible – then push that to a DB of some kind, pull out logical services from the monolithic app, repeat until done.
But I have a bad feeling that the devil is in the details and I am certain I am not the first person to have an idea like this.

My experience in these types of system/architecture migrations, the time-sinks and killers tend to be:
Identifying all use-cases and functional requirements. Roughly 50% will not be documented. Out of the 50% that are, 50% will be incorrect. Out of the 50% that are both document and correct, 50% will not be valid anymore.
Ensuring backwards compatibility. Just because you are moving to a new architecture generally doesn't mean that all clients will move with you at the same time.
Migrating existing data into new structure/model/architecture. This always takes a lot longer than you think. Take the worst case scenario you can imagine in terms of time/cost, double it and you'll still be short by about half.
Access control model.
Documenting your services in a clear and useful way. Your shiny new SOA architecture won't be worth squat if no one can use it.
Performance testing. It's mind boggling how often this is skipped or done at the very end of the project. Establish performance scenarios, testing infrastructure and baseline values first thing and then continually test and measure against them. This should be to an architect what unit testing is to a developer.
These are the things that make projects fail, go over budget and over time if you don't address them early in the project.

Advantage Data Access Layer

At work, I'm trying to get n-tier model implemented in a large already existing PHP application.
I have to convince my seniors since they don't see the point of an extra DA layer, becasue of performance. The code now queries the Db in the business logic, and calculates in the loop while retrieving the data from the resultset. Low performance cost.
I have tried convincing them by the obvious reasons: transparancy ('we can read SQL'), changing of database ('won't happen').
Their argument is that if it is getting done by a seperate layer, it will mean that a dataset has to be created, and being looped again in the business layer. Costing performance.
Also, creating this n-tier model will mean a lot of work which has no 'real' pay.
Is this a performance issue, and therefore a logical reason to say no to a seperate DA layer?

I think you touch an important point: Hand-optimized SQL without an additional abstraction layer can be faster. However, that comes at a price.
The question will probably be: Does the benefit of the additional speed outweigh the benefit of the database access layer e.g. encapsulating the SQL specific knowledge so that the engineers can focus on the business logic (domain layer).
In most cases you probably will find that the performance of a database abstraction layer will be good enough provided the implementation was done by an expert in this matter. If done properly the double buffer/looping can be avoided to a large degree.
I suspect that there are only a small proportion of application (my guess is no more than 20%) where performance is so critical that the abstraction layer is not an option.
But there is also a hybrid solution possible: Use the abstraction layer for the 80% of the modules where flexibility and convenience trumps speed and talk straight to the database in the 20% of the modules where speed is critical.
I probably would vote in favor of the abstraction layer and then optimize performance where needed (which may be achieved with means other than talking directly to the database).

Data access layer is outdated technology compare to the present technology because it is too much complicated & scientifically unproven technology, it checkes each and sql datatypes in a while loop & validates the datatypes, .net is facing serious app domain issues like executing codes from one class file to another class file takes more time, because .net assemblies are not tightly coupled, proof for my argument we can run Suse linux in 256 MB Ram in a very smooth manner, but not windows 7 or windows xp, moreover .net claims automatic memoery management, which is not true in fact, .net leaves lots of unused memory in the heap, this results in lots of performance loss in DAP architechure, moreover efforts in DAL is 95 % more compare to direct connection to the database using stored procedure, Don't use DAL, instead of it use WCF or xml webservices

Does three-tier architecture ever work?

We are building three-tier architectures for over a decade now. Dividing presentation-, logic- and data-tier is supposed to allow us to exchange each layer individually, should the need ever arise, be it through changed requirements or new technologies.
I have never seen it working in practice...
Mostly because (at least) one of the following reasons:
The three tiers concept was only visible in the source code (e.g. package naming in Java) which was then deployed as one, tied together package.
The code representing each layer was nicely bundled in its own deployable format but then thrown into the same process (e.g. an "enterprise container").
Each layer was run in its own process, sometimes even on different machines but through the static nature they were connected to each other, replacing one of them meant breaking all of them.
Thus what you usually end up with, in is a monolithic, tightly coupled system that does not deliver what it's architecture promised.
I therefore think "three-tier architecture" is a total misnomer. The true benefit it brings is that the code is logically sound. But that's at "write time", not at "run time". A better name would be something like "layered by responsibility". In any case, the "architecture" word is misleading.
What are your thoughts on this? How could working three-tier architecture be achieved? By that I mean one which holds its promises: Allowing to plug out a layer without affecting the other ones. The system should survive that and be in a well defined state afterwards.
Thanks!

The true purpose of layered architectures (both logical and physical tiers) isn't to make it easy to replace a layer (which is quite rare), but to make it easy to make changes within a layer without affecting the others (and as Ben notes, to facilitate scalability, consistency, and security) - which works all the time all around us.

One example of a 3-tier architecture is a typical database-driven web application:
End-user's web browser
Server-side web application logic
Database engine

In every system, there is the nice, elegant architecture dreamed up at the beginning, then the hairy mess when its finally in production, full of hundreds of bug fixes and special case handlers, and other typical nasty changes made to address specific issues not realized during the design.
I don't think the problems you've described are specific to three-teir architecture at all.

If you haven't seen it working, you may just have bad luck. I've worked on projects that serve several UIs (presentation) from one web service (logic). In addition, we swapped data providers via configuration (data) so we could use a low-cost database while developing and Oracle in higher environments.
Sure, there's always some duplication - maybe you add validation in the UI for responsiveness and then validate again in the logic layer - but overall, a clean separation is possible and nice to work with.

Once you accept that n-tier's major benefits--namely scalability, logical consistency, security--could not easily be achieved through other means, the question of whether or not any of the tiers can be replaced outright without breaking the the others becomes more like asking whether there's any icing on the cake.

Any operating system will have a similar kind of architecture, or else it won't work. The presentation layer is independent of the hardware layer, which is abstracted into drivers that implement a certain interface. The data is handled using logic that changes depending on the type of data being read (think NTFS vs. FAT32 vs. EXT3 vs. CD-ROM). Linux can run on just about any hardware you can throw at it and it will still look and behave the same because the abstractions between the layers insulate each other from changes within a single layer.

One of the biggest practical benefits of the 3-tier approach is that it makes it easy to split up work. You can easily have a DBA and a business anylist or two building a data layer, a traditional programmer building the server side app code, and a graphic designer/ web designer building the UI. The three teams still need to communicate, of course, but this allows for much smoother development in most cases. In this regard, I see the 3-tier approach working reliably everyday, and this enough for me, even if I cannot count on "interchangeable parts", so to speak.

Is Unit Testing Suitable for BPM Development?

I'm currently working on a large BPM project at work which uses the Global 360 BPM tool-set called Process 360. Just to give some background; this product works like a lot of other BPM solutions in that you design multiple "process maps" which define the flow of a particular business process you're trying to model, and each process map consists of multiple task nodes connected together which perform particular functions (calling web-services etc).
Currently we're experiencing some pretty serious issues during QA phases of our releases because there isn't any way provided by the tool-set to automate testing of the process map routes. So when a large and complex process is developed and handed over to our test team there are often a large number of issues which crop up. While obviously you'd expect some issues to come out of QA, I can't help the feeling that a lot of the bugs etc could have been spotted during development if we had some sort of automated testing framework which we could use to build up a set of unit tests which proved the various routes in the process map(s).
At the moment the only real development testing that occurs is more akin to functional testing performed by the developers which is documented as a set of manual steps per test-case. The problem with this approach is that it's very time consuming for the developers to run manually, and because of this, is also relatively prone to error. Also; because we're usually on a pretty tight schedule, the tests are often not executed often enough to spot issues early.
As I mentioned earlier; there isn't a way provided by the current tool-set to perform this sort of automated testing. Which actually got me thinking why? Being very new to the whole BPM scene my assumption was that this was just a feature lacking in the product, but I also wonder whether "unit testing" just isn't done in the BPM world traditionally? Perhaps it just isn't suited well to this sort of work?
I'd be interested to know if anyone else has ever encountered these sorts of issues, and also what - if anything - can be done to improve things.

I have seen something about that, though not Global 360 related : using bpelunit for testing processes
I develop a workflow tool and there is an increased demand for opening the test tools used for testing the engine to end-users.

I've done "unit" testing with K2.net 2003, another commercial BPM. I'd really call this integration testing, because it requires a test server and it's relatively slow. However, it is automated.
There's a good discussion of this in the book Professional K2 blackpearl (it applies to K2.net 2003 as well).
In order to apply it to your platform, the tool set has to have an API that permits starting process instances, obtaining work items, completing work items, etc. You write tests using any supported language (I used C#) and a testing framework (I used NUnit). If the API supports synchronous calls, this is easier to do. For each test:
Start the process under test
Progress the work item to a decision point
Set process instance data appropriately
Complete the work item
Assert that the work item is now at the expected activity
Delete or complete the process instance
Base test classes or helper methods can make this easier. You could even write a DSL for testing maps.
Essentially you want full "test coverage" of the process/map - test every decision point and insure that the correct branch is taken.

There are two aspects of BPM that are related yet not identical.
There's BPM that tool and technology vendors advocate for which is all about features.
There's also BPM that Enterprise Architects advocate for which is all about establishing Centers of Excellence.
The former is where a company buys some software.
The latter is where a company makes systemic and inherent changes to the behavior of their IT workers.
The former is supposed to be in the service of the latter but that isn't necessarily so. Acquiring the former is necessary but not sufficient for achieving the latter.
I don't know just how well that Global 360 supports what is known as Test Driven Development but JBoss jBPM does provide some tool support for easily writing JUnit tests.
However, the tool can't and won't force developers to write them or to embrace TDD principles.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js