Conceptual difference between seeding and migration - database-migration

We use knex for database migrations, and we now need to introduce database seeding to the mix. Seeding is a concept in knex, and the seed files are run by separate commands.
Currently, we need to seed the database with certain config values, however, it is not clear to me if this should be done using seeding or migration. I lean towards migration, because the data are unlikely to change, and by using migration, the data will be inserted into the correct schema, regardless of later schema changes and which point in time the seeding is executed. However, it seems a bit unwise to go against such a common pattern as seeding.
My question then, is if there is a practical difference between seeding and migration that I am not aware of, and if it is OK to insert data using the migration process.

If it is a data that should be inserted to DB just once and it should be same in every installation of APP, then migrations is good place to do it.
If data changes between installations, then seeds sounds like something that could be used.
I don't think that seeds should be used with knex for anything else except for maybe inserting some test data for tests to run (even in that case it is usually better to not to use knex's seed feature, but write your own that suits your tests better).

I think Mikael Lepistö makes a great point in their answer.
I feel that it can depend on the data and where you are expecting it to come from.
A good example of data good for migration is static enumerations. These are most likely defined in your code and will not change regardless of environment and is closest to the code that generates it, so migrations make more sense.
A good example for seeding is data generated at run time or from another system. This data could be different based on the environment and integrations, so a seed would better suit.
The idea that seeding is only for test data I do not agree with, but it does lend itself to test data, as this is data dependent on an environment.

Related

is it ok to write test cases for save/update/persist methods - whether it be mock or by calling real methods [duplicate]

I know what the advantages are and I use fake data when I am working with more complex systems.
What if I am developing something simple and I can easily set up my environment in a real database and the data being accessed is so small that the access time is not a factor, and I am only running a few tests.
Is it still important to create fake data or can I forget the extra coding and skip right to the real thing?
When I said real database I do not mean a production database, I mean a test database, but using a real live DBMS and the same schema as the real database.
The reasons to use fake data instead of a real DB are:
Speed. If your tests are slow you aren't going to run them. Mocking the DB can make your tests run much faster than they otherwise might.
Control. Your tests need to be the sole source of your test data. When you use fake data, your tests choose which fakes you will be using. So there is no chance that your tests are spoiled because someone left the DB in an unfamiliar state.
Order Independence. We want our tests to be runnable in any order at all. The input of one test should not depend on the output of another. When your tests control the test data, the tests can be independent of each other.
Environment Independence. Your tests should be runnable in any environment. You should be able to run them while on the train, or in a plane, or at home, or at work. They should not depend on external services. When you use fake data, you don't need an external DB.
Now, if you are building a small little application, and by using a real DB (like MySQL) you can achieve the above goals, then by all means use the DB. I do. But make no mistake, as your application grows you will eventually be faced with the need to mock out the DB. That's OK, do it when you need to. YAGNI. Just make sure you DO do it WHEN you need to. If you let it go, you'll pay.
It sort of depends what you want to test. Often you want to test the actual logic in your code not the data in the database, so setting up a complete database just to run your tests is a waste of time.
Also consider the amount of work that goes into maintaining your tests and testdatabase. Testing your code with a database often means your are testing your application as a whole instead of the different parts in isolation. This often result in a lot of work keeping both the database and tests in sync.
And the last problem is that the test should run in isolation so each test should either run on its own version of the database or leave it in exactly the same state as it was before the test ran. This includes the state after a failed test.
Having said that, if you really want to test on your database you can. There are tools that help setting up and tearing down a database, like dbunit.
I've seen people trying to create unit test like this, but almost always it turns out to be much more work then it is actually worth. Most abandoned it halfway during the project, most abandoning ttd completely during the project, thinking the experience transfer to unit testing in general.
So I would recommend keeping tests simple and isolated and encapsulate your code good enough it becomes possible to test your code in isolation.
As far as the Real DB does not get in your way, and you can go faster that way, I would be pragmatic and go for it.
In unit-test, the "test" is more important than the "unit".
I think it depends on whether your queries are fixed inside the repository (the better option, IMO), or whether the repository exposes composable queries; for example - if you have a repository method:
IQueryable<Customer> GetCustomers() {...}
Then your UI could request:
var foo = GetCustomers().Where(x=>SomeUnmappedFunction(x));
bool SomeUnmappedFunction(Customer customer) {
return customer.RegionId == 12345 && customer.Name.StartsWith("foo");
}
This will pass for an object-based fake repo, but will fail for actual db implementations. Of course, you can nullify this by having the repository handle all queries internally (no external composition); for example:
Customer[] GetCustomers(int? regionId, string nameStartsWith, ...) {...}
Because this can't be composed, you can check the DB and the UI independently. With composable queries, you are forced to use integration tests throughout if you want it to be useful.
It rather depends on whether the DB is automatically set up by the test, also whether the database is isolated from other developers.
At the moment it may not be a problem (e.g. only one developer). However (for manual database setup) setting up the database is an extra impediment for running tests, and this is a very bad thing.
If you're just writing a simple one-off application that you absolutely know will not grow, I think a lot of "best practices" just go right out the window.
You don't need to use DI/IOC or have unit tests or mock out your db access if all you're writing is a simple "Contact Us" form. However, where to draw the line between a "simple" app and a "complex" one is difficult.
In other words, use your best judgment as there is no hard-and-set answer to this.
It is ok to do that for the scenario, as long as you don't see them as "unit" tests. Those would be integration tests. You also want to consider if you will be manually testing through the UI again and again, as you might just automated your smoke tests instead. Given that, you might even consider not doing the integration tests at all, and just work at the functional/ui tests level (as they will already be covering the integration).
As others as pointed out, it is hard to draw the line on complex/non complex, and you would usually now when it is too late :(. If you are already used to doing them, I am sure you won't get much overhead. If that were not the case, you could learn from it :)
Assuming that you want to automate this, the most important thing is that you can programmatically generate your initial condition. It sounds like that's the case, and even better you're testing real world data.
However, there are a few drawbacks:
Your real database might not cover certain conditions in your code. If you have fake data, you cause that behavior to happen.
And as you point out, you have a simple application; when it becomes less simple, you'll want to have tests that you can categorize as unit tests and system tests. The unit tests should target a simple piece of functionality, which will be much easier to do with fake data.
One advantage of fake repositories is that your regression / unit testing is consistent since you can expect the same results for the same queries. This makes it easier to build certain unit tests.
There are several disadvantages if your code (if not read-query only) modifies data:
- If you have an error in your code (which is probably why you're testing), you could end up breaking the production database. Even if you didn't break it.
- if the production database changes over time and especially while your code is executing, you may lose track of the test materials that you added and have a hard time later cleaning it out of the database.
- Production queries from other systems accessing the database may treat your test data as real data and this can corrupt results of important business processes somewhere down the road. For example, even if you marked your data with a certain flag or prefix, can you assure that anyone accessing the database will adhere to this schema?
Also, some databases are regulated by privacy laws, so depending on your contract and who owns the main DB, you may or may not be legally allowed to access real data.
If you need to run on a production database, I would recommend running on a copy which you can easily create during of-peak hours.
It's a really simple application, and you can't see it growing, I see no problem running your tests on a real DB. If, however, you think this application will grow, it's important that you account for that in your tests.
Keep everything as simple as you can, and if you require more flexible testing later on, make it so. Plan ahead though, because you don't want to have a huge application in 3 years that relies on old and hacky (for a large application) tests.
The downsides to running tests against your database is lack of speed and the complexity for setting up your database state before running tests.
If you have control over this there is no problem in running the tests directly against the database; it's actually a good approach because it simulates your final product better than running against fake data. The key is to have a pragmatic approach and see best practice as guidelines and not rules.

When should SQLite not be used for testing in Django if a different RDBMS(E.g. PostgreSQL) is used in production and development?

This article advises to use SQLite for tests even if you use another RDBMS (I use PostgreSQL) in development and production. I tried SQLite for one test case and it ran faster indeed (~18.8 times faster, 0.5s vs 9.4s!).
In which cases could using SQLite result in different test results than if I used PostgreSQL?
Only if I would test a piece of code that contains a raw SQL query?
Any time the query generator might produce queries that behave differently on different platforms despite the efforts of the query generator's platform abstractions. Differences in regular expressions, collations and sorting, different levels of strictness about aggregates and grouping, use of full-text search or other extension features, use of anything but the most utterly simple functions and operators, etc.
Also, as you noted, any time you run raw SQL.
It's moderately reasonable to run tests on SQLite during iterative development, but you really need to run them on the same DB you're going to deploy on before you push to production. Otherwise you'll get bitten by some query where different engines have different capabilities to prove transitive equality through joins and GROUP BY or are differently permissive of queries, so a query will work on one then fail on the other.
You should also test against PostgreSQL on a reasonable data set before pushing changes live in order to find obvious performance regressions that'll be an issue in production. It makes little sense to do this on SQLite, where often totally different queries will be fast or slow.
I'm surprised you're seeing the kind of speed difference you report. I'd want to look into why the tests run so much slower on PostgreSQL and what you can do about it, since in production it's clearly not going to have the same kind of performance difference. I wrote a bit about this in optimise PostgreSQL for fast testing.
The performance characteristics will be very different in most cases. Often faster. It's typically good for testing because the SQLite engine does not need to take into account multiple client access. SQLite only allowed one thread to access it at once. This greatly reduces a lot of the overhead and complexity compared to other RDBMSs.
As far as raw queries go, there are going to be lot of features that SQLite does not support compared to Postgres or another RDBMS. Stay away from raw queries as much as possible to keep your code portable. The exception will be when you need to optimize specific queries for production. In those cases you can keep a setting in settings.py to check if you are on production and run the generic filters instead of a raw query. There are many types of generic raw queries that will not need this sort of checking though.
Also, the fact that a SQLite DB is just a file, it makes it very simple to tear down and start over for testing.

Best practice approach for automated testing

This is a very strange request for advice for which I truly feel there is no real answer. In my project I have archiving routines on various objects that have been consumed for logical calculations, I archive these items for the sake of audit trail and to check up on calculation errors or prove correctiveness at a later stage. I am working with Entity Framework and things are slightly different to perhaps your own project.
I consume the original object, modify it directly, create a clone of the modified item, revert the original item from store and save changes accordingly. An object is not reverted to original if never consumed by a calculation, in these instances, I save directly over that object along with the various relationships that exist with further objects.
This may sound long winded, but I assure you - it seems the easiest so far in terms of my workings with EF in my situation.
My trouble with these archiving routines is, that over time as I introduce further functionality - I sometimes, without knowing, break critical code to a point where I have to regression test the entire solution over, from beginning to end, to ensure that the archiving requirements remain intact.
Is there any unit test approach or automated methodology for testing these sorts of requirements. It would speed up deployment of packages cutting down on my own manual testing.
Any advice or links to simlar situations appreciated.
I think there are two pieces to this problem you are describing:
First you need some unit tests that you can build which will represent technical requirements of the system. Think of the unit tests as the rules which you have set up to technically accomplish the goal that the end user desires. In this way, I would craft unit tests that you can feel confident will break if a technical assumption you had made about the system fails because of a code change. Remember to keep the unit tests at the unit level so that you don't have a large amount of dependencies interacting to fail a test. A unit test should test exactly one thing. If you do this, when you make code changes you can run all your unit tests and immediately know what assumptions you had made about the system which are now not being met.
I would also set up some sort of integration functional tests which are automated. I think in your problem domain it would make sense to set up integrated tests which are similar to unit tests (you can use the same tool.) Here you will want to take bigger pieces of functionality, perhaps pipes which data flows through the system and test that the correct series of transformations occur on the data.
One best practice is to make sure the tests can be run in any order. You could separate the produce routines from the archive routines, perhaps by using "gold" data on the archive routing.
The number one best practice for unit tests is just do it! Beyond that, I'd like to recommend xUnit Test Patterns: Refactoring Test Code by Gerard Meszaros.

How do I break this down into Unit Tests?

I have a method that is called on an object to perform some business logic and add it to the database.
The object is a Transaction, and part of the business logic requires searching the databses for related accounts and history items on the account.
There are then a series of comparisons and operations that need to bring back information from the account and apply it to the transaction before the transaction is then passed on to other people and written to the database.
The only way I can think of for testing this currently is within the test to create an account and the relevant history information, then to construct a transaction for each different scenario and capture the information written to the DB for the transaction and information being passed on, however this feels like its testing way too much in one test. Each scenario would be performed in a separate unit test, with the test construction refactored out into separate methods, but the actual piece of code targetted by the test is over 500 lines long.
I guess this question is more about refactoring than unit testing, but in this case they go hand in hand.
If anyone has any advice (good or bad) then I'd be glad to hear it.
EDIT:
Pseudo code:
Find account for transaction
Do validation on transaction codes and values
Update transaction with info from account
Get related history from account Handle different transaction codes and values (6 different combinations, each with different logic)
Update the transaction again with new account info (resulting from business logic)
Send transaction to clients
I would appreciate it if you had some pseudocode on this question, but just following it over I would:
Create interfaces for the data access objects that directly access the database - this way you can pass in an object that only pretends (e.g., mocks) that it accesses the database. This object would then return results consistent with the results your database would return, without actually doing any DB call. Your object could also simulate scenarios such as rolling back data to its original state.
Extract each "scenario" into a single method each - that is the essence of a unit. If your method is 500 lines long then there must be contiguous blocks in there that can be extracted. Write a unit test for each, if appropriate.
If your unit test is testing too much, that probably means your method is doing too much - You can extract methods by identifying the different things you are testing and then putting them in their own methods. Rinse and repeat until you only need one test for each method.
Transactions "passed on to other people" sounds like a code smell - a transaction in and by itself should only be one contiguous unit. If you need different users to finish your transaction, you're doing too much; keep track of your data's state on the DB instead, in terms of flags or such, not in terms of a DB transaction.
Separating out units from existing legacy code can be extremely tricky and time consuming. Check out Working Effectively With Legacy Code for a variety of tried and tested techniques to make things more manageable.
This depends on what you would like to test. Would you like to test the database transaction? Would you like to test the business transaction or something else? Try to use mockups for things you would not like to test. With mockups you can concentrate on certain test objectives.
Yes you -can- rewrite you current code so that it can be unit tested according
to all guidelines and best-practices.
However, that can be expensive, and You should estimate the cost and compare that
against the earnings...
The earnings is that you might discover a problem with the code and also, if done right, the reduction of the complexity as the result of the refactoring.
Both factors might save some time - in the future.
The cost is the time and effort you have to spend both refactor your code,
writing the test cases and also the extra time you might have to spend in the future
to maintain the test-cases and the mocking code - and that can be significant costs.
You are comparing a known cost against a future risk and I am sure a lot of smart guys knows how to do that, but it's obvious that You can actually spend an infinite time refactoring and mocking without ever reducing the risk of failure to zero (or even at all if the code and problem is complex and you are messing things up when refactoring), so you need to find a balance here.
In this case, as the code is old, it might be ok to be "sloppy" or "pragmatic" and do black box testing - or top-down testing and just test the interface (or abstraction) without bothering to mock the database. And yes, you can argue that this is not a unit test but instead a system test or a function test practice.
...But, it might give the best value for your money - or your employers/customers money - or more time with your significant other (or at least more time to watch discovery channel.)
If you have old code, allow black box testing, allow dependencies between the tests and compile a sequence of test that sets up the test data and manipulates it, and its at least tested automatically while not tested 100%.

Random data in Unit Tests?

I have a coworker who writes unit tests for objects which fill their fields with random data. His reason is that it gives a wider range of testing, since it will test a lot of different values, whereas a normal test only uses a single static value.
I've given him a number of different reasons against this, the main ones being:
random values means the test isn't truly repeatable (which also means that if the test can randomly fail, it can do so on the build server and break the build)
if it's a random value and the test fails, we need to a) fix the object and b) force ourselves to test for that value every time, so we know it works, but since it's random we don't know what the value was
Another coworker added:
If I am testing an exception, random values will not ensure that the test ends up in the expected state
random data is used for flushing out a system and load testing, not for unit tests
Can anyone else add additional reasons I can give him to get him to stop doing this?
(Or alternately, is this an acceptable method of writing unit tests, and I and my other coworker are wrong?)
There's a compromise. Your coworker is actually onto something, but I think he's doing it wrong. I'm not sure that totally random testing is very useful, but it's certainly not invalid.
A program (or unit) specification is a hypothesis that there exists some program that meets it. The program itself is then evidence of that hypothesis. What unit testing ought to be is an attempt to provide counter-evidence to refute that the program works according to the spec.
Now, you can write the unit tests by hand, but it really is a mechanical task. It can be automated. All you have to do is write the spec, and a machine can generate lots and lots of unit tests that try to break your code.
I don't know what language you're using, but see here:
Java
http://functionaljava.org/
Scala (or Java)
http://github.com/rickynils/scalacheck
Haskell
http://www.cs.chalmers.se/~rjmh/QuickCheck/
.NET:
http://blogs.msdn.com/dsyme/archive/2008/08/09/fscheck-0-2.aspx
These tools will take your well-formed spec as input and automatically generate as many unit tests as you want, with automatically generated data. They use "shrinking" strategies (which you can tweak) to find the simplest possible test case to break your code and to make sure it covers the edge cases well.
Happy testing!
This kind of testing is called a Monkey test. When done right, it can smoke out bugs from the really dark corners.
To address your concerns about reproducibility: the right way to approach this, is to record the failed test entries, generate a unit test, which probes for the entire family of the specific bug; and include in the unit test the one specific input (from the random data) which caused the initial failure.
There is a half-way house here which has some use, which is to seed your PRNG with a constant. That allows you to generate 'random' data which is repeatable.
Personally I do think there are places where (constant) random data is useful in testing - after you think you've done all your carefully-thought-out corners, using stimuli from a PRNG can sometimes find other things.
I am in favor of random tests, and I write them. However, whether they are appropriate in a particular build environment and which test suites they should be included in is a more nuanced question.
Run locally (e.g., overnight on your dev box) randomized tests have found bugs both obvious and obscure. The obscure ones are arcane enough that I think random testing was really the only realistic one to flush them out. As a test, I took one tough-to-find bug discovered via randomized testing and had a half dozen crack developers review the function (about a dozen lines of code) where it occurred. None were able to detect it.
Many of your arguments against randomized data are flavors of "the test isn't reproducible". However, a well written randomized test will capture the seed used to start the randomized seed and output it on failure. In addition to allowing you to repeat the test by hand, this allows you to trivially create new test which test the specific issue by hardcoding the seed for that test. Of course, it's probably nicer to hand-code an explicit test covering that case, but laziness has its virtues, and this even allows you to essentially auto-generate new test cases from a failing seed.
The one point you make that I can't debate, however, is that it breaks the build systems. Most build and continuous integration tests expect the tests to do the same thing, every time. So a test that randomly fails will create chaos, randomly failing and pointing the fingers at changes that were harmless.
A solution then, is to still run your randomized tests as part of the build and CI tests, but run it with a fixed seed, for a fixed number of iterations. Hence the test always does the same thing, but still explores a bunch of the input space (if you run it for multiple iterations).
Locally, e.g., when changing the concerned class, you are free to run it for more iterations or with other seeds. If randomized testing ever becomes more popular, you could even imagine a specific suite of tests which are known to be random, which could be run with different seeds (hence with increasing coverage over time), and where failures wouldn't mean the same thing as deterministic CI systems (i.e., runs aren't associated 1:1 with code changes and so you don't point a finger at a particular change when things fail).
There is a lot to be said for randomized tests, especially well written ones, so don't be too quick to dismiss them!
If you are doing TDD then I would argue that random data is an excellent approach. If your test is written with constants, then you can only guarantee your code works for the specific value. If your test is randomly failing the build server there is likely a problem with how the test was written.
Random data will help ensure any future refactoring will not rely on a magic constant. After all, if your tests are your documentation, then doesn't the presence of constants imply it only needs to work for those constants?
I am exaggerating however I prefer to inject random data into my test as a sign that "the value of this variable should not affect the outcome of this test".
I will say though that if you use a random variable then fork your test based on that variable, then that is a smell.
In the book Beautiful Code, there is a chapter called "Beautiful Tests", where he goes through a testing strategy for the Binary Search algorithm. One paragraph is called "Random Acts of Testing", in which he creates random arrays to thoroughly test the algorithm. You can read some of this online at Google Books, page 95, but it's a great book worth having.
So basically this just shows that generating random data for testing is a viable option.
Your co-worker is doing fuzz-testing, although he doesn't know about it. They are especially valuable in server systems.
One advantage for someone looking at the tests is that arbitrary data is clearly not important. I've seen too many tests that involved dozens of pieces of data and it can be difficult to tell what needs to be that way and what just happens to be that way. E.g. If an address validation method is tested with a specific zip code and all other data is random then you can be pretty sure the zip code is the only important part.
if it's a random value and the test fails, we need to a) fix the object and b) force ourselves to test for that value every time, so we know it works, but since it's random we don't know what the value was
If your test case does not accurately record what it is testing, perhaps you need to recode the test case. I always want to have logs that I can refer back to for test cases so that I know exactly what caused it to fail whether using static or random data.
You should ask yourselves what is the goal of your test.
Unit tests are about verifying logic, code flow and object interactions. Using random values tries to achieve a different goal, thus reduces test focus and simplicity. It is acceptable for readability reasons (generating UUID, ids, keys,etc.).
Specifically for unit tests, I cannot recall even once this method was successful finding problems. But I have seen many determinism problems (in the tests) trying to be clever with random values and mainly with random dates.
Fuzz testing is a valid approach for integration tests and end-to-end tests.
Can you generate some random data once (I mean exactly once, not once per test run), then use it in all tests thereafter?
I can definitely see the value in creating random data to test those cases that you haven't thought of, but you're right, having unit tests that can randomly pass or fail is a bad thing.
If you're using random input for your tests you need to log the inputs so you can see what the values are. This way if there is some edge case you come across, you can write the test to reproduce it. I've heard the same reasons from people for not using random input, but once you have insight into the actual values used for a particular test run then it isn't as much of an issue.
The notion of "arbitrary" data is also very useful as a way of signifying something that is not important. We have some acceptance tests that come to mind where there is a lot of noise data that is no relevance to the test at hand.
I think the problem here is that the purpose of unit tests is not catching bugs. The purpose is being able to change the code without breaking it, so how are you going to know that you break
your code when your random unit tests are green in your pipeline, just because they doesn't touch the right path?
Doing this is insane for me. A different situation could be running them as integration tests or e2e not as a part of the build, and just for some specific things because in some situations you will need a mirror of your code in your asserts to test that way.
And having a test suite as complex as your real code is like not having tests at all because who
is going to test your suite then? :p
A unit test is there to ensure the correct behaviour in response to particular inputs, in particular all code paths/logic should be covered. There is no need to use random data to achieve this. If you don't have 100% code coverage with your unit tests, then fuzz testing by the back door is not going to achieve this, and it may even mean you occasionally don't achieve your desired code coverage. It may (pardon the pun) give you a 'fuzzy' feeling that you're getting to more code paths, but there may not be much science behind this. People often check code coverage when they run their unit tests for the first time and then forget about it (unless enforced by CI), so do you really want to be checking coverage against every run as a result of using random input data? It's just another thing to potentially neglect.
Also, programmers tend to take the easy path, and they make mistakes. They make just as many mistakes in unit tests as they do in the code under test. It's way too easy for someone to introduce random data, and then tailor the asserts to the output order in a single run. Admit it, we've all done this. When the data changes the order can change and the asserts fail, so a portion of the executions fail. This portion needn't be 1/2 I've seen exactly this result in failures 10% of the time. It takes a long time to track down problems like this, and if your CI doesn't record enough data about enough of the runs, then it can be even worse.
Whilst there's an argument for saying "just don't make these mistakes", in a typical commercial programming setup there'll be a mix of abilities, sometimes relatively junior people reviewing code for other junior people. You can write literally dozens more tests in the time it takes to debug one non-deterministic test and fix it, so make sure you don't have any. Don't use random data.
In my experience unit tests and randomized tests should be separated. Unit tests serve to give a certainty of the correctness of some cases, not only to catch obscure bugs.
All that said, randomized testing is useful and should be done, separately from unit tests, but it should test a series of randomized values.
I can't help to think that testing 1 random value with every run is just not enough, neither to be a sufficient randomized test, neither to be a truly useful unit test.
Another aspect is validating the test results. If you have random inputs, you have to calculate the expected output for it inside the test. This will at some level duplicate the tested logic, making the test only a mirror of the tested code itself. This will not sufficiently test the code, since the test might contain the same errors the original code does.
This is an old question, but I wanted to mention a library I created that generates objects filled with random data. It supports reproducing the same data if a test fails by supplying a seed. It also supports JUnit 5 via an extension.
Example usage:
Person person = Instancio.create(Person.class);
Or a builder API for customising generation parameters:
Person person = Instancio.of(Person.class)
.generate(field("age"), gen -> gen.ints.min(18).max(65))
.create();
Github link has more examples: https://github.com/instancio/instancio
You can find the library on maven central:
<dependency>
<groupId>org.instancio</groupId>
<artifactId>instancio-junit</artifactId>
<version>LATEST</version>
</dependency>
Depending on your object/app, random data would have a place in load testing. I think more important would be to use data that explicitly tests the boundary conditions of the data.
We just ran into this today. I wanted pseudo-random (so it would look like compressed audio data in terms of size). I TODO'd that I also wanted deterministic. rand() was different on OSX than on Linux. And unless I re-seeded, it could change at any time. So we changed it to be deterministic but still psuedo-random: the test is repeatable, as much as using canned data (but more conveniently written).
This was NOT testing by some random brute force through code paths. That's the difference: still deterministic, still repeatable, still using data that looks like real input to run a set of interesting checks on edge cases in complex logic. Still unit tests.
Does that still qualify is random? Let's talk over beer. :-)
I can envisage three solutions to the test data problem:
Test with fixed data
Test with random data
Generate random data once, then use it as your fixed data
I would recommend doing all of the above. That is, write repeatable unit tests with both some edge cases worked out using your brain, and some randomised data which you generate only once. Then write a set of randomised tests that you run as well.
The randomised tests should never be expected to catch something your repeatable tests miss. You should aim to cover everything with repeatable tests, and consider the randomised tests a bonus. If they find something, it should be something that you couldn't have reasonably predicted; a real oddball.
How can your guy run the test again when it has failed to see if he has fixed it? I.e. he loses repeatability of tests.
While I think there is probably some value in flinging a load of random data at tests, as mentioned in other replies it falls more under the heading of load testing than anything else. It is pretty much a "testing-by-hope" practice. I think that, in reality, your guy is simply not thinkng about what he is trying to test, and making up for that lack of thought by hoping randomness will eventually trap some mysterious error.
So the argument I would use with him is that he is being lazy. Or, to put it another way, if he doesn't take the time to understand what he is trying to test it probably shows he doesn't really understand the code he is writing.