Unit Testing & Primary Keys

Unit Testing & Primary Keys - unit-testing

I am new to Unit Testing and think I might have dug myself into a corner.
In your Unit Tests, what is the better way to handle primary keys?
Hopefully an example will paint some context. If create several instances of an object (Lets' say Person).
My unit test is to test the correct relationships are being created.
My code is to create Homer, he children Bart and Lisa. He also has a friend Barney, Karl & Lenny.
I've seperated my data layer with an Interface. My preference is to keep the primary key simple. Eg On Save, Person.ProductID = new Random().Next(10000); instead of say Barney.PersonID = 9110 Homer.PersonID = 3243 etc.
It doesn't matter what the primary key is, it just needs to be unique.
Any thoughts???
EDIT:
Sorry I haven't made it clear. My project is setup to use Dependency Injection. The data layer is totally separate. The focus of my question is, what is practical?

I have a class called "Unique" which produces unique objects (strings, integers, etc). It makes sure they're unique per-test by keeping a internal static counter. That counter value is incremented per key generated, and included in the key somehow.
So when I'm setting up my test
var Foo = {
ID = Unique.Integer()
}
I like this as it communicates that the value is not important for this test, just the uniqueness.
I have a similar class 'Some' that does not guarantee uniqueness. I use it when I need an arbitrary value for a test. Its useful for enums and entity objects.
None of these are threadsafe or anything like that, its strictly test code.

There are several possible corners you may have dug yourself into that could ultimately lead to the question that you're asking.
Maybe you're worried about re-using primary keys and overwriting or incorrectly loading data that's already in the database (say, if you're testing against a dev database as opposed to a clean test database). In that case, I'd recommend you set up your unit tests to create their records' PKs using whatever sequence a normal application would or to test in a clean, dedicated testing database.
Maybe you're concerned about the efficacy of your code with PKs beyond a simple 1,2,3. Rest assured, this isn't something one would typically test for in a straightforward application, because most of it is outside the concern of your application: generating a number from a sequence is the DB vendor's problem, keeping track of a number in memory is the runtime/VM's problem.
Maybe you're just trying to learn what the best practice is for this sort of thing. I would suggest you set up the database by inserting records before executing your test cases using the same facilities that your application itself will use to insert records; presumably your application code will rely on a database-vended sequence number for PKs, and if so, use that. Finally, after your test cases have executed, your tests should roll back any changes they made to the database to ensure the test is idempotent over multiple executions. This is my sorry attempt of describing a design pattern called test fixtures.

Consider using GUIDs. They're unique across space and time, meaning that even if two different computers generated them at the same exact instance in time, they will be different. In other words, they're guaranteed to be unique. Random numbers are never good, there is a considerable risk of collision.
You can generate a Guid using the static class and method:
Guid.NewGuid();
Assuming this is C#.
Edit:
Another thing, if you just want to generate a lot of test data without having to code it by hand or write a bunch of for loops, look into NBuilder. It might be a bit tough to get started with (Fluent methods with method chaining aren't always better for readability), but it's a great way to create a huge amount of test data.

Why use random numbers? Does the numeric value of the key matter? I would just use a sequence in the database and call nextval.

The essential problem with database unit testing is that primary keys do not get reused. Rather, the database creates a new key each time you create a new record, even if you delete the record with the original key.
There are two basic ways to deal with this:
Read the generated Primary Key, from the database and use it in your tests, or
Use a fresh copy of the database each time you test.
You could put each test in a transaction and roll the transaction back when the test completes, but rolling back transactions doesn't always work with Primary Keys; the database engine will still not reuse keys that have been generated once (in SQL Server anyway).

When a test executes against a database through another piece of code it ceases to be an unit test. It is called an "integration test" because you are testing the interactions of different pieces of code and how they "integrate" together. Not that it really matters, but its fun to know.
When you perform a test, the following things should occur:
Begin a db transaction
Insert known (possibly bogus) test items/entities
Call the (one and only one) function to be tested
Test the results
Rollback the transaction
These things should happen for each and every test. With NUnit, you can get away with writing step 1 and 5 just once in a base class and then inheriting from that in each test class. NUnit will execute Setup and Teardown decorated methods in a base class.
In step 2, if you're using SQL, you'll have to write your queries such that they return the PK numbers back to your test code.
INSERT INTO Person(FirstName, LastName)
VALUES ('Fred', 'Flintstone');
SELECT SCOPE_IDENTITY(); --SQL Server example, other db vendors vary on this.
Then you can do this
INSERT INTO Person(FirstName, LastName, SpouseId)
VALUES('Wilma', 'Flintstone', #husbandId);
SET #wifeId = SCOPE_IDENTITY();
UPDATE Person SET SpouseId = #wifeId
WHERE Person.Id = #husbandId;
SELECT #wifeId;
or whatever else you need.
In step 4, if you use SQL, you have to re-SELECT your data and test the values returned.
Steps 2 and 4 are less complicated if you are lucky enough to be able to use a decent ORM like (N)Hibernate (or whatever).

Related

Synchronized Model instances in Django

I'm building a model for a Django project (my first Django project) and noticed
that instances of a Django model are not synchronized.
a_note = Notes.objects.create(message="Hello") # pk=1
same_note = Notes.objects.get(pk=1)
same_note.message = "Good day"
same_note.save()
a_note.message # Still is "Hello"
a_note is same_note # False
Is there a built-in way to make model instances with the same primary key to be
the same object? If yes, (how) does this maintain a globally consistent state of all
model objects, even in the case of bulk updates or changing foreign keys
and thus making items enter/exit related sets?
I can imagine some sort of registry in the model class, which could at least handle simple cases (i.e. it would fail in cases of bulk updates or a change in foreign keys). However, the static registry makes testing more difficult.
I intend to build the (domain) model with high-level functions to do complex
operations which go beyond the simple CRUD
actions of Django's Model class. (Some classes of my model have an instance
of a Django Model subclass, as opposed to being an instance of subclass. This
is by design to prevent direct access to the database which might break consistencies and to separate the business logic from the purely data access related Django Model.) A complex operation might touch and modify several components. As a developer
using the model API, it's impossible to know which components are out of date after
calling a complex operation. Automatically synchronized instances would mitigate this issue. Are there other ways to overcome this?

TL;DR "Is there a built-in way to make model instances with the same primary key to be the same object?" No.
A python object in memory isn't the same thing as a row in your database. So when you create a_note and then fetch same_note from the db, those are two different objects in memory, even though they are the same representation of the underlying row in your database. When you fetch same_note, in fact, you instantiate a new Notes object and initialise it with the values fetched from the database.
Then you change and save same_note, but the a_note object in memory isn't changed. If you did a_note.refresh_from_db() you would see that a_note.message was changed.
Now a_note is same_note will always be False because the location in memory of these two objects will always be different. Two variables are the same (is is True) if they point to the same object in memory.
But a_note == same_note will return True at any time, since Django defines two model instances to be equal if their pk is the same.
Note that if the complexity you're talking about is that in the case of multiple requests one request might change underlying values that are being used by another request, then use F to avoid race conditions.
Within one request, since everything is sequential and single threaded, there's not risk of variables going out of sync: You know the order in which things are done and therefore can always call refresh_from_db() when you know a previous method call might have changed the database value.
Note also: Having two variables holding the same row means you'll have performed two queries to your db, which is the one thing you want to avoid at all cost. So you should think why you have this situation in the first place.

Database unit test

I am hoping to get some advice on a unit test I am writing for to test some db entries.
The function I am testing seeds the database if no records are found.
func Seed(db *gorm.DB) {
var data []Data
db.Find(&data)
if len(data) == 0 {
// do seed default data
}
}
What I can't quite seem to get going is the test for that if len test. I am using a test db so I can nuke it whenever so it is not an issue if I just need to force an empty DB on the function.
The function itself works and I just want to make sure I get that covered.
Any advice would be great.
Thanks!

It really depends, there are so many ways of addressing this based on your risk level and the amount of time you want to invest to mitigate those risks.
You could write a unit test that asserts your able to detect and act on users logic (ie seeding when empty and ignoring when full) without any database.
If you would like to test the logic as well as your programs ability to speak to mysql correctly through the gorm library you could:
Have a test where you call Seed with no users in the DB, after calling it your test could select from Users and make sure there are the expected entries created from len(users) == 0 conditional
Have a test where the test creates a single entry and calls Seed, after which asserting that underlying tables are empty.
It can get more complicated. If Seed is selecting a subset of data than your test could insert 2 users, one that qualifies and one that doesnt', and make sure that no new users are Seeded.

Container for in-memory representation of a DB table

Let's say I have a (MySQL) DB. I want to automate the update of this database via an application, that will:
1. Import from DB
2. Calculate updated data
3. Export back updated data
The timing is important, I don't want to import while calculating, in fact I don't want any queries then; I want to import (a) table(s) as a whole, then calculate. So, my question is, if a row is represented with an instance of a class, then what container do I put these objects into?
A vector? A set? What about ordered vs. unordered? Just use what seems best for my case according to big O times? Any special traps to fall into here? Is this case no different than with data "born in memory", so the only things to consider besides size overhead are "do I want the lookup or the insertion to be faster" ?
Probably the best route is to use some ORM, but let's say I don't want to.
I've seen some apps use boost::unordered_set, and I wondered, if there is a particular reason for its use...
I use a jdbc-like interface as the connector (libmysqlcpp).

I do not think that the container you have to use can be guessed with so few information. It mainly depends of the data size, type and the algorithm you will run.
But my main concern over such a design is that it will quickly choke your network or your base and database. If you have a big table you'll:
select all the data from the table
retrieve all the data over the network
process on you machine part (some columns ?) or the entirety of the data
push the data over the network
update your rows (or erase/replace maybe)
Why don't you consider working directly on the mysql server ? You create your user defined function that work on the directly data, saving the network and even taking advantage of the fact that mysql is built to handle gigantic amount of data, quantity that an in-memory container is not built to handle.

Where to put pre-flush entity state data in Zend where EntityRepository can access it?

I have an Person that is many-to-one with a Family. The system takes in many rows of data with multiple persons at the same time, which may or may not belong to the same families. I don't have the information about families beforehand.
When I process a Person to enter into the system, I check whether I need to add its Family to the database first. I naturally ask about that from FamilyRepository, but even if I've already created and persisted the same Family, the FamilyRepository still doesn't know about this, since it's written to database only at flush().
The solution would be to temporarily add a reference to somewhere during the PrePersist of the newly created Family, and make the FamilyRepository check from that place as well as from the database.
But where should this temporary persisted-but-not-yet-flushed entity reference go, so that I can access it from the entity's repository?
Alternative solutions I don't like:
The code that does the adding (PersonService->insertPersons()) could of course keep track of the persisted entities, but this seems like a non-optimal solution since it is not a general solution and that code would have to be put each place that adds data.
I could just flush after each addition, but I'd prefer not to flush until all the data has been processed.
I could also loop through $entityManager->getUnitOfWork()->getScheduledEntityInsertions(), and find entries from there, but that seems rather like a hack than an actual solution.

I'm not entirely clear on what you're trying to do, but it sounds like you can handle this by manually handling your transactions (assuming you're using the ORM, anyway. Not sure about transaction support in ODM).
By wrapping your whole import in a transaction, you can make incremental flushes so that SELECTs issued by the repository will return the data, but you can still rollback the entire thing if something goes wrong:
<?php
$em->beginTransaction();
$familyRepository = $em->getRepository('Family');
foreach($personData as $p){
$lastname = $p['lastname'];
$person = new Person();
$person->setLastname($lastname);
$family = $familyRepository->findOneByLastname($lastname);
if (! $family){
$family = new Family();
$family->setLastname($lastname);
$em->persist($family);
}
$person->setFamily($family);
$em->persist($person);
$em->flush();
}
$em->commit();

How to delete all database data with NHibernate?

Is it possible to delete all data in the database using NHibernate. I want to do that before starting each unit test. Currently I drop my database and create it again but this is not acceptable solution for me.
==========================================================
Ok, here are the results. I am testing this on a database (Postgre). I will test CreateSchema(1), DanP solution(2) and apollodude217 solution(3). I run the tests 5 times with each method and take the average time.
Round 1 - 10 tests
(1) - ~26 sec
(2) - 9,0 sec
(3) - 9,3 sec
Round 2 - 100 tests
(1) - Come on, I will not do that on my machine
(2) - 12,6 sec
(3) - 18,6 sec
I think that it is not necessary to test with more tests.

I'm using the SchemaExport class and recreate the schema before each test. This is almost like dropping the database, but it's only dropping and recreating the tables. I assume that deleting all data from each table is not faster then this, it could even be slower.
Our unit tests are usually running on Sqlite in memory, this is very fast. This database exists only as long as the connection is open, so the whole database is recreated for each test. We're switching to Sqlserver by changing the build configuration.

Personally, I use a stored procedure to do this, but it may be possible with Executable HQL (see this post for more details: http://fabiomaulo.blogspot.com/2009/05/nh21-executable-hql.html )
Something along the lines of session.Delete("from object");

I do not claim this is faster, but you can do something like this for each mapped class:
// untested
var entities = MySession.CreateCriteria(typeof(MappedClass)).List<MappedClass>();
foreach(var entity in entities)
MySession.Delete(entity); // please optimize
This (alone) will not work in at least 2 cases:
When there is data that must be in your database when the app starts up.
When you have a type where the identity property's unsaved-value is "any".

A good alternative is having a backup of the initial DB state and restoring it when starting tests (this can be complex or not, depending on the DB)

Re-creating the database is a good choice, especially for unit testing. If the creation script is too slow you could take a backup of the database and use it to restore the DB to an initial state before each test.
The alternative would be to write a script that would drop all foreign keys in the database then delete/truncate all tables. This would not reset any autogenerated IDs or sequences however. This doesn't seem like an elegant solution and it is definitely more time consuming.
In any case, this is not something that should be done through an ORM, not just NHibernate.
Why do you reject the re-creation option? What are your requirements? Is the schema too complex? Does someone else design the database? Do you want to avoid file fragmentation?

Another solution might be to create a stored procedure that wipes the data. In your test set up or instantiate method run the stored procedure first.
However I am not sure if this is quicker than any of the other methods as we don't know the size of database and number of rows likely to be deleted. Also I would not recommend deploying this stored procedure to the live server for safety purposes!
HTH

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js