I have a method which does many things, so that mean many side affects are created. For instance, say a single call to REST API returns a JSON object with many fields set. In this case if we want to check for each individual fields should we have one single test method which will contain many assertEquals or should we have one single test method per field validation containing one assertEquals.
Similarly a method can have many other side affects, for eg, saving to the database, sending emails etc. In this case should I have one unit test method per side affect?
In addition, if there are multiple test inputs per SUT methods then will this affect the decision of how many test methods are created?
Moreover, it might be related to story that means the story says this, this sub functionalities belong to the story in that case should not they belong to the same test method? Because if the requirement changes then all the test methods per side affects need to be changed as well. Would that be manageable?
In this case if we want to check for each individual fields should we have one single test method which will contain many assertEquals or should we have one single test method per field validation containing one assertEquals.
Worrying about the number of asserts in a test case is like worrying about how many lines are in a function. It's a first order approximation of complexity, but it's not really what you should be concerned with.
What you should be concerned with is how hard that test is to maintain, and how good its diagnostics are. When the test fails, will you be able to figure out why? A lot of that depends on how good your assertEquals is with large data structures.
json = call_rest();
want = { ...whatever you expect... };
assertEquals( json.from_json(), want );
If that just tells you they're not equal, that's not very useful. You then have to manually go in and look at json and want. If it dumps out both data structures, that's also not very useful, you have to look for the differences by eye.
But if it presents a useful diff of the two data structures, that is useful. For example, Perl's Test2 will produce diagnostics like this.
use Test2::Bundle::Extended;
is { foo => 23, bar => 42, baz => 99 },
{ foo => 22, bar => 42, zip => 99 };
done_testing;
# +-------+------------------+---------+------------------+
# | PATH | GOT | OP | CHECK |
# +-------+------------------+---------+------------------+
# | {foo} | 23 | eq | 22 |
# | {zip} | <DOES NOT EXIST> | | 99 |
# | {baz} | 99 | !exists | <DOES NOT EXIST> |
# +-------+------------------+---------+------------------+
Then there's the question of maintenance, and again this comes down to how good your assertEquals is. A single assertEquals is easy to read and maintain.
json = call_rest();
want = { ...whatever you expect... };
assertEquals( json.from_json(), want );
There's very little code, and want is very clear about what is expected.
While multiple gets very wordy.
json = call_rest();
have = json.from_json();
assertEquals( have['thiskey'], 23 );
assertEquals( have['thatkey'], 42 );
assertEquals( have['foo'], 99 );
...and so on...
It can be difficult to know which assert failed, you have to tell by line number or (if your test suite supports it) manually naming each assert which is more work and more maintenance and one more thing to get wrong.
OTOH individual assertEquals allow for more flexibility. For example, what if you only want to check certain fields? What if there are acceptable values in a range?
# bar just has to contain a number
assert( have['bar'].is_number );
But some test suites support this via a single assert.
want = {
thiskey: 23,
thatkey: 42,
foo: 99,
bar: is_number
}
assertEquals( json.from_json, want );
is_number is a special object that tells assert it's not a normal quality check, but to just check that the value is a number. If your test suite supports this style, it's generally superior to writing out a bunch of asserts. The declarative approach means less code to write, read, and maintain.
The answer is: it depends on how good your testing tools are!
Related
Let's say there is a function:
type config struct {
percent float64
flat float64
}
func Calculate(c *config, arrayofdata *arrayofdata) float64 {
result := 0.0
for _, data := range arrayofdata {
value1 = data * percent
value2 = flat
result += math.Min(value1, value2)
}
return result
}
This simple calculate function simply calculates the result per data based on the flat value or the percent whichever is lower. And aggregates them.
If I were to write tests, how would you do it???
Do I have to have multiple test for each trivial scenario?
Say when value1 < value2 ??
TestCalculate/CorrectValue_FlatValueLessThanPercentValue
Say when value1 > value2 ??
TestCalculate/CorrectValue_FlatValueEqualToPercentValue
Check if flat is added per data?? So for 3 contents of arrayofdata, result = 3*config.flat??
TestCalculate/CorrectValue_FlatValuePerData
All these seem very trivial and can simply be combined into one test. what is the recommended way?
Like say a test where
config { percent: 1, flat: 20}
And then you put arrayofdata where each element checks for one of each case above written
arrayofdata: {
1, // 1*percent < flat
40, // 40*percent > flat
}
And the result would be correct if we add up the values, so you already check for case when more than one element in arrayofdata.
Is this a better approach? One test but combining the details.
And separate tests for other cases like zero elements in arrayofdata etc.
I would recommend following the common practices in Clean Code, by Martin.Two of the discussed guidelines in particular are "One Assert per Test" and "Single Concept per Test".
When we have asserts that test more than 1 scenario and things start failing, it can become difficult to figure out which part is no longer passing. When that happens, you're going to end up splitting that unit test out into three separate test anyways.
Might as well start them out at 3 separate test, just keep your tests clean and expressive so if you come back to it months later you can figure out what unit test is doing what :)
Can I have a one-line regex code that matches the values between a pipe line "|" independent of the number if items between the pipe lines. E.g. I have the following regex:
^(.*?)\|(.*?)\|(.*?)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)$
which works only if I have 12 items. How can I make the same work for e.g. 6 items as well?
([^|]+)+
This is the pattern I've used in the past for that purpose. It matches 1 or more group that does not contain the pipe delimeter.
For Adobe Classification Rule Builder (CRB), there is no way to write a regex that will match an arbitrary number of your pattern and push them to $n capture group. Most regex engines do not allow for this, though some languages offer certain ways to more or less effectively do this as returned arrays or whatever. But CRB doesn't offer that sort of thing.
But, it's mostly pointless to want this anyways, since there's nothing upstream or downstream that really dynamically/automatically accommodates this sort of thing anyways.
For example, there's no way in the CRB interface to dynamically populate the output value with an arbitary $1$2$3[$n..] value, nor is there a way to dynamically generate an arbitrary number of rules in the rule set.
In addition, Adobe Analytics (AA) does not offer arbitrary on-the-fly classification column generation anyways (unless you want to write a script using the Classification API, but you can't say the same for CRBs anyways).
For example if you have
s.eVar1='foo1|foo2';
And you want to classify this into 2 classification columns/reports, you have to go and create them in the classification interface. And then let's say your next value sent in is:
s.eVar1='foo1|foo2|foo3';
Well AA does not automatically create a new classification level for you; you have to go in and add a 3rd, and so on.
So overall, even though it is not possible to return an arbitrary number of captured groups $n in a CRB, there isn't really a reason you need to.
Perhaps it would help if you explain what you are actually trying to do overall? For example, what report(s) do you expect to see?
One common reason I see this sort of "wish" come up for is when someone wants to track stuff like header or breadcrumb navigation links that have an arbitrary depth to them. So they push e.g. a breadcrumb
Home > Electronics > Computers > Monitors > LED Monitors
...or whatever to an eVar (but pipe delimited, based on your question), and then they want to break this up into classified columns.
And the problem is, it could be an arbitrary length. But as mentioned, setting up classifications and rules for them doesn't really accommodate this sort of thing.
Usually the best practice for a scenario like this is to to look at the raw data and see how many levels represents the bulk of your data, on average. For example if you look at your raw eVar report and see even though upwards of like 5 or 6 levels in the values can be found, but you can also see that most of values on average are between 1-3 levels, then you should create 4 classification columns. The first 3 classifications represent the first 3 levels, and the 4th one will have everything else.
So going back to the example value:
Home|Electronics|Computers|Monitors|LED Monitors
You can have:
Level1 => Home
Level2 => Electronics
Level3 => Computers
Level4+ => Monitors|LED Monitors
Then you setup a CRB with 4 rules, one for each of the levels. And you'd use the same regex in all 4 rule rows:
^([^|]+)(?:\|([^|]+))?(?:\|([^|]+))?(?:\|(.+))?
Which will return the following captured groups to use in the CRB outputs:
$1 => Home
$2 => Electronics
$3 => Computers
$4 => Monitors|LED Monitors
Yeah, this isn't the same as having a classification column for every possible length, but it is more practical, because when it comes to analytics, you shouldn't really try to be too granular about things in the first place.
But if you absolutely need to have something for every possible amount of delimited values, you will need to find out what the max possible is and make that many, hard coded.
Or as an alternative to classifications, consider one of the following alternatives:
Use a list prop
Use a list variable (e.g. list1)
Use a Merchandising eVar (product variable syntax)
This isn't exactly the same thing, and they each have their caveats, but you didn't provide details for what you are ultimately trying to get out of the reports, so this may or may not be something you can work with.
Well anyways, hopefully some of this is food for thought for you.
I am looking for a way to help me quickly look up duplicate entries in some sort of table structure in a very efficient way. Here's my problem:
I have objects of type Cargo where each Cargo has a list of other associated Cargo. Each cargo also has a type associated with it.
so for example:
class Cargo {
int cargoType;
std::list<Cargo*> associated;
}
now for each cargo object and its associated cargo, there is a certain value assigned to it based on their types. This evaluation happens by classes that implement CargoEvaluator.
Now, I have a CargoEvaluatorManager which basically handles connecting everything together. CargoEvaluators are registered with CargoEvaluatorManager, then, to evaluate cargo, I call CargoEvaluatorManager.EvaluateCargo(Cargo* cargo).
Here's the current state
class CargoEvaluatorManager{
std::vector<std::vector<CargoEvaluator*>> evaluators;
double EvaluateCargo(Cargo* cargo)
{
double value = 0.0;
for(auto& associated : cargo->associated) {
auto evaluator = evaluators[cargo->cargoType][associated->cargoType];
if(evaluator != nullptr)
value += evaluator->Evaluate(cargo, associated);
}
return value;
}
}
So to recap and mention a few extra points:
CargoEvaluatorManager stores CargoEvaluators in a 2-D array using cargo types as indices. The entire 2d vector is initialized with nullptrs. When a CargoEvaluator is registered, resizing the array and the other bits and peieces I haven't shown here are handled appropriately.
I had tried using a map with std::pair as a key to look up different evaluators, but it is too slow.
This only allows one CargoEvaluator per combination of cargotype. I want to have multiple cargo evaluators potentially as well.
I am calling this EvaluateCargo tens and billions of times. I am aware my current implementation is not the most efficient and am looking for alternatives.
What I am looking for
As stated above, I want to do much of what I've outlined with the exception that I want to allow multiple Evaluators for each pair of Cargo types. As I envision it, naively, is a table like this :
--------------------------------------------
|int type 1 | int type 2 | CaroEvaluator* |
|------------------------------------------|
| 5 | 7 | Eval1* |
| 5 | 7 | Eval2* |
| 4 | 6 | Eval3* |
--------------------------------------------
The lookup should be symmetric in that the set (5,7) and (7,5) should resolve to the same entries. For speed I don't mind preloading duplicate entries.
There are maybe 3x or more more associated Caro in the list than there are Evaluators, if that factors into things.
Performance is crucial, as mentioned above!
For bonus points, each cargo evaluator may have an additional penalty value associated with it, that is not dependent on how many Associates a Cargo has. In other words: if a row in the table above is looked up, I want to call double Evaluator->AddPenality() once and only once each time EvaluateCargo is called. I cannot store any instance variables since it would cause some multithreading issues.
One added constraint is I need to be able to identify the CargoEvaluators associated with a particular cargotype, meaning that hashing the two cargotypes together is not a viable option.
If any further clarification is needed, I'll gladly try to help.
Thank you all in advance!
While creating a train,test & cross validation sample in Python, I see the default method as -:
1. Reading the dataset , after skipping headers
2. Creating the train, test and Cross validation sample
import csv
with open('C:/Users/Train/Trainl.csv', 'r') as f1:
next(f1)
reader = csv.reader(f1, delimiter=',')
input_set = []
for row in reader:
input_set.append(row)
import numpy as np
from numpy import genfromtxt
from sklearn import cross_validation
train, intermediate_set = cross_validation.train_test_split(input_set, train_size=0.6, test_size=0.4)
cv, test = cross_validation.train_test_split(intermediate_set, train_size=0.5, test_size=0.5)
My problem though is that I have a field say "A" in the csv file that I read into the numpy array, and all sampling should respect this field. That is, all entries with similar values for "A" should go in one sample .
Line #|A | B | C | D
1 |1 |
2 |1 |
3 |1 |
4 |1 |
5 |2 |
6 |2 |
7 |2 |
Required : line 1,2,3,4 should go in "one" sample and 5,6,7 should go in the "one" sample.
Value of column A is a unique id, corresponding to one single entity(could be seen as a cross section data points on one SINGLE user, so it MUST go in one unique sample of train, test, or cv), and there are many such entities, so a grouping by entity id is required.
B, C,D columns may have any values, but a grouping preservation is not required on them. (Bonus: can I group the sampling for multiple fields?)
What I tried :
A. Finding all unique values of A's - denoting this as my sample I now distribute the sample among-st train, intermediate & cv & test -> then putting the rest of the rows for this value of "A" in each of these files.
that is if train had entry for "3" , test for"2" and cv for "1" then all rows with value of A as 3 go in train, all with 2 go in test and all with 1 go in cv.
Ofcourse this approach is not scalable.
And I doubt, it may have introduced bias into the datasets, since the number of 1's in column A , no of 2's etc. is not equal, meaning this approach will not work !
B. I also tried numpy.random.shuffle, or numpy.random.permutation as per the thread here - Numpy: How to split/partition a dataset (array) into training and test datasets for, e.g., cross validation? , but it did not meet my requirement.
C. A third option of-course is writing a custom function that does this grouping, and then balances the training, test and cv data-sets based on number of data points in each group. But just wondering, if there's already an efficient way to implement this ?
Note my data set is huge, so ideally I would like to have a deterministic way to partition my datasets, without having multiple eye-ball-scans to be sure that the partition is correct.
EDIT Part 2:
Since I did not find any that fit my sampling criteria - I actually wrote a module to sample with grouping constraints. This is the github code to it. The code was not written for very large data in mind, so it's not very efficient. Should you FORK this code - please point out how can I improve the run-time.
https://github.com/ekta1007/Sampling-techniques/blob/master/sample_expedia.py
By forcing such constraints you will introduce bias either way, to you procedure. So approach based on the partition of the "users" data and then collecting their respective "measurements" does not seem bad. And it will scale just fine, this is O(n) method, the only reason for not scaling up is bad implementation, not bad method.
The reason for no such functionality in existing methods (like sklearn library) is because it looks highly artificial, and counter machine learning models idea. If these are somehow one entities then they should not be treated as separate data points. If you need this separate representation then requiring such division, that the particular entity cannot be partially in test test and partially in training will for sure bias the whole model.
To sum up - you should really deeply analyze whether your approach is reasonable from the machine learning point of view. If you are sure about it, I think the only possibility is to write the segmentation by yourself, as even though using many ML libraries in the past, I've never seen such functionality.
In fact I am not sure, if the problem of creating segmentation of the set containing N numbers (sizes of entities) into K (=3) subsets of given sums proportions with uniform distribution when treated as a random process is not NP problem on itself. If you cannot guarantee uniform distribution, then your datasets cannot be used as a statistically correct method of training/testing/validating your model. Even if it has a reasonable polynomial solution, it can still scale up badly (much worse then linear methods). This doubt applies if your constraints are "strict", if they are "weak" you can always do "generate and reject" approach, which should have amortized linear complexity.
I was also facing similar kind of issue, though my coding is not too good I came up with the solution as given below:
Created a new data frame that only contains the Unique Id of the df and removed duplicates.
new = df[["Unique_Id "]].copy()
New_DF = new.drop_duplicates()
Created training and test set on the basis of New_DF
train, test = train_test_split(New_DF, test_size=0.2)
And then merged those training and test set with original df.
df_Test = pd.merge(df, test, how='inner', on = “Unique_Id”)
df_Train = pd.merge(df, train, how='inner', on = “Unique_Id”)
Similarly, we can create sample for the validation part too.
Cheers.
I'm trying to turn XML files in to SQL statements within Eclipse to save on the manual work of converting large files.
A line such as:
<TABLE_NAME COL_1="value1" COL_2="value2"/>
should be converted to:
insert into TABLE_NAME (COL_1, COL_2) values ("value1", "value2");
So far, I have managed to match and capture the table name and the first column/value pair with:
<(\w+)( \w+)=(".+?").*/>
The .* near the end is just there to test the first part of the pattern and should be removed once it is complete.
The following replace pattern yields the following result:
insert into $1 ($2) values ($3);
insert into TABLE_NAME ( COL_1) values ("value1");
The problem I'm having is that the number of columns is different for different tables, so I would like a generic pattern that would match n column/value pairs and repeatedly use the captured groups in the replace pattern. I haven't managed to understand how to do this yet although \G seems to be a good candidate.
Ideally, this would be solved in a single regexp statement, though I also wouldn't be against multiple statements that have to be run in sequence (though I really don't want to have to force the developer to execute once for every column/value pair).
Anyone have any ideas?
I did not in the end solve this with regular expressions as originally envisaged. Taking alternative solutions in mind, it was solved in the end by re-factoring test code to make it re-usable and then co-opting dbunit to use tests to write data to the different stages of the development databases.
I know that this is a misuse of tests, but it does make the whole thing much simpler since the user gets a green/red feedback as to whether the data is inserted or not and the same test data can be inserted for manual acceptance tests as is used for the automated component/integration tests (which used dbunit with an in-memory database). As long as these fake test classes are not added to the test suite, then they are only executed by hand.
The resulting fake tests ended up being extremely simple and lightweight, for instance:
public class MyComponentTestDataInserter extends DataInserter {
#Override
protected String getDeleteSql() {
return "./input/component/delete.sql";
}
#Override
protected String getInsertXml() {
return "./input/component/TestData.xml";
}
#Test
public void insertIntoDevStage() {
insertData(Environment.DEV);
}
#Test
public void insertIntoTestStage() {
insertData(Environment.TEST);
}
}
This also had the additional advantage that a developer can insert the data in to a single environment by simply executing a single test method via the context menu in the IDE. At the same time, by running the whole test class, the exact same data can be deployed to all environments simultaneously.
Also, a failure at any point during data cleanup or insertion causes the whole transaction to be rolled back, preventing an inconsistent state.