Numerical regression testing - c++

I'm working on a scientific computing code (written in C++), and in addition to performing unit tests for the smaller components, I'd like to do regression testing on some of the numerical output by comparing to a "known-good" answer from previous revisions. There are a few features I'd like:
Allow comparing numbers to a specified tolerance (for both roundoff error and looser expectations)
Ability to distinguish between ints, doubles, etc, and to ignore text if necessary
Well-formatted output to tell what went wrong and where: in a multi-column table of data, only show the column entry that differs
Return EXIT_SUCCESS or EXIT_FAILURE depending on whether the files match
Are there any good scripts or applications out there that do this, or will I have to roll my own in Python to read and compare output files? Surely I'm not the first person with these kind of requirements.
[The following is not strictly relevant, but it may factor into the decision of what to do. I use CMake and its embedded CTest functionality to drive unit tests that use the Google Test framework. I imagine that it shouldn't be hard to add a few add_custom_command statements in my CMakeLists.txt to call whatever regression software I need.]

You should go for PyUnit, which is now part of the standard lib under the name unittest. It supports everything you asked for. The tolerance check, e.g., is done with assertAlmostEqual().

The ndiff utility may be close to what you're looking for: it's like diff, but it will compare text files of numbers to a desired tolerance.

I ended up writing a Python script to do more or less what I wanted.
#!/usr/bin/env python
import sys
import re
from optparse import OptionParser
from math import fabs
splitPattern = re.compile(r',|\s+|;')
class FailObject(object):
def __init__(self, options):
self.options = options
self.failure = False
def fail(self, brief, full = ""):
print ">>>> ", brief
if options.verbose and full != "":
print " ", full
self.failure = True
def exit(self):
if (self.failure):
print "FAILURE"
sys.exit(1)
else:
print "SUCCESS"
sys.exit(0)
def numSplit(line):
list = splitPattern.split(line)
if list[-1] == "":
del list[-1]
numList = [float(a) for a in list]
return numList
def softEquiv(ref, target, tolerance):
if (fabs(target - ref) <= fabs(ref) * tolerance):
return True
#if the reference number is zero, allow tolerance
if (ref == 0.0):
return (fabs(target) <= tolerance)
#if reference is non-zero and it failed the first test
return False
def compareStrings(f, options, expLine, actLine, lineNum):
### check that they're a bunch of numbers
try:
exp = numSplit(expLine)
act = numSplit(actLine)
except ValueError, e:
# print "It looks like line %d is made of strings (exp=%s, act=%s)." \
# % (lineNum, expLine, actLine)
if (expLine != actLine and options.checkText):
f.fail( "Text did not match in line %d" % lineNum )
return
### check the ranges
if len(exp) != len(act):
f.fail( "Wrong number of columns in line %d" % lineNum )
return
### soft equiv on each value
for col in range(0, len(exp)):
expVal = exp[col]
actVal = act[col]
if not softEquiv(expVal, actVal, options.tol):
f.fail( "Non-equivalence in line %d, column %d"
% (lineNum, col) )
return
def run(expectedFileName, actualFileName, options):
# message reporter
f = FailObject(options)
expected = open(expectedFileName)
actual = open(actualFileName)
lineNum = 0
while True:
lineNum += 1
expLine = expected.readline().rstrip()
actLine = actual.readline().rstrip()
## check that the files haven't ended,
# or that they ended at the same time
if expLine == "":
if actLine != "":
f.fail("Tested file ended too late.")
break
if actLine == "":
f.fail("Tested file ended too early.")
break
compareStrings(f, options, expLine, actLine, lineNum)
#print "%3d: %s|%s" % (lineNum, expLine[0:10], actLine[0:10])
f.exit()
################################################################################
if __name__ == '__main__':
parser = OptionParser(usage = "%prog [options] ExpectedFile NewFile")
parser.add_option("-q", "--quiet",
action="store_false", dest="verbose", default=True,
help="Don't print status messages to stdout")
parser.add_option("--check-text",
action="store_true", dest="checkText", default=False,
help="Verify that lines of text match exactly")
parser.add_option("-t", "--tolerance",
action="store", type="float", dest="tol", default=1.e-15,
help="Relative error when comparing doubles")
(options, args) = parser.parse_args()
if len(args) != 2:
print "Usage: numdiff.py EXPECTED ACTUAL"
sys.exit(1)
run(args[0], args[1], options)

I know I'm quite late to the party, but a few months ago I wrote the nrtest utility in an attempt to make this workflow easier. It sounds like it might help you too.
Here's a quick overview. Each test is defined by its input files and its expected output files. Following execution, output files are stored in a portable benchmark directory. A second step then compares this benchmark to a reference benchmark. A recent update has enabled user extensions, so you can define comparison functions for your custom data.
I hope it helps.

Related

Searching for a string in a file with python

I need to do pretty much what a 'grep -i str file' gives back but I have been hitting my head up against this issue for ages.
I have a func called 'siteLookup' that I am passing two parameters: str 's', and file_handle 'f'.
I want to a) determine if there is a (single) occurrence of the string (in this example site="XX001"),
and b) if found, take the line it was found in, and return another field value that I extract from that [found] line back to the caller. (it is a 'csv' lookup). I have had this working periodically but then it will stop working and I cannot see why.
I have tried all of the different 'open' options including f.readlines etc.
#example line: 'XX001,-1,10.1.1.1/30,By+Location/CC/City Name/'
#example from lookupFile.csv: "XX001","Charleston","United States"
sf = open('lookupFile.csv')
def siteLookup(s, f):
site = s.split(',')[0].strip().upper()
if len(site) == 5:
f.seek(0)
for line in f:
if line.find(site)>=0:
city = line.split(',')[1].strip('"').upper()
return city
# else site not found
return -1
else: # len(site) != 5
return -1
city = siteLookup(line, sf)
print(city)
sf.close()
I am getting zero matches in this code. (I have simplified this example code to a single search). I am expecting to get back the name of the city that matches a 5 digit site code - the site code is the first field in the example "line".
Any assistance much appreciated.
Your return is wrongly indented - if the thing you look for is not found in the first line, it will return -1 and not look further.
Use with open(...) as f: to make your code more secure:
with open("lookupFile.csv","w") as f:
f.write("""#example from lookupFile.csv:
"XX001","Charleston","United States"
""")
def siteLookup(s, f):
site = s.split(',')[0].strip().upper()
if len(site) == 5:
f.seek(0)
for line in f:
if site in line: # if site in line is easier for checking
city = line.split(',')[1].strip('"').upper()
return city
# wrongly indented - will return if site not in line
# return -1
# if too short or not found, return -1 - no need for 2 returns
return -1
line = 'XX001,-1,10.1.1.1/30,By+Location/CC/City Name/'
with open('lookupFile.csv') as sf:
city = siteLookup(line, sf)
print(city)
Output:
CHARLESTON

Python v3 inconsistent regex match returns

I'm writing a small python script which takes a log file, matches strings within it and saves them and another custom string "goal " to another text file. Then I take some values from the second file and add them to a list. The problem is that depending on the length of the custom string (e.g. "goalgoalgoal ") the lists with the values varies in length - currently, I'm working with a log file which includes 1031 matches of the string "goal ", but the length of list varies from everything between ~980 and 1029.
Here is the code:
for line in inputfile:
if "Started---" in line:
startTime = line[11:23]
testfile.write("\n"+"Start"+"\n"+"goal "+ startTime+"\n")
counterLines +=1
elif "done!" in line:
testfile.write("\n"+find_between(line, "| ", "done!")+"\n")
elif "Errors:" in line:
testfile.write("\n"+"Errors:"+line.split("Errors:",1)[1]+"\n")
elif "Warnings:" in line:
testfile.write("\n"+"Warnings:"+line.split("Warnings:",1)[1]+"\n")
elif "Successes:" in line:
testfile.write("\n"+"Successes:"+line.split("Successes:",1)[1]+"\n")
elif "END---" in line:
endTime = line[11:23]
testfile.write("\n"+"End"+"\n"+"endTime "+ endTime+"\n")
else:
print("nothing found")
testfileread = open(filePath+"\\testFile.txt", "r")
startTimesList = []
endTimesList = []
for line in testfileread:
matchObj = re.match(r'goal', line)
if matchObj:
startTimesList.append(line)
print(len(startTimesList))
Do you have ideas why the code doesn't work as expected?
Thank you in advance!
Most probably it's due to the fact that you don't flush testFile.txt after writing is completed - as a result, there is unpredictable amount of data in the file when you start reading it. Calling testfile.flush() should fix the problem. Alternatively, wrap the writing logic in a with block.

os.walk set start and end point - python

I'm trying to find how to stop a os.walk after it has walked through a particular file.
I have a directory of log files organized by date. I'm trying to replace grep searches allowing a user to find ip addresses stored in a date range they specify.
The program will take the following arguments:
-i ipv4 or ipv6 address with subnet
-s start date ie 2013/12/20 matches file structure
-e end date
I'm assuming because the topdown option their is a logic that should allow me to declare a endpoint, what is the best way to do this? I'm thinking while loop.
I apologize in advance if something is off with my question. Just checked blood sugar, it's low 56, gd type one.
Additional information
The file structure will be situated in flows/index_border as such
2013
--01
--02
----01
----...
----29
2014
___________Hope this is clear, year folder contains month folders, containing day folders, containing hourly files. Dates increase downwards.___________________
The end date will need to be inclusive, ( I didn't focus too much on it because I can just add code to move one day up)
I have been trying to make a date range function, I was surprised I didn't see this in any datetime docs, seems like it would be useful.
import os, gzip, netaddr, datetime, argparse
startDir = '.'
def sdate_format(s):
try:
return (datetime.datetime.strptime(s, '%Y/%m/%d').date())
except ValueError:
msg = "Bad start date. Please use yyyy/mm/dd format."
raise argparse.ArgumentTypeError(msg)
def edate_format(e):
try:
return (datetime.datetime.strptime(e, '%Y/%m/%d').date())
except ValueError:
msg = "Bad end date. Please use yyyy/mm/dd format."
raise argparse.ArgumentTypeError(msg)
parser = argparse.ArgumentParser(description='Locate IP address in log files for a particular date or date range')
parser.add_argument('-s', '--start_date', action='store', type=sdate_format, dest='start_date', help='The first date in range of interest.')
parser.add_argument('-e', '--end_date', action='store', type=edate_format, dest='end_date', help='The last date in range of interest.')
parser.add_argument('-i', action='store', dest='net', help='IP address or address range, IPv4 or IPv6 with optional subnet accepted.', required=True)
results = parser.parse_args()
start = results.start_date
end = results.end_date
target_ip = results.net
startDir = '/flows/index_border/{0}/{1:02d}/{2:02d}'.format(start.year, start.month, start.day)
print('searching...')
for root, dirs, files in os.walk(startDir):
for contents in files:
if contents.endswith('.gz'):
f = gzip.open(os.path.join(root, contents), 'r')
else:
f = open(os.path.join(root, contents), 'r')
text = f.readlines()
f.close()
for line in text:
for address_item in netaddr.IPNetwork(target_IP):
if str(address_item) in line:
print line,
You need to describe what works or does not work. The argparse of your code looks fine, though I haven't done any testing. The use of type is refreshingly correct. :) (posters often misuse that parameter.)
But as for the stopping, I'm guessing you could do:
endDir = '/flows/index_border/{0}/{1:02d}/{2:02d}'.format(end.year, end.month, end.day)
for root, dirs, files in os.walk(startDir):
for contents in files:
....
if endDir in <something based on dirs and files>:
break
I don't know enough your file structure to be more specific. It's also been sometime since I worked with os.walk. In any case, I think a conditional break is the way to stop the walk early.
#!/usr/bin/env python
import os, gzip, netaddr, datetime, argparse, sys
searchDir = '.'
searchItems = []
def sdate_format(s):
try:
return (datetime.datetime.strptime(s, '%Y/%m/%d').date())
except ValueError:
msg = "Bad start date. Please use yyyy/mm/dd format."
raise argparse.ArgumentTypeError(msg)
def edate_format(e):
try:
return (datetime.datetime.strptime(e, '%Y/%m/%d').date())
except ValueError:
msg = "Bad end date. Please use yyyy/mm/dd format."
raise argparse.ArgumentTypeError(msg)
parser = argparse.ArgumentParser(description='Locate IP address in log files for a particular date or date range')
parser.add_argument('-s', '--start_date', action='store', type=sdate_format, dest='start_date',
help='The first date in range of interest.', required=True)
parser.add_argument('-e', '--end_date', action='store', type=edate_format, dest='end_date',
help='The last date in range of interest.', required=True)
parser.add_argument('-i', action='store', dest='net',
help='IP address or address range, IPv4 or IPv6 with optional subnet accepted.', required=True)
results = parser.parse_args()
start = results.start_date
end = results.end_date + datetime.timedelta(days=1)
target_IP = results.net
dateRange = end - start
for addressOfInterest in(netaddr.IPNetwork(target_IP)):
searchItems.append(str(addressOfInterest))
print('searching...')
for eachDay in range(dateRange.days):
period = start+datetime.timedelta(days=eachDay)
searchDir = '/flows/index_border/{0}/{1:02d}/{2:02d}'.format(period.year, period.month, period.day)
for contents in os.listdir(searchDir):
if contents.endswith('.gz'):
f = gzip.open(os.path.join(searchDir, contents), 'rb')
text = f.readlines()
f.close()
else:
f = open(os.path.join(searchDir, contents), 'r')
text = f.readlines()
f.close()
#for line in text:
# break
for addressOfInterest in searchItems:
for line in text:
if addressOfInterest in line:
# if str(address_item) in line:
print contents
print line,
I was banging my head, because I thought I was printing a duplicate. Turns out the file I was given to test has duplication. I ended up removing os.walk due to the predictable nature of the file system, but #hpaulj did provide a correct solution. Much appreciated!

Mt4 Probability Script

I'm fairly new to python
I have made a simple script that imports price feeds from mt4
My idea / Project is to turn this into some sort of a probability indicator, that is giving the probability, besides the bid and ask,
for example:
TIME/ BID ASK
USD/CADD 22:19 1.30451 60%^ 1.30D39 40%v
and the probability is changing within specific period, i.e for example 1hr period, so every hour it will give a new probability of the direction
It is looking for two patterns: A, B,
Pattern A represents a bullish pattern
Pattern B represents a bearish pattern
basically looking for how strong is the probability A or B reoccurring
out of the two which has a higher chance of reoccurring,
Here is where I am stuck
I have no idea how to put that together...
Here is what I have so far:
import datetime
import numpy as np
import pandas as pd
import sklearn
from pandas.io.data import DataReader
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.lda import LDA
from sklearn.metrics import confusion_matrix
from sklearn.qda import QDA
from sklearn.svm import LinearSVC, SVC
import dde_client as ddec
import time
QUOTE_client = ddec.DDEClient('MT4', 'QUOTE')
symbols = ['USDCAD', 'GOLD','EURUSD', 'SILVER', 'US30Cash', ]
for i in symbols:
QUOTE_client.advise(i)
def Get_quote():
while 1:
time.sleep(1)
print "Symbol\tDATE\t\tTIME\tBID\tASK"
for i in symbols:
current_quote = QUOTE_client.request(i).split(" ")
day, _time, bid, ask = (current_quote[0], current_quote[1],
current_quote[2], current_quote[3])
print i + ":\t" + day + "\t" + _time +"\t" +bid+ "\t" + ask
break
time.sleep(1)
return Get_quote()
continue
def create_lagged_series(cuurent_quote,start_time, end_time, lags=1):
ts = DataReader(cuurent_quote,symbols,
start_time-datetime.timedelta(hours=1),
end_time
)
tslag = pd.DataFrame(index=ts.index)
ts['Today'] = ts['Adj Close']
tslag["Volume"] = ts["Volume"]
for i in xrange(0, lags):
tslag["Lag%s" % str(i+1)] = ts["Adj Close"].shift(i+1)
tsret = pd.DataFrame(index=tslag.index)
tsret["Volume"] = tslag["Volume"]
tsret["Today"] = tslag["Today"].pct_change()*100.0
for i in xrange(0, lags):
if (abs(x) < 0.0001):
tsret["Today"][i] = 0.0001
for i in xrange(0,lags):
tsret["Lag%s" % str(i+1)] = \
tslag["Lag%s" % str(i+1)].pct_change()*100.0
tsret["Direction"] = np.sign(tsret["Today"])
tsret = tsret[tsret.index >= start_time]
return tsret
if __name__ == "__main__":
snpert = create_lagged_series(len('GOLD', Get_quote(), 1))
X = snpret[["Lag1","Lag2"]]
y = snpret["Direction"]
start_test = cuurernt_quote
X_train = X[X.index < start_test]
X_test = X[X.index >= start_test]
y_train = y[y.index < start_test]
y_test = y[y.index >= start_test]
print "Hit Rates/Confusion Matrices:\n"
models = [ ( "LR", LogisticRegression() ),
( "LDA", LDA() ),
( "QDA", QDA() ),
( "LSVC", LinearSVC() ),
( "RSVM", SVC( C = 1000000.0,
cache_size = 200,
class_weight = None,
coef0 = 0.0,
degree = 3,
gamma = 0.0001,
kernel = 'rbf',
max_iter = -1,
probability = False,
random_state = None,
shrinking = True,
tol = 0.001,
verbose = False
)
),
( "RF", RandomForestClassifier( n_estimators = 1000,
criterion = 'gini',
max_depth = None,
min_samples_split = 2,
min_samples_leaf = 1,
max_features = 'auto',
bootstrap = True,
oob_score = False,
n_jobs = 1,
random_state = None,
verbose = 0
)
)
]
# Iterate through the models
for m in models:
# Train each of the models on the training set
m[1].fit(X_train, y_train)
pred = m[1].predict(X_test)
print "%s:\n%0.3f" % (m[0], m[1].score(X_test, y_test))
print "%s\n" % confusion_matrix(pred, y_test)
Here is just my MT4 price feed script on its own:
import dde_client as ddec
import time
__author__ = 'forex Ticker'
print __author__
QUOTE_client = ddec.DDEClient('MT4', 'QUOTE')
symbols = ['USDCAD', 'GOLD','EURUSD', 'SILVER', 'US30Cash', ]
for i in symbols:
QUOTE_client.advise(i)
while 1:
time.sleep(1)
print "Symbol\tDATE\t\tTIME\tBID\tASK"
for i in symbols:
current_quote = QUOTE_client.request(i).split(" ")
day, _time, bid, ask = (current_quote[0], current_quote[1],
current_quote[2], current_quote[3])
print i + ":\t" + day + "\t" + _time +"\t" +bid+ "\t" + ask
break
time.sleep(1)
continue
As to fixing your algorithm I suggest that rather than doing it from scratch and hacking various libraries together, start from a working example and modify it to your liking. You don't have to understand it completely, but you need something to start with.
I would even abstract away the MT4 and quotes reading logic, and just have some test numbers (fake or sampled) in a CSV or TXT file. Start form a working example that can recognize A and B patterns in this file. Having that define how your own algorithm is different and try to adjust it.
When it's working the final step is the MT4 integration. It looks like the DDE server is only meant for data export, but not for building an indicator. Consider this alternative framework: linking MT4 to Python. Not only you can build an on-chart indicator with it, but also perform automatic trading using your algorithm.
Q: How to put that together...? A: Have a Realistic Plan - best before one puts the money on table. That can save youfrom even starting to do a nonsense orfrom aiming at non-realistic targets.
No one would be harmed, if plan is the first working document elaborated and agreed by all parties involved on HOW the "great and cool new disruptive vision" WILL BE CREATED.
Organise your further work in steps + always add budget controls, be it in [man*weeks] or [k$], one is willing to spend on items.
One ought be able to decide on the feasibility and surviveability of the initially great & cool idea.
Plan carefully inside the principal phases, both on the MQL4/5 side, python and other components:
X [man*weeks] on [System Integration Architecture],
Y [man*weeks] on [Integration Model Design],
Z [man*weeks] on [Integration Model Prototype],
U [man*weeks] on [Integration Model Testing],
V [man*weeks] on [Integration Model Release],
W [man*weeks] on [Integration Model Production Ecosystem]
S [man*weeks] on [Design Cycles on finding best Predictions Model]
T [man*weeks] on [Design Cycles on finding good Trading Strategies for Predictions]
Items not to be forgotten to overcome in the early architecture decisions:
0) Forget to use MQL4/5 examples, you put yourself at risk into sub-millisecond domain battle with hundreds of millions USD in fight and motion
1) Forget to use Custom Indicator in MQL4/5 MetaTrader Terminal ( blocking )
2) Forget to use DDE integration, some O/S do not support it at all
3) Forget to use pandas ( even for any AI/ML-model prototyping ) as nanoseconds matter a lot in ML-process, pandas is a great toy, but not for the performance a real-world trading needs for ML-model tuning.
4) Forget to use start-end logic, the AI/ML-engines have to be separate, in order to efficiently train/validate/test for their best generalisation abilities in vast HyperPARAM state-spaces.for m in models: can be in the source-code, but not in the reality. One instrument may take ( and does take ) about a few tens of [CPU-core*days] runtimes in parameter optimisation on COTS-hardware, so count with realistic numbers here, for proper budgeting of each of the [S]+[T] cycles.
Epilogue:
Anyway a smart Programme, if approved as financially feasible. May like other posts on Low-latency MT4-AI/ML-integration for algorithmic-trading.

Conditional quit multiprocess in python

I'm trying to build a python script that runs several processes in parallel. Basically, the processes are independent, work on different folders and leave their output as text files in those folders. But in some special cases, a process might terminate with a special (boolean) status. If so, I want all the other processes to terminate right away. What is the best way to do this?
I've fiddled with multiprocessing.condition() and multiprocessing.manager, after reading the excellent tutorial by Doug Hellmann:
http://pymotw.com/2/multiprocessing/communication.html
However, I do not seem to understand how to get a multiprocessing process to monitor a status indicator and quit if it takes a special value.
To examplify this, I've written the small script below. It somewhat does what I want, but ends in an exception. Suggestions on more elegant ways to proceed are gratefully welcomed:
br,
Gro
import multiprocessing
def input(i):
"""arbitrary chosen value(8) gives special status = 1"""
if i == 8:
value = 1
else:
value = 0
return value
def sum_list(list):
"""accumulative sum of list"""
sum = 0
for x in range(len(list)):
sum = sum + list[x]
return sum
def worker(list,i):
value = input(i)
list[i] = value
print 'processname',i
if __name__ == '__main__':
mgr = multiprocessing.Manager()
list = mgr.list([0]*20)
jobs = [ multiprocessing.Process(target=worker, args=(list,i))
for i in range(20)
]
for j in jobs:
j.start()
sumlist = sum_list(list)
print sumlist
if sumlist == 1:
break
for j in jobs:
j.join()