How can I use XGBOOST library in c++? I have founded Python and Java API, but I can't found API for c++
I ended up using the C API, see below an example:
// create the train data
int cols=3,rows=5;
float train[rows][cols];
for (int i=0;i<rows;i++)
for (int j=0;j<cols;j++)
train[i][j] = (i+1) * (j+1);
float train_labels[rows];
for (int i=0;i<rows;i++)
train_labels[i] = 1+i*i*i;
// convert to DMatrix
DMatrixHandle h_train[1];
XGDMatrixCreateFromMat((float *) train, rows, cols, -1, &h_train[0]);
// load the labels
XGDMatrixSetFloatInfo(h_train[0], "label", train_labels, rows);
// read back the labels, just a sanity check
bst_ulong bst_result;
const float *out_floats;
XGDMatrixGetFloatInfo(h_train[0], "label" , &bst_result, &out_floats);
for (unsigned int i=0;i<bst_result;i++)
std::cout << "label[" << i << "]=" << out_floats[i] << std::endl;
// create the booster and load some parameters
BoosterHandle h_booster;
XGBoosterCreate(h_train, 1, &h_booster);
XGBoosterSetParam(h_booster, "booster", "gbtree");
XGBoosterSetParam(h_booster, "objective", "reg:linear");
XGBoosterSetParam(h_booster, "max_depth", "5");
XGBoosterSetParam(h_booster, "eta", "0.1");
XGBoosterSetParam(h_booster, "min_child_weight", "1");
XGBoosterSetParam(h_booster, "subsample", "0.5");
XGBoosterSetParam(h_booster, "colsample_bytree", "1");
XGBoosterSetParam(h_booster, "num_parallel_tree", "1");
// perform 200 learning iterations
for (int iter=0; iter<200; iter++)
XGBoosterUpdateOneIter(h_booster, iter, h_train[0]);
// predict
const int sample_rows = 5;
float test[sample_rows][cols];
for (int i=0;i<sample_rows;i++)
for (int j=0;j<cols;j++)
test[i][j] = (i+1) * (j+1);
DMatrixHandle h_test;
XGDMatrixCreateFromMat((float *) test, sample_rows, cols, -1, &h_test);
bst_ulong out_len;
const float *f;
XGBoosterPredict(h_booster, h_test, 0,0,&out_len,&f);
for (unsigned int i=0;i<out_len;i++)
std::cout << "prediction[" << i << "]=" << f[i] << std::endl;
// free xgboost internal structures
Use XGBoost C API.
BoosterHandle booster;
const char *model_path = "/path/of/model";
// create booster handle first
XGBoosterCreate(NULL, 0, &booster);
// by default, the seed will be set 0
XGBoosterSetParam(booster, "seed", "0");
// load model
XGBoosterLoadModel(booster, model_path);
const int feat_size = 100;
const int num_row = 1;
float feat[num_row][feat_size];
// create some fake data for predicting
for (int i = 0; i < num_row; ++i) {
for(int j = 0; j < feat_size; ++j) {
feat[i][j] = (i + 1) * (j + 1)
// convert 2d array to DMatrix
DMatrixHandle dtest;
num_row, feat_size, NAN, &dtest);
// predict
bst_ulong out_len;
const float *f;
XGBoosterPredict(booster, dtest, 0, 0, &out_len, &f);
assert(out_len == num_row);
std::cout << f[0] << std::endl;
// free memory
Note when you want to load an existing model(like above code shows), you have to ensure the data format in training is the same as in predicting. So, if you predict with XGBoosterPredict, which accepts a dense matrix as parameter, you have to use dense matrix in training.
Training with libsvm format and predict with dense matrix may cause wrong predictions, as XGBoost FAQ says:
“Sparse” elements are treated as if they were “missing” by the tree booster, and as zeros by the linear booster. For tree models, it is important to use consistent data formats during training and scoring.
Here is what you need:
#include "xgboostpp.h"
#include <algorithm>
#include <iostream>
int main(int argc, const char* argv[])
auto nsamples = 2;
auto xgb = XGBoostPP(argv[1], 3); //特征列有4列, label有3个, iris例子中分别为三种类型的花,回归任何的话,这里nlabel=1即可
//result = array([[9.9658281e-01, 2.4966884e-03, 9.2058454e-04],
// [9.9608469e-01, 2.4954407e-03, 1.4198524e-03]], dtype=float32)
XGBoostPP::Matrix features(2, 4);
features <<
5.1, 3.5, 1.4, 0.2,
4.9, 3.0, 1.4, 0.2;
XGBoostPP::Matrix y;
auto ret = xgb.predict(features, y);
if (ret != 0){
std::cout << "predict error" << std::endl;
std::cout << "intput : \n" << features << std::endl << "output: \n" << y << std::endl;
In case training in Python is okay and you only need to run the prediction in C++, there is a nice tool for generating static if/else-code from a trained model:
I ended up using this after spending a day trying to load and use a xgboost model in C++ without success. The code generated by xgb2cpp was working instantly and also has the nice benefit that it does not have any dependencies.
There is no example I am aware of. there is a c_api.h file that contains a C/C++ api for the package, and you'll have to find your way using it. I've just did that. Took me a few hours reading the code and trying few things out. But eventually I managed to create a working C++ example of xgboost.
To solve this problem we runs the xgboost program from C++ source code.
I am running a Tensorflow model return a 3D array as output, and I couldn't get that array of data from the tensor.
I did print the shape of the output of the model without any problem.
std::vector<tf::Tensor> outputs;
auto start_inference = std::chrono::high_resolution_clock::now();
_status = _session->Run({inputs}, {"k2tfout_0", "k2tfout_1"}, {}, &outputs);
if (!_status.ok())
std::cerr << _status.ToString() << std::endl;
return 0;
unsigned int output_img_n0 = outputs[0].shape().dim_size(0);
unsigned int output_img_h0 = outputs[0].shape().dim_size(1);
unsigned int output_img_w0 = outputs[0].shape().dim_size(2);
unsigned int output_img_c0 = outputs[0].shape().dim_size(3);
That code worked without any error and showed the shape of the array. But still, I couldn't get the data from the outputs Tensor object.
The only function is worked is
float_t *plant_pointer = outputs[1].flat<float_t>().data();
But it destroy the array shape.
The output shape of the tensor is [num,high,width,channel] === [1,480,600,3]. So, the output is an image of semantic segmentation image of the model. I just want the image part without the first dim which always be 1.
The tensorflow::Tensor class allows you to access its contents through several methods. With .flat you get a flattened version of the array, .tensor gives you a full Eigen tensor, and then there are a few other like .vec/.matrix (like .tensor with number of dimensions fixed to 1 or 2) and flat_inner_dims/flat_outer_dims/flat_inner_outer_dims (gives you a tensor with some dimensions collapsed). You can use the one that suits you best. In this case, for example if you want to print all the values in the tensor, you can use .flat and compute the corresponding offset or use .tensor if you know that the number of dimensions is 4:
std::vector<tf::Tensor> outputs;
auto start_inference = std::chrono::high_resolution_clock::now();
_status = _session->Run({inputs}, {"k2tfout_0", "k2tfout_1"}, {}, &outputs);
if (!_status.ok())
std::cerr << _status.ToString() << std::endl;
return 0;
unsigned int output_img_n0 = outputs[0].shape().dim_size(0);
unsigned int output_img_h0 = outputs[0].shape().dim_size(1);
unsigned int output_img_w0 = outputs[0].shape().dim_size(2);
unsigned int output_img_c0 = outputs[0].shape().dim_size(3);
for (unsigned int ni = 0; ni < output_img_n0; ni++)
for (unsigned int hi = 0; hi < output_img_h0; hi++)
for (unsigned int wi = 0; wi < output_img_w0; wi++)
for (unsigned int ci = 0; ci < output_img_c0; ci++)
float_t value;
// Get vaule through .flat()
unsigned int offset = ni * output_img_h0 * output_img_w0 * output_img_c0 +
hi * output_img_w0 * output_img_c0 +
wi * output_img_c0 +
value = outputs[0].flat<float_t>()(offset);
// Get value through .tensor()
value = outputs[0].tensor<float_t, 4>()(ni, hi, wi, ci);
std::cout << "output[0](" << ni << ", " << hi << ", " << wi << ", " << ci << ") = ";
std::cout << value << std::endl;
Note that, although these methods create Eigen::TensorMap objects, which are not really expensive, you may prefer to call them only once and then query the tensor object multiple times. For example:
// Make tensor
tf::TTypes<float_t, 4>::Tensor outputTensor0 = outputs[0].tensor<float_t, 4>();
// Query tensor multiple times
for (...)
std::cout << outputTensor0(ni, hi, wi, ci) << std::endl;
If you want to obtain a pointer to the data of the tensor (for example to build another object from the same buffer avoiding copies or iteration), you can also do that. One option is to use the .tensor_data method, which returns a tensorflow::StringPiece, which is in turn a absl::string_view, which is just a polyfill for std::string_view. So the .data method of this object will give you a pointer to the underlying byte buffer for the tensor (note the warning in the documentation of .tensor_data: "the underlying tensor buffer is refcounted", so do not let the returned object be destroyed while you use the buffer). You can therefore do:
tf::StringPiece output0Str = outputs[0].tensor_data();
const char* output0Ptr =;
This however gives you a pointer to char so you would have to cast it to use it as float. It should be safe, but it looks ugly, so you can let Eigen do that for you. All Eigen objects have a .data method that returns a pointer of its type to the underlying buffer. For example:
const float_t* output0Ptr = outputs[0].flat<float_t>().data();
I'm currently porting my old OpenCV C code to the C++ interface of OpenCV 2/3 and I'm not quite sure about some equivalents of old functions. Pretty early I ran into an issue with cvZero. The only possibility I found was to set the matrix content via Mat::setTo. Now, having to be able to manage multi-channel scalars and different data types, setTo iterates through all elements of the matrix and sets them one after another while cvZero basically did a memset. I am wondering what would be the recommended way for using the C++ interface, in case I just want to clear my image black.
yourMat = cv::Mat::zeros(yourMat.size(), yourMat.type()) does not seem to allocate new memory but only overwrites the existing Mat object (memory was previously allocated, otherwise .size is 0). Not sure whether memset is used internally, but this sample code gives 50% longer processing time for the version with .setTo compared to the version with cv::Mat::zeros - but I didn't evaluate the offset from the manipulation (which should be quite identical in both versions)!
int main(int argc, char* argv[])
cv::Mat input = cv::imread("C:/StackOverflow/Input/Lenna.png");
cv::Mat a = input;
cv::Mat b = input;
cv::imshow("original", a);
b = cv::Mat::zeros(a.size(), a.type());
std::vector<int> randX;
std::vector<int> randY;
std::vector<cv::Vec3b> randC;
int n = 500000;
for (unsigned int i = 0; i < n; ++i)
randX[i] = rand() % input.cols;
randY[i] = rand() % input.rows;
randC[i] = cv::Vec3b(rand()%255, rand()%255, rand()%255);
clock_t start1 = clock();
for (unsigned int i = 0; i < randX.size(); ++i)
{<cv::Vec3b>(randY[i], randX[i]) = randC[i];
b = cv::Mat::zeros(b.size(), b.type());
clock_t end1 = clock();
clock_t start2 = clock();
for (unsigned int i = 0; i < randX.size(); ++i)
{<cv::Vec3b>(randY[i], randX[i]) = randC[i];
b.setTo( cv::Scalar(0, 0, 0));
clock_t end2 = clock();
std::cout << "time1 = " << ( (end1 - start1) / CLOCKS_PER_SEC ) << " seconds" << std::endl;
std::cout << "time2 = " << ((end2 - start2) / CLOCKS_PER_SEC) << " seconds" << std::endl;
cv::imshow("a", a);
cv::imshow("b", b);
return 0;
gives me output:
time1 = 14 seconds
time2 = 21 seconds
on my machine (Release mode) (no IPP).
and a black image for both, a and b which indicates that no new memory was allocated, but the existing Mat memory was used.
int n = 250000; will produce output
time1 = 6 seconds
time2 = 10 seconds
This is no answer about whether or not memset is used internally or whether or not it is as fast as cvZero, but at least you know now how to set to zero faster than .setTo
I want calculate the mean and standard deviations for a histogram of a HSV image but I only want to do this histogram and calculations for the V channel.
I have been reading examples on how to do this for a set of channels and have tried these approaches but I am getting confused over whether my approach for initially creating the histogram is correct or not for just one channel because the program keeps crashing when i try to execute it.
Here is what I have at the moment (The variable test is a cv::Mat image and this can be any image you wish to use to recreate the issue). I have probably missed something obvious and the for loop might not be correct in terms of the range of values but I haven't done this in C++ before.
cv::cvtColor(test, test, CV_BGR2HSV);
int v_bins = 50;
int histSize[] = { v_bins };
cv::MatND hist;
float v_ranges[] = { 0, 255};
cv::vector<cv::Mat> channel(3);
split(test, channel);
const float* ranges[] = { v_ranges };
int channels[] = {0};
cv::calcHist(&channel[2], 1, channels, cv::Mat(), hist, 1, histSize, ranges, true, false); //histogram calculation
float mean=0;
float rows= hist.size().height;
float cols = hist.size().width;
for (int v = 0; v < v_bins; v++)
std::cout <<<float>(v, v) << std::endl;;
mean = mean +<float>(v);
mean = mean / (rows*cols);
std::cout << mean<< std::endl;;
You can simply use cv::meanStdDev, that calculates a mean and standard deviation of array elements.
Note that both mean and stddev arguments are cv::Scalar, so you need to do mean[0] and stddev[0] to get the double values of your single channel array hist.
This code will clarify it's usage:
#include <opencv2\opencv.hpp>
#include <iostream>
int main()
cv::Mat test = cv::imread("path_to_image");
cv::cvtColor(test, test, CV_BGR2HSV);
int v_bins = 50;
int histSize[] = { v_bins };
cv::MatND hist;
float v_ranges[] = { 0, 255 };
cv::vector<cv::Mat> channel(3);
split(test, channel);
const float* ranges[] = { v_ranges };
int channels[] = { 0 };
cv::calcHist(&channel[2], 1, channels, cv::Mat(), hist, 1, histSize, ranges, true, false); //histogram calculation
cv::Scalar mean, stddev;
cv::meanStdDev(hist, mean, stddev);
std::cout << "Mean: " << mean[0] << " StdDev: " << stddev[0] << std::endl;
return 0;
You can compute the mean and the standard deviation by their definition:
double dmean = 0.0;
double dstddev = 0.0;
// Mean standard algorithm
for (int i = 0; i < v_bins; ++i)
dmean +=<float>(i);
dmean /= v_bins;
// Standard deviation standard algorithm
std::vector<double> var(v_bins);
for (int i = 0; i < v_bins; ++i)
var[i] = (dmean -<float>(i)) * (dmean -<float>(i));
for (int i = 0; i < v_bins; ++i)
dstddev += var[i];
dstddev = sqrt(dstddev / v_bins);
std::cout << "Mean: " << dmean << " StdDev: " << dstddev << std::endl;
and you'll get the same values as OpenCV meanStdDev.
Be careful about calculating statistics on a histogram. If you just run meanStdDev, you'll get the mean and stdev of the bin values. That doesn't tell you an awful lot.
Probably what you want is the mean and stdev intensity.
So, if you want to derive the image mean and standard deviation from a histogram (or set of histograms), then you can use the following code:
// assume histogram is of type cv::Mat and comes from cv::calcHist
double s = 0;
double total_hist = 0;
for(int i=0; i <; ++i){
s +=<float>(i) * (i + 0.5); // bin centre
total_hist +=<float>(i);
double mean = s / total_hist;
double t = 0;
for(int i=0; i <; ++i){
double x = (i - mean);
t +=<float>(i)*x*x;
double stdev = std::sqrt(t / total_hist);
From the definitions of the mean:
mean = sum(x * p(x)) // expectation
std = sqrt(sum( p(x)*(x - mean)**2 ) // sqrt(variance)
The mean is the expectation value for x. So histogram[x]/sum(histogram) gives you p(x). The definition of standard deviation is similar and comes from the variance. The numbers are slightly simpler because pixels can only take integer values and are unit spaced.
Note this is also useful if you want to calculate normalisation statistics for a batch of images using the accumulate option.
Adapted from: How to calculate the standard deviation from a histogram? (Python, Matplotlib)
Note: I've posted this also on Eigen forum here
I want to premultiply 3xN matrices by a 3x3 matrix, i.e., to transform 3D points, like
p_dest = T * p_source
after initializing the matrices:
Eigen::Matrix<double, 3, Eigen::Dynamic> points = Eigen::Matrix<double, 3, Eigen::Dynamic>::Random(3, NUMCOLS);
Eigen::Matrix<double, 3, Eigen::Dynamic> dest = Eigen::Matrix<double, 3, Eigen::Dynamic>(3, NUMCOLS);
int NT = 100;
I have evaluated this two versions
// eigen direct multiplication
for (int i = 0; i < NT; i++){
Eigen::Matrix3d T = Eigen::Matrix3d::Random();
dest.noalias() = T * points;
// col multiplication
for (int i = 0; i < NT; i++){
Eigen::Matrix3d T = Eigen::Matrix3d::Random();
for (int c = 0; c < points.cols(); c++){
dest.col(c) = T * points.col(c);
the NT repetition are done just to compute average time
I am surprised the the column by column multiplication is about 4/5 time faster than the direct multiplication
(and the direct multiplication is even slower if I do not use the .noalias(), but this is fine since it is doing a temporary copy)
I've tried to change NUMCOLS from 0 to 1000000 and the relation is linear.
I'm using Visual Studio 2013 and compiling in release
The next figure shows on X the number of columns of the matrix and in Y the avg time for a single operation, in blue the col by col multiplication, in red the matrix multiplication
Any suggestion why this happens?
Short answer
You're timing the lazy (and therefore lack of) evaluation in the col multiplication version, vs. the lazy (but evaluated) evaluation in the direct version.
Long answer
Instead of code snippets, let's look at a full MCVE. First, "you're" version:
void ColMult(Matrix3Xd& dest, Matrix3Xd& points)
Eigen::Matrix3d T = Eigen::Matrix3d::Random();
for (int c = 0; c < points.cols(); c++){
dest.col(c) = T * points.col(c);
void EigenDirect(Matrix3Xd& dest, Matrix3Xd& points)
Eigen::Matrix3d T = Eigen::Matrix3d::Random();
dest.noalias() = T * points;
int main(int argc, char *argv[])
int NUMCOLS = 100000 + rand();
Matrix3Xd points = Matrix3Xd::Random(3, NUMCOLS);
Matrix3Xd dest = Matrix3Xd(3, NUMCOLS);
Matrix3Xd dest2 = Matrix3Xd(3, NUMCOLS);
int NT = 200;
// eigen direct multiplication
auto beg1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < NT; i++)
EigenDirect(dest, points);
auto end1 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_seconds = end1-beg1;
// col multiplication
auto beg2 = std::chrono::high_resolution_clock::now();
for(int i = 0; i < NT; i++)
ColMult(dest2, points);
auto end2 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_seconds2 = end2-beg2;
std::cout << "Direct time: " << elapsed_seconds.count() << "\n";
std::cout << "Col time: " << elapsed_seconds2.count() << "\n";
std::cout << "Eigen speedup: " << elapsed_seconds2.count() / elapsed_seconds.count() << "\n\n";
return 0;
With this code (and SSE turned on), I get:
Direct time: 0.449301
Col time: 0.10107
Eigen speedup: 0.224949
Same 4-5 slowdown you complained of. Why?!?! Before we get to the answer, let's modify the code a bit so that the dest matrix is sent to an ostream. Add std::ostream outPut(0); to the beginning of main() and before ending the timers add outPut << dest << "\n\n"; and outPut << dest2 << "\n\n";. The std::ostream outPut(0); doesn't output anything (I'm pretty sure the badbit is set), but it does cause Eigens operator<< to be called, which forces the evaluation of the matrix.
NOTE: if we used outPut << dest(1,1) then dest would be evaluated only enough to output the single element in the col multiplication method.
We then get
Direct time: 0.447298
Col time: 0.681456
Eigen speedup: 1.52349
as a result as expected. Note that the Eigen direct method took the exact(ish) same time (meaning the evaluation took place even without the added ostream), whereas the col method all of the sudden took much longer.
I am using openCV to implementing camera motion compensation for an application. I know I need to calculate the optical flow and then find the fundamental matrix between two frames to transform the image.
Here is what I have done so far:
void VideoStabilization::stabilize(Image *image) {
if (image->getWidth() != width || image->getHeight() != height) reset(image->getWidth(), image->getHeight());
IplImage *currImage = toCVImage(image);
IplImage *currImageGray = cvCreateImage(cvSize(width, height), IPL_DEPTH_8U, 1);
cvCvtColor(currImage, currImageGray, CV_BGRA2GRAY);
if (baseImage) {
CvPoint2D32f currFeatures[MAX_CORNERS];
char featuresFound[MAX_CORNERS];
opticalFlow(currImageGray, currFeatures, featuresFound);
IplImage *result = transformImage(currImage, currFeatures, featuresFound);
if (result) {
updateImage(image, result);
if (baseImage) cvReleaseImage(&baseImage);
baseImage = currImageGray;
void VideoStabilization::updateGoodFeatures() {
const double QUALITY_LEVEL = 0.05;
const double MIN_DISTANCE = 5.0;
baseFeaturesCount = MAX_CORNERS;
cvGoodFeaturesToTrack(baseImage, eigImage,
tempImage, baseFeatures, &baseFeaturesCount, QUALITY_LEVEL, MIN_DISTANCE);
cvFindCornerSubPix(baseImage, baseFeatures, baseFeaturesCount,
cvSize(10, 10), cvSize(-1,-1), TERM_CRITERIA);
void VideoStabilization::opticalFlow(IplImage *currImage, CvPoint2D32f *currFeatures, char *featuresFound) {
const unsigned int WIN_SIZE = 15;
const unsigned int PYR_LEVEL = 5;
cvCalcOpticalFlowPyrLK(baseImage, currImage,
IplImage *VideoStabilization::transformImage(IplImage *image, CvPoint2D32f *features, char *featuresFound) const {
unsigned int featuresFoundCount = 0;
for (unsigned int i = 0; i < MAX_CORNERS; ++i) {
if (featuresFound[i]) ++featuresFoundCount;
if (featuresFoundCount < 8) {
std::cout << "Not enough features found." << std::endl;
return NULL;
CvMat *points1 = cvCreateMat(2, featuresFoundCount, CV_32F);
CvMat *points2 = cvCreateMat(2, featuresFoundCount, CV_32F);
CvMat *fundamentalMatrix = cvCreateMat(3, 3, CV_32F);
unsigned int pos = 0;
for (unsigned int i = 0; i < featuresFoundCount; ++i) {
while (!featuresFound[pos]) ++pos;
cvSetReal2D(points1, 0, i, baseFeatures[pos].x);
cvSetReal2D(points1, 1, i, baseFeatures[pos].y);
cvSetReal2D(points2, 0, i, features[pos].x);
cvSetReal2D(points2, 1, i, features[pos].y);
int fmCount = cvFindFundamentalMat(points1, points2, fundamentalMatrix, CV_FM_RANSAC, 1.0, 0.99);
if (fmCount < 1) {
std::cout << "Fundamental matrix not found." << std::endl;
return NULL;
std::cout << fundamentalMatrix->data.fl[0] << " " << fundamentalMatrix->data.fl[1] << " " << fundamentalMatrix->data.fl[2] << "\n";
std::cout << fundamentalMatrix->data.fl[3] << " " << fundamentalMatrix->data.fl[4] << " " << fundamentalMatrix->data.fl[5] << "\n";
std::cout << fundamentalMatrix->data.fl[6] << " " << fundamentalMatrix->data.fl[7] << " " << fundamentalMatrix->data.fl[8] << "\n";
IplImage *result = transformImage(image, *fundamentalMatrix);
return result;
MAX_CORNERS is 100 and it usually find around 70-90 features.
With this code, I get a weird fundamental matrix, like:
-0.000190809 -0.00114947 1.2487
0.00127824 6.57727e-05 0.326055
-1.22443 -0.338243 1
Since I just hold the camera with my hand and try not to shake it (and there werent any objects moving), I expected the matrix to be close to identity. What am I doing wrong?
Also, I'm not sure what to use to transform the image. cvWarpAffine need a 2x3 matrix, should I discard the last row or use another function?
What you're looking for is not the fundamental matrix but rather an affine or perspective transform.
The fundamental matrix describes the relation of two cameras having significantly different viewpoints. It is calculated such that if you have two points x (on one image) and x' (on another) that are projections of the same point in space, then x F x' (the product) is zero. If x and x' are nearly identical... then the only solution is to make F nearly zero (and practically useless). That's why you've got what you have.
The matrix that should indeed be near identity is a transformation A that transforms the points x to x'= A x (the old image into the new one). Depending on what types of transformations you want to include (affine or perspective), you could (theoretically) use the functions cvGetAffineTransform or cvGetPerspectiveTransform to calculate the transform. For that, you would need 3 or 4 point pairs, respectively.
However, the best choice (I think) is cvFindHomograpy. It estimates a perspective transform based on all of the point pairs available, using outlier filtering algorithms (RANSAC, for example), giving you a 3x3 matrix.
Then you can use cvWarpPerspective to transform the images themselves.