I am trying to find the real world distance from camera to ORB feature points using Monocular ORB SLAM2.
I calculated the Euclidean distance between world coordinates of each ORB feature point and the world coordinate of the current key frame's camera location. This process is repeated for all the frames. So a distance is obtained for each ORB point in the current frame.
In Viewer.cc
std::vector<MapPoint*> MP = mpTracker->mCurrentFrame.mvpMapPoints;
cv::Mat CC = mpTracker->mCurrentFrame.GetCameraCenter();
mpFrameDrawer->DrawFrame(MP,CC);
In FrameDrawer.cc
cv::Mat FrameDrawer::DrawFrame(std::vector<MapPoint*> DistMapPt, cv::Mat CameraPt)
{
.
.
.
else if(state==Tracking::OK) //TRACKING
{
mnTracked=0;
mnTrackedVO=0;
const float r = 5;
for(int i=0;i<n;i++)
{
if(vbVO[i] || vbMap[i])
{
cv::Point2f pt1,pt2;
pt1.x=vCurrentKeys[i].pt.x-r;
pt1.y=vCurrentKeys[i].pt.y-r;
pt2.x=vCurrentKeys[i].pt.x+r;
pt2.y=vCurrentKeys[i].pt.y+r;
float MapPointDist;
MapPointDist = sqrt(pow((DistMapPt[i]->GetWorldPos().at<float>(0)-CameraPt.at<float>(0)),2)+pow((DistMapPt[i]->GetWorldPos().at<float>(1)-CameraPt.at<float>(1)),2)+pow((DistMapPt[i]->GetWorldPos().at<float>(2)-CameraPt.at<float>(2)),2));
}
.
.
.
}
}
}
How ever this calculated distance is neither equal to nor can be scaled to the real distance. The same method gives relatively accurate distances in RGBD ORB SLAM2.
Is there any method to scale distances in Monocular ORB SLAM?
Please look at this post: "ORB-SLAM2 arbitrarily define scale at initialization, as the median scene depth. In practice, scale is different every time you initialize orb-slam2." It is impossible to obtain correct scale in monocular SLAM as you cannot estimate real world depth from sequence of images. You need another source of data such as second camera, IMU, Lidar, robot odometry or a marker with known real-world dimensions. In RGBD case the depth is known from the depth sensor so the coordinates are scaled correctly.
Related
I am writing a Eye On Hand Calibration Program.
To do that, I am moving the camera mounted on the robot arm to 20 different positions, looking at a single aruco marker.
The translation vector is very stable, but the rotation axes flicker, introducing an error into the resulting calibration matrix.
Therefore, I would like to average X number of frames' rotation vectors (The aruco library does return rotation vectors and translation vectors separately).
Here is the important part of the code
cv::aruco::detectMarkers(image, dictionary, markerCorners, markerIds, parameters, rejectedCandidates);
outputImage = image.clone();
cv::aruco::drawDetectedMarkers(outputImage, markerCorners, markerIds);
cv::aruco::estimatePoseSingleMarkers(markerCorners, 0.05, camMatrix, distCoeffs, rvecs, tvecs);
rvecs is actualy a vector of rotation vectors, with only one member because there is only one aruco marker.
If a marker is found in the frame then,
if (rvecs.size() == 1) { // There is one marker good frame
framesFound++;
for (int i = 0; i < 3; i++) {
avgRvecs[i] =+ rvecs[0][i];
avgTvecs[i] =+ tvecs[0][i];
}
}
And after all the desired frames to average have been processed,
if (framesFound == 0 ) { // No frames with markers...
} else {
for (int i = 0; i < 3; i++) {
avgRvecs[i] = avgRvecs[i] / framesFound;
avgTvecs[i] = avgTvecs[i] / framesFound;
}
cv::drawFrameAxes(outputImage, camMatrix, distCoeffs, avgRvecs, avgTvecs, 0.1);
With a single frame I get
With 10 averaged frames I get
Because the pose estimation of Aruco markers is usually done with IPPE (the algorithm under the hood of solvePnP), you might have some "singularities" in the rotation results.
Averaging the rotations can be a good solution as it works as a low-pass filter, though you need to remember that if you are seeking precision, this might not be the most appropriate filter to use.
Most of the time, we like to manipulate Euler angles, Direct Cosine Matrix, or Rodriguez angles. Unfortunately, none of those is the ideal solution to average angular values. If you want to manipulate angles without mathematical errors, I definitely recommend having a look at quaternions.
Some existing posts suggest interesting approaches with quaternions :
"Average" of multiple quaternions?
https://math.stackexchange.com/questions/1984608/average-of-3d-rotations
Math can be a little scary but the posts come with well written code examples.
I have been struggling trying to implement the outlining algorithm described here and here.
The general idea of the paper is determining the Hausdorff distance of binary images and using it to find the template image from a test image.
For template matching, it is recommended to construct image pyramids along with sliding windows which you'll use to slide over your test image for detection. I was able to do both of these as well.
I am stuck on how to move forward from here on. Do I slide my template over the test image from different pyramid layers? Or is it the test image over the template? And with regards to the sliding window, is/are they meant to be a ROI of the test or template image?
In a nutshell, I have pieces to the puzzle but no idea of which direction to take to solve the puzzle
int distance(vector<Point>const& image, vector<Point>const& tempImage)
{
int maxDistance = 0;
for(Point imagePoint: image)
{
int minDistance = numeric_limits<int>::max();
for(Point tempPoint: tempImage)
{
Point diff = imagePoint - tempPoint;
int length = (diff.x * diff.x) + (diff.y * diff.y);
if(length < minDistance) minDistance = length;
if(length == 0) break;
}
maxDistance += minDistance;
}
return maxDistance;
}
double hausdorffDistance(vector<Point>const& image, vector<Point>const& tempImage)
{
double maxDistImage = distance(image, tempImage);
double maxDistTemp = distance(tempImage, image);
return sqrt(max(maxDistImage, maxDistTemp));
}
vector<Mat> buildPyramids(Mat& frame)
{
vector<Mat> pyramids;
int count = 6;
Mat prevFrame = frame, nextFrame;
while(count > 0)
{
resize(prevFrame, nextFrame, Size(), .85, .85);
prevFrame = nextFrame;
pyramids.push_back(nextFrame);
--count;
}
return pyramids;
}
vector<Rect> slidingWindows(Mat& image, int stepSize, int width, int height)
{
vector<Rect> windows;
for(size_t row = 0; row < image.rows; row += stepSize)
{
if((row + height) > image.rows) break;
for(size_t col = 0; col < image.cols; col += stepSize)
{
if((col + width) > image.cols) break;
windows.push_back(Rect(col, row, width, height));
}
}
return windows;
}
Edit I: More analysis on my solution can be found here
This is a bi-directional task.
Forward Direction
1. Translation
For each contour, calculate its moment. Then for each point in that contour, translate it about the moment i.e. contour.point[i] = contour.point[i] - contour.moment[i]. This moves all of the contour points to the origin.
PS: You need to keep track of each contour's produced moment because it will be used in the next section
2. Rotation
With the newly translated points, calculate their rotated rect. This will give you the angle of rotation. Depending on this angle, you would want to calculate the new angle which you want to rotate this contour by; this answer would be helpful.
After attaining the new angle, calculate the rotation matrix. Remember that your center here will be the origin i.e. (0, 0). I did not take scaling into account (that's where the pyramids come into play) when calculating the rotation matrix hence I passed 1.
PS: You need to keep track of each contour's produced matrix because it will be used in the next section
Using this matrix, you can go ahead and rotate each point in the contour by it as shown here*.
Once all of this is done, you can go ahead and calculate the Hausdorff distance and find contours which pass your set threshold.
Back Direction
Everything done in the first section, has to be undone in order for us to draw the valid contours onto our camera feed.
1. Rotation
Recall that each detected contour produced a rotation matrix. You want to undo the rotation of the valid contours. Just perform the same rotation but using the inverse matrix.
For each valid contour and corresponding matrix
inverse_matrix = matrix[i].inv(cv2.DECOMP_SVD)
Use * to rotate the points but with inverse_matrix as parameter
PS: When calculating the inverse, if the produced matrix was not a square one, it would fail. cv2.DECOMP_SVD will produce an inverse matrix even if the original matrix was a non-square.
2. Translation
With the valid contours' points rotated back, you just have to undo the previously performed translation. Instead of subtracting, just add the moment to each point.
You can now go ahead and draw these contours to your camera feed.
Scaling
This is were image pyramids come into play.
All you have to do is resize your template image by a fixed size/ratio upto your desired number of times (called layers). The tutorial found here does a good job of explaining how to do this in OpenCV.
It goes without saying that the values you choose to resize your image by and number of layers will and do play a huge role in how robust your program will be.
Put it all together
Template Image Operations
Create a pyramid consisting of n layers
For each layer in n
Find contours
Translate the contour points
Rotate the contour points
This operation should only be performed once and only store the results of the rotated points.
Camera Feed Operations
Assumptions
Let the rotated contours of the template image at each level be stored in templ_contours. So if I say templ_contours[0], this is going to give me the rotated contours at pyramid level 0.
Let the image's translated, rotated contours and moments be stored in transCont, rotCont and moment respectively.
image_contours = Find Contours
for each contour detected in image
moment = calculate moment
for each point in image_contours
transCont.thisPoint = forward_translate(image_contours.thisPoint)
rotCont.thisPoint = forward_rotate(transCont.thisPoint)
for each contour_layer in templ_contours
for each contour in rotCont
calculate Hausdorff Distance
valid_contours = contours_passing_distance_threshold
for each point in valid_contours
valid_point = backward_rotate(valid_point)
for each point in valid_contours
valid_point = backward_translate(valid_point)
drawContours(valid_contours, image)
I succesfully calibrate my camera using opencv. The camera lens i am using.
https://www.baslerweb.com/en/products/vision-components/lenses/basler-lens-c125-0418-5m-f1-8-f4mm/
The internal and external camera parameter is given below.
cv::Mat cameraMatrix(3, 3, cv::DataType<double>::type);
cameraMatrix.at<double>(0) = 1782.80;//fx //432.2 in mm
cameraMatrix.at<double>(1) = 0;
cameraMatrix.at<double>(2) = 3.0587694283633488e+002;//cx
cameraMatrix.at<double>(3) = 0;
cameraMatrix.at<double>(4) = 1782.80;//fy
cameraMatrix.at<double>(5) = 3.0535864258476721e+002;//cy
cameraMatrix.at<double>(6) = 0;
cameraMatrix.at<double>(7) = 0;
cameraMatrix.at<double>(8) = 1;
cv::Mat disCoeffs(1, 5, cv::DataType<double>::type);
disCoeffs.at<double>(0) = -8.1752937039996709e-001;//k1
disCoeffs.at<double>(1) = -2.5660653367749450e+001;//k2
disCoeffs.at<double>(2) = -1.5556922931812768e-002;//p1
disCoeffs.at<double>(3) = -4.4021541217208054e-002;//p2
disCoeffs.at<double>(4) = 1.5042036073609015e+002;//k3
I know this formula is used to calculate the distance of the object. But i am very confuse how to proper use it.
Resolution of my camera is 640x480.
focal length = 1782.80 (px) do not know how to correctly convert to mm
i know focal length is distance from sensor to image plane. So what actually this value represent? Pixel is just a unit represent dot on screen.
Object i am using is circle.
radius = 22. (width and height 44*44)
circle center point: 300,300 (x,y)
sensor height do not know how to get?
Where do i use principle points?
How i get distance from camera to object? How do get real world coordinate of the circle?
I know its too much to ask. I try one month. Did not find any proper solution.
i use function solvePnP to get the camera translation and rotation matrix. But i have problem how to calculate object point?
Your cx and cy seems to be wrong because they should be half the resolution: 640/2 & 480/2.
fx and fy are in pixel unit you get from calibration process. To convert them to mm use that formula:
pixels width = (image width in pixels) * (focal length in mm) / (CCD width in mm)
pixels height = (image height in pixels) * (focal length in mm) / (CCD height in mm)
When you calibrate your camera, you use those formulas to make sure you've the right values. For me cx and cy are wrong because they represent the center of the image (they shouldn't be equal unless your image is square which is not the case). For fx and fy I can't tell because I don't know the CCD of your camera. They can be equal if the CCD is square.
Don't change those parameters manually but let the your calibration software compute them.
Now you've those parameters, how you compute the distance?
The formula you presented is not useful in a sense that if you can measure the real height, you usually can measure the distance (at least in your case).. so why using a camera!?
So to compute the distance in real world, you need two more things: The extrinsic parameters (Your cameraMatrix matrix is the intrinsic parameters) and at least four points (the more points the better) in real world coordinates.
Once you have those things, you can use solvePnP function to find the pose of an object. The pose represents the translation and rotation with respect to the camera frame.
http://docs.opencv.org/2.4/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html#solvepnp
This is a piece of code can help to do that:
//Four points in real world with `x,y and z` coordinates
vector<Point3f> vec3d;
vec3d.push_back(Point3f(0, 0, 0));
vec3d.push_back(Point3f(0, 211, 0));
vec3d.push_back(Point3f(295, 211, 0));
vec3d.push_back(Point3f(295, 0, 0));
The z=0 because your real points are in a plane.
//The same four points but in your image plan, therefore there is no z and they're in pixel unit
vector<Point2f> vec2d;
vec2d.push_back(Point2f(532, 412)); //(y,x)
vec2d.push_back(Point2f(583, 594));
vec2d.push_back(Point2f(927, 535));
vec2d.push_back(Point2f(817, 364));
//The pose of the object: rvec is your rotation vector, tvec is your translation vector
cv::Mat rvec, tvec;
solvePnP(vec3d, vec2d, cameraMatrix, distCoeffs, rvec, tvec);
Finally, you can compute the real distance from the tvec as euclidean distance: d=std::sqrt(tx*tx+ty*ty+tz*tz).
Your questions:
sensor height do not know how to get?
Look for your camera specification in the internet or in the manual book and you'll find it.
Where do i use principle points?
They're your intrinsic parameters. You're not gonna use them separately.
How i get distance from camera to object? How do get real world coordinate of the circle?
I explained that above. You need four points and with a circle you have only one which not enough to compute the pose.
But i have problem how to calculate object point?
objectPoints in solvePnP are your real world coordinates. For example, a chessboard has corners in which we know the exact position in mm of each one with respect to a world frame that you choose in the chessboard. It can be in the left top corner or something like that and z=0 because the chessboard is printed in a paper just like your circle!
EDIT:
You can find more specifications in the manual page 13 here. It is said 7.4 x 7.4µm:
f (mm)=f(pixel) x pixel_size(mm) => f (mm) = 1782.80x7.2e-6 = 12.83616 (mm)
Which is not 4mm!! then you need to do the calibration again, something is wrong!
3D points:
vector vec3d;
vec3d is where you gonna store your 3D coordinates point. I gave you an example for the first point which the origin:
vec3d.push_back(Point3f(0, 0, 0)); //y,x,z
EDIT3
If you take a pattern like this
Then choose for example the circle in top left or right corner and it will have a coordinate of (0,0,0), that the origin. After that the circle next to it is your second point and it will have (x,0,0) x is the distance in (mm) between the two circles.. You do the same for four points in your pattern. You can choose any pattern you want as long as you can detect it in your image and retrieve their coordinates in pixel.
If you still don't understand, I advise you take a course in projective geometry and camera models.. so as you can understand what every parameter means.
I'm to build a panorama image of the ground covered by a downward facing camera (at a fixed height, around 1 metre above ground). This could potentially run to thousands of frames, so the Stitcher class' built in panorama method isn't really suitable - it's far too slow and memory hungry.
Instead I'm assuming the floor and motion is planar (not unreasonable here) and trying to build up a cumulative homography as I see each frame. That is, for each frame, I calculate the homography from the previous one to the new one. I then get the cumulative homography by multiplying that with the product of all previous homographies.
Let's say I get H01 between frames 0 and 1, then H12 between frames 1 and 2. To get the transformation to place frame 2 onto the mosaic, I need to get H01*H12. This continues as the frame count increases, such that I get H01*H12*H23*H34*H45*....
In code, this is something akin to:
cv::Mat previous, current;
// Init cumulative homography
cv::Mat cumulative_homography = cv::Mat::eye(3);
video_stream >> previous;
for(;;) {
video_stream >> current;
// Here I do some checking of the frame, etc
// Get the homography using my DenseMosaic class (using Farneback to get OF)
cv::Mat tmp_H = DenseMosaic::get_homography(previous,current);
// Now normalise the homography by its bottom right corner
tmp_H /= tmp_H.at<double>(2, 2);
cumulative_homography *= tmp_H;
previous = current.clone( );
}
It works pretty well, except that as the camera moves "up" in the viewpoint, the homography scale decreases. As it moves down, the scale increases again. This gives my panoramas a perspective type effect that I really don't want.
For example, this is taken on a few seconds of video moving forward then backward. The first frame looks ok:
The problem comes as we move forward a few frames:
Then when we come back again, you can see the frame gets bigger again:
I'm at a loss as to where this is coming from.
I'm using Farneback dense optical flow to calculate pixel-pixel correspondences as below (sparse feature matching doesn't work well on this data) and I've checked my flow vectors - they're generally very good, so it's not a tracking problem. I also tried switching the order of the inputs to find homography (in case I'd mixed up the frame numbers), still no better.
cv::calcOpticalFlowFarneback(grey_1, grey_2, flow_mat, 0.5, 6,50, 5, 7, 1.5, flags);
// Using the flow_mat optical flow map, populate grid point correspondences between images
std::vector<cv::Point2f> points_1, points_2;
median_motion = DenseMosaic::dense_flow_to_corresp(flow_mat, points_1, points_2);
cv::Mat H = cv::findHomography(cv::Mat(points_2), cv::Mat(points_1), CV_RANSAC, 1);
Another thing I thought it could be was the translation I include in the transformation to ensure my panorama is centred within the scene:
cv::warpPerspective(init.clone(), warped, translation*homography, init.size());
But having checked the values in the homography before the translation is applied, the scaling issue I mention is still present.
Any hints are gratefully received. There's a lot of code I could put in but it seems irrelevant, please do let me know if there's something missing
UPDATE
I've tried switching out the *= operator for the full multiplication and tried reversing the order the homographies are multiplied in, but no luck. Below is my code for calculating the homography:
/**
\brief Calculates the homography between the current and previous frames
*/
cv::Mat DenseMosaic::get_homography()
{
cv::Mat grey_1, grey_2; // Grayscale versions of frames
cv::cvtColor(prev, grey_1, CV_BGR2GRAY);
cv::cvtColor(cur, grey_2, CV_BGR2GRAY);
// Calculate the dense flow
int flags = cv::OPTFLOW_FARNEBACK_GAUSSIAN;
if (frame_number > 2) {
flags = flags | cv::OPTFLOW_USE_INITIAL_FLOW;
}
cv::calcOpticalFlowFarneback(grey_1, grey_2, flow_mat, 0.5, 6,50, 5, 7, 1.5, flags);
// Convert the flow map to point correspondences
std::vector<cv::Point2f> points_1, points_2;
median_motion = DenseMosaic::dense_flow_to_corresp(flow_mat, points_1, points_2);
// Use the correspondences to get the homography
cv::Mat H = cv::findHomography(cv::Mat(points_2), cv::Mat(points_1), CV_RANSAC, 1);
return H;
}
And this is the function I use to find the correspondences from the flow map:
/**
\brief Calculate pixel->pixel correspondences given a map of the optical flow across the image
\param[in] flow_mat Map of the optical flow across the image
\param[out] points_1 The set of points from #cur
\param[out] points_2 The set of points from #prev
\param[in] step_size The size of spaces between the grid lines
\return The median motion as a point
Uses a dense flow map (such as that created by cv::calcOpticalFlowFarneback) to obtain a set of point correspondences across a grid.
*/
cv::Point2f DenseMosaic::dense_flow_to_corresp(const cv::Mat &flow_mat, std::vector<cv::Point2f> &points_1, std::vector<cv::Point2f> &points_2, int step_size)
{
std::vector<double> tx, ty;
for (int y = 0; y < flow_mat.rows; y += step_size) {
for (int x = 0; x < flow_mat.cols; x += step_size) {
/* Flow is basically the delta between left and right points */
cv::Point2f flow = flow_mat.at<cv::Point2f>(y, x);
tx.push_back(flow.x);
ty.push_back(flow.y);
/* There's no need to calculate for every single point,
if there's not much change, just ignore it
*/
if (fabs(flow.x) < 0.1 && fabs(flow.y) < 0.1)
continue;
points_1.push_back(cv::Point2f(x, y));
points_2.push_back(cv::Point2f(x + flow.x, y + flow.y));
}
}
// I know this should be median, not mean, but it's only used for plotting the
// general motion direction so it's unimportant.
cv::Point2f t_median;
cv::Scalar mtx = cv::mean(tx);
t_median.x = mtx[0];
cv::Scalar mty = cv::mean(ty);
t_median.y = mty[0];
return t_median;
}
It turns out this was because my viewpoint was close to the features, meaning that the non-planarity of the tracked features was causing skew to the homography. I managed to prevent this (it's more of a hack than a method...) by using estimateRigidTransform instead of findHomography, as this does not estimate for perspective variations.
In this particular case, it makes sense to do so, as the view does only ever undergo rigid transformations.
I am having a lot of trouble applying RVEC and TVEC (the estimated camera pose) from OpenCV's cv::solvePnP to a vtkCamera that I have in a virtual 3D scene. I am hoping someone can show me the errors I am making.
I am trying to take this vtkActor (3D DICOM Rendering of my chest with fiduciary markers placed on my torso):
and use cv::solvePnP to align the fiduciary markers with the red circles shown in the following image (note: the red circles are hard-coded coordinates from a picture of the fiduciary markers with a certain camera perspective):
As you can see, the super-imposed volume rendering is mis-aligned after applying the following vtkTransform to the vtkCamera.
cv::Mat op(model_points);
cv::Mat rvec;
cv::Mat tvec;
// op = the 3D coordinates of the markers in the scene
// ip = the 2D coordinates of the markers in the image
// camera = the intrinsic camera parameters (for calibration)
// dists = the camera distortion coefficients
cv::solvePnP(op, *ip, camera, dists, rvec, tvec, false, CV_ITERATIVE);
cv::Mat rotM;
cv::Rodrigues(rvec, rotM);
rotM = rotM.t();
cv::Mat rtvec = -(rotM*tvec);
std::cout << "rotM: \n" << rotM << std::endl;
std::cout << "tvec: \n" << tvec << std::endl;
std::cout << "rtvec: \n" << rtvec << std::endl;
double cam[16] = {
rotM.at<double>(0), rotM.at<double>(1), rotM.at<double>(2), rtvec.at<double>(0),
rotM.at<double>(3), rotM.at<double>(4), rotM.at<double>(5), rtvec.at<double>(1),
rotM.at<double>(6), rotM.at<double>(7), rotM.at<double>(8), rtvec.at<double>(2),
0, 0, 0, 1
};
vtkSmartPointer<vtkTransform> T = vtkSmartPointer<vtkTransform>::New();
T->SetMatrix(cam);
vtkSmartPointer<vtkRenderer> renderer = v->renderer();
double b_p[3];
double a_p[3];
double *b_o;
double b_o_store[3];
double *a_o;
double b_f[3];
double a_f[3];
vtkSmartPointer<vtkCamera> scene_camera = v->camera();
// Reset Position/Focal/Orientation before applying transformation
// so the transform does not compound
v->ResetCameraPositionOrientation();
// Apply the transformation
scene_camera->ApplyTransform(T);
scene_camera->SetClippingRange(1, 2000);
This is emphasized in the following capture of the scene (the chest is bowed, towards the screen, you can see three of the top fiduciary markers at the bottom-most part of the actor in the scene):
The following screenshot shows the RVEC & TVEC I get, as well as the Position/Orientation/Focal-Point before and after the transformation:
The scene is initialized in the following manner:
this->actor_ = vtkVolume::New();
this->actor_->SetMapper(mapper);
this->actor_->SetProperty(volumeProperty);
this->actor_->SetPosition(0,0,0);
this->actor_->RotateX(90.0);
this->renderer_ = vtkRenderer::New();
this->renderer_->AddViewProp(this->actor_);
this->renderer_->SetBackground(0.3,0.3,0.3);
this->camera_ = this->renderer_->GetActiveCamera();
// Center the scene so that we can grab the position/focal-point for later
// use.
this->renderer_->ResetCamera();
// Get the position/focal-point for later use.
double pos[3];
double orientation[3];
this->camera_->GetPosition(pos);
this->camera_->GetFocalPoint(this->focal_);
double *_o = this->camera_->GetOrientation();
this->orientation_[0] = _o[0];
this->orientation_[1] = _o[1];
this->orientation_[2] = _o[2];
this->position_[0] = pos[0];
this->position_[1] = pos[1];
this->position_[2] = pos[2];
// Set the camera in hopes of it "sticking"
this->camera_->SetPosition(pos);
this->camera_->SetFocalPoint(this->focal_);
this->camera_->SetViewUp(0, 1, 0);
this->camera_->SetFreezeFocalPoint(true);
I apologize for such a long question. I wanted to provide as much information as possible. I have been working on this problem for a few days now and cannot figure it out!
this may be too late but I'm doing nearly the exact thing currently and we just addressed this issue. I'm even using VTK and everything! What SolvePnP returns (Assuming that I am not mistaken that OpenCV's Rodrigues returns a similar matrix) is a global transformation matrix. This matrix represents rotations and translations in the global frame. Due to the way transformations work, globals must be premultiplied whereas local transformations are postmultiplied, so that needs to be done in your code. The way we did this was to use the line
((vtkMRMLLinearTransformNode*)(d->CameraVideoTransformComboBox->currentNode()))->SetMatrixTransformFromParent(staticVideoTransform4x4);
where:
d is a reference to the UI of a 3d Slicer module, which is a program that can basically be treated like a big VTK toolbox
CameraVideoTransformComboBox is just a UI combo box that stores transforms
and staticVideoTransform4x4 is our translational matrix.
We then applied the transform via the UI as opposed to your method of doing it through code, so unfortunately there's some black boxing in there that doesn't let me give you an exact answer as to how you'd have to code it yourself. I would recommend looking at vtkMRMLTransformNode::SetMatrixTransformFromParent() for guidance if you (or more likely someone reading this) is having this issue. If that doesn't work outright, try inverting the matrix!
You can invert matrix returned from solvePnP() and use cv::viz::Viz3d::setViewerPose() sources as example