Why do unknown objects that are close destroy the reconstruction error of my Autoencoder? - computer-vision

I built up a convolutional Autoencoder and trained it on a large set of synthetic streets, that do not contain any other vehicles or objects within the roads. The quality of my reconstructions (image 1.) is quite sharp, even on unseen data.
Spawning an object on the road (image 2.) slightly causes some reconstruction errors, but the overall reconstruction remains illustrative. The AE seems to ignore the spawned object, since streets do not contain any objects in the training data.
However, when I encounter a nearby object (image 3.), the reconstruction stops and I get a noisy image.
Why is this the case? Why does the reconstruction of the (same) background not work anymore?.

Related

2D object detection with only a single training image

The vision system is given a single training image (e.g. a piece of 2D artwork ) and it is asked whether the piece of artwork is present in the newly captured photos. The newly captured photos can contain a lot of any other object and when the artwork is presented, it must face up but may be occluded.
The pose space is x,y,rotation and scale. The artwork may be highly symmetric or not.
What is the latest state of the art handling this kind of problem?
I have tried/considered the following options but there are some problems in all of them. If my argument is invalid please correct me.
deep learning(rcnn/yolo): a lot of labeled data are needed means a lot of human labor is needed for each new pieces of artwork.
traditional machine learning(SVM,Random forest): same as above
sift/surf/orb + ransac or voting: when the artwork is symmetric, the features matched are mostly incorrect. A lot of time is needed in the ransac/voting stage.
generalized hough transform: the state space is too large for the voting table. Pyramid can be applied but it is difficult to choose some universal thresholds for different kinds of artwork to proceed down the pyramid.
chamfer matching: the state space is too large. Too much time is needed in searching across the state space.
Object detection requires a lot of labeled data of the same class to generalize well, and in your setting it would be impossible to train a network with only single instance.
I assume that in your case online object trackers can work, at least give it a try. There are some convolutional object trackers that work great like Siamese CNNs. The code is open source at github, and you can watch this video to see its performance.
Online object tracking: Given the initialized state (e.g., position
and size) of a target object in a frame of a video, the goal
of tracking is to estimate the states of the target in the subsequent
frames.-source-
You can try using traditional feature based image processing algorithm which might give true positive template matches up to a descent accuracy.
Given template image as in the question:
First dilate the image to join all very closely spaced connected
components.
Find the convex hull of the connected object obtained above,This will give you a polygon.
Use above polygon edge length information like (max-length/min-length) ratio as feature of the template.
Also find the pixel density in the polygon as second feature.
We have 2 features now.
Scene image feature vector:
Similarly Again in the scene image use dilation followed by connected components identification, define convex hull(polygon) around each connected objects and define feature vector for each object(edge info, pixel density).
Now as usual search for template feature vector in the scene image feature vectors data with minimum feature distance(also use certain upper level distance threshold value to avoid false positive matches).
This should give the true positive matches if available in the scene image.
Exception: This method would not work for occluded objects.

findHomography usage opencv

I am using opencv c++ and am a new user. I am interested in object detection problems . So far I have studies and implemented the use of sparse optical flow( Lucas Kanade method) in a video from a stationary camera.After trying k means and Background substraction , I have decided to move to a more difficult problem , that is the moving camera.
I have so far studied some documentation and found out that I could use cv::findHomography in order to find the inliers or outliers during the sequence of frames in my video and then understand from the returned values what movement is caused due to camera motion and what due to object motion. In addition , I could use SURF features to track some objects and then decide which of them are good points .
However , I was wondering how I could implement this theory. For example, should I use the first frame as ground truth and detect some features using SURF and then for the rest of the video use findHomography for each frame ? Any ideas/help is welcome !
Detecting moving objects from moving camera is a quite challenging task, and requires solid understanding of multiple view geometry, besides there is less info on this topic available (than, for example, about structure from motion), so be warned!
Anyway, homography matrix will not be a good choice for detection of moving objects (unless you are 100% sure that your background can be represented by a flat surface accurately enough). You should probably use a fundamental matrix or trifocal tensor.
Fundamental matrix is computed from point correspondences between 2 frames. It associates points on one image with lines on other image (so called epipolar lines), and this way it is independent from scene structure. After you have obtained F matrix using some robust estimation method, like RANSAC or LMEDS (RANSAC seems to be better for this kind of task), you can calculate the reprojection error for each point. Objects that move independently from scene would not be accurately described by F matrix and will have a bigger error. So, outliers of F matrix calculated from image matches over two frames can be considered moving objects. One note though - objects that move along epipolar lines would not be detected by this approach, since their parallax can be also described by some depth level.
Trifocal tensor does not have the depth/motion ambiguity with objects that move along epipolar lines, but it is harder to estimate and it is not included into OpenCV. It can be calculated from correspondences over 3 frames, and its usage can be conceptually described as triangulating a point from 2 views and then calculating reprojection error on a third view.
As for the matching - I still think that LK tracking will be better than SURF matching if you work with video sequences, since in that case you don't need to consider very distant points as matches, and tracking usually is faster then detection+matching.

Any Fast object detection method for Binary image?

My current task is detecting abnormal spot from the binary image.
I trained simple Convolutional NN model for classifying abnormality for 1 spot.
I need to detect abnormal point from large image.
It requires thousands of classifying to inspect whole pixel. (I'm using this method, But It requires tremendous time.)
The example of large image is below.
*need to inspect white/Black patterns. (It seems hard to be segmented)
*yellow box is size of 1 patch of CNN
*the center of yellow box is the location of abnormal spot.
Example of Large Image

Opencv Optical flow tracking: stop condition

I'm currently trying to implement a face tracking by using optical flow with opencv.
To achieve this, I detect faces with the openCV face detector, I determine features to track on the detected areas by calling goodFeaturesToTrack and I operate tracking by calling calcOpticalFlowPyrLK.
It gives good results.
However, I'd like to know when the face I'm currently tracking is not visible anymore (the person leaves the room, is hidden behind an object or another person, ...) but calcOpticalFlowPyrLK tells me nothing about it.
The status parameter of the calcOpticalFlowPyrLK function rarely reports errors concerning a tracked feature (so, if the person disappear, I will still have a good amount of valid features to track).
I've tried to calculate the directional vectors for each feature to determine the move between the previous and the actual frame for each feature of the face (for example, determining that some point of the face has move to the left between the two frames) and to calculate the variance of these vectors (if vectors are mostly different, variance is high, otherwise it is not) but it did not give the expected results (good in some situation, but bad in other cases).
What could be a good condition to determine whether the optical flow tracking has to be stopped or not?
I've thought of some possible solutions like these ones:
variance of the distance for the vectors of each tracked feature (if the move is linear, distances should be nearly the same, but if something happened, distances will be different).
Comparing the shape and size of the area containing the original position of the tracked features with the area containing the current one. At the beginning we have a square containing the features of the face. But if the person leaves the room, it can lead to a deformation of the shape.
You can try a bidirectional confidenze measure of your track points.
Therefore estimate the feature positions from img0 to img1 and than the tracked positions backwards from img1 to img0. If the double tracked features near the original ( distance should be less than 1 or 0.5 pixel) than they are successfully tracked. This is a little bit more relyable than the SSD which is used by the status flag of opencv's plk. If a certain amount of features could not been tracked the event raises.

Semi-automatic face & eye detection

For analysis we have a sequence of images or a movie. My aim is to create a semi automatic face and eye detection for these sequences. The sequences consist of about 4000 images with a frontal capture of a person slightly moving. I want to process these images semi automatic or manual to get the two/three ROIs of the face and eyes.
I tried OpenCV's cascade classifiers but for my sequences they do not turn out to be robust (with manual controll we need to get a rate of 100%). The cascade classifiers do not give positions, eg. when the person is looking slightly to the side.
Is there any semi automatic approach out there for imagej, matlab or opencv/c++ to select/correct the rois manually if false detected or to select templates for tracking ?
If you are processing a movie, it is reasonable to assume that the motion between frames is low. The following is a possible approach.
Initialize the first frame manually (or get user input to confirm/edit the positions detected by cascade classifiers)
For the next frame, check if the features detected are too far off the original positions. You can also check if the positions of different parts are moving in an illogical manner.
Stop and get the user to correct the points, if processing in step 2 suggest errors.
Note: With OpenCV cascades, face detection is generally accurate. But eye detection is not so accurate and you might not detect both eyes in some frames. Some projects use AAMs (Active Appearance Models) to robustly track a face, and this might work for you.
For face detection, try this list of 50+ API's :
http://blog.mashape.com/post/53379410412/list-of-50-face-detection-recognition-apis
for eyes detection you can try flandmark detector:
http://cmp.felk.cvut.cz/~uricamic/flandmark/
Or STASM:
http://www.milbo.users.sonic.net/stasm/