I'm working on my bachaleor theses called "Traffic sign detection in image and video" and I'm using neural network called YOLO (You Only Look Once). I think its name is pretty self-explaining, but paper may be found here.
This network is learning from not-cropped annotated images (Regular networks use to train on cropped images). To be able to learn this network, i need not-cropped annotated european traffic signs dataset. I wasn't able to find any, even not here, so i decided to generate my own dataset.
First, I load many images of road taken from static camera on the car.
I've got few TRANSPARENT traffic (stop) signs like this
Then I'm doing few operations to make the traffic sign look "real" and copy it to random positions (where traffic signs usually are located). The size of traffic sign is adjusted due to its position in image -- closer to the middle sign is, the smaller sign is.
Operations I'm performing on traffic sign are:
Blur sign with kernel of random size from 1x1 to 31x31.
Rotate image left/right in X axis, random angle 0 to 20.
Rotate image left/right in Z axis, random angle 0 to 30.
Increase luminescence by adding/subtracting random value from 0 to 50
Here you may see few result examples (the better ones i guess): click.
Here is the source code: click.
Question:
Is there anything i could do to make the signs look more real and let the neural network train better ?
If the question would suit better for different kind of site, please, let me know.
Related
I am working on a project to detect certain objects in an aerial image, and as part of this I am trying to utilize elevation data for the image. I am working with Digital Elevation Models (DEMs), basically a matrix of elevation values. When I am trying to detect trees, for example, I want to search for tree-shaped regions that are higher than their surrounding terrain. Here is an example of a tree in a DEM heatmap:
https://i.stack.imgur.com/pIvlv.png
I want to be able to find small regions like that that are higher than their surroundings.
I am using OpenCV and GDAL for my actual image processing. Do either of those already contain techniques for what I'm trying to accomplish? If not, can you point me in the right direction? Some ideas I've had are going through each pixel and calculating the rate of change in relation to it's surrounding pixels, which would hopefully mean that pixels with high rates change/steep slopes would signify an edge of a raised area.
Note that the elevations will change from image to image, and this needs to work with any elevation. So the ground might be around 10 meters in one image but 20 meters in another.
Supposing you can put the DEM information into a 2D Mat where each "pixel" has the elevation value, you can find local maximums by applying dilate and then substract the result from the original image.
There's a related post with code examples in: http://answers.opencv.org/question/28035/find-local-maximum-in-1d-2d-mat/
Now I am doing a project which is called crowd estimation. I am focused on estimating the crowd levels in canteens. My approach is based on subtracting the background and only keeping the people in the foreground. The people are represented by white pixels and the background is black. So I can estimate the crowd level by counting the white pixels in the foreground. However, this needs people with different distances to the camera appear to be the same sizes. Otherwise, the white pixels for the people sitting near to the camera may be 5 times another person who sits far away from the camera. This is not I want, I want their pixels to be almost the same. Below is a screenshot of the canteen:
The first way I have tried is by segmenting the scene into different regions, the near regions are assigned with less weights, the far regions are assigned with more weights. However, the segmentation and weights assigning are all done manually, and it's hard to judge the line to segment and the weights assigned to each region. I use a picture below to show how it is done:
Another way I have tried is the perceptive transform, choosing four points on the input image and mapping them to the 4 points on the output image. However it's hard to choose the 4 points and it's hard to decide whether the people's sizes are the same after the transformation. It's shown below:
Can anyone provide a good way to solve this people's sizes problem(due too different distances)? Your reply is greatly appreciated.
I was unable to find literature on this.
The question is that given some photograph with a well known object within it - say something that was printed for this purpose, how well does the approach work to use that object to infer lighting conditions as a method of color profile calibration.
For instance, say we print out the peace flag rainbow and then take a photo of it in various lighting conditions with a consumer-grade flagship smartphone camera (say, iphone 6, nexus 6) the underlying question is whether using known references within the image is a potentially good technique in calibrating the colors throughout the image
There's of course a number of issues regarding variance of lighting conditions in different regions of the photograph along with what wavelengths the device is capable from differentiating in even the best circumstances --- but let's set them aside.
Has anyone worked with this technique or seen literature regarding it, and if so, can you point me in the direction of some findings.
Thanks.
I am not sure if this is a standard technique, however one simple way to calibrate your color channels would be to learn a regression model (for each pixel) between the colors that are present in the region and their actual colors. If you have some shots of known images, you should have sufficient data to learn the transformation model using a neural network (or a simpler model like linear regression if you like, but a NN would be able to capture multi-modal mappings). You can even do a patch based regression using a NN on small patches (say 8x8, or 16x16) if you need to learn some spatial dependencies between intensities.
This should be possible, but you should pay attention to the way your known object reacts to light. Ideally it should be non-glossy, have identical colours when pictured from an angle, be totally non-transparent, and reflect all wavelengths outside the visible spectrum to which your sensor is sensitive (IR, UV, no filter is perfect) uniformly across all different coloured regions. Emphasis added because this last one is very important and very hard to get right.
However, the main issue you have with a coloured known object is: What are the actual colours of the different regions in RGB(*)? So in this way you can determine the effect of different lighting conditions between each other, but never relative to some ground truth.
The solution: use a uniformly white, non-reflective, intransparant surface: A sufficiently thick sheet of white paper should do just fine. Take a non-overexposed photograph of the sheet in your scene, and you know:
R, G and B should be close to equal
R, G and B should be nearly 255.
From those two facts and the R, G and B values you actually get from the sheet you can determine any shift in colour and brightness in your scene. Assume that black is still black (usually a reasonable assumption) and use linear interpolation to determine the shift experienced by pixels coloured somewhere between 0 and 255 on any of the axed.
(*) or other colourspace of your choice.
I am currently trying to track human heads from a CCTV. I am currently using colour histogram and LBP histogram comparison to check the affinity between bounding boxes. However sometimes these are not enough.
I was reading through a paper in the following link : paper where dispersion metric is described. However I still cannot clearly get it. For example I cannot understand what pi,j is referring to in the equation. Can someone kindly & clearly explain how I can find dispersion between bounding boxes in separate frames please?
You assistance is much appreciated :)
This paper tackles the tracking problem using a background model, as most CCTV tracking methods do. The BG model produces a foreground mask, and the aforementioned p_ij relates to this mask after some morphology. Specifically, they try to separate foreground blobs into components, based on thresholds on allowed 'gaps' in FG mask holes. The end result of this procedure is a set of binary masks, one for each hypothesized object. These masks are then used for tracking using spatial and temporal consistency. In my opinion, this is an old fashioned way of processing video sequences, only relevant if you're limited in processing power and the scenes are not crowded.
To answer your question, if O is the mask related to one of the hypothesized objects, then p_ij is the binary pixel in the (i,j) location within the mask. Thus, c_x and c_y are the center of mass of the binary shape, and the dispersion is simply the average distance from the center of mass for the shape (it is larger for larger objects. This enforces scale consistency in tracking, but in a very weak manner. You can do much better if you have a calibrated camera.
I am trying to do image detection in C++. I have two images:
Image Scene: 1024x786
Person: 36x49
And I need to identify this particular person from the scene. I've tried to use Correlation but the image is too noisy and therefore doesn't give correct/accurate results.
I've been thinking/researching methods that would best solve this task and these seem the most logical:
Gaussian filters
Convolution
FFT
Basically, I would like to move the noise around the images, so then I can use Correlation to find the person more effectively.
I understand that an FFT will be hard to implement and/or may be slow especially with the size of the image I'm using.
Could anyone offer any pointers to solving this? What would the best technique/algorithm be?
In Andrew Ng's Machine Learning class we did this exact problem using neural networks and a sliding window:
train a neural network to recognize the particular feature you're looking for using data with tags for what the images are, using a 36x49 window (or whatever other size you want).
for recognizing a new image, take the 36x49 rectangle and slide it across the image, testing at each location. When you move to a new location, move the window right by a certain number of pixels, call it the jump_size (say 5 pixels). When you reach the right-hand side of the image, go back to 0 and increment the y of your window by jump_size.
Neural networks are good for this because the noise isn't a huge issue: you don't need to remove it. It's also good because it can recognize images similar to ones it has seen before, but are slightly different (the face is at a different angle, the lighting is slightly different, etc.).
Of course, the downside is that you need the training data to do it. If you don't have a set of pre-tagged images then you might be out of luck - although if you have a Facebook account you can probably write a script to pull all of yours and your friends' tagged photos and use that.
A FFT does only make sense when you already have sort the image with kd-tree or a hierarchical tree. I would suggest to map the image 2d rgb values to a 1d curve and reducing some complexity before a frequency analysis.
I do not have an exact algorithm to propose because I have found that target detection method depend greatly on the specific situation. Instead, I have some tips and advices. Here is what I would suggest: find a specific characteristic of your target and design your code around it.
For example, if you have access to the color image, use the fact that Wally doesn't have much green and blue color. Subtract the average of blue and green from the red image, you'll have a much better starting point. (Apply the same operation on both the image and the target.) This will not work, though, if the noise is color-dependent (ie: is different on each color).
You could then use correlation on the transformed images with better result. The negative point of correlation is that it will work only with an exact cut-out of the first image... Not very useful if you need to find the target to help you find the target! Instead, I suppose that an averaged version of your target (a combination of many Wally pictures) would work up to some point.
My final advice: In my personal experience of working with noisy images, spectral analysis is usually a good thing because the noise tend to contaminate only one particular scale (which would hopefully be a different scale than Wally's!) In addition, correlation is mathematically equivalent to comparing the spectral characteristic of your image and the target.