Based on a, b, c, d, Action Recognition with Deep Learning, Long-term Recurrent Convolutional Networks, e, Generic Features for Video Analysis,... there are several methods for analyses video by caffe but what is exactly the input for caffe.
Can we put video in different folders like image for training?
DIGITS doesn't support video data yet. When we do we'll add some sort of video example here:
https://github.com/NVIDIA/DIGITS/tree/master/examples
As far as my experience, you can't directly do it with digits, because in digits no default settings for sequences of frame analysis. A very famous project in Caffe known as C3D for action recognition can be used for training a new model or fine-tune existing moded for action or activity recognition.
C3D
Related
I want to detect more objects than coco dataset which detects only 80 objects , I want to detect as many as possible actions also like hugging ,swimming.....etc.
I don't care about the size and I do not want to train myself ... So is there a dataset(weights) big enough already available that I can download and use or I do have to train and label for yolo?
You can find here a very huge dataset with bounding boxes!
What you are trying to classify is represented as Action Recognition. Here [1] is a good repo that lists a lot of out-of-the-box models for this task.
An explanation: Models (like YOLO) contain two main blocks: feature extraction (CNN stuff) and classification (linear layers). When training from scratch, both feature extraction and classification will be trained from scratch. It is easy to train classification to what you want, but it is hard to train the feature extraction part (as it takes a lot of time). Hence, we typically use pre-trained models on generalized datasets (like YOLO is trained on COCO), so our feature extraction part starts from a somewhat good generalized place. Later, we replace the classification part will our own to be trained from scratch for our task.
TL;DR, you can use a pre-trained YOLO model on COCO for your task by replacing the last linear layers to classify what you want. Here are some resources for different frameworks [2, 3].
Here [4] is a simple walkthrough of how to do this.
[1] https://github.com/jinwchoi/awesome-action-recognition/blob/master/README.md#action-recognition-and-video-understanding
[2] TensorFlow: https://www.tensorflow.org/tutorials/images/transfer_learning
[3] PyTorch: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
[4] https://blog.roboflow.com/training-yolov4-on-a-custom-dataset/
Here my settings of Google Speech to Text AI
Here is the output file of Speech to Text AI : https://justpaste.it/speechtotext2
Here is the output file of YouTube's auto caption: https://justpaste.it/ytautotranslate
This is the video link : https://www.youtube.com/watch?v=IOMO-kcqxJ8&ab_channel=SoftwareEngineeringCourses-SECourses
This is the audio file of the video provided to Google Speech AI : https://storage.googleapis.com/text_speech_furkan/machine_learning_lecture_1.flac
Here I am providing time assigned SRT files
YouTube's SRT : https://drive.google.com/file/d/1yPA1m0hPr9VF7oD7jv5KF7n1QnV3Z82d/view?usp=sharing
Google Speech to Text API's SRT (timing assigned by YouTube) : https://drive.google.com/file/d/1AGzkrxMEQJspYenCbohUM4iuXN7H89wH/view?usp=sharing
I made comparison for some sentences and definitely YouTube's auto translation is better
For example
Google Speech to Text : Represent the **doctor** representation is one of the hardest part of computer AI you will learn about more about that in the future lessons.
What does this mean? Do you think this means that we are not just focused on behavior and **into doubt**. It is more about the reasoning when a human takes an action. There is a reasoning behind it.
YouTube's auto captioning : represent the **data** representation is one of the hardest part of computer ai you will we will learn more about that in the future lessons
what does this mean do you think this means that we are not just focused on behavior and **input** it is more about the reasoning when a human takes an action there is a reasoning behind it
I checked many cases and YouTube's guessing correct words is much better. How is this even possible?
This is the command I used to extract audio of the video : ffmpeg -i "input.mkv" -af aformat=s16:48000:output.flac
Both the automatic captions of the Youtube Auto Caption feature and the transcription of the Speech to Text Recognition are generated by machine learning algorithms, in which case the quality of the transcription may vary according to different aspects.
It is important to note that he Speech to Text API utilizes machine learning algorithms for its transcription, the ones that are improved over time and the results can vary according to the input file and the request configuration. One way of helping the models of Google transcription is by enabling data logging, this will allow Google to collect data from your audio transcription requests that will help to improve its machine learning models used for recognizing speech audio, including enhanced models.
Additionally, on the request configuration of the Speech to Text API, you can specify the RecognitionConfig settings. This parameter contains the encoding, sampleRateHertz, languageCode, maxAlternatives, profanityFilter and the speechContext, every parameter plays an important role on the accuracy of the transcription of the file.
Specifically for FLAC audio files, a lossless compression helps in the quality of the audio provided, since there is no degradation in quality of the original digital sample, FLAC uses a compression level parameter from 0 (fastest) to 8 (smallest file size).
Also, the Speech to Text API offers different ways to improve the accuracy of the transcription, such as:
Speech adaptation : This feature allows you to specify words and/or phrases that STT should recognize more frequently in your audio data
Speech adaptation boost : This feature allows allows you to add numerical weights to words and/or phrases according to how frequently they should be recognized in your audio data.
Phrases hints : Send a list of words and phrases that provide hints to the speech recognition task
These features might help you with the accuracy of the Speech to Text API recognizing your audio files.
Finally, please refer to the Speech to Text best practices to improve the transcription of your audio files, these recommendations are designed for greater efficiency and accuracy as well as reasonable response times from the API.
I'm in the feasibility stage of a project and wanted to know whether the following was doable using Machine Vision:
If I wanted to see if two files were identical, I would use a hashing function of sorts (e.g. sha1 or md5) on the files and store the results in a database.
However, if I have two images where say image 1 is 90% quality and image 2 is 100% quality, then this will not work as they will have different hashes.
Using machine vision, is it possible to "look" at an image and create a signature from it, so that when another image is encountered, we can say "have we already got this image in the system", and if so, disregard the new image, and if not, save the image?
I know that you are able to perform Machine Vision comparison between two known images, e.g.:
https://www.pyimagesearch.com/2014/09/15/python-compare-two-images/
(there's a lot of code in there so I cannot simply paste in here for reference, unfortunately)
but an image by image comparison would be extremely expensive.
Thanks
python provide the module called : imagehash :
imagehash - encodes the image which is commend bellow.
from PIL import Image
import imagehash
hash = imagehash.average_hash(Image.open('./image_1.png'))
print(hash)
# d879f8f89b1bbf
otherhash = imagehash.average_hash(Image.open('./image_2.png'))
print(otherhash)
# ffff3720200ffff
print(hash == otherhash)
# False
print(hash)
above is the python code which will print "true" if images are identical and "false" if images are not identical.
Thanks.
I do not what you mean by 90% and 100%. Are they image compression quality using JPEG? Regardless of this, you can match images using many methods for example using image processing only approaches such as SIFT, SURF, BRISK, ORB, FREAK or machine learning approaches such as Siamese networks. However, they are heavy for simple computer to run (on my computer powered by core-i7 2670QM, from 100 to 2000 ms for a 2 mega pixel match), specially if you run them without parallelism ( programming without GPU, AVX, ...), specially the last one.
For hashing you may also use perceptual hash functions. They are widely used in finding cases of online copyright infringement as well as in digital forensics because of the ability to have a correlation between hashes so similar data can be found (for instance with a differing watermark) [1]. Also you can search copy move forgery and read papers around it and see how similar images could be found.
The OpenCV forum has been unavailable for a few days so i am posting this questions here. I want to implement a class in C++ that will analyze an image and determine how good that image is for feature tracking.
One approach has been explained by Vuforia.
https://developer.vuforia.com/library/articles/Solution/Natural-Features-and-Ratings
1) Number of Features
Count the number of features returned, let's say requires min 30 features.
2) Local contrast
The variance can be used as a starting point to measure how much variation there is in the image. What sort of preprocessing would this require to get the most out of this metric?
How can we improve this? With a FT or DFT transform, would it be possible to see if there is high contrast at lots of different image frequencies? How would that be achieved?
DFT -> Variance (?)
3) Feature distribution
This can be done with clustering, with a suitable center and mean+s.d. that is comparable to the image dimensions. 95% should be within mean + 2 x s.d. ideally.
4) Avoid organic shapes
This will yield no features, so is the same criteria as the number of features.
5) Avoid repetitive patterns
Match detected features against itself and make sure there aren't too many duplicates.
Vuforia do the same .
But if you want to write your own code to do the same then,
ARToolkit is open source SDK which provide same feature for NFT markers . if you go through the source code of ARToolkit then you
will find something like " DisplayFeatureSet"
There is DisplayfeatureSet.exe file also there which show the
feature(Hotspots) of selected image like:
Somehow I managed to get source code(.c) for this.
Here I providing My google Drive Link to download Source Code, Work on it and share your experience :
Source Code to Display Feature Set
Best Luck :)
I'm looking for advices, for a personal project.
I'm attempting to create a software for creating customized voice commands. The goal is to allow user/me to record some audio data (2/3 secs) for defining commands/macros. Then, when the user will speak (record the same audio data), the command/macro will be executed.
The software must be able to detect a command in less than 1 second of processing time in a low-cost computer (RaspberryPi, for example).
I already searched in two ways :
- Speech Recognition (CMU-Sphinx, Julius, simon) : There is good open-source solutions, but they often need large database files, and speech recognition is not really what I'm attempting to do. Speech Recognition could consume too much power for a small feature.
- Audio Fingerprinting (Chromaprint -> http://acoustid.org/chromaprint) : It seems to be almost what I'm looking for. The principle is to create fingerprint from raw audio data, then compare fingerprints to determine if they can be identical. However, this kind of software/library seems to be designed for song identification (like famous softwares on smartphones) : I'm trying to configure a good "comparator", but I think I'm going in a bad way.
Do you know some dedicated software or parcel of code doing something similar ?
Any suggestion would be appreciated.
I had a more or less similar project in which I intended to send voice commands to a robot. A speech recognition software is too complicated for such a task. I used FFT implementation in C++ to extract Fourier components of the sampled voice, and then I created a histogram of major frequencies (frequencies at which the target voice command has the highest amplitudes). I tried two approaches:
Comparing the similarities between histogram of the given voice command with those saved in the memory to identify the most probable command.
Using Support Vector Machine (SVM) to train a classifier to distinguish voice commands. I used LibSVM and the results are considerably better than the first approach. However, one problem with SVM method is that you need a rather large data set for training. Another problem is that, when an unknown voice is given, the classifier will output a command anyway (which is obviously a wrong command detection). This can be avoided by the first approach where I had a threshold for similarity measure.
I hope this helps you to implement your own voice activated software.
Song fingerprint is not a good idea for that task because command timings can vary and fingerprint expects exact time match. However its very easy to implement matching with DTW algorithm for time series and features extracted with CMUSphinx library Sphinxbase. See Wikipedia entry about DTW for details.
http://en.wikipedia.org/wiki/Dynamic_time_warping
http://cmusphinx.sourceforge.net/wiki/download