Creating custom voice commands (GNU/Linux) - c++

I'm looking for advices, for a personal project.
I'm attempting to create a software for creating customized voice commands. The goal is to allow user/me to record some audio data (2/3 secs) for defining commands/macros. Then, when the user will speak (record the same audio data), the command/macro will be executed.
The software must be able to detect a command in less than 1 second of processing time in a low-cost computer (RaspberryPi, for example).
I already searched in two ways :
- Speech Recognition (CMU-Sphinx, Julius, simon) : There is good open-source solutions, but they often need large database files, and speech recognition is not really what I'm attempting to do. Speech Recognition could consume too much power for a small feature.
- Audio Fingerprinting (Chromaprint -> http://acoustid.org/chromaprint) : It seems to be almost what I'm looking for. The principle is to create fingerprint from raw audio data, then compare fingerprints to determine if they can be identical. However, this kind of software/library seems to be designed for song identification (like famous softwares on smartphones) : I'm trying to configure a good "comparator", but I think I'm going in a bad way.
Do you know some dedicated software or parcel of code doing something similar ?
Any suggestion would be appreciated.

I had a more or less similar project in which I intended to send voice commands to a robot. A speech recognition software is too complicated for such a task. I used FFT implementation in C++ to extract Fourier components of the sampled voice, and then I created a histogram of major frequencies (frequencies at which the target voice command has the highest amplitudes). I tried two approaches:
Comparing the similarities between histogram of the given voice command with those saved in the memory to identify the most probable command.
Using Support Vector Machine (SVM) to train a classifier to distinguish voice commands. I used LibSVM and the results are considerably better than the first approach. However, one problem with SVM method is that you need a rather large data set for training. Another problem is that, when an unknown voice is given, the classifier will output a command anyway (which is obviously a wrong command detection). This can be avoided by the first approach where I had a threshold for similarity measure.
I hope this helps you to implement your own voice activated software.

Song fingerprint is not a good idea for that task because command timings can vary and fingerprint expects exact time match. However its very easy to implement matching with DTW algorithm for time series and features extracted with CMUSphinx library Sphinxbase. See Wikipedia entry about DTW for details.
http://en.wikipedia.org/wiki/Dynamic_time_warping
http://cmusphinx.sourceforge.net/wiki/download

Related

Text Detection with YOLO on Challenging Images

I have images that look as follows:
My goal is to detect and recognize the number 31197394. I have already fine-tuned a deep neural network on text recognition. It can successfully identify the correct number, if it is provided it in the following format:
The only task that remains is the detection of the corresponding bounding box. For this purpose, I have tried darknet. Unfortunately, it's not recognizing anything. Does anyone have an idea of a network that performs better on these kind of images? I know, that amazon recognition is able to solve this task. But I need a solution that works offline. So my hopes are still high that there exist pre-trained networks that work. Thank's a lot for your help!
Don't say darknet doesn't work. It depends on how you labeled your dataset. It is true that the numbers you want to recognize are too small so if you don't make any changes to the image during the pre-processing phase, it would be complicated for a neural network to recognize them well. So what you can do that will surely work is:
1---> Before labeling, increase the size of all images by 2 times its current size (like 1000*1000)
2---> used this size (1000 * 1000) for the darket trainer instead of the default size proposed by darknet which is 416 * 416. You would then have to change the configuration file
3---> use the latest darknet version (yolo v4)
4---> On the configuration file, always keep a number of subdivisions at 1.
I also specify that this method is too greedy in memory, it is therefore necessary to provide a machine with RAM > 16 GB. The advantage is that it works...
Thanks for your answers guys! You were right, I had to finetune yolo to make it work. So I created a dataset and fine-tuned yolov5. I am surprised how good the results are. Despite only having about 300 images in total, I get an accuracy of 97% to predict the correct number. This is mainly due to strong augmentations. And indeed the memory requirements are large, but I could train on a 32 GM RAM machine. I can really encourage anyone who faces similar problems to give yolo a shot!!
Maybe use an R-CNN to identify the region where the number is and then pass that region to your fine-tuned neural network for the digit classification

YouTube's auto captioning produces better results than Google Speech to Text API (Model: video, UseEnhanced: true). How can be this possible?

Here my settings of Google Speech to Text AI
Here is the output file of Speech to Text AI : https://justpaste.it/speechtotext2
Here is the output file of YouTube's auto caption: https://justpaste.it/ytautotranslate
This is the video link : https://www.youtube.com/watch?v=IOMO-kcqxJ8&ab_channel=SoftwareEngineeringCourses-SECourses
This is the audio file of the video provided to Google Speech AI : https://storage.googleapis.com/text_speech_furkan/machine_learning_lecture_1.flac
Here I am providing time assigned SRT files
YouTube's SRT : https://drive.google.com/file/d/1yPA1m0hPr9VF7oD7jv5KF7n1QnV3Z82d/view?usp=sharing
Google Speech to Text API's SRT (timing assigned by YouTube) : https://drive.google.com/file/d/1AGzkrxMEQJspYenCbohUM4iuXN7H89wH/view?usp=sharing
I made comparison for some sentences and definitely YouTube's auto translation is better
For example
Google Speech to Text : Represent the **doctor** representation is one of the hardest part of computer AI you will learn about more about that in the future lessons.
What does this mean? Do you think this means that we are not just focused on behavior and **into doubt**. It is more about the reasoning when a human takes an action. There is a reasoning behind it.
YouTube's auto captioning : represent the **data** representation is one of the hardest part of computer ai you will we will learn more about that in the future lessons
what does this mean do you think this means that we are not just focused on behavior and **input** it is more about the reasoning when a human takes an action there is a reasoning behind it
I checked many cases and YouTube's guessing correct words is much better. How is this even possible?
This is the command I used to extract audio of the video : ffmpeg -i "input.mkv" -af aformat=s16:48000:output.flac
Both the automatic captions of the Youtube Auto Caption feature and the transcription of the Speech to Text Recognition are generated by machine learning algorithms, in which case the quality of the transcription may vary according to different aspects.
It is important to note that he Speech to Text API utilizes machine learning algorithms for its transcription, the ones that are improved over time and the results can vary according to the input file and the request configuration. One way of helping the models of Google transcription is by enabling data logging, this will allow Google to collect data from your audio transcription requests that will help to improve its machine learning models used for recognizing speech audio, including enhanced models.
Additionally, on the request configuration of the Speech to Text API, you can specify the RecognitionConfig settings. This parameter contains the encoding, sampleRateHertz, languageCode, maxAlternatives, profanityFilter and the speechContext, every parameter plays an important role on the accuracy of the transcription of the file.
Specifically for FLAC audio files, a lossless compression helps in the quality of the audio provided, since there is no degradation in quality of the original digital sample, FLAC uses a compression level parameter from 0 (fastest) to 8 (smallest file size).
Also, the Speech to Text API offers different ways to improve the accuracy of the transcription, such as:
Speech adaptation : This feature allows you to specify words and/or phrases that STT should recognize more frequently in your audio data
Speech adaptation boost : This feature allows allows you to add numerical weights to words and/or phrases according to how frequently they should be recognized in your audio data.
Phrases hints : Send a list of words and phrases that provide hints to the speech recognition task
These features might help you with the accuracy of the Speech to Text API recognizing your audio files.
Finally, please refer to the Speech to Text best practices to improve the transcription of your audio files, these recommendations are designed for greater efficiency and accuracy as well as reasonable response times from the API.

Machine Vision - Hash An Image

I'm in the feasibility stage of a project and wanted to know whether the following was doable using Machine Vision:
If I wanted to see if two files were identical, I would use a hashing function of sorts (e.g. sha1 or md5) on the files and store the results in a database.
However, if I have two images where say image 1 is 90% quality and image 2 is 100% quality, then this will not work as they will have different hashes.
Using machine vision, is it possible to "look" at an image and create a signature from it, so that when another image is encountered, we can say "have we already got this image in the system", and if so, disregard the new image, and if not, save the image?
I know that you are able to perform Machine Vision comparison between two known images, e.g.:
https://www.pyimagesearch.com/2014/09/15/python-compare-two-images/
(there's a lot of code in there so I cannot simply paste in here for reference, unfortunately)
but an image by image comparison would be extremely expensive.
Thanks
python provide the module called : imagehash :
imagehash - encodes the image which is commend bellow.
from PIL import Image
import imagehash
hash = imagehash.average_hash(Image.open('./image_1.png'))
print(hash)
# d879f8f89b1bbf
otherhash = imagehash.average_hash(Image.open('./image_2.png'))
print(otherhash)
# ffff3720200ffff
print(hash == otherhash)
# False
print(hash)
above is the python code which will print "true" if images are identical and "false" if images are not identical.
Thanks.
I do not what you mean by 90% and 100%. Are they image compression quality using JPEG? Regardless of this, you can match images using many methods for example using image processing only approaches such as SIFT, SURF, BRISK, ORB, FREAK or machine learning approaches such as Siamese networks. However, they are heavy for simple computer to run (on my computer powered by core-i7 2670QM, from 100 to 2000 ms for a 2 mega pixel match), specially if you run them without parallelism ( programming without GPU, AVX, ...), specially the last one.
For hashing you may also use perceptual hash functions. They are widely used in finding cases of online copyright infringement as well as in digital forensics because of the ability to have a correlation between hashes so similar data can be found (for instance with a differing watermark) [1]. Also you can search copy move forgery and read papers around it and see how similar images could be found.

Looking for Ideas: How would you start to write a geo-coder?

Because the open source geo-coders cannot begin to compare to Google's or even Yahoo's, I would like to start a project to create a good open source geo-coder. Just to clarify, a geo-coder takes some text (usually with some constraints) and returns one or more lat/lon pairs.
I realize that this is a difficult and garguntuan task, so I am wondering how you might get started. What would you read? What algorithms would you familiarize yourself with? What code would you review?
And also, assuming you were going to develop this very agilely, what would you want the first prototype to be able to do?
EDIT: Let's set aside the data question for now. I am going to use OpenStreetMap data, along with a database of waypoints that I have. I would later plan to include other data sets as well, and I realize the geo-coder would be inherently limited by the quality of the original data.
The first (and probably blocking) problem would be: where do you get your data from? (unless you are willing to pay thousands of dollars for proprietary sets).
You could build a geocoding-api on top of OpenStreetMap (they publish their data in dumps on a regular basis) I guess, but that one was still very incomplete last time I checked.
Algorithms are easy. Good mapping data, however, is expensive. Very expensive.
Google drove their cars all over the world, collecting this data among other things.
From a .NET point of view these articles might be interesting for you:
Writing Your Own GPS Applications: Part I
Writing Your Own GPS Applications: Part 2
Writing GIS and Mapping Software for .NET
I've only glanced at the articles but they've been on CodeProject's 'Most Popular' list for a long time.
And maybe this CodePlex project which the author of the articles above made available.
I would start at the absolute beginning by figuring out how you're going to get the data that matches a street address with a geocode. Either Google had people going around with GPS units, OR they got the information from some existing source. That existing source may have been... (all guesses)
The Postal Service
Some existing maps(printed)
A bunch of enthusiastic users that were early adopters of GPS technology who ere more than willing to enter in street addresses and GPS coordinates
Some government entity (or entities)
Their own satellites
etc
I guess what I'm getting at is the information was either imported from somewhere or was input by someone via some interface. As my starting point I would look at how to get that information. In an open source situation, you may be able to get a bunch of enthusiastic people to enter information.
So for my first prototype, boring as it would be, I would create a form for entering information.
Then you need to know the math for figuring out the closest distance (as the crow flies). From there, try to figure out how to include roads. (My guess is you would have to have data point for each and every curve, where you hold the geocode location of the curve, and the angle of the road on a north/south and east/west vector. You'd probably need to take incline into account, too to get accurate road measurements.)
That's just where I'd start.
But in all honesty, I wouldn't even start on this. Other programmers have done it already, I'm more interested in what hasn't already been done.
get my free raw data from somewhere like http://ipinfodb.com/ip_database.php
load it into a database, denormalizing for fast lookups
design my API
build it out as a RESTful web service
return results in varying formats: JSON, XML, CSV, raw text
The first prototype should accept a ZIP code and return lat/lon in raw text.

How do I write a Perl script to filter out digital pictures that have been doctored?

Last night before going to bed, I browsed through the Scalar Data section of Learning Perl again and came across the following sentence:
the ability to have any character in a string means you can create, scan, and manipulate raw binary data as strings.
An idea immediately hit me that I could actually let Perl scan the pictures that I have stored on my hard disk to check if they contain the string Adobe. It seems by doing so, I can tell which of them have been photoshopped. So I tried to implement the idea and came up with the following code:
#!perl
use autodie;
use strict;
use warnings;
{
local $/="\n\n";
my $dir = 'f:/TestPix/';
my #pix = glob "$dir/*";
foreach my $file (#pix) {
open my $pic,'<', "$file";
while(<$pic>) {
if (/Adobe/) {
print "$file\n";
}
}
}
}
Excitingly, the code seems to be really working and it does the job of filtering out the pictures that have been photoshopped. But problem is many pictures are edited by other utilities. I think I'm kind of stuck there. Do we have some simple but universal method to tell if a digital picture has been edited or not, something like
if (!= /the origianl format/) {...}
Or do we simply have to add more conditions? like
if (/Adobe/|/ACDSee/|/some other picture editors/)
Any ideas on this? Or am I oversimplifying due to my miserably limited programming knowledge?
Thanks, as always, for any guidance.
Your best bet in Perl is probably ExifTool. This gives you access to whatever non-image information is embedded into the image. However, as other people said, it's possible to strip this information out, of course.
I'm not going to say there is absolutely no way to detect alterations in an image, but the problem is extremely difficult.
The only person I know of who claims to have an answer is Dr. Neal Krawetz, who claims that digitally altered parts of an image will have different compression error rates from the original portions. He claims that re-saving a JPEG at different quality levels will highlight these differences.
I have not found this to be the case, in my investigations, but perhaps you might have better results.
No. There is no functional distinction between a perfectly edited image, and one which was the way it is from the start - it's all just a bag of pixels in the end, after all, and any other metadata you can remove or forge all you want.
The name of the graphics program used to edit the image is not part of the image data itself but of something called meta data - which may be stored in the image file but, as others have noted, is neither required (so some programs may not store it, some may allow you an option of not storing it) nor reliable - if you forged an image, you might have forged the meta data as well.
So the answer to your question is "no, there's no way to universally tell if the pic was edited or not, although some image editing software may write its signature into the image file and it'll be left there by carelessness of the editing person.
If you're inclined to learn more about image processing in Perl, you could take a look at some of the excellent modules CPAN has to offer:
Image::Magick - read, manipulate and write of a large number of image file formats
GD - create colour drawings using a large number of graphics primitives, and emit the drawings in various formats.
GD::Graph - create charts
GD::Graph3d - create 3D Graphs with GD and GD::Graph
However, there are other utilities available for identifying various image formats. It's more of a question for Super User, but for various unix distros you can use file to identify many different types of files, and for MacOSX, Graphic Converter has never let me down. (It was even able to open the bizarre multi-file X-ray of my cat's shattered pelvis that I got on a disc from the vet.)
How would you know what the original format was? I'm pretty sure there's no guaranteed way to tell if an image has been modified.
I can just open the file (with my favourite programming language and filesystem API) and just write whatever I want into that file willy-nilly. As long as I don't screw something up with the file format, you'd never know it happened.
Heck, I could print the image out and then scan it back in; how would you tell it from an original?
As other's have stated, there is no way to know if the image was doctored. I'm guessing what you basically want to know is the difference between a realistic photograph and one that has been enhanced or modified.
There's always the option of running some extremely complex image recognition algorithm that would analyze every pixel in your image and do some very complicated stuff to determine if the image was doctored or not. This solution would probably involve AI which would examine millions of photos that are both doctored and those that are not and learn from them. However, this is more of a theoretical solution and isn't very practical... you would probably only see it in movies. It would be extremely complex to develop and probably take years. And even if you did get something like this to work, it probably still wouldn't be 100% correct all the time. I'm guessing AI technology still isn't at that level and could take a while until it is.
A not-commonly-known feature of exiftool allows you to recognize the originating software through an analysis of the JPEG quantization tables (not relying on image metadata). It recognizes tables written by many applications. Note that some cameras may use the same quantization tables as some applications, so this isn't a 100% solution, but it is worth looking into. Here is an example of exiftool run on two images, the first was edited by photoshop.
> exiftool -jpegdigest a.jpg b.jpg
======== a.jpg
JPEG Digest : Adobe Photoshop, Quality 10
======== b.jpg
JPEG Digest : Canon EOS 30D/40D/50D/300D, Normal
2 image files read
This will work even if the metadata has been removed.
There is existing software out there which uses various techniques (compression artifacting, comparison to signature profiles in a database of cameras, etc.) to analyze the actual image data for evidence of alteration. If you have access to such software and the software available to you provides an API for external access to these analysis functions, then there's a decent chance that a Perl module exists which will interface with that API and, if no such module exists, it could probably be created rather quickly.
In theory, it would also be possible to implement the image analysis code directly in native Perl, but I'm not aware of anyone having done so and I expect that you'd be better off writing something that low-level and processor-intensive in a fully-compiled language (e.g., C/C++) rather than in Perl.
http://www.impulseadventure.com/photo/jpeg-snoop.html
is a tool that does the job almost good
If there has been any cloning , there is a variation in the pixel density..or concentration which sometimes shows up.. upon manual inspection
a Photoshop cloned area will have even pixel density(my meaning is variation of Pixels wrt a scanned image)