How to detect when person holds rope at the bottom part in a video, by applying computer vision methods? - computer-vision

There is a video, where a person holds a rope at a specific moment.
What I would like to do is to understand when this person actually holding bottom rope automatically.
Can we do it with computer vision models?

Yes, it is possible
You should check out YOLO algorithm

Related

Can I / How to implement object recognition in app?

our aim is to develop an automotive app which automates and standardizes photo shooting of newly arrived cars at dealership. Basically, our Mavic 2 Pro takes off, orbits the vehicle and shoots photo every 90 degrees and than lands to its original position. The orbiting radius is app 4.5 m.
Since the scene of shooting is small (or when shooting indoor, GPS might not be available), we would like to rely more on built in object recognition as implemented in activetrack missions. We have currently an app based on waypoint mission, but it is not accurate. So my questions are:
1, Can anyone point us in direction how to implement object recognition in our app?
2, If recognition is not available, how to ensure consistency of output? While testing, sometimes the output exif info of photos show compass deviation up to 4 degrees, which results in object being out of view.
Thanks for advice,
Mirek
What you need is object recognition and tracking like YOLO.
The resources are here.
https://pjreddie.com/darknet/yolo/
Then you need to train the model to recognize cars, so you need a bunch of labeled photos where the car bounding box has been already given correctly.
Downstream you want then have the camera connect to a computer and stream the video, have the tracking algorithm recognize the bonding box of the car and, depending on center and size, maneuver the drone accordingly.

How do ARCore or ARKit produce real-time augmentations of live video?

So a while back about a year ago I was interested in building my own barebones augmented reality (AR) library. My goal was to be able to take a video of something (anything really) and then be able to place augmentations (3D objects that weren't really there) in the video. So for example I might take a video of my living room and then, through this AR library/tool, I'd be able to add in maybe a 3D avatar of a monster sitting on top of my coffee table. So, knowing absolutely nothing about the subject or computer vision in general, I had settled for the following strategy:
Use 3D reconstruction tools/techniques (Structure from Motion, or SfM) to build up a 3D model of everything in the video (e.g. a 3D model of my living room)
Analyze that 3D model (really a 3D pointcloud to be exact) for flat surfaces
Add my own logic to determine what objects (3D models such as Blender files, etc.) to place in what area of the video's 3D model (e.g. monster standing on top of the coffee table)
The hardest part: inferring the camera orientation in each frame of the video, and then figuring out how to orient the augmentation (e.g. monster) correctly based on what the camera is pointed at, and then "merging" the augmentation's 3D model into the main video 3D model. This means that as the camera moves around my living room, the monster appears to remain standing in the same place on my coffee table. I never figured out a good solution for this but figured if I could get to this 4th step that I'd find some solution.
After several difficult weeks (computer vision is hard!) I got the following pipeline of tools to work with mixed success:
I was able to feed sample frames of a video (e.g. a video taken while walking around my living room) into OpenMVG and produce a sparse pointcloud PLY file/model of it
Then I was able to feed that PLY file into MVE and produce a dense pointcloud (again PLY file) of it
Then I fed the dense pointcloud and the original frames into mvs-texturing to produce a textured 3D model of my video
About 30% of the time, this pipeline worked amazing! Here's the model of the front of my house. You can see my 3D front yard, my son's 3D playhouse and even kinda/sorta make out windows and doors!
About 70% of the time the pipelined failed with indecipherable errors, or produced something that looked like an abstract painting. Additionally, even with automated scripting involved, it took the tooling about 30 mins to produce the final 3D textured model...so pretty slow.
Well, looks like Google ARCode and Apple ARKit beat me to the punch! These frameworks can take live video feeds from your smartphone and accomplish exactly what I had been trying to accomplish about a year ago: real-time 3D AR. Very, very similar (but much more advanced & interactive) as Pokemon Go. Take a video of your living room, and voila, an animated monster is sitting on your coffee table, and you can interact with it. Very, very, very cool stuff.
My question
I'm jealous! Of course, Google and Apple can hire some best-in-show CV/3D recon folks, but I'm still jealous!!! I'm curious if there are any hardcore AR/CV/3D recon gurus out there that either have insider knowledge or just know the AR landscape so well that they can speak to what kind of tooling/pipeline/technology is going on behind the scenes here with ARCode or ARKit. Because I practically broke my brain trying to figure this out on my own, and I failed spectacularly.
Was my strategy (explained above) ballpark-accurate, or way off base? (Again: 3D recon of video -> surface analysis -> frame-by-frame camera analysis, model merge)?
What kind of tooling/libraries/techniques are at play here?
How do they accomplish this in real-time whereas, if my 3D recon even worked, it took 30+ mins to be processed & generated?
Thanks in advance!
I understand your jealousy and as a Computer Vision engineer I have it experienced many times before :-).
The key for AR on mobile devices is the fusion of computer vision and inertial tracking (the phone's gyroscope).
Quote from Apple's ARKit docu:
ARKit uses a technique called visual-inertial odometry. This process
combines information from the iOS device’s motion sensing hardware
with computer vision analysis of the scene visible to the device’s
camera.
Quote from Google's ARCore docu:
The visual information is combined with inertial measurements from the
device's IMU to estimate the pose (position and orientation) of the
camera relative to the world over time.
The problem with this approach is that you have to know every single detail about your camera and IMU sensor. They have to be calibrated and synced together. No wonder it is easier for Apple than for the common developer. And this is also the reason why Google only supports a couple of phones for the ARCore preview.

Bullet to detect collision detection

Actually, I'm currently working on a simple project to detect collision between 2 specific objects in a surgery scene. The problem is that I don't have background on such problems so I'm really newbie to such things and I don't know yet what to do. After a little bit of research, I found Bullet library which can be used as a collision detection tool but not sure yet if it suits my case. I already checked some examples where the developer create the objects of interest manually which led me to think that I should detect first the objects of interest then launch the collision detection process.
In my case, I have 2 types of data:
Video shooting the operating room
Cloud points representing the room in 3D
I need to detect the collision between two objects in the scene. Is there any way to use Bullet to achieve such thing? Is it common to use a video as input for a detection collision problem(I'm wondering since I could find too much resources on it)?
I'm just starting so it might be a fuzzy question so sorry in advance for any inconveniences.
EDITED:
I already checked it but my point was to understand what options can be used before digging into the details. For me, a collision detection problem should have 2 parts: the objects of interest (The 2 or more objects that we're trying to detect their collision) and the scene in which we will be trying to detect the collision of the objects of interest. For the scene, the data I have is presented in 2 types mentioned above. So, I was asking about which type of data should be used as input for bullet collision process. Should it be an image taken from the video or should it be a list of 3D points? Or something else?
I have used Bullet half a year ago. I remember, that you need to register objects to Bullet with a collision shape. In simplistic case of your points, it could probably be small spheres. In case of your video, you need to have a 3d representation. I do not understand a 100% what you mean by detecting a "video" for collisions. However, to use Bullet, you need to have a collision shape associated with the object.
Further, you register a Collision Callback. This is one function called for each collision detected. All callbacks are listed here: http://www.bulletphysics.org/mediawiki-1.5.8/index.php?title=Collision_Callbacks_and_Triggers
As the wiki says - and I implemented it this way - to detect a specific collision, you need to iterate over allr esulting manifolds from Bullet manually. A little bit painful and performance wise strange approach. So you cannot register a specific callback for a specific object with another specific object!
Once the objects are registered, you run the algorithm and then you can check all manifolds in the callback.
To get started with Bullet, I used Bullet Physics Simplest Collision Example with the answers at that time.

Opencv Object tracking and count objects which passes ROI in video frame

I am working on Opencv application that need to count any object which motion can be detected by the camera. The camera is still and I did the object tracking with opencv and cvblob by referring many tutorials.
I found some similar question:
Object counting
And i found this was similar
http://labs.globant.com/uncategorized/peopletracker-people-and-object-tracking/
I am new to OpenCV and I've gone through the opencv documentation but I couldn't find anything which is related to count moving objects in video.
Can any one please give me a idea how to do this specially the counting part. As I read in article above, they count people who crosses the virtual line.Is there a special algorithm to detect the object crossing the line?
Your question might be to broad when you are asking about general technique that count moving objects in video sequences. I would give some hints that might help you:
As usual in computer vision, there does not exist one specific way to solve your problem. Try do do some research about people detection, background extraction and motion detection to have a wider point of view
State more clearly user requirements of your system, namely how many people can occur in the image frame? The things get complicated when you would like to track more than one person. Furthermore, can other moving objects appear on an image (e.g. animals)? If no and only one person are supposed to be track, the answer to your problem is pretty easy, see an explanation below. If yes, you will have to do more research.
Usually you cannot find in OpenCV API direct solution to computer vision problem, namely there is not such method that solve directly problem of people counting. But for sure there exists some paper, reference (usually some scientific stuff) which can be adopted to solve your problem. So there is no method that "count people crossing vertical line". You have to solve problem my merging some algorithms together.
In the link you have provided one can see that they use some algorithm for background extraction which determined what is a non-moving background and moving foreground (in our case, a walking person). We are not sure if they use something more (or sophisticated), but information about background extraction is sufficient to start with problem solving.
And here is my contribution to the solution. Assuming only one person walks in front of the stable placed camera and no other objects motion can be observed, do as following:
Save frame when no person is moving in front of the camera, which will be used later as a reference for background
In a loop, apply some background detector to extract parts in the image representing motion (MOG or even you can just calculate difference between background and current frame, followed by binary threshold and blob counting, see my answer here)
From the assumption, only one blob should be detected (if not, use some metrics the chooses "the best one". for example choose the one with maximum area). That blob is the person we would like to track. Knowing its position on an image, compare to the position of the "vertical line". Objects moving from left to right are exiting and from right to left entering.
Remember that this solution will only work in case of the assumption we stated.

OpenTLD, how is it different from other object detection methods?

For those that have heard of OpenTLD, how does it alternate between tracking different objects? It can only track one object at a time, but if I had two or more objects trained in the same video feed, how does OpenTLD decide what to track? In all the sample videos the user manually bounded the object to be tracked, and afterwards it was automatically tracked.
Is this considered an object tracker only? And not an object recognition system? I'm slightly confused about this.
For my applications, I'm fine with tracking/detecting one object at a time, but only if I have the option of switching over to track another object.
For example, in a Haar-like feature setup:
1. I have a cup and a book trained using several positive and negatives
2. Starting up my Haar recognition software, the software picks up both the cup and the book and highlights them with the correct labels
Ideally what I think/hope/wish that OpenTLD does is:
1. Using compiled exe, bound cup in video, track and learn
2.next bound book in video, track and learn
3. In a video feed with both the book and cup, I tell the program to tell me all the objects that it can detect in the live video feed.
4. program tells me that it detects cup and book and gives me option to track one of them
Is this feasible?