django ORM merging data - django

I came across a problem where I could not find an elegant way to solve it...
We have an application that monitors audio-input and tries to assign matches based on acoustic fingerprints.
The application gets a sample every few seconds, then does a lookup and stores the timestamped result in the database.
The fingerprinting is not always accurate, so it happens that "wrong" items get assigned. So the data looks something like:
timestamp foreign_id my comment
--------------------------------------------------
12:00:00 17
12:00:10 17
12:00:20 17
12:00:30 17
12:00:40 723 wrong match
12:00:50 17
12:01:00 17
12:01:10 17
12:01:20 None no match
12:01:30 17
12:01:40 18
12:01:50 18
12:02:00 18
12:02:10 18
12:02:20 18
12:02:30 992 wrong match
12:02:40 18
12:02:50 18
So I'm looking for a way to "clean up" the data periodically.
Could anyone imagine a nice way to achieve this? In the given example - the entry with the foreign-id of 723 should be corrected to 17 etc. And - if possible - with a threshold about how many entries back and forth should be taken into account.
Not sure if my question is clear enough this way, but any inputs welcome!

Check that a foreign id is in the database so many times, then check if those times are close together?
Why not just disregard the 'bad' data when using the data?

Related

Conditional transformation data in Power BI

The question about transformation data in Power BI.
I have a text file with spaces as separators. Some rows (where day in date less than 10) contain double space before one field. It is always the third field.
Tue May 4 13:57:50 BST 2021: 64 bytes from 8.8.8.8: icmp_seq=12 ttl=119 time=9.22 ms
Tue May 4 13:58:05 BST 2021: 64 bytes from 8.8.8.8: icmp_seq=13 ttl=119 time=10.2 ms
Tue May 4 13:58:20 BST 2021: 64 bytes from 8.8.8.8: icmp_seq=14 ttl=119 time=8.77 ms
Tue May 4 13:58:35 BST 2021: 64 bytes from 8.8.8.8: icmp_seq=15 ttl=119 time=9.69 ms
Tue May 4 13:58:50 BST 2021: 64 bytes from 8.8.8.8: icmp_seq=16 ttl=119 time=9.22 ms
So I split this file by spaces and some rows are split for 15 columns and some for 16.
I do a lot of transformations with this file then so I need to be able to make the conditional transformation. I didn't find any solutions by myself, so I'll be appreciated for advice.
I found a few solutions.
Shortly, The first one is using a standard function Split Column which could be used in few steps.
Split by positions the first 3 fields
Clear it from spaces
Split by space the rest
The second way to do something similar is using Python. To successfully insert step with Python script in your current transformation you should:
Go to Transform data -> Advanced editor and copy current steps, because after insertion Python step we should do a Navigation step, which replaces all existing steps.
Find a step where you want to insert a new Python transformation and click on Transform -> Run Python script.
Then write code and save the result
Then click on dataset name and agreed with replacing all steps
Now copy your previous steps and past them back in Advanced editor
I described the solutions here
https://koftaylov.blogspot.com/2021/05/another-day-building-power-bi-dashboard.html

AWS Cloudwatch event rule: Start and stop time

Is it possible to set start and stop time of cloudwatch event rules?
Use Case - I want to create a rule which triggers a lambda, but I want it to run at a specific date (every 2 minutes) and disable it at another date (these date interval can span across months).
As far as I know, when we create a rule (rate (2 minutes)), it starts running immediately. I can use this approach and in the lambda I can check if the current date is same as target date and proceed with lambda execution, and disable the rule when current date is greater than end date. Although, it might work but it does not seem the right approach since lambda would be unnecessarily executing until the target date.
Is there any way I can achieve the same thing without the hack?
Yes, you can set it to specific date only. For instance the following rule 0/2 0 28 9 ? 2020 would execute every 2 minutes on 28 Sep 2020 only:
Update
To span across months, I think you need separate rules. For example you could define two rules to span date range 28 Sep to finish 5 Oct: 0/2 0 28-31 9 ? 2020 and 0/2 0 1-5 10 ? 2020.

Repeated events in calendar

First of all, please excuse my perfectible English, I am French.
I'm coming to ask for some advice on a database design problem.
I have to design a calendar with events. Briefly, an event includes a start date/time, an end date/time, and a description.
The problem is that I have to consider repetitions; it is possible when creating an event to indicate that it starts next week and repeats itself until a specific date or not.
I see two possibilities of design:
create an events table with id, start_datetime, end_datetime and description fields.
When adding a new event, we generate as many rows as there are repeated events.
Advantages: we can make a SELECT * to retrieve all the events, without particular algorithm. In addition, it is possible to modify the descriptions of each occurrence of an event, insofar as they are considered as all different.
Disadvantage (MAJOR!): If we do not put an end date to have an infinite repetition, we will not memorize an infinity of events...
take inspiration from the method described on this thread, that is to say two tables:
events table
id description
1 Single event on 2018-11-23 08:00-09:30
2 Repeated event :
* every monday from 10:00 to 12:00 from Monday 2018-11-26
* every wednesday from 2018-11-28 from 14:00 to 14:45 until 2019-02-27
event_repetitions table
id event_id start_datetime end_datetime interval end_date
1 1 2018-11-23 08:00:00 2018-11-23 09:30:00 NULL NULL
2 2 2018-11-26 10:00:00 2018-11-26 12:00:00 604800 NULL
3 2 2018-11-28 14:00:00 2018-11-28 14:45:00 604800 2019-02-27
Note: interval is the number of seconds between each occurrence, 604800 = 24 (hours) * 3600 (seconds) * 7 (days).
Advantage: In the case of infinite repetitions (case of the event of id 2), we have very few lines to write and performances are increased.
Disadvantages: if we want to modify the description of the event (or other possible fields) for a specific occurrence and not another, we can not without creating a third table, event_descriptions for example:
id event_id user_id datetime description
1 2 1 2018-11-26 10:00:00 Comment from 2018-11-26
2 2 2 2018-12-03 10:00:00 Comment of the second occurrence, i.e. from 2018-12-03
Note: user_id is the logged-in user who wrote the comment.
Another disadvantage is that to get the list of events for a given day, week, or month, the selection query will be more complex and use joins. The event_descriptions table may, when there are hundreds of thousands of events, be very big.
My question is: what would you recommend as a more effective alternative? Maybe the second solution is good? What do you think?
In terms of technologies used, I intend to go on MySQL, the DBMS I know best. Nevertheless, if you think that using for example MongoDB is better in case of very large numbers of lines, do not hesitate to report it.
For information, my application is an API developed with API Platform, so Symfony 4 with Doctrine ORM.
Thank you in advance for your answers.
I allow myself to do a little up, hoping other answers.

Error with filemanager in controller

I want to show 40 images in image upload
I am also trying to change the pagination in controller/file manager.php from 16 to 36 and when I am running this then getting error
unexpected Syntex class commonController in line 1
what may be the problem
Please modify filemanager.php under admin\controller. Replace 16 with 40 and 14 with 40-2= 38.
You can increase as many images per page as you can. I have put in upto 144 and working like a charm.
Please do not forget to refresh Modifications for every change otherwise you dont see the changes.
Raj

Loading first few observations of data set without reading entire data set (Stata 13.1)?

(Stata/MP 13.1)
I am working with a set of massive data sets that takes an extremely long time to load. I am currently looping through all the data sets to load them each time.
Is it possible to just tell Stata to load in the first 5 observations of each dataset (or in general the first n data sets in each use command) without actually having to load the entire data set? Otherwise, if I were to load in the entire data set and then just keep the first 5 observations, the process takes extremely long time.
Here are two work-arounds I have already tried
use in 1/5 using mydata : I think this is more efficient than just loading the data and then keeping the observations you want in a different line, but I think it still reads in the entire data set.
First load all the data sets, then save copies of all the data sets to just be the first 5 observations, and then just use the copies: This is cumbersome as I have a lot of different files; I would very much prefer just a direct way to read in the first 5 observations without having to resort to this method and without having to read the entire data set.
I'd say using in is the natural way to do this in Stata, but testing shows
you are correct: it really makes no "big" difference, given the size of the data set. An example is (with 148,000,000 observations)
sysuse auto, clear
expand 2000000
tempfile bigfile
save "`bigfile'", replace
clear
timer on 1
use "`bigfile'"
timer off 1
clear
timer on 2
use "`bigfile'" in 1/5
timer off 2
timer list
timer clear
Resulting in
. timer list
1: 6.44 / 1 = 6.4400
2: 4.85 / 1 = 4.8480
I find that surprising since in seems really efficicient in other contexts.
I would contact Stata Tech support (and/or search around, including www.statalist.com) only to ask why in isn't much faster
(independently of you finding some other strategy to handle this problem).
It's worth using, of course; but not fast enough for many applications.
In terms of workflow, your second option might be the best. Leave the computer running while the smaller datasets are created (use a for loop), and get back to your regular coding/debugging once that's finished. This really depends on what you're doing, so it may work or not.
Actually, I found the solution. If you run
use mybigdata if runiform() <= 0.0001
Stata will take a random sample of 0.0001 of the data set without reading the entire data set.
Thanks!
Vincent
Edit: 4/28/2015 (1:58 PM EST)
My apologies. It turns out the above was actually not a solution to the original question. It seems that on my system there was high variability in the speed of using
use mybigdata if runiform() <= 0.0001
each time I ran it. When I posted that the above was a solution, I think when I ran the code, it just happened to be a faster instance. However, as I now am repeatedly running
use mybigdata if runiform() <= 0.0001
vs.
use in 1/5 using mydata
I am actually finding that
use in 1/5 using mydata
is on average faster.
In general, my question is simply how to read in a portion of a Stata data set without having to read in the entire data set for computational purposes especially when the data set is really large.
Edit: 4/28/2015 (2:50 PM EST)
In total, I have 20 datasets, each with between 5 - 15 million observations. I only need to keep 8 of the variables (There are 58-65 variables in each data set). Below is the output from the first four "describe, short" statements.
2004 action1
Contains data from 2004action1.dta
obs: 15,039,576
vars: 64 30 Oct 2014 17:09
size: 2,827,440,288
Sorted by:
2004 action2578
Contains data from 2004action2578.dta
obs: 13,449,087
vars: 59 30 Oct 2014 17:16
size: 2,098,057,572
Sorted by:
2005 action1
Contains data from 2005action1.dta
obs: 15,638,296
vars: 65 30 Oct 2014 16:47
size: 3,143,297,496
Sorted by:
2005 action2578
Contains data from 2005action2578.dta
obs: 14,951,428
vars: 59 30 Oct 2014 17:03
size: 2,362,325,624
Sorted by:
Thanks!
Vincent