I want to use group by result from AWS cloud search.
user | expense | status
1 | 1000 | 1
1 | 300 | 1
1 | 700 | 2
1 | 500 | 2
2 | 1000 | 1
2 | 1200 | 3
3 | 200 | 1
3 | 600 | 1
3 | 1000 | 2
Above are my table structure, I want total count of expense for all user. Expected Answer is-
{ user:1,expense_count:2500},{user:2,expense_count:2200 },{user:3,expense_count:1800 }
I want GROUP BY the user column, and it should count the total expenses of the respective user.
There is no (easy) way to do this in CloudSearch, which is understandable when you consider that your use case is more like a SQL query and is not really what I would consider a search. If what you want to do is look up users by userId and sum their expenses, then a search engine is the wrong tool to use.
CloudSearch isn't meant to be used as a datastore; it should return minimal information (ideally just IDs), which you then use to retrieve data. Here is a blurb about it from the docs:
You should only store document data in the search index by making
fields return enabled when it's difficult or costly to retrieve the
data using other means. Because it can take some time to apply
document updates across the domain, you should retrieve critical data
such as pricing information by using the returned document IDs instead
of returned from the index.
Related
For reporting I want to group some counts together but still have access to the row data. I feel like I do this a lot but can't find any of the code today. This is in a B2B video delivery platform. A simplified version of my modelling:
Video(name)
Company(name)
View(video, company, when)
I want to show which companies viewed the most videos, but also show which Videos were the most popular.
+-----------+-----------+-------+
| Company | Video | Views |
+-----------+-----------+-------+
| Company E | Video 1 | 1215 |
| | Video 2 | 5 |
| | Video 3 | 1 |
| Company B | Video 2 | 203 |
| | Video 4 | 3 |
+-----------+-----------+-------+
I can do that by annotating views (to get the right to a company order, most views first) and then looping companies and performing a View query for each company, but how can I do this in in O(1-3) queries rather than O(n)?
So I think I got there with .values(). I always seem to trip over this and it does mean I'll have to do a couple of additional lookups to pull out the video and company names... But 3 feels healthier than n queries.
results = (
Video.objects
.values('views__company_id', 'id')
.annotate(
view_count=Count('views', filter=q_last_month) # filter count based on last month
)
)
I have two tables (subject & category) that are both related to the same parent table (main). Because of the foreign key constraints, it looks like Power BI automatically created the links.
Simple mock-up of table links
I need to count the subjects by type for each possible distance range. I tried a simple calculation shown below for each distance category.
less than 2m =
CALCULATE(
COUNTA('Category'[Descr]),
'Subject'[Distance] IN { "less than 2m" }
)
However, the filter doesn't seem to apply properly.
I want...
+------+--------------+--------------+--+
| Descr| less than 2m | more than 2m | |
+------+--------------+--------------+--+
| Car | 2 | 1 | |
| Sign | 4 | 2 | |
+------+--------------+--------------+--+
but I'm getting...
+------+--------------+--------------+--+
| Descr| less than 2m | more than 2m | |
+------+--------------+--------------+--+
| Car | 3 | 3 | |
| Sign | 6 | 6 | |
+------+--------------+--------------+--+
It's just giving me the total count by type which is correct but isn't applying the filter by distance so I can break it down.
I'm sure this is probably really simple but I'm pretty new with DAX and I can't figure this one out.
I wish I could mark Kosuke's comment as an answer. The issue was indeed with having to enable cross-filtering. This can either be done clicking on the link on your model or using a function to temporarily enable the cross filter.
I'd like to use cloudwatch insights to visualize a multiline graph of average latency by host over time. One line for each host.
This stats query extracts the latency and aggregates it in 10 minute buckets by host, but it doesn't generate any visualization.
stats avg(latencyMS) by bin(10m), host
bin(10m) | host | avg(latencyMS)
0m | 1 | 120
0m | 2 | 220
10m | 1 | 130
10m | 2 | 230
The docs call this out as a common mistake but don't offer any alternative.
The following query does not generate a visualization, because it contains more than one grouping field.
stats avg(myfield1) by bin(5m), myfield4
aws docs
Experementally, cloudwatch will generate a multi line graph if each record has multiple keys. A query that would generate a line graph must return results like this:
bin(10m) | host-1 avg(latencyMS) | host-2 avg(latencyMS)
0m | 120 | 220
10m | 130 | 230
I don't know how to write a query that would output that.
Parse individual message for each host then compute their stats.
For example, to get average latency for responses from processes with PID=11 and PID=13.
parse #message /\[PID:11\].* duration=(?<pid_11_latency>\S+)/
| parse #message /\[PID:13\].* duration=(?<pid_13_latency>\S+)/
| display #timestamp, pid_11_latency, pid_13_latency
| stats avg(pid_11_latency), avg(pid_13_latency) by bin(10m)
| sort #timestamp desc
| limit 20
The regular expressions extracts duration for processes having id 11 and 13 to parameters pid_11_latency and pid_13_latency respectively and fills null where there is no match series-wise.
You can build from this example by creating the match regular expression that extracts for metrics from message for hosts you care about.
Situation: I have about 40 million rows, 3 columns of unorganised data in a table in my SQLite DB (~300MB). An example of my data is as follows:
| filehash | filename | filesize |
|------------|------------|------------|
| hash111 | fileA | 100 |
| hash222 | fileB | 250 |
| hash333 | fileC | 380 |
| hash111 | fileD | 250 | #Hash collision with fileA
| hash444 | fileE | 520 |
| ... | ... | ... |
Problem: A single SELECT statement could take between 3 to 5 seconds. The application I am running needs to be fast. A single query taking 3 to 5 seconds is too long.
#calculates hash
md5hash = hasher(filename)
#I need all 3 columns so that I do not need to parse through the DB a second time
cursor.execute('SELECT * FROM hashtable WHERE filehash = ?', (md5hash,))
returned = cursor.fetchall()
Question: How can I make the SELECT statement run faster (I know this sounds crazy but I am hoping for speeds of below 0.5s)?
Additional information 1: I am running it on Python 2.7 program on a RPi 3B (1GB RAM, default 100MB SWAP). I am asking mainly because I am afraid that it will crash the RPi because 'not enough RAM'.
For reference, when reading from the DB normally with my app running, we are looking at max 55MB of RAM free, with a few hundred MB of cached data - I am unsure if this is the SQLite caches (SWAP has not been touched).
Additional information 2: I am open to using other databases to store the table (I was looking at either PyTables or ZODB as a replacement - let's just say that I got a little desperate).
Additional information 3: There are NO unique keys as the SELECT statement will look for a match in the column which are just hash values, which apparently have collisions.
Currently, the database has to scan the entire table to find all matches. To speed up searches, use an index:
CREATE INDEX my_little_hash_index ON hashtable(filehash);
I am parsing the USDA's food database and storing it in SQLite for query purposes. Each food has associated with it the quantities of the same 162 nutrients. It appears that the list of nutrients (name and units) has not changed in quite a while, and since this is a hobby project I don't expect to follow any sudden changes anyway. But each food does have a unique quantity associated with each nutrient.
So, how does one go about storing this kind of information sanely. My priorities are multi-programming language friendly (Python and C++ having preference), sanity for me as coder, and ease of retrieving nutrient sets to sum or plot over time.
The two things that I had thought of so far were 162 columns (which I'm not particularly fond of, but it does make the queries simpler), or a food table that has a link to a nutrient_list table that then links to a static table with the nutrient name and units. The second seems more flexible i ncase my expectations are wrong, but I wouldn't even know where to begin on writing the queries for sums and time series.
Thanks
You should read up a bit on database normalization. Most of the normalization stuff is quite intuitive, but really going through the definition of the steps and seeing an example helps understanding the concepts and will help you greatly if you want to design a database in the future.
As for this problem, I would suggest you use 3 tables: one for the foods (let's call it foods), one for the nutrients (nutrients), and one for the specific nutrients of each food (foods_nutrients).
The foods table should have a unique index for referencing and the food's name. If the food has other data associated to it (maybe a link to a picture or a description), this data should also go here. Each separate food will get a row in this table.
The nutrients table should also have a unique index for referencing and the nutrient's name. Each of your 162 nutrients will get a row in this table.
Then you have the crossover table containing the nutrient values for each food. This table has three columns: food_id, nutrient_id and value. Each food gets 162 rows inside this table, oe for each nutrient.
This way, you can add or delete nutrients and foods as you like and query everything independent of programming language (well, using SQL, but you'll have to use that anyway :) ).
Let's try an example. We have 2 foods in the foods table and 3 nutrients in the nutrients table:
+------------------+
| foods |
+---------+--------+
| food_id | name |
+---------+--------+
| 1 | Banana |
| 2 | Apple |
+---------+--------+
+-------------------------+
| nutrients |
+-------------+-----------+
| nutrient_id | name |
+-------------+-----------+
| 1 | Potassium |
| 2 | Vitamin C |
| 3 | Sugar |
+-------------+-----------+
+-------------------------------+
| foods_nutrients |
+---------+-------------+-------+
| food_id | nutrient_id | value |
+---------+-------------+-------+
| 1 | 1 | 1000 |
| 1 | 2 | 12 |
| 1 | 3 | 1 |
| 2 | 1 | 3 |
| 2 | 2 | 7 |
| 2 | 3 | 98 |
+---------+-------------+-------+
Now, to get the potassium content of a banana, your'd query:
SELECT food_nutrients.value
FROM food_nutrients, foods, nutrients
WHERE foods_nutrients.food_id = foods.food_id
AND foods_nutrients.nutrient_id = nutrients.nutrient_id
AND foods.name = 'Banana'
AND nutrients.name = 'Potassium';
Use the second (more normalized) approach.
You could even get away with fewer tables than you mentioned:
tblNutrients
-- NutrientID
-- NutrientName
-- NutrientUOM (unit of measure)
-- Otherstuff
tblFood
-- FoodId
-- FoodName
-- Otherstuff
tblFoodNutrients
-- FoodID (FK)
-- NutrientID (FK)
-- UOMCount
It will be a nightmare to maintain a 160+ field database.
If there is a time element involved too (can measurements change?) then you could add a date field to the nutrient and/or the foodnutrient table depending on what could change.