Sort redis hash maps based on string field - amazon-web-services

I'm trying to implement query functionality for my AWS redis cluster. I have stored all my data as hash maps and also created SortedSet for each of the indexed fields.
Whenever a query is received we query SortedSet to find ids. The querying may involve multiple indexes as well which gets merged based on AND/OR conditions. Once we have the final set of ids we need to sort the data based on some fields. So basically im fetching list of hashmaps which matches the ids. The hashmap looks like below
HSET employees::1 name Arivu salary 100000 age 30
HSET employees::2 name Uma salary 300000 age 31
HSET employees::3 name Jane salary 100000 age 25
HSET employees::4 name Zakir salary 150000 age 28
Now I'm adding all the hashes to a set so that I can use a sort function
SADD collection employees::1 employees::2 employees::3 employees::4
Now when i try to sort based on string field the sort doesn't seems to work
127.0.0.1:6379> SORT collection by name
1) "employees::2"
2) "employees::4"
3) "employees::3"
4) "employees::1"
127.0.0.1:6379> SORT collection by name desc
1) "employees::2"
2) "employees::4"
3) "employees::3"
4) "employees::1"
I assume this is because the hasmaps are stored as byte data but is there anyway i can sort these alphabetically?
I have also tried using alpha param which sort function provides but it doesnt seems to work
SORT collection by name desc ALPHA

Your usage seems to be incorrect.
Set your hash like this (like you are doing)
HSET employees::1 name Arivu salary 100000 age 30
HSET employees::2 name Uma salary 300000 age 31
HSET employees::3 name Jane salary 100000 age 25
HSET employees::4 name Zakir salary 150000 age 28
Store your ids in the set like this:
SADD collection 1 2 3 4
Please note in the set I just store the ids of the employee (1,2,3,4).
Now time to sort
SORT collection by employees::*->name ALPHA
It will sort like this as you expected
1) "1"
2) "3"
3) "2"
4) "4"
In case you need fields do like this:
SORT collection by employees::*->name ALPHA GET employees::*->name
1) "Arivu"
2) "Jane"
3) "Uma"
4) "Zakir"
In case you need age also along with name:
SORT collection by employees::*->name ALPHA GET employees::*->name GET employees::*->age
1) "Arivu"
2) "30"
3) "Jane"
4) "25"
5) "Uma"
6) "31"
7) "Zakir"
8) "28"

Related

Why bigquery can't handle a query processing 4TB data?

I'm trying to run this query
SELECT
id AS id,
ARRAY_AGG(DISTINCT users_ids) AS users_ids,
MAX(date) AS date
FROM
users,
UNNEST(users_ids) AS users_ids
WHERE
users_ids != " 1111"
AND users_ids != " 2222"
GROUP BY
id;
Where users table is splitted table with id column and user_ids (comma separated) column and date column
on a +4TB and it give me resources
Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations.
.. any idea why?
id userids date
1 2,3,4 1-10-20
2 4,5,6 1-10-20
1 7,8,4 2-10-20
so the final result I'm trying to reach
id userids date
1 2,3,4,7,8 2-10-20
2 4,5,6 1-10-20
Execution details:
It's constantly repartitioning - I would guess that you're trying to cramp too much stuff into the aggregation part. Just remove the aggregation part - I don't even think you have to cross join here.
Use a subquery instead of this cross join + aggregation combo.
Edit: just realized that you want to aggregate the arrays but with distinct values
WITH t AS (
SELECT
id AS id,
ARRAY_CONCAT_AGG(ARRAY(SELECT DISTINCT uids FROM UNNEST(user_ids) as uids WHERE
uids != " 1111" AND uids != " 2222")) AS users_ids,
MAX(date) OVER (partition by id) AS date
FROM
users
GROUP BY id
)
SELECT
id,
ARRAY(SELECT DISTINCT * FROM UNNEST(user_ids)) as user_ids
,date
FROM t
Just the draft I assume id is unique but it should be something along those lines? Grouping by arrays is not possible ...
array_concat_agg() has no distinct so it comes in a second step.

Creating a column with lookup from another table

I have a table of sales from multiple stores with the value of sales in dollars and the date and the corresponding store.
In another table I have the store name and the expected sales amount of each store.
I want to create a column in the main first table that evaluates the efficiency of sales based on the other table..
In other words, if store B made 500 sales today, I want to check with lookup table to see the target then use it to divide and obtain the efficiency then graph the efficiency of each store.
Thanks.
I tried creating some measures and columns but stuck with circular dependencies
I expect to add one column to my main table to an integer 0 to 100 showing the efficiency.
You can merge the two tables. In the query editor go to Merge Querires > Merge Query As New. Chose your relationship (match it by the column StoreName) and merge the two tables. You will get something like this (just a few of your sample data):
StoreName ActualSaleAmount ExpectedAmount
a 500 3000
a 450 3000
b 370 3500
c 400 5000
Now you can add a calculated column with your efficency:
StoreName ActualSaleAmount ExpectedAmount Efficency
a 500 3000 500/3000
a 450 3000 450/3000
b 370 3500 370/3500
c 400 5000 400/5000
This would be:
Efficency = [ActualSaleAmount] / [ExpectedAmount]

Room update based on selection

I am using Room in my application, and I wish to update the results based on a selection with order by.
The goal is to update the five top scoring results.
Inside my Dao class I have this query:
// Setting new five words
#Query("with data as( SELECT score FROM words_table ORDER BY score DESC LIMIT 5 )" +
" UPDATE data SET isInTodayWords= 1")
LiveData<List<Word>> setNewWords();
I get an error from the update statement - "can't resolve symbol data".
What am I missing in here?
I believe the issue is that you need to UPDATE the actual table not the CTE. Try :-
#Query("with data as( SELECT score FROM words_table ORDER BY score DESC LIMIT 5 )" +
" UPDATE words_table SET isInTodayWords= 1")
LiveData<List<Word>> setNewWords();
As per (qualified-table-name):-
Saying that I believe that you need to the utilise the CTA (data) in the WHERE clause otherwise all rows will be updated.
As such you likely want to add WHERE score IN data (see note below as this could apply, if a row has a unique column then you may prefer to extract that rather than the score and use it in the WHERE clause):-
#Query("WITH data AS (SELECT score FROM words_table ORDER BY score DESC LIMIT 5) UPDATE words_table SET isInTodaysWords = 1 WHERE score IN data")
An alternative approach could be to use :-
#Query("UPDATE words_table SET isInTodaysWords = 1 WHERE score >= (SELECT min(score) FROM (SELECT * FROM words_table ORDER BY score DESC LIMIT 5))")
However, if there were say 10 rows all with the same topscore then all 10 rows would be flagged.
If there was a unique column named _id then the following would restrict the flagging to the 5 selected rows :-
#Query("WITH data AS (SELECT _id FROM words_table ORDER BY score DESC LIMIT 5) UPDATE words_table SET isInTodaysWords = 1 WHERE _id IN data")

Pivoting/reshaping a dataframe to have dates as columns

Here is my dataframe:
ID AMT DATE
0 1496846 54.76 2015-02-11
1 1496846 195.00 2015-01-09
2 1571558 11350.00 2015-04-30
3 1498812 135.00 2014-07-11
4 1498812 157.00 2014-08-04
5 1498812 110.00 2014-09-23
6 1498812 1428.00 2015-01-28
7 1558450 4355.00 2015-01-26
8 1858606 321.52 2015-03-27
9 1849431 1046.81 2015-03-19
I would like to make this a dataframe consisting of time series data for each ID. That is, each column name is a date (sorted), and it is indexed by ID, and the values are the AMT values corresponding to each date. I can get so far as doing something like
df.set_index("DATE").T
but from here I'm stuck.
I also tried
df.pivot(index='ID', columns='DATE', values='AMT')
but this gave me an error on having duplicate entries (the IDs).
I envision it as transposing DATE, and then grouping by unique ID and melting AMT underneath.
you want to use pivot_table where there is an aggfunc parameter that handles duplicate indices.
df.pivot_table('AMT', 'DATE', 'ID', aggfunc='sum')
You'll want to choose how to handle the dups. I put 'sum' in there. It defaults to 'mean'

How to number households?

I have a set of household data with more than 20,000 records of 4200 households. In my data set there is no any column for household ID & all the records are per household member. There is a column for person's serial no & with each & every "1", the household should be changed.( i.e: if we start to number households, with the very 1st person's serial no when it's equal to 1, the corresponding HH_ID should be "1". Once the next record when person's serial no=1 meets, then the HH_ID should be 2.) So I want to add a column named HH_ID & number it from 1-4200. How could I write a program using STATA?
What you want is (assuming a variable personid for person identifier)
. gen hhid = sum(personid == 1)
That's it. The explanation is longer than the code. The expression personid == 1 evaluates as 1 when true and 0 when false. For the first household, first person, this will be 1, and for the other persons in the same household 0. For the second household, first person, this will be 1, and so on. The function sum() gives the cumulative or running sum, so that you should end with something that goes 1,1,1,2,2,2,2,3,3,3,... Clearly the actual numbers of 1s, 2s, 3s etc. will depend on the numbers of persons in the households.
On true and false in Stata see
http://www.stata.com/support/faqs/data-management/true-and-false/index.html