group by two values and get the third values

group by two values and get the third values - django

I have this Django model with three CharFields, which I want to run a query on to get the existing values for the two of them, and for each combination get the existing values of the third field.
a = models.CharField(null=False, max_length=8000)
b = models.CharField(null=False, max_length=8000)
c = models.CharField(null=False, max_length=8000)
if assume that these values are in the database:
a | b | c |
---------------
a1 | b2 | c3 |
a1 | b2 | c1 |
a2 | b2 | c3 |
a1 | b3 | c3 |
a1 | b2 | c2 |
I want some result in this form :
{"a1-b2" : [c3, c1, c2], "a2-b2" : [c3], "a1-b3" : [c3]}
or
{"a1" : {"b2":[c3, c1, c2], "b3": [c3]}, "a2": {"b2" : [c3]}}

TLDR:
items = MyModel.objects.annotate(custom_field=Concat('a', Values('-'), 'b').values('custom_field', 'c')
Explanation
With the part .annotate(custom_field=Concat('a', Values('-'), 'b'), you are basically doing a group_by operation in SQL and creating a temporary new column with name custom_field in your queryset which will have the value of a-b.
This gives you the following structure:
a | b | c | custom_field
a1 b1 c1 a1-b1
a2 b2 c2 a2-b2
a1 b1 c3 a1-b1
The .values('custom_field', 'c') portion fetches only the custom_field and c columns from this queryset. Now all you have to do is serialize your data.
EDIT
If you want your data in that specific format, you can concatenate column c. Please read the SO accepted answer in this post. Django making a list of a field grouping by another field in model. You can then create a new field during serialization which will split() the concatenated c field into a list.

Couldn't think of good pure SQL solution, but here's pythonic one using groupby:
from itertools import groupby
# Order by key fields so it will be easier to group later
items = YOUR_MODEL.objects.order_by('a', 'b')
# Group items by 'a' and 'b' fields as key
groups = groupby(items, lambda item: (item.a, item.b))
# Create dictionary with values as 'c' field from each item
res = {
'-'.join(key): list(map(lambda item: item.c, group))
for key, group in groups
}
# {'a1-b2': ['c3', 'c1', 'c2'], 'a1-b3': ['c3'], 'a2-b2': ['c3']}

Related

Power BI: How to use OR between 2 different filters?

I have some table like this:
+------+------+------+
| Lvl1 | Lvl2 | Lvl3 |
+------+------+------+
| A1 | B1 | C1 |
| A1 | B1 | C2 |
| A1 | B2 | C3 |
| A2 | B3 | C4 |
| A2 | B3 | C5 |
| A2 | B4 | C6 |
| A3 | B5 | C7 |
+------+------+------+
In which it is some thing like a hierarchy.
When user select A1, he actually selects the first 3 rows, B1, selects first 2 rows, and C1, selects only the first row.
That is A is the highest level, and C is the lowest. Note that ids from different levels are unique, since they have a special prefix, A,B,C.
The problem is when filtering in more than one level, I may have empty result set.
e.g. filtering on Lvl1=A1 & Lvl2=B3 (no intersection), and so will return nothing. What I need is to get the first 5 rows (Lvl1=A1 or Lvl2=B3)
const lvl1Filter: IBasicFilter = {
$schema: "http://powerbi.com/product/schema#basic",
target: {
table: "Hierarchy",
column: "Lvl1"
},
operator: "In",
values: ['A1'],
filterType: FilterType.BasicFilter
}
const lvl2Filter: IBasicFilter = {
$schema: "http://powerbi.com/product/schema#basic",
target: {
table: "Hierarchy",
column: "Lvl2"
},
operator: "In",
values: ['B3'],
filterType: FilterType.BasicFilter
}
report.setFilters([lvl1Filter, lvl2Filter]);
The problem is that the filters are independent from each other, and they will both be applied, that is with AND operation between them.
So, is there a way to send the filters with OR operation between them, or is there a way to simulate it?
PS: I tried to put all data in single column (like the following table), and it worked, but the data was very large (millions of records), and so, it was very very slow, so I need something more efficient.
All data in single column:
+--------------+
| AllHierarchy |
+--------------+
| A1 |
| A2 |
| A3 |
| B1 |
| B2 |
| B3 |
| B4 |
| B5 |
| C1 |
| C2 |
| C3 |
| C4 |
| C5 |
| C6 |
| C7 |
+--------------+
Set Filter:
const allHierarchyFilter: IBasicFilter = {
$schema: "http://powerbi.com/product/schema#basic",
target: {
table: "Hierarchy",
column: "AllHierarchy"
},
operator: "In",
values: ['A1', 'B3'],
filterType: FilterType.BasicFilter
}
report.setFilters([allHierarchyFilter]);

It isn't directly possible to make "or" filter between multiple columns in Power BI, so you were right to try to combine all values in a single column. But instead of appending all possible values by union them in one column, which will give you a long list, you can also try to combine their values "per row". For example, concatenate all values in the current row, maybe add some unique separator (it depends on your actual values, which are not shown). If all columns will have values, make it simple - create new DAX column (not a measure!):
All Levels = 'Table'[Lvl1] & "-" & 'Table'[Lvl2] & "-" & 'Table'[Lvl3]
If it is possible some of the levels to be blank, you can if you want, to handle that:
All Levels = 'Table'[Lvl1] &
IF('Table'[Lvl2] = BLANK(); ""; "-" & 'Table'[Lvl2]) &
IF('Table'[Lvl3] = BLANK(); ""; "-" & 'Table'[Lvl3])
Note that depending on your regional settings, you may have to replace semicolons in the above code with commas.
This will give you a new column, which will contain all values from the current row, e.g. A1-B2-C3. Now you can make a filter All Levels contains A1 or All Levels contains B3, which now is a filter on a single column and we can easily use or:
When embedding your JavaScript code should create advanced filter, like this:
const allLevelsFilter: IAdvancedFilter = {
$schema: "http://powerbi.com/product/schema#advanced",
target: {
table: "Hierarchy",
column: "All Levels"
},
logicalOperator: "Or",
conditions: [
{
operator: "Contains",
value: "A1"
},
{
operator: "Contains",
value: "B3"
}
],
filterType: FilterType.AdvancedFilter
}
report.setFilters([allLevelsFilter]);
If you need exact match (e.g. the above code will also return rows with A11 or B35), then add the separator at the start and the end of the column too (i.e. to get -A1-B2-C3-) and in your JavaScript code append it before and after your search string (i.e. search for -A1- and -B3-).
Hope this helps!

convert sparql query to sql query in pyspark-dataframe

I have dataset that consist of three columns subject, predicate, and object
subject predicate object
c1 B V3
c1 A V3
c1 T V2
c2 A V2
c2 A V3
c2 T V1
c2 B V3
c3 B V3
c3 A V3
c3 T V1
c4 A V3
c4 T V1
c5 B V3
c5 T V2
c6 B V3
c6 T V1
I want to apply association mining rules on this data by using sql queries.
I take this idea from this paper Association Rule Mining on Semantic data by using sparql(SAG algorithm)
first, the user has to specify T (target predicate) and minimum support,then query if this T is frequent or not:
SELECT ?pt ?ot (COUNT(*) AS ?Yent)
WHERE {?s ?pt ?ot.
FILTER (regex (str(?pt), 'T', 'i'».}
GROUP BY ?pt ?ot
HAVING (?Yent >= 2)
I tried following code and I got same result:
q=mtcars1.select('s','p','o').where(mtcars1['p']=='T')
q1=q.groupBy('p','o').count()
q1.filter(q1['count']>=2).show()
result
+---+---+-----+
| p| o|count|
+---+---+-----+
| T| V2| 2|
| T| V1| 4|
+---+---+-----+
second query to calculate other predicates and objects if they are frequent:
q2=mtcars1.select('s','p','o').where(mtcars1['p']!='T')
q3=q2.groupBy('p','o').count()
q3.filter(q3['count']>=2).show()
result
+---+---+-----+
| p| o|count|
+---+---+-----+
| A| V3| 4|
| B| V3| 5|
+---+---+-----+
in order to find rules between two above queries, we will scan dataset again and find if they are repeated together greater than or equal minimum support
SELECT ?pe ?oe ?pt ?ot (count(*) AS ?supCNT)
WHERE { ?s ?pt ?ot .
FILTER (regex (str(?pt), 'T','i'».
?s ?pe ?oe .
FILTER (!regex (str(?pe), 'T','i'».}
GROUP BY ?pe ?oe ?pt ?ot
HAVING (?supCNT >= I)
ORDER BY ?pt ?ot
I tried to store subject in list then join between items ,but this took long time, and this will take very long time if data is very large.
w=mtcars1.select('s','p' ,'o').where(mtcars1['p']=='T')
w1=w.groupBy('p','o').agg(collect_list('s')).show()
result
+---+---+----------------+
| p| o| collect_list(s)|
+---+---+----------------+
| T| V2| [c1, c5]|
| T| V1|[c2, c3, c4, c6]|
+---+---+----------------+
w2=mtcars1.select('s','p' ,'o').where(mtcars1['p']!='T')
w3=w2.groupBy('p','o').agg(collect_list('s')).show()
result
+---+---+--------------------+
| p| o| collect_list(s)|
+---+---+--------------------+
| A| V3| [c1, c2, c3, c4]|
| B| V3|[c1, c2, c3, c5, c6]|
| A| V2| [c2]|
+---+---+--------------------+
join code
from pyspark.sql.functions import *
w44=w1.alias("l")\
.crossJoin(w3.alias("r"))\
.select(
f.col('l.p').alias('lp'),
f.col('l.o').alias('lo'),
f.col('r.p').alias('rp'),
f.col('r.o').alias('ro'),
intersection_udf(f.col('l.collect_list(s)'), f.col('r.collect_list(s)')).alias('TID'),
intersection_length_udf(f.col('l.collect_list(s)'), f.col('r.collect_list(s)')).alias('len')
)\
.where(f.col('len') > 1)\
.select(
f.struct(f.struct('lp', 'lo'), f.struct('rp', 'ro')).alias('2-Itemset'),
'TID'
)\
.show()
result
+---------------+------------+
| 2-Itemset| TID|
+---------------+------------+
|[[T,V2],[B,V3]]| [c1, c5]|
|[[T,V1],[A,V3]]|[c3, c2, c4]|
|[[T,V1],[B,V3]]|[c3, c2, c6]|
+---------------+------------+
so,I have to re scan dataset again and find association rules between items, and re scan again to find again rules.
following query is used to construct 3-factor set:
SELECT ?pel ?oel ?pe2 ?oe2 ?pt ?ot (eount(*) AS
?supCNT)
WHERE { ?s ?pt ?ot .
FILTER (regex (str(?pt), 'T','i'».
?s ?pel ?oel .
FILTER (!regex (str(?pel), 'T','i'».
FILTER (!regex (str(?pc2), 'T','i')&& !regex
(str(?pc2), str(?pcl),'i') ).}
GROUP BY ?pcl ?ocl ?pc2 ?oc2 ?pt ?ot
HAVING (?supCNT >=2)
ORDER BY ?pt ?ot
result for this query should be
{[(A, V3) (B, V3) (T, V1), 2]}
and we will repeat queries until no other rules between items
can anyone help me how can make association rules by sql queries,where subject is used as ID ,predicate + object=items

Django ORM calculations between records

Is it possible to perform calculations between records in a Django query?
I know how to perform calculations across records (e.g. data_a + data_b). Is there way to perform say the percent change between data_a row 0 and row 4 (i.e. 09-30-17 and 09-30-16)?
+-----------+--------+--------+
| date | data_a | data_b |
+-----------+--------+--------+
| 09-30-17 | 100 | 200 |
| 06-30-17 | 95 | 220 |
| 03-31-17 | 85 | 205 |
| 12-31-16 | 80 | 215 |
| 09-30-16 | 75 | 195 |
+-----------+--------+--------+
I am currently using Pandas to perform these type of calculations, but would like eliminate this additional step if possible.

I would go with a Database cursor raw SQL
(see https://docs.djangoproject.com/en/2.0/topics/db/sql/)
combined with a Lag() window function as so:
result = cursor.execute("""
select date,
data_a - lag(data_a) over (order by date) as data_change,
from foo;""")
This is the general idea, you might need to change it according to your needs.

There is no row 0 in a Django database, so we'll assume rows 1 and 5.
The general formula for calculation of percentage as expressed in Python is:
((b - a) / a) * 100
where a is the starting number and b is the ending number. So in your example:
a = 100
b = 75
((b - a) / a) * 100
-25.0
If your model is called Foo, the queries you want are:
(a, b) = Foo.objects.filter(id__in=[id_1, id_2]).values_list('data_a', flat=True)
values_list says "get just these fields" and flat=True means you want a simple list of values, not key/value pairs. By assigning it to the (a, b) tuple and using the __in= clause, you get to do this as a single query rather than as two.
I would wrap it all up into a standalone function or model method:
def pct_change(id_1, id_2):
# Get a single column from two rows and return percentage of change
(a, b) = Foo.objects.filter(id__in=[id_1, id_2]).values_list('data_a', flat=True)
return ((b - a) / a) * 100
And then if you know the row IDs in the db for the two rows you want to compare, it's just:
print(pct_change(233, 8343))
If you'd like to calculate progressively the change between row 1 and row 2, then between row 2 and row 3, and so on, you'd just run this function sequentially for each row in a queryset. Because row IDs might have gaps we can't just use n+1 to compute the next row. Instead, start by getting a list of all the row IDs in a queryset:
rows = [r.id for r in Foo.objects.all().order_by('date')]
Which evaluates to something like
rows = [1,2,3,5,6,9,13]
Now for each elem in list and the next elem in list, run our function:
for (index, row) in enumerate(rows):
if index < len(rows):
current, next_ = row, rows[index + 1]
print(current, next_)
print(pct_change(current, next_))

DB2 IF and LENGTH usage

I have this DB2 table
A | B | C
aaaa |123 |
bbbb |1 |
cccc |123456 |
All columns are varchars. I would like to have the column C filled up with the contents of B concatenated with the contents of A.
BUT the max length of C is 8. So if the concatenated string exceeds 8, then i would like to have only 5 characters + "...".
Basically:
if(length(A) + length(B) > maximum(C) {
//display only the first (maximum(C) - 3) characters, then add "..."
} else {
// display B + A
}
How can i do this in DB2?

One good option would be to define column C as generated column so you do not have to handle anything.
create table t3 (A varchar(10),
B varchar(10),
C varchar(8) generated always as (case when length(concat(A, B)) > 8 then substr(concat(A,B),1,5) || '...' else concat(A, B) end)
)
insert into t3 (A,B) values ('This', ' is a test');
insert into t3 (A,B) values ('ABCD', 'EFGH');
select * from t3
will return
A B C
----------------------------------
This is a test This ...
ABCD EFGH ABCDEFGH
Alternatives could be triggers, procedures, explicit code etc.

Splunk query to compare two fields and select value from 3rd field if the comparison match

I am very new to splunk and need your help in resolving below issue.
I have two CSV files uploaded in splunk instance. Below mentioned is each file and its fileds.
Apple.csv
a. A1 b. A2 c. A3
Orange.csv
a. O1 (may have values matching with values of A3) b. O2
My requirement is as below:
Select set of values of A1,A2,A3 and O2 from Apple.csv and Orange.csv
where A1=”X” and A2=”Y” and A3 = O1
and display the values in a table:
A1 A2 A3
X Y 123
LP HJK 222
X Y 999
O1 O2
999 open
123 closed
65432 open
Output
A1 A2 A3 O2
X Y 123 Open
X Y 999 closed
Very much appreciate your help.

You could do this
source="apple.csv" OR source="orange.csv"
| eval grouping=coalesce(A3,O1)
| stats first(A1) as A1 first(A2) as A2 first(A3) as A3 first(O2) as O2 by grouping
| fields - grouping
Although I would think that considering the timestamp of the events might also be important...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

group by two values and get the third values - django

Related

Power BI: How to use OR between 2 different filters?

convert sparql query to sql query in pyspark-dataframe

Django ORM calculations between records

DB2 IF and LENGTH usage

Splunk query to compare two fields and select value from 3rd field if the comparison match

Categories

Resources