I have a table of files that are duplicated across multiple countries, and files that are duplicated have the same filename.
I am trying to create a calculated column that looksup the “Tag” value where
File = current filename in row context AND Geo = “us”
This would return the US tag value for a file for each duplicate file that exists in other countries. See desired result in table below.
File | Geo | Tag | Tag Lookup
FileName1 | us | a | a
FileName1 | jp | b | a
FileName1 | cn | c | a
FileName2 | us | e | e
FileName2 | jp | f | e
FileName2 | cn | g | e
I tried doing using a lookupvalue formula
Tag Lookup = LOOKUPVALUE( Table1[Tag] , Table1[Geo] , "us" , Table1[File] , Table1[File] )
My assumption was that the search_columnName2 instance of Table1[File] would act as the column lookup, and the search_value of Table1[File] would interpret it as the scalar value of the current rows context... but when I try the formula I get the error: "A table of multiple values was supplied where a single value was expected".
I then tried
CALCULATE(
VALUES( Table1[Tag] ) ,
FILTER(
Table1 ,
Table1[Geo] = "us" &&
Table1[File] = Table1[File]
)
)
But in this case I get the error: "A circular dependency was detected"
How can I perform a lookup / retrieve a value from within the same table, using a column's current context as part of the search?
Related
I have an import query (table a) and an imported Excel file (table b) with records I am trying to match it up with.
I am looking for a method to replicate this type of SQL in M:
SELECT a.loc_id, a.other_data, b.stk
FROM a INNER JOIN b on a.loc_id BETWEEN b.from_loc AND b.to_loc
Table A
| loc_id | other data |
-------------------------
| 34A032B1 | ... |
| 34A3Z011 | ... |
| 3DD23A41 | ... |
Table B
| stk | from_loc | to_loc |
--------------------------------
| STKA01 | 34A01 | 34A30ZZZ |
| STKA02 | 34A31 | 34A50ZZZ |
| ... | ... | ... |
Goal
| loc_id | other data | stk |
----------------------------------
| 34A032B1 | ... | STKA01 |
| 34A3Z011 | ... | STKA02 |
| 3DD23A41 | ... | STKD01 |
All of the other queries I can find along these lines use numbers, dates, or times in the BETWEEN clause, and seem to work by exploding the (from, to) range into all possible values and then filtering out the extra rows. However I need to use string comparisons, and exploding those into all possible values would be unfeasable.
Between all the various solutions I could find, the closest I've come is to add a custom column on table a:
Table.SelectRows(
table_b,
(a) => Value.Compare([loc_id], table_b[from_loc]) = 1
and Value.Compare([loc_id], table_b[to_loc]) = -1
)
This does return all the columns from table_b, however, when expanding the column, the values are all null.
This is not very specific "After 34A01 could be any string..." in trying to figure out how your series progresses.
But maybe you can just test for how a value "sorts" using the native sorting function in PQ.
add custom column with table.Select Rows:
= try Table.SelectRows(TableB, (t)=> t[from_loc]<=[loc_id] and t[to_loc] >= [loc_id])[stk]{0} otherwise null
To reproduce with your examples:
let
TableB=Table.FromColumns(
{{"STKA01","STKA02"},
{"34A01","34A31"},
{"34A30ZZZ","34A50ZZZ"}},
type table[stk=text,from_loc=text,to_loc=text]),
TableA=Table.FromColumns(
{{"34A032B1","34A3Z011","3DD23A41"},
{"...","...","..."}},
type table[loc_id=text, other data=text]),
//determine where it sorts and return the stk
#"Added Custom" = Table.AddColumn(#"TableA", "stk", each
try Table.SelectRows(TableB, (t)=> t[from_loc]<=[loc_id] and t[to_loc] >= [loc_id])[stk]{0} otherwise null)
in
#"Added Custom"
Note: if the above algorithm is too slow, there may be faster methods of obtaining these results
I am trying to use Django StrIndex to find all rows with the value a substring of a given string.
Eg:
my table contains:
+----------+------------------+
| user | domain |
+----------+------------------+
| spam1 | spam.com |
| badguy+ | |
| | protonmail.com |
| spammer | |
| | spamdomain.co.uk |
+----------+------------------+
but the query
SpamWord.objects.annotate(idx=StrIndex(models.Value('xxxx'), 'user')).filter(models.Q(idx__gt=0) | models.Q(domain='spamdomain.co.uk')).first()
matches <SpamWord: *#protonmail.com>
The query it is SELECT `spamwords`.`id`, `spamwords`.`user`, `spamwords`.`domain`, INSTR('xxxx', `spamwords`.`user`) AS `idx` FROM `spamwords` WHERE (INSTR('xxxx', `spamwords`.`user`) > 0 OR `spamwords`.`domain` = 'spamdomain.co.uk')
It should be <SpamWord: *#spamdomain.co.uk>
this is happening because
INSTR('xxxx', '') => 1
(and also INSTR('xxxxasd', 'xxxx') => 1, which it is correct)
How can I write this query in order to get entry #5 (spamdomain.co.uk)?
The order of the parameters of StrIndex [Django-doc] is swapped. The first parameter is the haystack, the string in which you search, and the second one is the needle, the substring you are looking for.
You thus can annotate with:
from django.db.models import Q, Value
SpamWord.objects.annotate(
idx=StrIndex('user', Value('xxxx'))
).filter(
Q(idx__gt=0) | Q(domain='spamdomain.co.uk')
).first()
Just filter rows where user is empty:
(~models.Q(user='') & models.Q(idx__gt=0)) | models.Q(domain='spamdomain.co.uk')
I have a table with numeric values and blank records. I'm trying to calculate a number of rows that are not blank and bigger than 20.
+--------+
| VALUES |
+--------+
| 2 |
| 0 |
| 13 |
| 40 |
| |
| 1 |
| 200 |
| 4 |
| 135 |
| |
| 35 |
+--------+
I've tried different options but constantly get the next error: "Cannot convert value '' of type Text to type Number". I understand that blank cells are treated as text and thus my filter (>20) doesn't work. Converting blanks to "0" is not an option as I need to use the same values later to calculate AVG and Median.
CALCULATE(
COUNTROWS(Table3),
VALUE(Table3[VALUES]) > 20
)
OR getting "10" as a result:
=CALCULATE(
COUNTROWS(ALLNOBLANKROW(Table3[VALUES])),
VALUE(Table3[VALUES]) > 20
)
The final result in the example table should be: 4
Would be grateful for any help!
First, the VALUE function expects a string. It converts strings like "123"into the integer 123, so let's not use that.
The easiest approach is with an iterator function like COUNTX.
CountNonBlank = COUNTX(Table3, IF(Table3[Values] > 20, 1, BLANK()))
Note that we don't need a separate case for BLANK() (null) here since BLANK() > 20 evaluates as False.
There are tons of other ways to do this. Another iterator solution would be:
CountNonBlank = COUNTROWS(FILTER(Table3, Table3[Values] > 20))
You can use the same FILTER inside of a CALCULATE, but that's a bit less elegant.
CountNonBlank = CALCULATE(COUNT(Table3[Values]), FILTER(Table3, Table3[Values] > 20))
Edit
I don't recommend the CALCULATE version. If you have more columns with more conditions, just add them to your FILTER. E.g.
CountNonBlank =
COUNTROWS(
FILTER(Table3,
Table3[Values] > 20
&& Table3[Text] = "xyz"
&& Table3[Number] <> 0
&& Table3[Date] <= DATE(2018, 12, 31)
)
)
You can also do OR logic with || instead of the && for AND.
I have a dataframe which has duplicate rows, and i would like merge them into one single record with all distinct columns.
My code sample is as follows:
df1= sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ","",""),("81A01","TERR NAME 01","","NY",""),("81A01","TERR NAME 01","","","LA"),("81A02","TERR NAME 01","CA","",""),("81A02","TERR NAME 01","","","NY")], ["zip_code","territory_name","state","state1","state2"])
The resulting dataframe is as follows:
df1.show()
+--------+--------------+-----+------+------+
|zip_code|territory_name|state|state1|state2|
+--------+--------------+-----+------+------+
| 81A01| TERR NAME 01| NJ| | |
| 81A01| TERR NAME 01| | NY| |
| 81A01| TERR NAME 01| | | LA|
| 81A02| TERR NAME 01| CA| | |
| 81A02| TERR NAME 01| | | NY|
+--------+--------------+-----+------+------+
I need to merge/consolidate duplicate records based on the zip_code, and get all different state values in one row.
Expected result:
+--------+--------------+-----+------+------+
|zip_code|territory_name|state|state1|state2|
+--------+--------------+-----+------+------+
| 81A01| TERR NAME 01| NJ| NY| LA|
| 81A02| TERR NAME 01| CA| | LA|
+--------+--------------+-----+------+------+
Am new to pyspark and not sure how to use group / joins. Can someone please help with code.
if you are sure that there is only 1 state, 1 state1 and 1 state2 for each zip_code territory combination, you can use the following code. The max function uses the string, if there is a string in the grouped data, since a non-empty string has a higher value (probably ASCII wise) then the empty string ""
from pyspark.sql.types import *
from pyspark.sql.functions import *
df1= sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ","",""),("81A01","TERR NAME 01","","NY",""),("81A01","TERR NAME 01","","","LA"),("81A02","TERR NAME 01","CA","",""),("81A02","TERR NAME 01","","","NY")], ["zip_code","territory_name","state","state1","state2"])
df1.groupBy("zip_code","territory_name").agg(max("state").alias("state"),max("state1").alias("state1"),max("state2").alias("state2")).show()
Result:
+--------+--------------+-----+------+------+
|zip_code|territory_name|state|state1|state2|
+--------+--------------+-----+------+------+
| 81A02| TERR NAME 01| CA| | NY|
| 81A01| TERR NAME 01| NJ| NY| LA|
+--------+--------------+-----+------+------+
Note: For any unique record of zip_code and territory_name, if under any of the state column there are multiple entries, then they would be concatenated.
Some explanation: In this code I employ RDDs. I first divide each record into two tuples, with tuple1 as a key and tuple2 as a value. Then, I reduce by key. x corresponds to tuple1 of (zip_code, territory_name) and tuple2 contains the 3 state columns. tuple1 is taken askey because we want to group by the distinct values of zip_code and territory_name. So, every distinct pair like (81A01,TERR NAME 01) , (81A02,TERR NAME 01) is a key, on the base of which we reduce. Reduce means taking every two values at one time and doing some operation on it and then repeating the same operation with this result and the next element, till the entire tuple in exhausted.
So, reduce of (1,2,3,4,5) with + operation will be - 1+2=3, then 3+3=6 and doing + operation till the last element is reached. Thus, 6+4=10 and finally 10+5=15. Since the tuple ended at 5, so the result is 15. This is how the reduce works with + operation. Since, here we have strings and not numbers, so concatenation will happen A+B=AB.
df1=df1.rdd.map(lambda r: ((r.zip_code, r.territory_name), (r.state, r.state1, r.state2)))\
.reduceByKey(lambda x,y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]))\
.map(lambda r: (r[0][0],r[0][1],r[1][0],r[1][1],r[1][2]))\
.toDF(["zip_code","territory_name","state","state1","state2"])
df1.show()
+--------+--------------+-----+------+------+
|zip_code|territory_name|state|state1|state2|
+--------+--------------+-----+------+------+
| 81A01| TERR NAME 01| NJ| NY| LA|
| 81A02| TERR NAME 01| CA| | NY|
+--------+--------------+-----+------+------+
I have this table
With SQL query i can get aggregated information about total "amount" for avery car, and amount of "checked"|"uncheked" rows and sign of finished checking for all rows for one car:
SELECT
car_id
, SUM(amount) as total_amount
, Sum(IF(checked=1,1,0)) as already_checked
, Sum(IF(checked=0,1,0)) as not_cjecked
, IF(Sum(IF(checked=0,1,0))=0,1,0) as check_finished
FROM
refuels_flow
GROUP BY car_id
Result:
+--------+--------------+-----------------+-------------+----------------+
| car_id | total_amount | already_checked | not_cjecked | check_finished |
+--------+--------------+-----------------+-------------+----------------+
| 1 | 1300 | 1 | 12 | 0 |
| 2 | 300 | 3 | 0 | 1 |
+--------+--------------+-----------------+-------------+----------------+
The question is - how i can do this with Django ORM (without use of raw query)?
To obtains the same SQL output, you may use the following queryset:
already_checked = Sum(Func('checked', function='IF', template='%(function)s(%(expressions)s=0, 0, 1)'))
not_checked = Sum(Func('checked', function='IF', template='%(function)s(%(expressions)s=0, 1, 0)'))
check_finished = Func(
not_checked,
function='IF', template='%(function)s(%(expressions)s=0, 1, 0)'
)
Refuels.objects.values('car_id').annotate(
total_amount=Sum('amount'),
already_checked=already_checked,
not_checked=not_checked,
check_finished=check_finished
)
Check the doc on expressions for more informations.
Now, already_checked could be simplified with:
already_checked = Sum('checked')
And instead of having the not_checked and check_finished annotations, you could annotate the count and easily compute them in Python, for example:
qs = Refuels.objects.values('car_id').annotate(
count_for_car=Count('car_id'),
total_amount=Sum('amount'),
already_checked=Sum('checked'),
)
for entry in qs:
not_checked = entry['count_for_car'] - entry['already_checked']
check_finished = not_checked == 0