Google sheets: find value multiple times from column - regex

I have a list of city names (col 2). If the city name exists in another list (Col 1), I would like to match this city name and write the URL (Col 3).
I have tried with index match, regex match, search with wildcard but have not found a solution.
How do I do it?
regexmatch(to_text(A2:A), textjoin("|", 1, to_text(B2:B))),C2:C,"no city"
This just writes TRUE or FALSE but does not look up the URL (Col 3)
Col 1 Col 2 Col 3 (URL) col 4 (result)
Philadelphia clothing Chicago Chicago URL Philadelphia URL
Chicago Philadelphia Philadelphia URL Chicago URL
outerwear
Philadelphia shoes Philadelphia URL

Can you try:
=BYROW(A2:A,LAMBDA(z,IF(z="",,XLOOKUP(IFNA(REGEXEXTRACT(z,"(?i)"&TEXTJOIN("|",1,B2:B))),B:B,C:C,))))

Related

Power BI DAX Measure works for one column but not another

Using the information below I need to create a new table in DAX called Table (Download a demo file here).
I need to find the location of each employee (column "Name") at the time of the sale date in column "Sale Date" based on their contract details in table DbEmployees. If there is more than one valid contract for a given employee that the sale date fits in, use the shortest contract length.
My problem is that the below measure isn't working to generate column "Location", but it works just fine for column "new value".
Why is this happening and how can it be fixed?
Expected result:
SaleID
EmployeeID
Sale Date
new value
Name
Location
1
45643213
2021-02-04
89067445
Sally Shore
4
2
57647868
2020-04-15
57647868
Paul Bunyon
3
3
89067445
2019-09-24
57647868
Paul Bunyon
6
DbEmployees:
ID
Name
StartDate
EndDate
Location
Position
546465546
Sandra Newman
2021/01/01
2021/12/31
1
Manager
546465546
Sandra Newman
2020/01/01
2020/12/31
2
Clerk
546465546
Sandra Newman
2019/01/01
2019/12/31
3
Clerk
545365743
Paul Bunyon
2021/01/01
2021/12/31
6
Manager
545365743
Paul Bunyon
2020/04/01
2020/05/01
3
Clerk
545365743
Paul Bunyon
2019/04/01
2021/01/01
6
Manager
796423504
Sally Shore
2020/01/01
2020/12/31
4
Clerk
783546053
Jack Tomson
2019/01/01
2019/12/31
2
Manager
DynamicsSales:
SaleID
EmployeeID
Sale Date
1
45643213
2021/02/04
2
57647868
2020/04/15
3
89067445
2019/09/24
DynamicsContacts:
EmployeeID
Name
Email
45643213
Sandra Newman
sandra.newman#hotmail.com
65437658
Jack Tomson
jack.tomson#hotmail.com
57647868
Paul Bunyon
paul.bunyon#hotmail.com
89067445
Sally Shore
sally.shore#hotmail.com
DynamicsAudit:
SaleID
Changed Date
old value
new value
AuditID
Valid Until
1
2019/06/08
65437658
57647868
1
2020-06-07
1
2020/06/07
57647868
89067445
2
2021-05-07
1
2021/05/07
89067445
45643213
3
2021-05-07
2
2019/06/08
65437658
57647868
4
2020-06-07
2
2020/06/07
57647868
89067445
5
2021-05-07
2
2021/05/07
89067445
45643213
6
2021-05-07
3
2019/06/08
65437658
57647868
7
2020-06-07
3
2020/06/07
57647868
89067445
8
2021-05-07
3
2021/05/07
89067445
45643213
9
2021-05-07
From what I can see there are a couple of issues with your formula.
First of all there is no relationship between Table and DbEmployees so when you are filtering exclusively on the dates, which might get you the wrong Location. This can be fixed by changing the formula to:
Location =
VAR CurrentContractDate = [Sale Date]
VAR empName = [Name]
RETURN
VAR RespLocation =
TOPN (
1,
FILTER(DbEmployees, DbEmployees[Name] = empName),
IF (
.....
Secondly, you need to remember that the TOPN function can return multiple rows, from the documentation:
If there is a tie, in order_by values, at the N-th row of the table, then all tied rows are returned. Then, when there are ties at the N-th row the function might return more than n rows.
This can be fixed by picking the Max/Min of the result in the table:
RETURN MAXX(SELECTCOLUMNS( RespLocation,"Location", [Location] ), [Location])
Finally, I don't understand why the last row on the expected result should be a 3, given that the sale date is within a record with location 6.
Full expression:
Location =
VAR CurrentContractDate = [Sale Date]
VAR empName = [Name]
RETURN
VAR RespLocation =
TOPN (
1,
FILTER(DbEmployees, DbEmployees[Name] = empName),
IF (
CurrentContractDate <= DbEmployees[EndDate]
&& CurrentContractDate >= DbEmployees[StartDate], //Check, whether there is matching date
DATEDIFF ( DbEmployees[StartDate], DbEmployees[EndDate], DAY ), //If so, rank matching locations (you may want to employ a different formula)
MIN ( //If the location is not matching, calculate how close it is (from both start and end date)
ABS ( DATEDIFF ( CurrentContractDate, DbEmployees[StartDate], DAY ) ),
ABS ( DATEDIFF ( CurrentContractDate, DbEmployees[EndDate], DAY ) )
) + 1000000 //Add a discriminating factor in case there are matching rows that should be favoured over non-matching.
), 1
)
RETURN
MAXX(SELECTCOLUMNS( RespLocation,"Location", [Location] ), [Location])

How to filter distinct counts of text with a greater than indicator in Power BI?

I am working on a report that counts stores with different types of beverages. I am trying to get a distinct count of stores that are selling 4 or more Powerade flavors and two or more Coca-Cola flavors while maintaining a count of stores that are purchashing other products (Sprite, Dr. Pepper, etc.).
My data table is BEVSALES and the data looks like:
CustomerNo Brand Flavor
43 PWD Fruit Punch
37 Coca-Cola Vanilla
43 PWD Mixed Bry
37 Coca-Cola Cherry
44 Sprite Tropical Mix
43 PWD Strawberry
43 PWD Grape
44 Coca-Cola Cherry
17 Dr. Pepper Cherry
I am trying to make the data give me a distinct count of customers with filters that have PWD>=4 and Coca-Cola>=2, while keeping the customer count of Dr. Pepper and Sprite at 1 each. (1 customer purchasing PWD, 1 customer Purchasing Coca-Cola, etc.)
The best measure that I have been able to find is
= SUMX(BEVSALES, 1*(FIND("PWD",BEVSALES[Brand],,0)))
but I don't know how to put it together so the formula counts the stores that have more than 4 PWD and 2 Coca-Cola flavors. Any ideas?
The easiest way would be to do this in a separate query. Go to the query design and click on edit. Then chose your table and group by column Brand and distinctcount the column Flavor. The result should look like this (Maybe as a new table):
GroupedBrand DistinctCountFlavor
PWD 4
Coca-Cola 2
Sprite 1
Dr. Pepper 1
Now you can access the distinct count of the flavors by brands. With an IIF() statement you can check for >=4 at PWD and so on...

Isolate the country name from Location column

I have data like this along with other columns in a pandas df.
Apologies I haven't figured out how to present the question with code for the dataframe. First Post
Location:
- Tokyo, Japan
- Sacramento, USA
- Mexico City, Mexico
- Mexico City, Mexico
- Colorado Springs, USA
- New York, USA
- Chicago, USA
Does anyone know how I could isolate the country name from the location and create a new column with just the Country Name?
Try this:
In [29]: pd.DataFrame(df.Location.str.split(',',1).tolist(), columns = ['City','Country'])
Out[29]:
City Country
0 Tokyo Japan
1 Sacramento USA
2 Mexico City Mexico
3 Mexico City Mexico
4 Colorado Springs USA
5 Seoul South Korea
You can do this without any regular expressions - you can find the String.indexOf(“, “) to find the position of the seperator in the String, and then use String.substring to cut the String down to just this section.
However, a regular expression can also do this easily, but would likely be slower.

city population difference

I have an input file
Chicago 500
NewWork 200
California 100
I need difference of second column as output for each city with each other
Chicago Newyork 300
Chicago California 100
Newyork Chicago -300
Newyork California 100
California Chicago -400
California Newyork -100
I tried alot but not able to figure out exact and correct way to implement in map reduce . Please give me some solution
Here is a pseudocode. I use Python often, so it looks more like it. For this to work, you must know the total number of lines (i.e., cities here) and use that for N prior to running the job.
map(dummy, line):
city, pop = line.split()
for idx in 1:N
emit(idx, (city, pop))
reduce(idx, city_data):
city_data.sort() # sort by city to ensure indices are consistent
city, pop = city_data[idx]
for i in 1:N
if idx != i:
c, p = city_data[i]
dist = pop - p
emit(city, (c, dist))

Self Join in Pandas: Merge all rows with the equivalent multi-index

I have one dataframe in the following form:
df = pd.read_csv('data/original.csv', sep = ',', names=["Date", "Gran", "Country", "Region", "Commodity", "Type", "Price"], header=0)
I'm trying to do a self join on the index Date, Gran, Country, Region producing rows in the form of
Date, Gran, Country, Region, CommodityX, TypeX, Price X, Commodity Y, Type Y, Prixe Y, Commodity Z, Type Z, Price Z
Every row should have all the different commodities and prices of a specific region.
Is there a simple way of doing this?
Any help is much appreciated!
Note: I simplified the example by ignoring a few attributes
Input Example:
Date Country Region Commodity Price
1 03/01/2014 India Vishakhapatnam Rice 25
2 03/01/2014 India Vishakhapatnam Tomato 30
3 03/01/2014 India Vishakhapatnam Oil 50
4 03/01/2014 India Delhi Wheat 10
5 03/01/2014 India Delhi Jowar 60
6 03/01/2014 India Delhi Bajra 10
Output Example:
Date Country Region Commodit1 Price1 Commodity2 Price2 Commodity3 Price3
1 03/01/2014 India Vishakhapatnam Rice 25 Tomato 30 Oil 50
2 03/01/2014 India Delhi Wheat 10 Jowar 60 Bajra 10
What you want to do is called a reshape (specifically, from long to wide). See this answer for more information.
Unfortunately as far as I can tell pandas doesn't have a simple way to do that. I adapted the answer in the other thread to your problem:
df['idx'] = df.groupby(['Date','Country','Region']).cumcount()
df.pivot(index= ['Date','Country','Region'], columns='idx')[['Commodity','Price']]
Does that solve your problem?