Find duplicate across sheet based on multiple columns google sheet

Find duplicate across sheet based on multiple columns google sheet - regex

I want to match a row in one sheet with row in another. (To use conditional formatting). But this match is based on multiple column.

https://docs.google.com/spreadsheets/d/18Cr13bQZ2ZZnb1Y2Nq6aMFhHXhFTZsioJ3M4S1fzURQ/edit?usp=sharing

Sheet 1
|Country|Year|Location|
|India |2001|D1 |
|Russia |1999|D3 |
|Kenya |1001|D4 |
|India |1999|D2 |

Sheet 2
|Country |Year|Destination|
|India |2000|DA1 |
|Bulgaria |1999|DA3 |
|Wakanda |1001|DA4 |
|India |1999|DA2 |
Only India-1999 should be highlighted

try:
=ARRAYFORMULA(REGEXMATCH($A2&$B2, TEXTJOIN("|", 1,
INDIRECT("Sheet1!A2:A")&INDIRECT("Sheet1!B2:B"))))

Related

How to use regexp_replace in spark.sql() to extract hashtags from string

I need to write a regexg_replace query in spark.sql() and I'm not sure how to handle it. For readability purposes, I have to utilize SQL for it. I am trying to pull out the hashtags from the table. I know how to do this using the python method but most of my team are SQL users.
My dataframe example looks like so:
Insta_post
Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House…
RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…
RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…
My code:
I create a tempview:
post_df.createOrReplaceTempView("post_tempview")
post_df = spark.sql("""
select
regexp_replace(Insta_post, '.*?(.|'')(#)(\w+)', '$1') as a
from post_tempview
where Insta_post like '%#%'
""")
My end result:
+--------------------------------------------------------------------------------------------------------------------------------------------+
|a |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House… |
|RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…|
|RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…|
+--------------------------------------------------------------------------------------------------------------------------------------------+
desired result:
+---------------------------------+
|a |
+---------------------------------+
| #SaveTheInternet, #NetNeutrality|
| #NALCABPolicy2018 |
| #NetNeutrality |
+---------------------------------+
I haven't really used regexp_replace too much so this is new to me. Any help would be appreciated as well as an explanation of how to structure the subsets!

For Spark 3.1+, you can use regexp_extract_all function to extract multiple matches:
post_df = spark.sql("""
select regexp_extract_all(Insta_post, '(#\\\\w+)', 1) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+----------------------------------+
#|a |
#+----------------------------------+
#|[#SaveTheInternet, #NetNeutrality]|
#|[#NALCABPolicy2018] |
#|[#NetNeutrality] |
#+----------------------------------+
For Spark <3.1, you can use regexp_replace to remove all that doesn't match the hashtag pattern :
post_df = spark.sql("""
select trim(trailing ',' from regexp_replace(Insta_post, '.*?(#\\\\w+)|.*', '$1,')) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+-------------------------------+
#|a |
#+-------------------------------+
#|#SaveTheInternet,#NetNeutrality|
#|#NALCABPolicy2018 |
#|#NetNeutrality |
#+-------------------------------+
Note the use trim to remove the unnecessary trailing commas created by the first replace $,.

Do you really need a view? Because the following code might do it:
df = df.filter(F.col('Insta_post').like('%#%'))
col_trimmed = F.trim((F.regexp_replace('Insta_post', '.*?(#\w+)|.+', '$1 ')))
df = df.select(F.regexp_replace(col_trimmed,'\s',', ').alias('a'))
df.show(truncate=False)
# +--------------------------------+
# |a |
# +--------------------------------+
# |#SaveTheInternet, #NetNeutrality|
# |#NALCABPolicy2018 |
# |#NetNeutrality |
# +--------------------------------+
I ended up using two of regexp_replace, so potentially there could be a better alternative, just couldn't think of one.

Insert cell's logic into another cell's logic in Google Sheets

I have a column in Google Sheets where each cell contains pre-defined logic. For example, something like the second column in this table:
| 1 | =A1*-1 |
| 2 | =B2*-1 |
| -3 | =C2*-1 |
Let's say later I want to add the same logic to each cell in column B. For example, make it such that it looks like:
| 1 | =MAX(A1*-1,0) |
| 2 | =MAX(B2*-1,0) |
| -3 | =MAX(C2*-1,0) |
What is the fastest way to do this, besides manually typing MAX(...,0) in each cell? Normal Sheets functions act on the value of the cell, not the logic, so I'm a bit lost.
To my knowledge there isn't a function that pipes in the logic from one cell to another ...

try:
=ARRAYFORMULA(IF(A1:A="",,IF(SIGN(A1:A)<0, A1:A*-1, 0)))
=ARRAYFORMULA(IF(A1:A="",,IF(SIGN(A1:A)>0, A1:A, 0)))

Keep words starting with character/letter in Pandas | Python

I'm not sure how to do this in a dataframe context
I have the table below here with text information
TEXT |
-------------------------------------------|
"Get some new #turbo #stacks today!" |
"Is it one or three? #phone" |
"Mayhaps it be three afterall..." |
"So many new issues with phone... #iphone" |
And I want to edit it down to where only the words with a '#' symbol are kept, like in the result below.
TEXT |
-----------------|
"#turbo #stacks" |
"#phone" |
"" |
"#iphone" |
In some cases, I'd also like to know if it's possible to eliminate the rows that are empty by checking for NaN as true or if you run a different kind of condition to get this result:
TEXT |
-----------------|
"#turbo #stacks" |
"#phone" |
"#iphone" |
Python 2.7 and pandas for this.

You could try using regex and extractall:
df.TEXT.str.extractall('(#\w+)').groupby(level=0)[0].apply(' '.join)
Output:
0 #turbo #stacks
1 #phone
3 #iphone
Name: 0, dtype: object

how to create a subcolumn inside a column in Gtk+ using C++

I am creating a listview with 5 columns in Gtk+ using C++. I was able to do that. But the problem is, I need subcolumns for the 2nd column which I'm not sure how to proceed.
firstcolumn | second column | third |
|SC1 | SC2 | SC3| |
| | | | |
Is this possible? Can you suggest how to go about it?

How to capture only part of an id?

I'm trying to capture the id of an element that will be randomly generated. I can successfully capture the value of my element id like this...
| storeAttribute | //div[1]#id | variableName |
Now my variable will be something like...
divElement-12345
I want to remove 'divElement-' so that the variable I am left with is '12345' so that I can use it later to select the 'form-12345' element associated with it...something like this:
| type | //tr[#id='form-${variableName}']/td/form/fieldset/p[1]/input | Type this |
How might I be able to accomplish this?

You have two options in Selenium, XPath and CSS Selector. I have read that CSS Selector is better for doing tests in both FireFox and IE.
Using the latest version of Selenium IDE (3/5/2009) I've had success with using storeEval which evaluates Javascript expressions, giving you access to javascript string functions.
XPath:
storeAttribute | //div[1]#id | divID
storeEval | '${divID}'.replace("divElement-", "") | number
type | //tr[#id='form-${number}']/td/form/fieldset/p[1]/input | Type this
CSS Selector:
storeAttribute | css=div[1]#id | divID
storeEval | '${divID}'.replace("divElement-", "") | number
type | css=tr[id='form-${number}'] > td > form > fieldset > p[1] > input | Type this

There are many functions in XPATH which should solve your problem. Assuming "divElement-" is a constant that will not change and that you are using XPath 2.0, I would suggest:
substring-after(div[1]/#id/text(),"divElement-")

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Find duplicate across sheet based on multiple columns google sheet - regex

try: =ARRAYFORMULA(REGEXMATCH($A2&$B2, TEXTJOIN("|", 1, INDIRECT("Sheet1!A2:A")&INDIRECT("Sheet1!B2:B"))))

Related

How to use regexp_replace in spark.sql() to extract hashtags from string

Insert cell's logic into another cell's logic in Google Sheets

Keep words starting with character/letter in Pandas | Python

how to create a subcolumn inside a column in Gtk+ using C++

How to capture only part of an id?

Categories

Resources