Match a string pattern from other data.frame - regex

Suppose I have two dataframes a and b,
a has one column called 'detail':
pure water
wood fire
mineral water
water
fire work
and b has one column called 'type':
water
fire
Many R functions require input text to get match, grep('fire',a), but my question is if there is a way to match a using b? I tried loop but failed. Following SQLDF got all false result for match.
ab <- sqldf(select *,case when detail in (select distinct types from b) then 1 else 0 end as match) from a)
Ideally, one can using something like c <- grep(a$detail,b$types). not sure if it is allowed in R though.
Thanks in advance!

Create a type column in a and then merge on it:
merge(transform(a, type = sub(".* ", "", a$detail)), b, all = TRUE)

Related

How to countif 56 exists in 156/56/2567 and only return true once? Google sheets

I have one sheet with data on my facebook ads. I have another sheet with data on the products in my store. I'm having trouble with some countifs where I'm counting how many times my product ID exists in a row where multiple numbers are. They are formatted like this: /2032/2034/2040/1/
It's easy on the rows where only one product ID exists but some rows have multiple ID's separated by a /. And I need to see if the ID exists as a exact match alone or somewhere between the /'s.
Rows with facebook ads data:
A1: /2032/2034/2040/1/
A2: /1548/84/2154/2001/
A3: /2032/1689/1840/2548/
Row with product data:
B1: 2034
C1: I need a countifs here that checks how many times B1 exists in column A. Lets say I have thousands of rows with different variations of A1 where B1 could standalone. How do I count this? I always need exact matches.
You can compare the number you want (56) with the REGEX #MonkeyZeus commented whith a little change -> "(?:^|/)"&B1&"(?:/|$)" so the end result is:
=IF(REGEXMATCH(A1, "(?:^|/)"&B1&"(?:/|$)"), true, false)
Example:
UPDATE
If you need to count the total of 56 in X rows you can change the "True / False" of the condition for "1 / 0" and then do a =SUM(C1:C5) on the last row:
=IF(REGEXMATCH(A1, "(?:^|/)"&B1&"(?:/|$)"), 1, 0)
UPDATE 2
Thanks for contributing. Unfortunately I'm not able to do it this way
since I have loads of data to do this on. Is there a way to do it with
a countif in a single cell without adding a extra step with "sum"?
In that case you can do:
=COUNTA(FILTER(A:A, REGEXMATCH(A:A, "(?:^|/)"&B2&"(?:/|$)")))
Example:
UPDATE 3
With the following condition you check every single possibility just by adding another COUNTIF:
=COUNTIF(A:A,B1) + COUNTIF(A:A, "*/"&B1) + COUNTIF(A:A, B1&"/*") + COUNTIF(A:A, "*/"&B1&"/*")
Hope this helps!
try:
=COUNTIF(SPLIT(A1, "/"), B1)
UPDATE:
=ARRAYFORMULA(IF(A2<>"", {
SUM(IF((REGEXMATCH(""&DATA!C:C, ""&A2))*(DATA!B:B="carousel"), 1, )),
SUM(IF((REGEXMATCH(""&DATA!C:C, ""&A2))*(DATA!B:B="imagepost"), 1, ))}, ))

Join strings from the same column in ´pandas´ using a placeholder condition

I have a series of data that I need to filter.
The df consists of one col. of information that is separated by a row with with value NaN.
I would like to join all of the rows that occur until each NaN in a new column.
For example my data looks something like:
the
car
is
red
NaN
the
house
is
big
NaN
the
room
is
small
My desired result is
B
the car is red
the house is big
the room is small
Thus far, I am approaching this problema by building a function and applying it to each row in my dataframe. See below for my working code example so far.
def joinNan(row):
newRow = []
placeholder = 'NaN'
if row is not placeholder:
newRow.append(row)
if row == placeholder:
return newRow
df['B'] = df.loc[0].apply(joinNan)
For some reason, the first row of my data is being used as the index or column title, hence why I am using 'loc[0]' here instead of a specific column name.
If there is a more straight forward way to approach this directly iterating in the column, I am open for that suggestion too.
For now, I am trying to reach my desired solution and have not found any other similiar case in Stack overflow or the web in general to help me.
I think for test NaNs is necessary use isna, then greate helper Series by cumsum and aggregate join with groupby:
df=df.groupby(df[0].isna().cumsum())[0].apply(lambda x: ' '.join(x.dropna())).to_frame('B')
#for oldier version of pandas
df=df.groupby(df[0].isnull().cumsum())[0].apply(lambda x: ' '.join(x.dropna())).to_frame('B')
Another solution is filter out all NaNs before groupby:
mask = df[0].isna()
#mask = df[0].isnull()
df['g'] = mask.cumsum()
df = df[~mask].groupby('g')[0].apply(' '.join).to_frame('B')

How to remove an part of values in R

I have got a data frame like this:
ID A B
1 x5.11 2,34
2 x5.57 5,36
3 x6,13 0,45
I would like to remove the 'x' of all values of the column A. How might I best accomplish this in R.
Thanks!
I have found a very easy way:
data.frama$A <- gsub("x", "", data.frame$A)

Searching column with string for wildcard match

I can't seem to figure out the VLOOKUP magic needed to make this work as I want it to.
See, what I've got is a column B containing filenames, like this:
[COLUMN B]
./11001 Boogie Oogie Oogie (A Taste Of Honey).wav
./11001 Rescue Me (A Taste Of Honey).wav
./11001 Sukiyaki (A Taste Of Honey).wav
./11002 Memory (Acker Bilk).wav
./11002 Stuck On You (Acker Bilk).wav
./11002 Could I Have This Dance (Acker Bilk).wav
./11002 Do That To Me One More Time (Acker Bilk).wav
./11002 This Masquerade (Acker Bilk).wav
./11002 Just Once (Acker Bilk).wav
And so on for 6220 entries.
I have another column, Column E, which contains a TRACK NAME which is present within the filename. Looks like this:
American Patrol
Artistry In Rhythm
Begin The Beguine
Big John's Special
Cherokee
For example. So what I want to do is, in another column I want to search through Column B using the strings from Column E and then returning the matched string from Column B.
So if we imagine I put this formula in the C Column starting in the same row as the American Patrol track name, it would search through the range in Column B and return this:
./11249 American Patrol (BBC Big Band).wav
./11249 Artistry In Rhythm (BBC Big Band).wav
./11249 Begin The Beguine (BBC Big Band).wav
And so on.
I tried doing this formula
=VLOOKUP(E2;B2:B6235;2;TRUE)
So, this returns a file name, but it seems to have matched all the filenames and are just returning whichever result I specify in the col_index variable, so now it returns the second match (basically, just the second row in Column B) and if I put a 3 instead, it would just return the third hit, again having matched all the file names, it seems..
I'm not that familiar with Excel functions, so I'm not sure where to look for the solution beyond this.
You should not be using TRUE as a VLOOKUP function's range_lookup parameter on unsorted data. You can, however, wrap your track title in wildcards to achieve the search you are looking for.
      
The formula in C1 is,
=INDEX(B:B, MATCH("*"&E1&"*",B:B, 0))
... or,
=VLOOKUP("*"&E1&"*",B:B, 1, FALSE)
They accomplish the same thing.

Piglatin to find if a column contains the contents of another column

I have a seemingly very simple problem but I just can't seem to figure it out.
I have data that looks like this :
A (B, C, A)
B (X, Y, Z)
C (F, C, D)
I am using Pig latin to check if the text in the first column is present in the second column.
This is my code for now:
Labels = LOAD 'example.txt' USING PigStorage('\t');
Projected = FOREACH Labels GENERATE $0 AS id, $1 AS group;
X = FILTER Projected BY (group matches '.*(chararray)id.*');
STORE X INTO '/test' USING PigStorage(',');
The output I am expecting is:
A (B, C, A)
C (F, C, D)
I also tried concatenating the ".*" to the id but it was of no avail.
I've been stuck with this for quite sometime and any help would be greatly appreciated. Thanks!
There's two problems, one you can't name your variable group because that's a reserved word, two you're matching the string "(chararray)id", not the id.
Also IMO I think it's cleaner never to assign variables by index, and just to define them in your load statement, you can remove the Projected alias if you do this.
Labels = LOAD 'example.txt' USING PigStorage('\t') AS
(id:chararray, stringvalue:chararray);
X = FILTER Labels BY (stringvalue matches CONCAT(CONCAT('.*',id),'.*'));
STORE X INTO '/test' USING PigStorage(',');
Tested this, it worked.