matching two partial strings in a cell in R - regex

I've read other articles, such as:
Selecting rows where a column has a string like 'hsa..' (partial string match)
How do I select variables in an R dataframe whose names contain a particular string?
Subset data to contain only columns whose names match a condition
but most of them are simple fix:
they only have one string to match
they only have one partial string to match
so im here to ask for help.
lets say we have a sample data table like this:
sample = data.table('Feb FY2016', 50)
sample = rbind(sample, list('Mar FY2017', 30))
sample = rbind(sample, list('Feb FY2017', 40))
sample = rbind(sample, list('Mar FY2016', 10))
colnames(sample) = c('month', 'unit')
how can i subset the data so that my data contains only the rows who's "month" column satisfy following requirements:
has year of 2016
start with either 'Mar' or 'Feb'
Thanks!

Since grep returns indices of items it matches, it will return the rows that match the pattern, and can be used for subsetting.
sample[grep('^(Feb|Mar).*2016$', sample$month),]
# month unit
# 1: Feb FY2016 50
# 2: Mar FY2016 10
The regex looks for
the start of the line ^;
followed by Feb or Mar with (Feb|Mar);
any character . repeated 0 to many times *;
2016 exactly;
followed by the end of the string $.

Related

How do I regextract the second date in a string?

I am trying to extract the second date displayed in this string, however my code keeps extracting just the first date in gsheet:
String: BOT +1 1/1 CUSTOM IWM 100 12 SEP 22/7 SEP 22 184/184 PUT/CALL #6.13
This is my code: =REGEXEXTRACT(A3,"(\d{1,2}\s+[A-Za-z]+\s\d{2,4})")
my result: 12 SEP 22
Desired result should be: 7 SEP 22
Appreciate the help, thanks in advance!
Considering you have already a working formula for detecting dates, you can try adding first outside of the parentheses the same structure. So it will look for the first date, then .+ will consider that there will be some characters in between, and then your working pattern between parenthesis. Then only that last part will be extracted:
=REGEXEXTRACT(A3,"\d{1,2}\s+[A-Za-z]+\s\d{2,4}.+(\d{1,2}\s+[A-Za-z]+\s\d{2,4})")
Here's one approach to dynamically extract N number of dates within your string OR extract the 2nd or 3rd date pattern as per the requirement.
=index(if(len(A:A),lambda(y,regexextract(y,lambda(z,regexreplace(y,"(?i)("&z&")","($1)"))("\d{1,2}\s"&JOIN("\s\d{2}|\d{1,2}\s",INDEX(TEXT(SEQUENCE(12,1,DATE(2022,1,1),31),"MMM")))&"\s\d{2}")))(regexreplace(A:A,"[\(\)/+]","")),))
if its to pick specific number pattern, wrap the formula within index + number as shown in the screenshot
=index(formula,,pattern number)
To extract just the second date, you can modify the code as follows:
=REGEXEXTRACT(A3,"\d{1,2}\s+[A-Za-z]+\s\d{2,4}.*(\d{1,2}\s+[A-Za-z]+\s\d{2,4})")
This regular expression \d{1,2}\s+[A-Za-z]+\s\d{2,4}.*(\d{1,2}\s+[A-Za-z]+\s\d{2,4}) will match the first date and the second date in the string, and then extract just the second date.

How to use a Regular Expression to find a Specific word and return following 10 characters?

I need to find a regular expression that finds "Order #" and then return the following 10 characters.
For example I can the following rows (Ignore row numbers just using them to designate that it is a new or next line in the original data):
Row 1 Order #100013661 By John DOE
Row 2 REFUND for CHARGE(Order #100013667 By Lara Croft
Row 3 Order #100013668 By Sammy
Row 4 Blah Blah Blah Order #10013664 By Fluffy fluff
I want the expression to return:
ROW 1 100013661
ROW 2 100013667
Row 3 100013668
Row 4 100013664
Use capturing groups for that:
Order #(.{9})
Use the tools in your hosting language to harvest the capturing group.
Demo.
The regex you need is
(?<=Order #).{10}
Detailed explanation:
(?<=Order #) is a positive lookbehind: it matches if the literal string Order # occurs before current position;
.{10} matches any 10 characters.
Note that this won't match if your line has less than 10 characters in a line after the search string. If you need to match up to 10 characters, not exactly 10 characters, replace {10} with {1,10}.
Here is a demo.
Order #(.{10}) or Order #(.{1,10}) if it could be up to 10 characters.
Order #(\d{1,10}) if they are always numbers.

Separating column using separate (tidyr) via dplyr on a first encountered digit

I'm trying to separate a rather messy column into two columns containing period and description. My data resembles the extract below:
set.seed(1)
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
Desired results
Desired results should look like that:
indicator period values
1 someindicator 2001 0.2655087
2 someindicator 2011 0.3721239
3 some text 20022008 0.5728534
4 another indicator 2003 0.9082078
Characteristics
Indicator descriptions are in one column
Numeric values (counting from first digit with the first digit are in the second column)
Code
require(dplyr); require(tidyr); require(magrittr)
dta %<>%
separate(col = indicator, into = c("indicator", "period"),
sep = "^[^\\d]*(2+)", remove = TRUE)
Naturally this does not work:
> head(dta, 2)
indicator period values
1 001 0.2655087
2 011 0.3721239
Other attempts
I have also tried the default separation method sep = "[^[:alnum:]]" but it breaks down the column into too many columns as it appears to be matching all of the available digits.
The sep = "2*" also doesn't work as there are too many 2s at times (example: 20032006).
What I'm trying to do boils down to:
Identifying the first digit in the string
Separating on that charter. As a matter of fact, I would be happy to preserve that particular character as well.
I think this might do it.
library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
# indicator period values
# 1 someindicator 2001 0.2655087
# 2 someindicator 2011 0.3721239
# 3 some text 20022008 0.5728534
# 4 another indicator 2003 0.9082078
The following is an explanation of the regular expression, brought to you by regex101.
(?<=[a-z]) is a positive lookbehind - it asserts that [a-z] (match a single character present in the range between a and z (case sensitive)) can be matched
? matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed
(?=[0-9]) is a positive lookahead - it asserts that [0-9] (match a single character present in the range between 0 and 9) can be matched
You could also use unglue::unnest() :
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_unnest(dta, indicator, "{indicator}{=\\s*}{period=\\d*}")
#> values indicator period
#> 1 0.43234262 someindicator 2001
#> 2 0.65890900 someindicator 2011
#> 3 0.93576805 some text 20022008
#> 4 0.01934736 another indicator 2003
Created on 2019-09-14 by the reprex package (v0.3.0)

R - Need to subset a data frame using matches from a regex expression

I'm looking to subset a data frame based on matches from a regular expression that scans a single column, and returns the data in all the rows where column 2 has a match from the regular expression.
Using R 3.01 and I'm a relative inexperienced R programmer.
My data frame looks like this:
data:
........Column 1 .. Column2 Column 3
Row 1 ..data..........string....data
Row 2 ..data..........string....data
Row 3 ..data..........string....data
Row 4 ..data..........string....data
I'm using the following to scan column 2:
grep("word1", data$Column2, perl=TRUE)]
So far, I get all the strings returned from column2 that contain word1, but I'm looking to subset the entire row(s) where those matches are found.
new.data.frame <- old.data.frame[grep("word1", data$Column2, perl=TRUE), ]

Regex :How to remove repetition of the same string?

I'm trying to find the year from the date.
the dates are in the format
"Nov.-Dec. 2010"
"Aug. 30 2011-Sept. 3 2011"
"21-21 Oct. 1997"
my regular expression is
q = re.compile("\d\d\d\d")
a = q.findall(date)
so obviously in the list it has two items for a string like "Aug. 30 2011-Sept. 3 2011"
["2011","2011"]
i dont want a repetition, how do i do that?
You could use a backreference in the regex (see the syntax here):
(\d{4}).*\1
Or you could use the current regex and put this logic in the python code:
if a[0] == a[1]:
...
Use the following function :
def getUnique(date):
q = re.compile("\d\d\d\d")
output = []
for x in q.findall(date):
if x not in output:
output.append(x)
return output
It's O(n^2) though, with the repeated use of not in for each element of the input list
see How to remove duplicates from Python list and keep order?