Regex :How to remove repetition of the same string? - regex

I'm trying to find the year from the date.
the dates are in the format
"Nov.-Dec. 2010"
"Aug. 30 2011-Sept. 3 2011"
"21-21 Oct. 1997"
my regular expression is
q = re.compile("\d\d\d\d")
a = q.findall(date)
so obviously in the list it has two items for a string like "Aug. 30 2011-Sept. 3 2011"
["2011","2011"]
i dont want a repetition, how do i do that?

You could use a backreference in the regex (see the syntax here):
(\d{4}).*\1
Or you could use the current regex and put this logic in the python code:
if a[0] == a[1]:
...

Use the following function :
def getUnique(date):
q = re.compile("\d\d\d\d")
output = []
for x in q.findall(date):
if x not in output:
output.append(x)
return output
It's O(n^2) though, with the repeated use of not in for each element of the input list
see How to remove duplicates from Python list and keep order?

Related

How do I regextract the second date in a string?

I am trying to extract the second date displayed in this string, however my code keeps extracting just the first date in gsheet:
String: BOT +1 1/1 CUSTOM IWM 100 12 SEP 22/7 SEP 22 184/184 PUT/CALL #6.13
This is my code: =REGEXEXTRACT(A3,"(\d{1,2}\s+[A-Za-z]+\s\d{2,4})")
my result: 12 SEP 22
Desired result should be: 7 SEP 22
Appreciate the help, thanks in advance!
Considering you have already a working formula for detecting dates, you can try adding first outside of the parentheses the same structure. So it will look for the first date, then .+ will consider that there will be some characters in between, and then your working pattern between parenthesis. Then only that last part will be extracted:
=REGEXEXTRACT(A3,"\d{1,2}\s+[A-Za-z]+\s\d{2,4}.+(\d{1,2}\s+[A-Za-z]+\s\d{2,4})")
Here's one approach to dynamically extract N number of dates within your string OR extract the 2nd or 3rd date pattern as per the requirement.
=index(if(len(A:A),lambda(y,regexextract(y,lambda(z,regexreplace(y,"(?i)("&z&")","($1)"))("\d{1,2}\s"&JOIN("\s\d{2}|\d{1,2}\s",INDEX(TEXT(SEQUENCE(12,1,DATE(2022,1,1),31),"MMM")))&"\s\d{2}")))(regexreplace(A:A,"[\(\)/+]","")),))
if its to pick specific number pattern, wrap the formula within index + number as shown in the screenshot
=index(formula,,pattern number)
To extract just the second date, you can modify the code as follows:
=REGEXEXTRACT(A3,"\d{1,2}\s+[A-Za-z]+\s\d{2,4}.*(\d{1,2}\s+[A-Za-z]+\s\d{2,4})")
This regular expression \d{1,2}\s+[A-Za-z]+\s\d{2,4}.*(\d{1,2}\s+[A-Za-z]+\s\d{2,4}) will match the first date and the second date in the string, and then extract just the second date.

matching two partial strings in a cell in R

I've read other articles, such as:
Selecting rows where a column has a string like 'hsa..' (partial string match)
How do I select variables in an R dataframe whose names contain a particular string?
Subset data to contain only columns whose names match a condition
but most of them are simple fix:
they only have one string to match
they only have one partial string to match
so im here to ask for help.
lets say we have a sample data table like this:
sample = data.table('Feb FY2016', 50)
sample = rbind(sample, list('Mar FY2017', 30))
sample = rbind(sample, list('Feb FY2017', 40))
sample = rbind(sample, list('Mar FY2016', 10))
colnames(sample) = c('month', 'unit')
how can i subset the data so that my data contains only the rows who's "month" column satisfy following requirements:
has year of 2016
start with either 'Mar' or 'Feb'
Thanks!
Since grep returns indices of items it matches, it will return the rows that match the pattern, and can be used for subsetting.
sample[grep('^(Feb|Mar).*2016$', sample$month),]
# month unit
# 1: Feb FY2016 50
# 2: Mar FY2016 10
The regex looks for
the start of the line ^;
followed by Feb or Mar with (Feb|Mar);
any character . repeated 0 to many times *;
2016 exactly;
followed by the end of the string $.

Delete numbers not dates in R (regex)

I want to remove numbers (integers and floats) from a character vector, preserving dates:
"I'd like to delete numbers like 84 and 0.5 but not dates like 2015"
I would like to get:
"I'd like to delete numbers like and but not dates like 2015"
In English a quick and dirty rule could be: if the number starts with 18, 19, or 20 and has length 4, don't delete.
I asked the same question in Python and the answer was very satisfying (\b(?!(?:18|19|20)\d{2}\b(?!\.\d))\d*\.?).
However, when I pass the same regex to grepl in R:
gsub("[\b(?!(?:18|19|20)\d{2}\b(?!\.\d))\d*\.?]"," ", "I'd like to delete numbers like 84 and 0.5 but not dates like 2015")
I get:
Error: '\d' is an unrecognized escape in character string starting ""\b(?!(?:18|19|20)\d"
As I mentioned in my comments, the main points here are:
regex pattern should be placed outside the character class to be treated as a sequence of subpatterns and not as separate symbols inside the class
the backslashes must be doubled in R regex patterns (since it uses C strings where \ is used to escape entities like \n, \r, etc)
and also you need to use perl=T with patterns featuring lookarounds (you are using lookaheads in yours)
Use
gsub("\\b(?!(?:18|19|20)\\d{2}\\b(?!\\.\\d))\\d*\\.?\\d+\\b"," ", "I'd like to delete numbers like 84 and 0.5 but not dates like 2015", perl=T)
See IDEONE demo.
To search and replace in R you can use:
gsub("\\b(?!(?:18|19|20)\\p{Nd}{2}\\b(?!\\.\\p{Nd}))\\p{Nd}*\\.?", "replacement_text_here", subject, perl=TRUE);

Regexp matched values subpattern as subarray

My regular expression: https://regex101.com/r/oF7pM8/1
I get http://joxi.ru/J2b54KaI40bbwm
But, i have get all "num" values (all digits) and that they are in an array "num"
I have to get it:
name = house
num = [3 4 5 6 7 8 9]
What's wrong doing?
p.s.: python regular expression
The pattern must find all the numbers separately (array).
Does (?P<name>house)(?:\s(?P<num>(\d\s+)+)\d?)+? do the job ?
My additions to your original in bold: (?Phouse)(?:\s(?P(\d\s+)+)\d?)+?
Then the last digit is found, not all. I need all.
re.match finds all, but returns only the last one. Since you have to post-process the matches anyway in order to assign them to the Python variables name and num, make the pattern simple:
import re
test_string = 'house 3 44 555 6666 777 88 9'
m = re.match(r'(house)((\s\d+)+)', test_string)
name = m.group(1)
num = [int(s) for s in m.group(2).split()]

R - split string before two last digits in each column cell

I have a csv with usernames in a column, followed by each user's feedback rating, out of 100.
E.g. James89
I hope to find a way to split the name and the rating, e.g. by inserting a comma before the two last digits using regex. Is this possible? And/or is there a better way to do this?
df1 = data.frame(Product = c(rep("ARCH78"), rep("AUSFUNGUY91"), rep("AddiesAndXans96"), rep("AfroBro79")))
The code above is a tiny excerpt of the data I'm dealing with. I hope to get this output:
ARCH 78
AUSFUNGUY 91
AddiesAndXans 96
AfroBro 79
I've tried this code (inspired from this answer:
df1$P2 <- gsub("(.*?)(..)", "\\1", df1$Product)
It seems to be working, but there's something wrong with the output:
ARCH78 AR
AUSFUNGUY91 AUUNY
AddiesAndXans96 AdesdXs
AfroBro79 AfBr9
As for the following:
I hope to find a way to split the name and the rating, e.g. by inserting a comma before the two last digits using regex.
You can achieve it with a mere
df1 = data.frame(Product = c(rep("ARCH78"), rep("AUSFUNGUY91"), rep("AddiesAndXans96"), rep("AfroBro79")))
gsub("(\\d{2})$",",\\1",df1$Product)
## => [1] "ARCH,78" "AUSFUNGUY,91" "AddiesAndXans,96" "AfroBro,79"
See IDEONE demo
You can further adjust the replacement ",\\1" that features a backreference \1 to the last 2 digits.