Matching until first occurrence of a term followed by a character

Matching until first occurrence of a term followed by a character - regex

I would like match the string that only has one search term (that would only be the first string in my example). Strings with multiple search terms are separated by a + sign:
jobs?search=term1&location=&distance=10+page=2
jobs?search=term1+term2&location=ca&distance=30
jobs?search=term1+term2+term3&location=nyc&distance=25
My idea was to match any word (preceded by search=) not followed by + but is followed with &:
search=.*?[^+]&
But it doesn't quite work and captures strings with multiple terms.

You need to use
[&?]search=([^&+]+)(?=&|$)
See the regex demo
It will match:
[&?] - a ? or & (to make sure search is the whole key name)
search= - a literal substring
([^&+]+) - Group 1 capturing 1+ symbols other than + and &
(?=&|$) - a lookahead requiring a & or end of string to appear immediately after the last symbol captured with the preceding subpattern (note it can be replaced with a non-capturing group, (?:&|$), the value will be still in Group 1).
Python demo:
import re
ss = ['jobs?search=term1&location=&distance=10+page=2','jobs?search=term1+term2&location=ca&distance=30','jobs?search=term1+term2+term3&location=nyc&distance=25']
rx = re.compile(r'[&?]search=([^&+]+)(?=&|$)')
for s in ss:
m = rx.search(s)
if m:
print("{}: {}".format(s, m.group(1)))
Base R:
ss <- c('jobs?search=term1&location=&distance=10+page=2','jobs?search=term1+term2&location=ca&distance=30','jobs?search=term1+term2+term3&location=nyc&distance=25')
results <- regmatches(ss, regexec("[&?]search=([^&+]+)(?:&|$)",ss))
unlist(results)[2]
... or with R stringr:
> library(stringr)
> ss <- c('jobs?search=term1&location=&distance=10+page=2','jobs?search=term1+term2&location=ca&distance=30','jobs?search=term1+term2+term3&location=nyc&distance=25')
> results <- str_match(ss, "[&?]search=([^&+]+)(?:&|$)")
> results[,2]
[1] "term1" NA NA
>

If you want to only capture the term and not the preceding search=:
(?<=search=)[^+]*?(?=&|$)
(?<=search=) - Positive Lookbehind to ensure the search= precedes the term
[^+]*? - To match the term (makes sure it doesn't include any +). This is a non-greedy match (using *?) so that the first occurrence of & works
(?=&|$) - Positive Lookahead to ensure the term is followed by either a & or end of string ($)
Regex101 Demo

Related

How to capture group no of every group in a repeated capturing group

My regex is something like this **(A)(([+-]\d{1,2}[YMD])*)** which is matching as expected like A+3M, A-3Y+5M+3D etc..
But I want to capture all the groups of this sub pattern**([+-]\d{1,2}[YMD])***
For the following example A-3M+2D, I can see only 4 groups. A-3M+2D (group 0), A(group 1), -3M+2D (group 2), +2D (group 3)
Is there a way I can get the **-3M** as a separate group?

Repeated capturing groups usually capture only the last iteration. This is true for Kotlin, as well as Java, as the languages do not have any method that would keep track of each capturing group stack.
What you may do as a workaround, is to first validate the whole string against a certain pattern the string should match, and then either extract or split the string into parts.
For the current scenario, you may use
val text = "A-3M+2D"
if (text.matches("""A(?:[+-]\d{1,2}[YMD])*""".toRegex())) {
val results = text.split("(?=[-+])".toRegex())
println(results)
}
// => [A, -3M, +2D]
See the Kotlin demo
Here,
text.matches("""A(?:[+-]\d{1,2}[YMD])*""".toRegex()) makes sure the whole string matches A and then 0 or more occurrences of + or -, 1 or 2 digits followed with Y, M or D
.split("(?=[-+])".toRegex()) splits the text with an empty string right before a - or +.
Pattern details
^ - implicit in .matches() - start of string
A - an A substring
(?: - start of a non-capturing group:
[+-] - a character class matching + or -
\d{1,2} - one to two digits
[YMD] - a character class that matches Y or M or D
)* - end of the non-capturing group, repeat 0 or more times (due to * quantifier)
\z - implicit in matches() - end of string.
When splitting, we just need to find locations before - or +, hence we use a positive lookahead, (?=[-+]), that matches a position that is immediately followed with + or -. It is a non-consuming pattern, the + or - matched are not added to the match value.
Another approach with a single regex
You may also use a \G based regex to check the string format first at the start of the string, and only start matching consecutive substrings if that check is a success:
val regex = """(?:\G(?!^)[+-]|^(?=A(?:[+-]\d{1,2}[YMD])*$))[^-+]+""".toRegex()
println(regex.findAll("A-3M+2D").map{it.value}.toList())
// => [A, -3M, +2D]
See another Kotlin demo and the regex demo.
Details
(?:\G(?!^)[+-]|^(?=A(?:[+-]\d{1,2}[YMD])*$)) - either the end of the previous successful match and then + or - (see \G(?!^)[+-]) or (|) start of string that is followed with A and then 0 or more occurrences of +/-, 1 or 2 digits and then Y, M or D till the end of the string (see ^(?=A(?:[+-]\d{1,2}[YMD])*$))
[^-+]+ - 1 or more chars other than - and +. We need not be too careful here since the lookahead did the heavy lifting at the start of string.

Relevant Regular Expression in scala

I want to keep only the last term of a string separated by dots
Example:
My string is:
abc"val1.val2.val3.val4"zzz
Expected string after i use regex:
abc"val4"zzz
Which means i want the content from left-hand side which was separated with dot (.)
The most relevant I tried was
val json="""abc"val1.val2.val3.val4"zzz"""
val sortie="""(([A-Za-z0-9]*)\.([A-Za-z0-9]*){2,10})\.([A-Za-z0-9]*)""".r.replaceAllIn(json, a=> a.group(3))
the result was:
abc".val4"zzz
Can you tell me if you have different solution for regex please?
Thanks

You may use
val s = """abc"val1.val2.val3.val4"zzz"""
val res = "(\\w+\")[^\"]*\\.([^\"]*\")".r replaceAllIn (s, "$1$2")
println(res)
// => abc"val4"zzz
See the Scala demo
Pattern details:
(\\w+\") - Group 1 capturing 1+ word chars and a "
[^\"]* - 0+ chars other than "
\\. - a dot
([^\"]*\") - Group 2 capturing 0+ chars other than " and then a ".
The $1 is the backreference to the first group and $2 inserts the text inside Group 2.

Maybe without Regex at all:
scala> json.split("\"").map(_.split("\\.").last).mkString("\"")
res4: String = abc"val4"zzz
This assumes you want each "token" (separated by ") to become the last dot-separated inner token.

capture repetition of letters in a word with regex

I'm trying to detect conditions where words have repetition of letters, and i would like to replace such matched conditions with the repeated letter. The text is in Hebrew. For instance, שללללוווווםםםם should just become שלום.
Basically,when a letter repeats itself 3 times or more - it should be detected and replaced.
I want to use the regex expression for r gsub.
df$text <- gsub("?", "?", df$text)

You can use
> x = "שללללוווווםםםם"
> gsub("(.)\\1{2,}", "\\1", x)
#[1] "שלום"
NOTE :- It will replace any character (not just hebrew) which is repeated more than three times.
or following for only letter/digit from any language
> gsub("(\\w)\\1{2,}", "\\1", x)

If you plan to only remove repeating characters from the Hebrew script (keeping others), I'd suggest:
s <- "שללללוווווםםםם ......... שללללוווווםםםם"
gsub("(\\p{Hebrew})\\1{2,}", "\\1", s, perl=TRUE)
See the regex demo in R
Details:
(\\p{Hebrew}) - Group 1 capturing a character from Hebrew script (as \p{Hebrew} is a Unicode property/category class)
\\1{2,} - 2 or more (due to {2,} limiting quantifier) same characters stored in Group 1 buffer (as \\1 is a backreference to Group 1 contents).

regular expression in R, match substring only if things after

my_string = "2011, this year I made 750,000 dollars"
Is there an elegant way to match "2011" and "750,000" in the string above. The idea is to extract numeric values when it looks like to numeric values, i.e. \d+ or \d+[\.,]?\d* depending on the presence of a comma after
I tried this but it doesn't match exactly what I wanted, I got "2011," which is no good
library(stringr)
str_match_all(fkin, "(\\d+[\\.,]?\\d*)
Here is my expected resut:
"2011" "750,000"

You can do:
[0-9]+(?:[,.][0-9]+)*
It's very elegant, I tried it in front of a mirror.

Here is a one regex pure base R approach to extract integer or float values that are not part of the string of digits separated with a hyphen:
> str <- "2011, this year I made 750,000 dollars and 750,000-589 here"
> regmatches(str, gregexpr('(?<!\\d-)\\b\\d+(?:[,.]\\d+)?+(?!-)', str, perl=T))[[1]]
[1] "2011" "750,000"
See the IDEONE demo and a regex demo.
Since the regex contains lookarounds, you need to specify the perl=TRUE argument.
Pattern explanation:
(?<!\d-) - a negative lookbehind failing the match when a digit with a hyhen precedes the current location
\b\d+ - a word boundary (before the next digit, there cannot be a word char - letter, digit or _)
(?:[,.]\d+)?+ - a non-capturing group ((?:...)) matching 1 or 0 sequences of a comma or dot ([,.]) followed with 1 or more digits (and this sequence is matched possessively (see ?+) so that the regex engine did not check for a hyphen after \b\d+)
(?!-) - a negative loookahead that fails the match if there is a hyphen after the digits detected.

Extracting part of string using regular expressions

I’m struggling to get a bit of regular expressions code to work. I have a long list of strings that I need to partially extract. I need only strings that starting with “WER” and I only need the last part of the string commencing (including) on the letter.
test <- c("abc00012Z345678","WER0004H987654","WER12400G789456","WERF12","0-0Y123")
Here is the line of code which is working but only for one letter. However in my list of strings it can have any letter.
ifelse(substr(test,1,3)=="WER",gsub("^.*H.*?","H",test),"")
What I’m hoping to achieve is the following:
H987654
G789456
F12

You can use the following pattern with gsub:
> gsub("^(?:WER.*([a-zA-Z]\\d*)|.*)$", "\\1", test)
[1] "" "H987654" "G789456" "F12" ""
See the regex demo
This pattern matches:
^ - start of a string
(?: - start of an alternation group with 2 alternatives:
WER.*([a-zA-Z]\\d*) - WER char sequence followed with 0+ any characters (.*) as many as possible up to the last letter ([a-zA-Z]) followed by 0+ digits (\\d*) (replace with \\d+ to match 1+ digits, to require at least 1 digit)
| - or
`.* - any 0+ characters
)$ - closing the alternation group and match the end of string with $.
With str_match from stringr, it is even tidier:
> library(stringr)
> res <- str_match(test, "^WER.*([a-zA-Z]\\d*)$")
> res[,2]
[1] NA "H987654" "G789456" "F12" NA
>
See another regex demo
If there are newlines in the input, add (?s) at the beginning of the pattern: res <- str_match(test, "(?s)^WER.*([a-zA-Z]\\d*)$").

If you don't want empty strings or NA for strings that don't start with "WER", you could try the following approach:
sub(".*([A-Z].*)$", "\\1", test[grepl("^WER", test)])
#[1] "H987654" "G789456" "F12"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Matching until first occurrence of a term followed by a character - regex

Related

How to capture group no of every group in a repeated capturing group

Relevant Regular Expression in scala

capture repetition of letters in a word with regex

regular expression in R, match substring only if things after

Extracting part of string using regular expressions

Categories

Resources