Regex to remove trailing optional garbage - regex

I want to clean strings that may contain garbage at the end, always separated by a forward slash / and if there is no garbage, there is no separator.
Example > expected output
Foo/Bar > Foo
Foobar > Foobar
I tried several versions like this one to extract the payload only, none of the worked:
(.*)\/.*
(.*)?\/.*
(.*)?\/*.*
And so on. Problem is: i always only get the first or second line to match.
What would be the correct expression to extract the wanted information?

Your first and second pattern capture till before the first / so that will not give a match for the third line as there is no / present.
The third pattern matches the whole line as the /* matches an optional forward slash, so the capture group will match the whole line, and the .* will not match any characters any more as the capture group is already at the end of the line.
You could write the pattern with a capture group for 1 or more word characters as the first part, and an optional second part starting the match from / till the end of the string.
In the replacement you can use the first capture group.
^(\w+)(?:\/.*)?$
^ Start of string
(\w+) Capture 1+ word characters in group 1
(?:\/.*)? Optionally match / and the rest of the line (to be removed after the replacement)
$ End of string
See a regex demo.
There is no language listed, but an example using JavaScript:
const regex = /^(\w+)(?:\/.*)?$/m;
const str = `Foo/Bar
Foobar`;
const result = str.replace(regex, "$1");
console.log(result);
Example using Python
import re
regex = r"^(\w+)(?:\/.*)?$"
test_str = ("Foo/Bar\n"
"Foobar")
result = re.sub(regex, r'\1', test_str, 0, re.MULTILINE)
if result:
print (result)
Output
Foo
Foobar
Python demo

You can use replace here as:
const cleanString = (str) => str.replace(/\/.*/, "");
console.log(cleanString("Foo/Bar"));
console.log(cleanString("Foobar"));

This task doesn't need the power of regex, you need to split on the first slash, e.g. in Python:
test_string.split('/', 1)[0]
I think the reason your regex doesn't work is that Foobar has no / to match on. So for regex you need to handle none, one, or many slashes. Again, in Python:
>>> test = ['foobar', 'foo/bar', 'foo/bar/baz']
>>> for s in t:
print(re.findall('^(.*?)(?=/|$)', s))
['foobar']
['foo']
['foo']
The regex says: from the start of the string, group all characters (non-greedy) until either a slash or the end of the string.

You can try doing a regex.split on / and select the first element from the list. For example in python:
import regex as re
new_string = re.split('/',string)[0]

Related

Regex To Match String With All Words Contains Certain Format

I want to validate a field of string so that it only accept string that contains words with certain format.
Example accepted string:
#key;
#key1; #key2;#key3;
Example rejected string:
key;
%key1X #key2X$key3X
My regex:
\B(\#[a-zA-Z0-9_; ]+\b)(\;)
It seems my regex still accept a string as long as it has a word with valid format, while I only want it to be accepted if whole words are in the correct format.
Current example:
%key1; %key2 #keysz;#key3; #key4;
From the above Current Example still accepted because it contains #keysz; and #key3; while I want it to be rejected because there are %key1; %key2 and #key4;.
I've do some search and the closest I can found is this question, but it returns similar result as my current regex.
What did i do wrong in my regex? What is the right regex?
Sorry if this is dumb question but I'm a newbie in regex.
The main thing needed are start ^ and end $ anchors. The rest can be simplified too:
^( *#\w+;)+$
See live demo.
Breaking it down:
^ = start
* = 0-n spaces
# = a literal hash (these don't need escaping in regex)
\w+ = one or more word characters (letters, digits and the underscore)`
$
If underscore can be in the input and must not be, then use:
^( *#[A-Za-z0-9]+;)+$
Your regex matches a full sentence because in your regex pattern(\B(\#[a-zA-Z0-9_; ]+\b)(\;)) you haven't specified where the matching process should start and end. So regex engine will try to match every position of the string on which you run the regex.match.
The way to specify where regex should try to match is done by adding anchors(^-beginning and $-end) to regex pattern.
You can edit your pattern to look like this: /(?:\s|^)(#[a-zA-Z0-9_; ]+?);(?:\s|$)/gm
Explanation:
/(?:\s|^)
- (?: means a non capture group, means dont include whatever is matched in between these () in the result. \s|^ means start matching if the beginning is a white space or beginning of a string.
(#[a-zA-Z0-9_; ]+);
- () is a regular capture group, which means that things captured in this group are included in the result.
You don't need to insert a '\' before every symbol
(?:\s|$)/
- another non capture group, specifying to match a white space or end position of a string.
gm
- global and multiline flags of javascript regex
Here is an example:
let regex_pattern = /(?:\s|^)(#[a-zA-Z0-9_; ]+);(?=\s|$)/gm
let input1 = " #key;" // string with just one word
let input2 = "#key1; #key2;#key3;" // string with one whole word and another word which will match your pattern
let input3 = "soemthing random #key;andjointstring" // a string with a word that will match the pattern but its not a whole word
console.log(input1.match(regex_pattern)) // it matches
console.log(input2.match(regex_pattern)) // it matches
console.log(input3.match(regex_pattern)) // it doesnt matches

Regex match the unknown characters with dash between

I'm struggling with the following combination of characters that I'm trying to parse:
I have two types of text:
1. AF-B-W23F4-USLAMC-X99-JLK
2. LS-V-A23DF-SDLL--X22-LSM
I want to get the last two combination of characters devided by - within dash.
From the 1. X99-JLK and from the 2. X22-LSM
I accomplished the 2. with the following regex '--(.*-.*)'
How can I parse the 1. sample and is there any option to parse it at one time with something like OR operator?
Thanks for any help!
The pattern --(.*-.*) that you tried matches the second example because it contains -- and it matches the first occurrence.
Then it matches until the end of the string and backtracks to find another hyphen.
As .* can match any character (also -) and there are no anchors or boundaries set, this is a very broad match.
If there have to be 2 dashes, you can match the first one, and use a capture group for the part with the second one using a negated character class [^-]
The character class can also match a newline. If you don't want to match a newline you can use [^-\r\n] or also not matching spaces [^-\s] (as there are none in the example data)
-([^-]+-[^-]+)$
Explanation
- Match -
( Capture group 1
[^-]+-[^-]+ Match the second dash between chars other than -
) Close group 1
$ End of string
See a regex demo
For example using Javascript:
const regex = /-([^-]+-[^-]+)$/;
[
"AF-B-W23F4-USLAMC-X99-JLK",
"LS-V-A23DF-SDLL--X22-LSM"
].forEach(s => {
const m = s.match(regex);
if (m) {
console.log(m[1]);
}
})
You can try lookahead to match the last pair before the new line. JavaScript example:
const str = `
AF-B-W23F4-USLAMC-X99-JLK
LS-V-A23DF-SDLL--X22-LSM
`;
const re = /[^-]*-[^-]*(?=\n)/g;
console.log(str.match(re));

Select everything before & or everything if there is no &

I want to use regex to split some text.
my text:
Hello&World
Hello
0011&World
0011
using (.*)(\&.*) only matches 'Hello&World' and '0011&World' and (.*)(\&.*)? ignores the last part.
For the first 2 I want to get 'Hello' and the last 2 I want to get '0011'
Thank you
It seems you need to fetch 0+ chars other than & at the beginning of a string.
Use the following regex:
^[^&]*
See the regex demo.
Details:
^ - start of string
[^&]* - a negated character class matching zero or more (*) chars other than & (to match 1 or more replace * with +).
See the Python demo:
import re
ss = ['Hello&World','Hello','0011&World','0011']
for s in ss:
print(re.match('[^&]*', s).group())
# print(re.search('^[^&]*', s).group())
Note that re.match looks for a match only at the start of the string, thus making ^ redundant in the pattern.
Else, if you use re.search, the ^ anchor is necessary to anchor the search at the start of the string.

Added some regex into existing regular pattern

I am not good regex and need to update following pattern without impacting other pattern. Any suggestion $ sign contain 1t0 4. $ sign always be begining of the line.( space may or may not be)
import re
data = " $$$AKL_M0_90_2K: Two line end vias (VIAG, VIAT and/or"
patt = '^ (?:ABC *)?([A-Za-z0-9/\._\:]+)\s*: ? '
match = re.findall( patt, data, re.M )
print match
Note : data is multi line string
match should contain : "$$$AKL_M0_90_2K" this result
I suggest the following solution (see IDEONE demo):
import re
data = r" $$$AKL_M0_90_2K: Two line end vias (VIAG, VIAT and/or"
patt = r'^\s*([$]{1,4}[^:]+)'
match = re.findall( patt, data, re.M )
print(match)
The re.findall will return the list with just one match. The ^\s*([$]{1,4}[^:]+) regex matches:
^ - start of a line (you use re.M)
\s* - zero or more whitespaces
([$]{1,4}[^:]+) - Group 1 capturing 1 to 4 $ symbols, and then one or more characters other than :.
See the regex demo
If you need to keep your own regex, just do one of the following:
Add $ to the character class (demo): ^ (?:ABC *)?([$A-Za-z0-9/._:]+)\s*: ?
Add an alternative to the first non-capturing group and place it at the start of the capturing one (demo): ^ ((?:ABC *|[$]{1,4})?[A-Za-z0-9/._:]+)\s*: ?

Regular expression for duplicate words

I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?
Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here
I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html
The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.
Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}
Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source
The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1
Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+
No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.
This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.
Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.
The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.
This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result
Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b
I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.
You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}
As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )
To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.
Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b