Regex to catch commands recursively - regex

I would like to capture my commands with dynamic group matching...
I'm not familiar with regex but it maybe requires a recursive group. Still, I don't understand the syntax.
some command content;some other command;another one
my groups would be any character between ;.
Here's my attempt at the code, but it only works for the first two groups:
(.+)[;]*(.*)

You may be overthinking this: [^;]+ will give you all parts of the string between semicolons. No need for recursion.
Test it live on regex101.com.

Try using this
(.+);*
or
(.+)(;*)(.*)

My approach would be like it:
const text = 'some command content;some other command; another one;'
const regex = '/[\s\S]*;/gi'
console.log(text.match(regex))
The output would be:
[ 'some command content;some other command; another one;' ]
[\s|s]* -> \s matches whitespace (spaces, tabs and new lines). \S is
negated \s. (*) It means the occurrence must appear zero or more times.
; -> symbol ';' must be included.

Related

Regex: matching up first occurence before special characters (|,-,/...)

I have product id on a sheet in two parts separated by special characters
I have several pattern, I can't find a solution that works for all my patterns, I would like to keep only the text before the "-", "|", space can be everywhere
aaa23-rerez3
dfds12|gdflk 132
ds123 fdsf-123 gad
sa 123,fdsg 123
I found this regex :
.*\w
working for some pattern but didn't work for pipe | and -
many thanks for your help
To match only the text before the | or - you can use an anchor ^ to assert the start of the string and use a negated character class to match any char except the listed in the character class.
^[^|-]+
Regex demo
If the spaces can be anywhere and you also want to match those along with only word characters:
^\s*(?:\w+\s*)+
Regex demo
I hope the following regular expression works for you. I tested it and it worked for all your patterns.
^([^-\|\s]+)(?=[-\|\s].*$)
Allow spaces, but separate if special character found.
["aaa23-rerez3", "dfds12|gdflk 132", "ds123 fdsf-123 gad", "sa 123,fdsg 123"].forEach(x => console.log(x, x.split(/[^\d\w\s]/g)))
Separates space also.
["aaa23-rerez3", "dfds12|gdflk 132", "ds123 fdsf-123 gad", "sa 123,fdsg 123"].forEach(x => console.log(x, x.split(/\W/g)))

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

Match between two strings + concatenate

I have this text:
2015-10-01 15:15:30 subject: Announcement: [Word To Find] Some other thext
My Goal is to match the date with the time:
(?s)(?<=^)(.+?)(?= subject\: Announcement\: )
And also the text within [ ]
(?s)(?<=\[)(.+?)(?=\])
How to get those two results in a single regex?
I'm going to chime in with a working regex, which although similar to other answers, has all redundancies removed:
^(?s)(.*?) subject: Announcement: \[(.*?)]
Which yields groups:
1. "2015-10-01 15:15:30"
2. "Word To Find"
See live demo.
Redundancies:
It is not necessary to escape ] except within a character class
It is never necessary to escape a colon :
The look behind (?<=^) is identical to simply ^, since both are zero-width assertions
Use regex alternation operator.
^(?s).*?(?= subject\: Announcement\: )|(?<=\[)[^\]]*(?=\])
DEMO
You can use a simple regex for that:
(.*)\s+subject.*\[(.*?)\]
Or
(.*)\s+subject.*\[([^]]+)\]
The first group contains the date, the second contains the text within the [ ].
You can use following regex to get both match :
(?<=^|\[)(.*?)(?=subject|\])
see demo https://regex101.com/r/hU2iZ3/2
Note that all you need is use a logical OR (|) between your precede tokens and next tokens.
Also note that if your have another brackets within your text you should use a negated character class instead .*:
(?<=^|\[)([^[\]]*?)(?=subject|\])

How to distinguish between saved segment and alternative?

From the following text...
Acme Inc.<SPACE>12345<SPACE or TAB>bla bla<CRLF>
... I need to extract company name + zip code + rest of the line.
Since either a TAB or a SPACE character can separate the second from the third tokens, I tried using the following regex:
FIND:^(.+) (\d{5})(\t| )(.+)$
REPLACE:\1\t\2\t\3
However, the contents of the alternative part is put in the \3 part, so the result is this:
Acme Inc.<TAB>12345<TAB><TAB or SPACE here>$
How can I tell the (Perl) regex engine that (\t| ) is an alternative instead of a token to be saved in RAM?
Thank you.
You want:
^(.+?) (\d{5})[\t ](.+)$
Since you are matching one character or the other, you can use a character class instead. Also, I made your first quantifier non-greedy (+? instead of +) to reduce the amount of backtracking the engine has to do to find the match.
In general, if you want to make capture groups not capture anything, you can add ?: to it, like:
^(.+?) (\d{5})(?:\t| )(.+)$
Use non-capturing parentheses:
^(.+) (\d{5})(?:\t| )(.+)$
One way is to use \s instead of ( |\t) which will match any whitespace char.
See Backslash-sequences for how Perl defines "whitespace".

Regex - Multiline Problem

I think I'm burnt out, and that's why I can't see an obvious mistake. Anyway, I want the following regex:
#BIZ[.\s]*#ENDBIZ
to grab me the #BIZ tag, #ENDBIZ tag and all the text in between the tags. For example, if given some text, I want the expression to match:
#BIZ
some text some test
more text
maybe some code
#ENDBIZ
At the moment, the regex matches nothing. What did I do wrong?
ADDITIONAL DETAILS
I'm doing the following in PHP
preg_replace('/#BIZ[.\s]*#ENDBIZ/', 'my new text', $strMultiplelines);
The dot loses its special meaning inside a character class — in other words, [.\s] means "match period or whitespace". I believe what you want is [\s\S], "match whitespace or non-whitespace".
preg_replace('/#BIZ[\s\S]*#ENDBIZ/', 'my new text', $strMultiplelines);
Edit: A bit about the dot and character classes:
By default, the dot does not match newlines. Most (all?) regex implementations have a way to specify that it match newlines as well, but it differs by implementation. The only way to match (really) any character in a compatible way is to pair a shorthand class with its negation — [\s\S], [\w\W], or [\d\D]. In my personal experience, the first seems to be most common, probably because this is used when you need to match newlines, and including \s makes it clear that you're doing so.
Also, the dot isn't the only special character which loses its meaning in character classes. In fact, the only characters which are special in character classes are ^, -, \, and ]. Check out the "Metacharacters Inside Character Classes" section of the character classes page on Regular-Expressions.info.
// Replaces all of your code with "my new text", but I do not think
// this is actually what you want based on your description.
preg_replace('/#BIZ(.+?)#ENDBIZ/s', 'my new text', $contents);
// Actually "gets" the text, which is what I think you might be looking for.
preg_match('/(#BIZ)(.+?)(#ENDBIZ)/s', $contents, $matches);
list($dummy, $startTag, $data, $endTag) = $matches;
This should work
#BIZ[\s\S]*#ENDBIZ
You can try this online Regular Expression Testing Tool
The mistake is the character group [.\s] that will match a dot (not any character) or white space. You probably tried to get .* with . matching newline characters, too. You achieve this by enabling the single line option ((?s:) does this in .NET regex).
(?s:#BIZ.*?#ENDBIZ)
Depending on the environment you're using your regex in, it may need special care to properly parse multiline text, eg re.DOTALL in Python. So what environment is that?
you can use
preg_replace('/#BIZ.*?#ENDBIZ/s', 'my new text', $strMultiplelines);
the 's' modifier says "match the dot with anything, even the newline character". the '?' says don't be greedy, such as for the case of:
foo
#BIZ
some text some test
more text
maybe some code
#ENDBIZ
bar
#BIZ
some text some test
more text
maybe some code
#ENDBIZ
hello world
the non-greediness won't get rid of the "bar" in the middle.
Unless I am missing something, you handle this the same way that you would in Perl, with either the /m or /s modifier at the end? Oddly enough the other answers that rather correctly pointed this out got down voted?!
It looks like you're doing a javascript regex, you'll need to enable multiline by specifying the m flag at the end of the expression:
var re = /^deal$/mg