Match between two strings + concatenate - regex

I have this text:
2015-10-01 15:15:30 subject: Announcement: [Word To Find] Some other thext
My Goal is to match the date with the time:
(?s)(?<=^)(.+?)(?= subject\: Announcement\: )
And also the text within [ ]
(?s)(?<=\[)(.+?)(?=\])
How to get those two results in a single regex?

I'm going to chime in with a working regex, which although similar to other answers, has all redundancies removed:
^(?s)(.*?) subject: Announcement: \[(.*?)]
Which yields groups:
1. "2015-10-01 15:15:30"
2. "Word To Find"
See live demo.
Redundancies:
It is not necessary to escape ] except within a character class
It is never necessary to escape a colon :
The look behind (?<=^) is identical to simply ^, since both are zero-width assertions

Use regex alternation operator.
^(?s).*?(?= subject\: Announcement\: )|(?<=\[)[^\]]*(?=\])
DEMO

You can use a simple regex for that:
(.*)\s+subject.*\[(.*?)\]
Or
(.*)\s+subject.*\[([^]]+)\]
The first group contains the date, the second contains the text within the [ ].

You can use following regex to get both match :
(?<=^|\[)(.*?)(?=subject|\])
see demo https://regex101.com/r/hU2iZ3/2
Note that all you need is use a logical OR (|) between your precede tokens and next tokens.
Also note that if your have another brackets within your text you should use a negated character class instead .*:
(?<=^|\[)([^[\]]*?)(?=subject|\])

Related

Regex: matching up first occurence before special characters (|,-,/...)

I have product id on a sheet in two parts separated by special characters
I have several pattern, I can't find a solution that works for all my patterns, I would like to keep only the text before the "-", "|", space can be everywhere
aaa23-rerez3
dfds12|gdflk 132
ds123 fdsf-123 gad
sa 123,fdsg 123
I found this regex :
.*\w
working for some pattern but didn't work for pipe | and -
many thanks for your help
To match only the text before the | or - you can use an anchor ^ to assert the start of the string and use a negated character class to match any char except the listed in the character class.
^[^|-]+
Regex demo
If the spaces can be anywhere and you also want to match those along with only word characters:
^\s*(?:\w+\s*)+
Regex demo
I hope the following regular expression works for you. I tested it and it worked for all your patterns.
^([^-\|\s]+)(?=[-\|\s].*$)
Allow spaces, but separate if special character found.
["aaa23-rerez3", "dfds12|gdflk 132", "ds123 fdsf-123 gad", "sa 123,fdsg 123"].forEach(x => console.log(x, x.split(/[^\d\w\s]/g)))
Separates space also.
["aaa23-rerez3", "dfds12|gdflk 132", "ds123 fdsf-123 gad", "sa 123,fdsg 123"].forEach(x => console.log(x, x.split(/\W/g)))

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

Regex to catch commands recursively

I would like to capture my commands with dynamic group matching...
I'm not familiar with regex but it maybe requires a recursive group. Still, I don't understand the syntax.
some command content;some other command;another one
my groups would be any character between ;.
Here's my attempt at the code, but it only works for the first two groups:
(.+)[;]*(.*)
You may be overthinking this: [^;]+ will give you all parts of the string between semicolons. No need for recursion.
Test it live on regex101.com.
Try using this
(.+);*
or
(.+)(;*)(.*)
My approach would be like it:
const text = 'some command content;some other command; another one;'
const regex = '/[\s\S]*;/gi'
console.log(text.match(regex))
The output would be:
[ 'some command content;some other command; another one;' ]
[\s|s]* -> \s matches whitespace (spaces, tabs and new lines). \S is
negated \s. (*) It means the occurrence must appear zero or more times.
; -> symbol ';' must be included.

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

How to distinguish between saved segment and alternative?

From the following text...
Acme Inc.<SPACE>12345<SPACE or TAB>bla bla<CRLF>
... I need to extract company name + zip code + rest of the line.
Since either a TAB or a SPACE character can separate the second from the third tokens, I tried using the following regex:
FIND:^(.+) (\d{5})(\t| )(.+)$
REPLACE:\1\t\2\t\3
However, the contents of the alternative part is put in the \3 part, so the result is this:
Acme Inc.<TAB>12345<TAB><TAB or SPACE here>$
How can I tell the (Perl) regex engine that (\t| ) is an alternative instead of a token to be saved in RAM?
Thank you.
You want:
^(.+?) (\d{5})[\t ](.+)$
Since you are matching one character or the other, you can use a character class instead. Also, I made your first quantifier non-greedy (+? instead of +) to reduce the amount of backtracking the engine has to do to find the match.
In general, if you want to make capture groups not capture anything, you can add ?: to it, like:
^(.+?) (\d{5})(?:\t| )(.+)$
Use non-capturing parentheses:
^(.+) (\d{5})(?:\t| )(.+)$
One way is to use \s instead of ( |\t) which will match any whitespace char.
See Backslash-sequences for how Perl defines "whitespace".