Regex: matching up first occurence before special characters (|,-,/...) - regex

I have product id on a sheet in two parts separated by special characters
I have several pattern, I can't find a solution that works for all my patterns, I would like to keep only the text before the "-", "|", space can be everywhere
aaa23-rerez3
dfds12|gdflk 132
ds123 fdsf-123 gad
sa 123,fdsg 123
I found this regex :
.*\w
working for some pattern but didn't work for pipe | and -
many thanks for your help

To match only the text before the | or - you can use an anchor ^ to assert the start of the string and use a negated character class to match any char except the listed in the character class.
^[^|-]+
Regex demo
If the spaces can be anywhere and you also want to match those along with only word characters:
^\s*(?:\w+\s*)+
Regex demo

I hope the following regular expression works for you. I tested it and it worked for all your patterns.
^([^-\|\s]+)(?=[-\|\s].*$)

Allow spaces, but separate if special character found.
["aaa23-rerez3", "dfds12|gdflk 132", "ds123 fdsf-123 gad", "sa 123,fdsg 123"].forEach(x => console.log(x, x.split(/[^\d\w\s]/g)))
Separates space also.
["aaa23-rerez3", "dfds12|gdflk 132", "ds123 fdsf-123 gad", "sa 123,fdsg 123"].forEach(x => console.log(x, x.split(/\W/g)))

Related

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

How to find multiple dot with space character before first dots using REGEX

#^.[\S]+\.[\S]+\.(.*)$
I have used this regex to find multiple dot, but if my string contains white-space before first dot then it is not working
^.[\S]+\.[\S]+\.(.*)$
I expect that the regex should find this value
adajda9a b0a09.haa.ajada
teast.php.tasd
madnadak.ajada.a.jjhjhh
adjahdja.dfajha.ada.adjahdaj..jajjjjjhjha....dahhhhhbbja...
madkaja.adhakjda.sjjj
sadada.asdaa.jadfajk jadajda ajdhajda ada- 0(i09d0a9 )_) aciai
aadhadka.adad.akdjajdka0sd009999a.o999
adajda9a b0a09.haa.ajada
If you just want to match strings that have at least two dots, then why not just use this:
^.*\..*\..*$
Demo
You could also write this using a lookahead:
^(?=.*\..*\.).*$
I have created a regex that will match strings that have multiple dots in them and where there is only one space before the dots appear.
^[^.\s]* [^\s]*(?:\..*\..*)+$
Demo: https://regex101.com/r/UQksQK/4/
If you want to allow several spaces before the dots, use
^[^\.\s]* +.*(?:\..*\..*)+$
This will also match:
adajda9a b0a09.haa.ajada.123
If you want to forbid the space character between the dots, change the regex into:
^[^.\s]* +[^\s]*(?:\.[^\s]*\.[^\s]*)+$
It will not match strings like (where you have spaces between the dots):
adajda9a b0a09.ha a.ajada.123
Per comment to match line with space preceding first multiple dots:
^[^\.]* .*\..*\..*$
Test:
$ cat test.regexp
teast.php.tasd
madnadak.ajada.a.jjhjhh
adjahdja.dfajha.ada.adjahdaj..jajjjjjhjha....dahhhhhbbja...
madkaja.adhakjda.sjjj
sadada.asdaa.jadfajk jadajda ajdhajda ada- 0(i09d0a9 )_) aciai
aadhadka.adad.akdjajdka0sd009999a.o999
adajda9a b0a09.haa.ajada
$ egrep "^[^\.]* .*\..*\..*$" test.regexp
adajda9a b0a09.haa.ajada

Regex to MATCH number string (with optional text) in a sentence

I am trying to write a regex that matches only strings like this:
89-72
10-123
109-12
122-311(a)
22-311(a)(1)(d)(4)
These strings are embedded in sentences and sometimes there are 2 potential matches in the sentence like this:
In section 10-123 which references section 122-311(a) there is a phone number 456-234-2222
I do not want to match the phone. Here is my current working regex
\d{2,3}\-\d{2,3}(\([a-zA-Z0-9]\))*
see DEMO
I've been looking on Stack and have not found anything yet. Any help would be appreciated. Will be using this in a google sheet and potentially postgres.
Based on regex, suggested by #Wiktor Stribiżew:
=REGEXEXTRACT(A1,REPT("\b(\d{2,3}-\d{2,3}\b(?:\([A-Za-z0-9]\))*)(?:[^-]|$)(?:.*)",LEN(REGEXREPLACE(REGEXREPLACE(A1,"\b(\d{2,3}-\d{2,3}\b(?:\([A-Za-z0-9]\))*)(?:[^-]|$)", char (9)),"[^"&char(9)&"]",""))))
The formula will return all matches.
String:
A
In 22-311(a)(1)(d)(4) section 10-123 which ... 122-311(a) ... number 456-234-2222
Output:
B C D
22-311(a)(1)(d)(4) 10-123 122-311(a)
Solution
To extract all matches from a string, use this pattern:
=REGEXEXTRACT(A1,
REPT(basic_regex & "(?:.*)",
LEN(REGEXREPLACE(REGEXREPLACE(A1,basic_regex, char (9)),"[^"&char(9)&"]",""))))
The tail of a function:
LEN(REGEXREPLACE(REGEXREPLACE(A1,basic_regex, char (9)),"[^"&char(9)&"]","")))
is just for finding number 3 -- how many entries of a pattern in a string.
To not match the phone number you have to indicate that the match must neither be preceded nor followed by \d or -. Google spreadsheet uses RE2 which does not support look around assertion (see the list of supported feature) so as far as I can tell, the only solution is to add a character before and after the match, or the string boundary:
(?:^|[^-\d])\d{2,3}\-\d{2,3}(\([a-zA-Z0-9]\))*(?:$|[^-\d])
(?:^|[^-\d]) means either the start of a line (^) or a character that is not - or \d (you might want to change that, and forbid all letters as well). $ is the end of a line. ^ and $ only do what you want with the /m flag though
As you can see here this finds the correct strings, but with additional spaces around some of the matches.

Match between two strings + concatenate

I have this text:
2015-10-01 15:15:30 subject: Announcement: [Word To Find] Some other thext
My Goal is to match the date with the time:
(?s)(?<=^)(.+?)(?= subject\: Announcement\: )
And also the text within [ ]
(?s)(?<=\[)(.+?)(?=\])
How to get those two results in a single regex?
I'm going to chime in with a working regex, which although similar to other answers, has all redundancies removed:
^(?s)(.*?) subject: Announcement: \[(.*?)]
Which yields groups:
1. "2015-10-01 15:15:30"
2. "Word To Find"
See live demo.
Redundancies:
It is not necessary to escape ] except within a character class
It is never necessary to escape a colon :
The look behind (?<=^) is identical to simply ^, since both are zero-width assertions
Use regex alternation operator.
^(?s).*?(?= subject\: Announcement\: )|(?<=\[)[^\]]*(?=\])
DEMO
You can use a simple regex for that:
(.*)\s+subject.*\[(.*?)\]
Or
(.*)\s+subject.*\[([^]]+)\]
The first group contains the date, the second contains the text within the [ ].
You can use following regex to get both match :
(?<=^|\[)(.*?)(?=subject|\])
see demo https://regex101.com/r/hU2iZ3/2
Note that all you need is use a logical OR (|) between your precede tokens and next tokens.
Also note that if your have another brackets within your text you should use a negated character class instead .*:
(?<=^|\[)([^[\]]*?)(?=subject|\])

Regex validation of filename failing

I'm trying to validate a filename having letters "CAT" or "DOG" followed by 8 numerics, and ending in ".TXT".
Examples:
CAT20000101.TXT
DOG20031212.TXT
This would NOT match:
ATA12330000.TXT
CAT200T0101.TXT
DOG20031212.TX1
Here's the regex I am trying to make work:
(([A-Z]{3})([0-9]{8})([\.TXT]))\w+
Why is the last section (.TXT) failing against non-matching file extensions?
See example: http://regexr.com/3a7fo
Inside character class there is no regex grouping hence [\.TXT] is not right.
You can use this regex:
^[A-Z]{3}[0-9]{8}\.TXT$
For only matching CAT and DOG use:
^(CAT|DOG)[0-9]{8}\.TXT$
lose the unnecessary parentheses
[A-Z]{3}[0-9]{8}[\.TXT]\w+
lose the unnecessary/pattern-breaking character class [] around \.TXT
[A-Z]{3}[0-9]{8}\.TXT\w+
lose the \w+ at the end
[A-Z]{3}[0-9]{8}\.TXT
change [A-Z]{3} to (?:CAT|DOG).
(?:CAT|DOG)[0-9]{8}\.TXT
voilà.
It's failing because \.TXT is in square brackets, which matches only one of those four characters. Just use (\.TXT).
remove square brackets around [.TXT] to .TXT
Your example modified http://regexr.com/3a7fu