Regex - select all text that does not start with a specific number - regex

I want to get all text that does not start with 1,2,12,34.
I wrote
^((?!1|2|12|34).)*$
(^ asserts position at start of a line)
as in:
https://regex101.com/r/gI6sN8/14
Problems
It also doesn't select text that has 1 or 2 in the middle ("AB 1 CD").
It also doesn't select 13 (because it starts with 1)
How can I restrict it

Looks like you want this:
^(?!(1|2|12|34)\s).*
https://regex101.com/r/gI6sN8/16

As mentioned in comment, you need word boundary and correct parenthesis position
^(?!(?:1|2|12|34)\b)(.*)$
Regex Demo
You can also use \D
^(?!(?:1|2|12|34)\D)(.*)$
In your regex
^((?!1|2|12|34).)*$
you are finding whether any of the above alternative 1|2|12|34 is correct at every position. That's why it's not matching AB 1 CD

This works
^(?!(?:12?|2|34)(?!\d)).+$
https://regex101.com/r/gI6sN8/19
A valid boundary between the numbers you don't want it to
start with and the character after it appears to be any non-digit.

Related

Extract numbers from line not starting with comment symbol using regex

I'm try to replace all numbers not in the comment section. Here is a sample of the file to fix:
/* 2018-01-01 06:00:55 : realtime(0.002) --status(10)-- ++numretLines(0)++ --IP(192.168.1.5) PORT(22)-- queryNo(2) comment[TO: Too much time] TYPE[QUERY 4.2] */
select count(*) from table where id1 = 41111 and id2 = 221144
GO
Basically, I would like to replace numbers in strings not beginning with "/*".
I came up with the following regex: /^(?!\/\*)(?:.+\K(\d+?))/gmU
But I only manage to extract the first number of each line not starting with "/*". How could I extend this to get all the numbers of those rows?
Thanks!
Assuming your regex engine (which you haven't told) supports look behind and look ahead, you can use this regex:
(?<!^\/\*.*)(?:(?<=\s)\d+(?=\s))+
The regex starts by using a negative look behind, looking for the start of line, followed by a slash and a star.
Then it creates a new negative look behind for a White Space, then any number of digits, followed by a negative look ahead for a White Space. This Group is repeated any number of times.
You need to set the global and 'multiline' flag.
The regex skips numbers not surrounded by White Space (for instance 'id1')
Based on Wiktor Stribiżew comment, I used \/\*.*?\*\/(*SKIP)(*F)|-?\b\d+(\.\d+)? to extract the numbers, including decimals and negative values.

Regular Expression Extracting Text from a group

I have a filename like this:
0296005_PH3843C5_SEQ_6210_QTY_BILLING_D_DEV_0000000000000183.PS.
I needed to break down the name into groups which are separated by a underscore. Which I did like this:
(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
So far so go.
Now I need to extract characters from one of the group for example in group 2 I need the first 3 and 8 decimal ( keep mind they could be characters too ).
So I had try something like this :
(.*?)_([38]{2})(.*?) _(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
It didn’t work but if I do this:
(.*?)_([PH]{2})(.*?) _(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
It will pull the PH into a group but not the 38 ? So I’m lost at this point.
Any help would be great
Try the below Regex to match any first 3 char/decimal and one decimal
(.?)_([A-Z0-9]{3}[0-9]{1})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
Try the below Regex to match any first 3 char/decimal and one decimal/char
(.?)_([A-Z0-9]{3}[A-Z0-9]{1})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
It will match any 3 letters/digits followed by 1 letter/digit.
If your first two letter is a constant like "PH" then try the below
(.?)_([PH]+[0-9A-Z]{2})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
I am assuming that you are trying to match group2 starting with numbers. If that is the case then you have change the source string such as
0296005_383843C5_SEQ_6210_QTY_BILLING_D_DEV_0000000000000183.PS.
It works, check it out at https://regex101.com/r/zem3vt/1
Using [^_]* performs much better in your case than .*? since it doesn't backtrack. So changing your original regex from:
(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
to:
([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
reduces the number of steps from 114 to 42 for your given string.
The best method might be to actually split your string on _ and then test the second element to see if it contains 38. Since you haven't specified a language, I can't help to show how in your language, but most languages employ a contains or indexOf method that can be used to determine whether or not a substring exists in a string.
Using regex alone, however, this can be accomplished using the following regular expression.
See regex in use here
Ensuring 38 exists in the second part:
([^_]*)_([^_]*38[^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
Capturing the 38 in the second part:
([^_]*)_([^_]*)(38)([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)

How Can I Create a RegEx Pattern that will Get N Words Using Custom Word Boundary?

I need a RegEx pattern that will return the first N words using a custom word boundary that is the normal RegEx white space (\s) plus punctuation like .,;:!?-*_
EDIT #1: Thanks for all your comments.
To be clear:
I'd like to set the characters that would be the word delimiters
Lets call this the "Delimiter Set", or strDelimiters
strDelimiters = ".,;:!?-*_"
nNumWordsToFind = 5
A word is defined as any contiguous text that does NOT contain any character in strDelimiters
The RegEx word boundary is any contiguous text that contains one or more of the characters in strDelimiters
I'd like to build the RegEx pattern to get/return the first nNumWordsToFind using the strDelimiters.
EDIT #2: Sat, Aug 8, 2015 at 12:49 AM US CT
#maraca definitely answered my question as originally stated.
But what I actually need is to return the number of words ≤ nNumWordsToFind.
So if the source text has only 3 words, but my RegEx asks for 4 words, I need it to return the 3 words. The answer provided by maraca fails if nNumWordsToFind > number of actual words in the source text.
For example:
one,two;three-four_five.six:seven eight nine! ten
It would see this as 10 words.
If I want the first 5 words, it would return:
one,two;three-four_five.
I have this pattern using the normal \s whitespace, which works, but NOT exactly what I need:
([\w]+\s+){<NumWordsOut>}
where <NumWordsOut> is the number of words to return.
I have also found this word boundary pattern, but I don't know how to use it:
a "real word boundary" that detects the edge between an ASCII letter
and a non-letter.
(?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z])
However, I would want my words to allow numbers as well.
IAC, I have not been able how to use the above custom word boundary pattern to return the first N words of my text.
BTW, I will be using this in a Keyboard Maestro macro.
Can anyone help?
TIA.
All you have to do is to adapt your pattern ([\w]+\s+){<NumWordsOut>} to, including some special cases:
^[\s.,;:!?*_-]*([^\s.,;:!?*_-]+([\s.,;:!?*_-]+|$)){<NumWordsOut>}
1. 2. 3. 4. 5.
Match any amount of delimiters before the first word
Match a word (= at least one non-delimiter)
The word has to be followed by at least one delimiter
Or it can be at the end of the string (in case no delimiter follows at the end)
Repeat 2. to 4. <NumWordsOut> times
Note how I changed the order of the -, it has to be at the start or end, otherwise it needs to be escaped: \-.
Thanks to #maraca for providing the complete answer to my question.
I just wanted to post the Keyboard Maestro macro that I have built using #maraca's RegEx pattern for anyone interested in the complete solution.
See KM Forum Macro: Get a Max of N Words in String Using RegEx

Regular Expressions in R

I found somewhat similar questions
R - Select string text between two values, regex for n characters or at least m characters,
but I'm still having trouble
say I have a string in r
testing_String <- "AK ADAK NAS PADK ADK 70454 51 53N 176 39W 4 X T 7"
And I need to be able to pull anything between the first element in the string that contains 2 characters (AK) and PADK,ADK. PADK and ADK will change in character but will always be 4 and 3 characters in length respectively.
So I would need to pull
ADAK NAS
I came up with this but its picking up everything from AK to ADK
^[A-Za-z0_9_]{2}(.*?) +[A-Za-z0_9_]{4}|[A-Za-z0_9_]{3,}
If I understood your question correctly, this should do the trick:
\b[A-Z]{2}\s+(.+?)\s+[A-Z]{4}\s+[A-Z]{3}\b
Demo
You'll have to switch the perl = TRUE option (to use a decent regex engine).
\b means word boundary. So this pattern looks for a match starting with a 2-letter word and ending with a 4 letter word followed by a 3 letter word. Your value will be in the first group.
Alternatively, you can write the following to avoid using the capturing group:
\b[A-Z]{2}\s+\K.+?(?=\s+[A-Z]{4}\s+[A-Z]{3}\b)
But I'd prefer the first method because it's easier to read.
Lookbehind is supported for perl=TRUE, so this regex will do what you want:
(?<=\w{2}\s).*?(?=\s+[^\s]{4}\s[^\s]{2})

best approach for my pattern match

So, I've built a regex which follows this:
4!a2!a2!c[3!c]
which is translated to
4 alpha character followed by
2 alpha characters followed by
2 characters followed by
3 optional character
this is a standard format for SWIFT BIC code HSBCGB2LXXX
my regex to pull this out of string is:
(?<=:32[^:]:)(([a-zA-Z]{4}[a-zA-Z]{2})[0-9][a-zA-Z]{1}[X]{3})
Now this is targeting a specific tag (32) and works, however, I'm not sure if it's the cleanest, plus if there are any characters before H then it fails.
the string being matched against is:
:32B:HsBfGB4LXXXHELLO
the following returns HSBCGB4LXXX, but this:
:32B:2HsBfGB4LXXXHELLO
returns nothing.
EDIT
For clarity. I have a string which contains multiple lines all starting with :2xnumber:optional letter (eg, :58A:) i want to specify a line to start matching in and return a BIC from anywhere in the line.
EDIT
Some more example data to help:
:20:ABCDERF Z
:23B:CRED
:32A:140310AUD2120,
:33B:AUD2120,
:50K:/111222333
Mr Bank of Dad
Dads house
England
:52D:/DBEL02010987654321
address 1
address 2
:53B:/HSBCGB2LXXX
:57A://AU124040
AREFERENCE
:59:/44556677
A line which HSBCGB2LXXX contains a BIC
:70:Another line of data
:71A:Even more
Ok, so I need to pass in as a variable the tag 53 or 59 and return the BIC HSBCGB2LXXX only!
Your regex can be simplified, and corrected to allow a character before the H, to:
:32[^:]:.?([a-zA-Z]{6}\d[a-zA-Z]XXX)
The changes made were:
Lost the look behind - just make it part of the match
Inserting .? meaning "optional character"
([a-zA-Z]{4}[a-zA-Z]{2}) ==> [a-zA-Z]{6} (4+2=6)
[0-9] ==> \d (\d means "any digit")
[X]{3} ==> XXX (just easier to read and less characters)
Group 1 of the match contains your target
I'm not quite sure if I understand your question completely, as your regular expression does not completely match what you have described above it. For example, you mentioned 3 optional characters, but in the regexp you use 3 mandatory X-es.
However, the actual regular expression can be further cleaned:
instead of [a-zA-Z]{4}[a-zA-Z]{2}, you can simply use [a-zA-Z]{6}, and the grouping parentheses around this might be unnecessary;
the {1} can be left out without any change in the result;
the X does not need surrounding brackets.
All in all
(?<=:32[^:]:)([a-zA-Z]{6}[0-9][a-zA-Z]X{3})
is shorter and matches in the very same cases.
If you give a better description of the domain, probably further improvements are also possible.