Search regex expression whose line begins with - regex

Take for example these two lines:
# ManyRandomCharacters 1 2 3
ManyRandomDifferentCharacters 4 5 6
I'd like a regex such that it finds the numbers at the end but only for the line that doesn't begin with #. I just want to match the numbers, not the whole line (i.e., I just want "4", "5" and "6", not "1", "2" or "3"). That's the tricky part, because everything I tried selects all the line up to the numbers. Is there a way to do this? Thanks!

^ matches the start of the string, and if the multiline flag is set (depends on implementation, usually m), it detects also the start of a line.
So, something like /^(?:[^#].*)(\d+ \d+ \d+)/gm would match any line whose first character isn't #.

Related

RegEx matching odd number of characters at the beginning and the end of string

I have grammar is Lezer where I need to match a "custom string" which can start with any odd number of " and end with the same corresponding number. It can span multiple lines as well and anything inside needs to be skipped as far as the parser is concerned. I am struggling a little with the regEx part of the matching.
str"test" // valid
str"""test""" // valid
str""test"" // not valid
str""""test"""" // not valid
I am trying to match the beginning and end of that string.
I tried among other things "("")*[^"] but it matches the first letter after the odd double quotes (due to the [^"] which is something I would like to avoid.
For matching the end I have a similar issue.
So with the given input of:
1 str"test"
2 str"""
3 a
4 b
5 c
6 """
7 str""nope""
I am trying to match only str" for line 1 and str""" for line 2 and not match on line 7.
Need to match the ends as well (not in the same regex). So the match should be on " for line 1 and """ for line 6.
I have this so far: start: ^str"("")*[^"] end: [^"]"("")*$ but it is not optimal.
FYI I need start and end since the expectation is when you start writing and we hit a match on the beginning the highlighting in the editor should highlight all remaining text as a string until you have a matching odd number of ".
Any advice is appreciated.

extract text data using regexp in MATLAB

I'm dealing with extracting visibility data in METAR(airport weather observation data).
Visibility is a 4 digit(0~9) data, and can also be expressed as'CAVOK' when visibility is good.
but it's quite tricky to use regexp. (METAR data have many variations.)
Data sample(MET_VIS) below:
201903072300 METAR RKPC 072300Z 17003KT 110V210 CAVOK 05/02 Q1026 NOSIG=
201903062000 METAR RKPC 062000Z 33018G29KT 4000 BR FEW012 SCT025 08/04 Q1018 WS R13 R31 NOSIG=
201903062200 METAR RKPC 062200Z 33015KT 290V350 9999 SCT030 07/03 Q1019 NOSIG=
201903080000 METAR RKPC 080000Z 29002KT CAVOK 08/02 Q1027 NOSIG=
I want to extract CAVOK, 4000, 9999, CAVOK on each line.
I tried but this code doesn't work with line 3 :( It returns blank.
regexp(MET_VIS(i),'((?<=KT\s)\d{4})|CAVOK','match')
The third value does not end on KT. What you might do is use another positive lookbehind to check if the string before it ends on KT and match a range of matching 7 times A-Z0-9 followed by a whitespace char after it.
Then you either match 4 digits or CAVOK using an alternation (?:\d{4}|CAVOK) or else you could match CAVOK anywhere in the string.
Add a word boundary after it to prevent the match being part of a larger word.
(?:(?<=KT\s)|(?<=KT [A-Z0-9]{7}\s))(?:\d{4}|CAVOK)\b
Regex demo
You could also make an assumption about the range of "words" from the end your target should be allowed to occur in. For example:
/\b(?:\d{4}|CAVOK)\b(?=(?: \S+){3,9}$)/gm
See regex demo.
Here we're looking for a four-digit number or the phrase CAVOK only, if it is followed by 3 to 9 non-space substrings of variable length until the end of the line.

Regex - Match n occurences of substring within any m-lettered window

I am facing some issues forming a regex that matches at least n times a given pattern within m characters of the input string.
For example imagine that my input string is:
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
I want to detect all cases where an 1 appears at least 7 times (not necessarily consecutively) in the input string, but within a window of up to 20 characters.
So far I have built this expression:
(1[^1]*?){7,}
which detects all cases where an 1 appears at least 7 times in the input string, but this now matches both the:
11000000011101111
and the
1100000001110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011
parts whereas I want only the first one to be kept, as it is within a substring composed of less than 20 characters.
It tried to combine the aforementioned regex with:
(?=(^[01]{0,20}))
to also match only parts of the string containing either an '1' or a '0' of length up to 20 characters but when I do that it stops working.
Does anyone have an idea gow to accomplish this?
I have put this example in regex101 as a quick reference.
Thank you very much!
This is not something that can be done with regex without listing out every possible string. You would need to iterate over the string instead.
You could also iterate over the matches. Example in Python:
import re
matches = re.finditer(r'(?=((1[^1]*?){7}))', string)
matches = [match.group(1) for match in matches if len(match.group(1)) <= 20]
The next Python snippet is an attempt to get the desired sequences using only the regular expression.
import re
r = r'''
(?mx)
( # the 1st capturing group will contain the desired sequence
1 # this sequence should begin with 1
(?=(?:[01]{6,19}) # let's see that there are enough 0s and 1s in a line
(.*$)) # the 2nd capturing group will contain all characters to the end of a line
(?:0*1){6}) # there must be six more 1s in the sequence
(?=.{0,13} # complement the 1st capturing group to 20 characters
\2) # the rest of a line should be 2nd capturing group
'''
s = '''
0000000
101010101010111111100000000000001
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
1111111
111111
'''
print([m.group(1) for m in re.finditer(r, s)])
Output:
['1010101010101', '11111100000000000001', '110000000111011', '1111111']
You can find an exhaustive explanation of this regular expression on RegEx101.

Matlab Regexp for nested groups and captured tokens

Is there a way to capture tokens inside a non-captured group in Matlab regular expressions? Here is the specific problem:
InputString = 'Identifiers: 10 12 1 3 8 6 4 2'
Expression = 'Identifiers:\s(?:(\d*)\t?)+'
regexp(InputString, Expression, 'tokens')
I need to find the numbers after 'Identifier'. The string InputString is part of a big character array with lines before and after this line, separated by \r\n characters. The character after the colon is a whitespace, the numbers are seperated by tabs. The last number has no trailing tab. The number of numbers can vary, but it's always at least one and only integers with 1 or n digits.
I had the following idea in my Expression: Identify line by Identifiers:\s, find numbers with n>1 digits and captured token and possible trailing tab by (\d*)\t and repeat this 1 or more times by +. To repeat the digit part expression, I need to put it in a group. But I don't want to capture the token of the outer group (?:(\d*)\t?), but of course the token of the inner grouping (\d*). Thats why I used ?:. When I remove ?: only one token containing 1012138642 is returned.
Isn't it possible to capture tokens inside a non-capturing group? Do you have any solution to return the numbers in a single statement?
In my current solution I find the line by
Expression = 'Identifiers:.+?\r\n'
Line = regexp(InputString, Expression, 'match')
and get the digits with
regexp(Line, '(\d+)\t+', 'tokens')
I spend so much time finding a single statement solution, I now really need to know if it's possible or not! I am not sure if I am thinking wrong, my head is not working as intended or it's just impossible.
MATLAB doesn't support nested tokens, even if you you mark them as non capturing.
Starting in 16b there are some new text manipulations that make this easier:
str = "Identifiers: 10 12 1 3 8 6 4 2" + newline + "Blah";
str = str.extractBetween("Identifiers: ",newline).split
str =
8×1 string array
"10"
"12"
"1"
"3"
"8"
"6"
"4"
"2"
If your goal is one statement with regexp, using split might get you closer.
str = regexp(str,'(?<=Identifiers[^\n]*)\s*(?=[^\n]*)','split')
str =
1×10 string array
"Identifiers:" "10" "12" "1" "3" "8" "6" "4" "2" "Blah"

c# split text file by changing the line number

I'm trying to split text file by line numbers,
for example, if I have text file like:
1 ljhgk uygk uygghl \r\n
1 ljhg kjhg kjhg kjh gkj \r\n
1 kjhl kjhl kjhlkjhkjhlkjhlkjhl \r\n
2 ljkih lkjhl kjhlkjhlkjhlkjhl \r\n
2 lkjh lkjh lkjhljkhl \r\n
3 asdfghjkl \r\n
3 qweryuiop \r\n
I want to split it to 3 parts (1,2,3),
How can I do this? the size of the text is very large (~20,000,000 characters) and I need an efficient way (like regex).
Another idea, you can use linq to get the groups you're after, by splitting by each first word. Note that this will take each first word, so make sure you only have numbers there. This is using the split/join antipattern, but it seems to work nice here.
var lines = from line in s.Split("\r\n".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries)
let lineNumber = line.Split(" ".ToCharArray(), 2).FirstOrDefault()
group line by lineNumber
into g
select String.Join("\n", g);
Notes:
GroupBy is gurenteed to return lines in the order they appeared.
If a block appears more than once (e.g. "1 1 2 2 3 3 1"), all blocks with the same number will be merged.
You can use a regex, but Split will not work too well. You can Match for the following pattern:
^(\d).*$ # Match first line, capture number
([\r\n]+^\1.*$)* # Match additional lines that begin with the same number
Example: here
I did try to split by$(?<=^(\d+).*)[\r\n]+^(?!\1), but it adds the line numbers as additional elementnt in the array.