vim regex that combines all items - regex

I have a file in the following format
1 2472
1 664
2 2600
10 4135
10 5606
...
and I want to convert it to
1 2472 664
2 2600
10 4135 5606
...

You can combine items by executing this command:
:%s/\v(\d+\s)(.*\n\1.*)+/\=substitute(submatch(0),'\n'.submatch(1),' ','g')/

With multiline support, you can go with:
^(\d+) (\d+)$[.\s]+?^\1 (\d+)
See it on regex101.
The idea is to use backreferences to match two lines.
In vim's syntax, you'd write it as follows:
:%s/\_^\(\d*\) \(\d*\)\_$\_.*\_^\1 \(\d*\)/\1 \2 \3/

Related

Regex Stop at the First Occurrence of a Word

How would I change my Regex to stop after the first match of the word?
My text is:
-rwxr--r-- 1 bob123 bob123 0 Nov 10 22:48 /path/for/bob123/dir/to/file.txt
There is a variable called owner, the first arg from cmd:
owner=$1
My regex is: ^.*${owner}
My match ends up being:
-rwxr--r-- 1 bob123 bob123 0 Nov 10 22:48 /path/for/bob123
But I only want it to be: -rwxr--r-- 1 bob123.
By adding a question mark: ^.*?${owner}. This will make the * quantifier non-greedy. But use -P option: grep -P to use Perl-compatible regular expression.
https://regex101.com/r/ThGpcq/1.
You do not need a regex here, use string manipulation and concatenation:
text='-rwxr--r-- 1 bob123 bob123 0 Nov 10 22:48 /path/for/bob123/dir/to/file.txt'
owner='bob123'
echo "${text%%$owner*}$owner"
# => -rwxr--r-- 1 bob123
See the online Bash demo.
The ${text%%$owner*} removes as much text as possible from the end of string (due to %%) up to and including the $owner, and - since the $owner text is removed - "...$owner" adds $owner back.

Regex to match blocks of text with key phrases in the middle

VB2010: I have text that consists of blocks of text that start with day and time DD HHMM and end only at the next day/time.
Here is my sample text:
18 2131 Z50000 ZZ-AAA
PR
PR
AGM TPS P773QQ 1500 DCA 22FEB
21,77,23,M10,F,26,3100,2
OK
18 2134 Z50000 ZZ-AAA
PR
QU HMKKDBB
.DDVZAZC 182134
ARR
FI US1500/AN P773QQ/DA KDCA/AD KMIA/IN 2026/FB 152/LA /LR
DT DDL DCAV 182134 M33A
- OS KMIA /GNO6541/R200RR
18 2134 Z50000 ZZ-AAA
PR
PR
ARR OPN P773QQ 1500 DCA 22FEB
0757
OK
18 2135 Z50000 ZZ-AAA
PR
PR
ARR M58 P773QQ 1500 DCA 22FEB
212
UNKNOWN POL/SPOL
QU HMKKDBB
.DDVZAZC 182134
ARR
FI US1500/AN P773QQ/DA KDCA/AD KMIA/IN 2026/FB 152/LA /LR
DT DDL DCAV 182134 M33A
- OS KMIA /GNO6541/R200RR
18 2136 Z50000 ZZ-AAA
PRF 1500/18 MIA IN 0152 333
18 2137 Z50000 ZZ-AAA
PR
PRZ 1500/18 MIA IN 2026 N/A 333
My goal is to get only the blocks of text that have key phrases ^FI and ^DT in the middle. The matching groups should contain only two blocks. The one from 18 2134 and end at M33A and then from 18 2135 to M33A.
I have tried:
This works for the most part except it starts the match at the prior block.
RegexOptions.Singleline Or RegexOptions.Multiline Or RegexOptions.IgnoreCase
^\d\d \d{4}(.*?)^FI US(.*?)^DT DDL(.*?)\r
This one I took from another post but cant seem to wrap my head around. It matches only the first part of every block.
RegexOptions.Multiline Or RegexOptions.IgnoreCase
^\d\d \d{4}.*\r[\s\S]*?(?=(?:^\d\d \d{4}|$))
Haven't used regex in a while so any help appreciated.
You may use
(?ms)^\d\d +\d{4}\b(?:(?!^(?:\d\d +\d{4}\b|FI|DT)).)*?^(?:FI|DT).*?(?=^\d\d +\d{4}\b|\Z)
See the regex demo (Though it is a PCRE regex test, it will work the same in .NET).
Pattern details
(?ms) - multiline and singleline options
^ - start of a line
\d\d +\d{4}\b - 2 digits, 1 or more spaces and 4 digits as a whole word
(?:(?!^(?:\d\d +\d{4}\b|FI|DT)).)*? - any char, 0+ repetitions, as few as possible, that does not start the sequence: start of a line, 2 digits, 1 or more spaces and 4 digits as a whole word, or FI or DT
^(?:FI|DT) - FI or DT at the start of a line
.*? - any 0+ chars, as few as possible
(?=^\d\d +\d{4}\b|\Z) - a positive lookahead that requires ^\d\d +\d{4}\b (start of a line, 2 digits, 1 or more spaces and 4 digits as a whole word) or \Z (end of string) to match immediately to the right of the current location.
This regex should find what you need, if single line enabled
[0-3]\d\s+[0-2]\d[0-5]\d.*?(FI.*?)\n(DT.*?)\n
Explanation:
[0-3]\d\s+[0-2]\d[0-5]\d day hour and minute check
.*? ungreedy capturing, . includes newline
(FI.*?)\n first group, FI line, until line break
(DT.*?)\n second group, same deal

regex to match lines with coordinates ending in zero

given the following:
1803 1004 -4.2
1807 1005 3.3
1809 1006 -8.9
1800 1007 -3.7
1805 1008 9.1
1808 1009 -4.3
1800 1000 3.2
I'd like regex to match a line with the two first coordinates that are ending in zero, so we'd only return:
1800 1000 3.2
I only want lines that have both the first two digits ending in zero, and yes the lines will have large quantities of whitespace either at the start or between the digits.
I've tried various combinations of '\s*\d+0\z*\d+0*' and '\d+0\s\d+0*' with no result.
I'm using this in combination with grep.
I recommend option in grep: -E
$ grep -E '^ *([0-9]*0) +([0-9]*0) +.*$' dataFile
Result:
In action: https://regex101.com/r/h4on2q/1
Additional,
About -E: $ man grep
-E, --extended-regexp
Interpret pattern as an extended regular expression (i.e. force grep to behave as egrep).
Basic vs Extended Regular Expressions:
https://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html
Give this a try: ^\s*\d+0\s+\d+0\s+.*$
In action: https://regex101.com/r/t0hhDL/2
It's not clear from your question whether the data you're working with is all one big string, or these are multiple lines being returned. I assumed the latter with the answer above, but the pattern will need to be slightly different if that's not the case.

Regex expression if number string contains specific numbers

I need some help with creating a regex string. I have this long list of numbers:
7001 7002 7003 7004 7005 7006 7007 7008 7009 7010 7011 7012 7013 7014
7015 7016 7017 7018 7019 7020 7021 7022 7023 7024 7025 7026 7027 7028
7029 7030 7031 7032 7033 7034 7035 7036 7037 7038 7039 7040 7041 7042
7043 7044 7045 7046 7047 7048 7049 7050 7051 7052 7053 7054 7055 7056
7057 7058 7059 7060 7061 7062 7063 7064 7065 7066 7067 7068 7069 7070
7071 7072 7073 7074 7075 7076 7077 7078 7079 7080 7081 7082 7083 7084
7085 7086 7087 7088 7089 7090 7091 7092 7093 7094 7095 7096 7097 7098
7099 7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112
7113 7114 7115 7116 7117 7118 7119 7120 7121 7122 7123 7124 7125 7126
7127 7128 7129 7130 7131 7132 7133 7134 7135 7136 7137 7138 7139 7140
7141 7142 7143 7144 7145 7146 7147 7148 7149 7150 7151 7152 7153 7154
7155 7156 7157 7158 7159 7160 7161 7162 7163 7164 7165 7166 7167 7168
7169 7170 7171 7172 7173 7174 7175 7176 7177
Basically, I need to find the numbers that contain numbers 8 and 9 so I can remove them from the list.
I tried this regex: ([0-7][0-7][8-9]{2}) but that will only match numbers that strictly have both numbers 8 & 9.
How about you just write some simple code rather than trying to cram everything into a regex?
#!/usr/bin/perl -i -p # Process the file in place
#n = split / /; # Split on whitespace into array #n
#n = grep { !/[89]/ } #n; # #n now contains only those numbers NOT containing 8 or 9
$_ = join( ' ', #n ); # Rebuild the line
Dalorzo answer would work, but I suggest a different approach:
/\b(?=\d{4}\b)(\d*[89]\d*)\b/g
Assuming you are only looking for 4 digit numbers, then it is using a positive lookahead to ensure you have those (so it won't match, say, 3 or 5 digit numbers) and then checks if at least one of the digits is 8 or 9.
http://regex101.com/r/hW4vQ3
If you need to catch all numbers, not just four digit ones, then
/\b(?=\d+\b)(\d*[89]\d*)\b/g
See it in action:
http://regex101.com/r/bW2gH3
And as a bonus, the regex is also capturing the numbers so you can do a replace afterwards, if you wish
This is a bit long-winded, but easier to decipher:
/\b([89]\d{3}|\d[89]\d{2}|\d{2}[89]\d|\d{3}[89])\b/g
It also restricts the search to 4-digit groups.
How about:
/\b((?:[\d]+)?[89](?:[\d]+)?)\b/g
Online Demo
\b will match the end and the begging of each number.
(?:[\d]+)? a non matching group of numbers, we need optional at the begging [89] and ending [89] and containing [89].
?: The non-matching group may be optional in this expression but there was not need to match the sub-groups.
You can use this pattern:
[0-7]*(?:8[0-8]*9|9[0-9]*8)[0-9]*
or with a backreference:
(?:[0-9]*(?!\1)([89])){2}[0-9]*
re.findall(r"(\d\d[0-7][89])|(\d\d[89][0-7])|(\d\d[89][89])",x)
Works for the input given.
Slightly simpler regex with lookahead:
(?=\d*[89])\d+
Demo

How to match lines not ending (-1)\r\n

I'm trying to match the first 5 lines, and the last line, in this sample:
-- 2012-09-20 rep +6 = 184
1 12532070 (2)
2 12531806 (5)
2 12531806 (5)
-- 2012-09-21 rep +12 = 196
3 125xxxxx (-1)
3 125xxxxx (-1)
16 12557052 (2)
Leaving the following unmatched:
3 125xxxxx (-1)
3 125xxxxx (-1)
I've tried the following regular expressions:
^.*[^(-1)\r\n].*
^.*[^(-1)].*\r\n
^.*[^\(-1\)\r\n].*
^.*[^\(\-1\)\r\n].*
^.*[?!\(-1)\r\n].*
^(!?.*-1.*\r\n)
But none of them do what I want (mostly matching all lines).
My RegEx skills are not brilliant - can anybody point me in the right direction?
You can use negative lookahead
^(?!.*\(-1\)$).*$\r\n
Rather than trying to create a regular expression for this, I would just use the surrounding language to negate the sense of the match, and use a regex that only matches lines that end in '(-1)\r\n'. For instance:
Shell: grep -v '(-1)^M$'
Perl: !/\(-1\)\r\n/
Ed/Vi: v/(-1)^M$
etc.