Regex to match all lines after a specific string - regex

Possible duplicate of Regex - find all lines after a match: although my need is a little different.
I want to parse a plain text file with multiple date/value data separated by specific strings. I want to skip the first half of the file until a specific line where I want to match the results.
Here is an example of the file in question (including the mess with tabulations and spaces):
I dont want to capture the following measures. This text is on a single line and contains tabs and spaces is also ends with this token : Token1
05/01/1969 0.01846
15/01/1969 0.16730
25/01/1969 0.33988
05/04/1969 0.81319
15/04/1969 0.76973
25/11/2011 0.24210
05/12/2011 0.25220
15/12/2011 0.31160
25/12/2011 0.36845
End : bla bla bla
This text is also on a single line and marks the beginning of a new series of results. These are the results that I want. it also ends with the following token : Token2
05/01/1969 109.46333
15/01/1969 110.06998 118.18000
25/01/1969 110.82954
05/02/1969 111.51394 118.83000
25/02/1969 112.36483
05/10/2011 114.38798 114.31000
05/10/2011 114.31000 114.38798 114.38798 114.38798 114.38798 114.38798 114.38798
25/12/2011 112.64000 112.41261 112.86301 113.25494 114.06421 115.93219 116.38780
05/01/2012 112.22834 112.92301 113.40561 114.78823 116.62931 117.43421
05/09/2012 110.01410 112.16391 112.88199 115.23640 117.04756 118.04632
15/09/2012 109.97572 112.00809 112.70266 114.91247 116.65256 117.57412
25/09/2012 109.93967 111.87272 112.53305 114.60381 116.26935 117.12756
End : Marks the end of the file
What I wish to do is to match every line after the line which ends with Token2. I have tried different solutions from the other similar questions but none work. I ended up matching all the results of the file and considered splitting it before applying the following pattern. Is there a pure regex solution to this ?
Here is the pattern that works for the whole file. With named capture groups :
(?P<date>\d\d\/\d\d\/\d\d\d\d)\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*){0,1}[\t ]*(?P<prev_no_rain>\d+\.*\d*){0,1}[\t ]*(?P<prev_10_dry>\d+\.*\d*){0,1}[\t ]*(?P<prev_20_dry>\d+\.*\d*){0,1}[\t ]*(?P<prev_50>\d+\.*\d*){0,1}[\t ]*(?P<prev_20_wet>\d+\.*\d*){0,1}[\t ]*(?P<prev_10_wet>\d+\.*\d*){0,1}
Regex101 link : https://regex101.com/r/a0mCZ2/3

You may leverage the \G operator that matches the start of string (that can be excluded with a negative lookaround) and the end of the previous successful match position. With the (?:\G(?!\A)|\bToken2[\r\n]+) we can tell the regex engine to find a whole word Token2 at the end of the line (with linebreak symbols) and then only find the following subpatterns if they follow in an immediate succession.
A regex that can be used:
(?:\G(?!\A)[\r\n]*|Token2[\r\n]+)\K(?P<date>\d\d\/\d\d\/\d{4})\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*)?[\t ]*(?P<prev_no_rain>\d+(?:\.\d+)*)?[\t ]*(?P<prev_10_dry>\d+\.*\d*)?[\t ]*(?P<prev_20_dry>\d+\.*\d*)?[\t ]*(?P<prev_50>\d+\.*\d*)?[\t ]*(?P<prev_20_wet>\d+\.*\d*)?[\t ]*(?P<prev_10_wet>\d+\.*\d*)?
See the regex demo. Note I replaced {0,1} with ? to shorten it a bit.
The part you are interested in is (?:\G(?!\A)[\r\n]*|Token2[\r\n]+)\K.
(?:\G(?!\A)[\r\n]*|Token2[\r\n]+) - 1 of two alternatives:
\G(?!\A)[\r\n]* - end of the previous successful match and 0+ linebreak symbols
| - or
Token2[\r\n]+ - Token2 followed with 1+ CR or LFs. (If you need to match Token2 as a whole word, you might add \b before it).
\K - omit the text matched so far.
The (?P<date>\d\d\/\d\d\/\d{4})\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*)?[\t ]*(?P<prev_no_rain>\d+(?:\.\d+)*)?[\t ]*(?P<prev_10_dry>\d+\.*\d*)?[\t ]*(?P<prev_20_dry>\d+\.*\d*)?[\t ]*(?P<prev_50>\d+\.*\d*)?[\t ]*(?P<prev_20_wet>\d+\.*\d*)?[\t ]*(?P<prev_10_wet>\d+\.*\d*)? is your pattern that I did not modify too much, and that matches a line with specific fata (note that the fact it matches a line justifies the usage of [\r\n]* after (\G(?!\A))).

Related

Remove duplicate lines containing same starting text

So I have a massive list of numbers where all lines contain the same format.
#976B4B|B|0|0
#970000|B|0|1
#974B00|B|0|2
#979700|B|0|3
#4B9700|B|0|4
#009700|B|0|5
#00974B|B|0|6
#009797|B|0|7
#004B97|B|0|8
#000097|B|0|9
#4B0097|B|0|10
#970097|B|0|11
#97004B|B|0|12
#970000|B|0|13
#974B00|B|0|14
#979700|B|0|15
#4B9700|B|0|16
#009700|B|0|17
#00974B|B|0|18
#009797|B|0|19
#004B97|B|0|20
#000097|B|0|21
#4B0097|B|0|22
#970097|B|0|23
#97004B|B|0|24
#2C2C2C|B|0|25
#979797|B|0|26
#676767|B|0|27
#97694A|B|0|28
#020202|B|0|29
#6894B4|B|0|30
#976B4B|B|0|31
#808080|B|1|0
#800000|B|1|1
#803F00|B|1|2
#808000|B|1|3
What I am trying to do is remove all duplicate lines that contain the same hex codes, regardless of the text after it.
Example, in the first line #976B4B|B|0|0 the hex #976B4B shows up in line 32 as #976B4B|B|0|31. I want all lines EXCEPT the first occurrence to be removed.
I have been attempting to use regex to solve this, and found ^(.*)(\r?\n\1)+$ $1 can remove duplicate lines but obviously not what I need. Looking for some guidance and maybe a possibility to learn from this.
You can use the following regex replacement, make sure you click Replace All as many times as necessary, until no match is found:
Find What: ^((#[[:xdigit:]]+)\|.*(?:\R.+)*?)\R\2\|.*
Replace With: $1
See the regex demo and the demo screenshot:
Details:
^ - start of a line
((#[[:xdigit:]]+)\|.*(?:\R.+)*?) - Group 1 ($1, it will be kept):
(#[[:xdigit:]]+) - Group 2: # and one or more hex chars
\| - a | char
.* - the rest of the line
(?:\R.+)*? - any zero or more non-empty lines (if they can be empty, replace .+ with .*)
\R\2\|.* - a line break, Group 2 value, | and the rest of the line.

Regex to replace block comment with line comment

There are tons of examples to do the conversion from C-style line comment to 1-line block comment. But I need to do the opposite: find a regex to replace multi-line block comment with line comments.
From:
This text must not be touched
/*
This
is
random
text
*/
This text must not be touched
To
This text must not be touched
// This
// is
// random
// text
This text must not be touched
I was thinking if there's a way to represent "each line" concept in regex, then just add // in front of each line. Something like
\/\*\n(?:(.+)\n)+\*\/ -> // $1
But the greediness nature of the regex engine makes $1 just match the last line before */. I know Perl and other languages have some advanced regex features like recursion, but I need to do this in a standard engine. Is there any trick to accomplish this?
EDIT: To clarify, I'm looking for pure regex solution, not involving any programming language. Should be testable on sites like https://regex101.com/.
If you are interested in a single regex pass in the modern JavaScript engine (and other regex engines supporting infinite length patterns in lookbehinds), you can use
/(?<=^(\/)\*(?:(?!^\/\*)[\s\S])*?\r?\n)(?=[\s\S]*?^\*\/)|(?:\r?\n)?(?:^\/\*|^\*\/)/gm
Replace with $1$1, see the regex demo.
Details
(?<=^(\/)\*(?:(?!^\/\*)[\s\S])*?\r?\n) - a positive lookbehind that matches a location that is immediately preceded with
^(\/)\* - /* substring at the start of a line (with / captured into Group 1)
(?:(?!^\/\*)[\s\S])*? - any char, zero or more occurrences, as few as possible, not starting a /* char sequence that appears at the start of a line
\r?\n - a CRLF or LF ending
(?=[\s\S]*?^\*\/) - a positive lookahead that requires any 0 or more chars as few as possible followed with */ at the start of a line, immediately to the right of the current location
| - or
(?:\r?\n)? - an optional CRLF or LF linebreak
(?:^\/\*|^\*\/) - and then either /* or */ at the start of a line.
As usual in such cases, two regular expressions—the second applied to the matches of the first—can do what one cannot achieve.
const txt = `This text must not be touched
/*
This
is
random
text
*/
This text must not be touched`;
const to1line = str => str.replace(
/\/\*\s*(.*?)\s*\*\//gs,
(_, comment) => comment.replace( /^/mg, '//')
);
console.log( to1line( txt ));

Regex which grabs everything between two characters at the end of a line

I'm looking to create a regex which grabs the text between two ":"s but only if it is the "last set", for example:
\--- org.codehaus.groovy.modules.http-builder:http-builder:0.7.1
should return:
http-builder
It should be noted that it's possible to get something like:
\--- org::codehaus::groovy::modules::http-builder:http-builder:0.7.1
because the input does not necessarily follow conventions (based on the problem at hand) but the required information is ALWAYS in the last two ":"s.
I've tried some of the following (minus the end of line):
1) (?<=\:).*(?=\:)
2) [^(.*:)].*[^(:.*)]
3) :.*: (this was the most successful, although I got the ":"s with the result but there are issues when there is more than one set of ":"s)
Futher information:
I need to use Groovy for this
I can read it using a stream or a file (in case that matters)
Thanks for reading and any help!
:([^:]*):[^:]*$
That means:
Sequence must start with a :
Then start capturing (
Capture all characters that are not colons [^:]*
End capturing ) ...
... at the next colon :
Then there's another sequence of chars [^:]*
And after that sequence the line must end $ (no more sequence)
Or if you can use non-greedy matches, you can also use
:(.*?):[^:]*$
.* means capture as many characters as possible, while .*? means capture as little characters as possible. Not all regex implementation support that, though.
How about splitting on the : and grabbing the next-to-last segment?
['org.codehaus.groovy.modules.http-builder:http-builder:0.7.1',
/\--- org::codehaus::groovy::modules::http-builder:http-builder:0.7.1/].each { line ->
assert 'http-builder' == line.split(':')[-2]
}

How remove 1st ":" word from line in txt file?

Please see my textfile data below
roydwk27:teenaibuchytilibu5762sumonkhan:IJQRiq&76:8801627574057
deonnarsi15:latashajcclaypoolejcv5946sumonkhan:JKVWjv&20:8801627573929
ernaalo68:lindaohschletteoha1797sumonkhan:OPYZoy&84:8801628302709
dorathyshi56:fredrickaslperkinsonsle8932sumonkhan:STJKsj&30:8801621846709
londassg15:nataliaunmcredmondung5478sumonkhan:UVDEud&61:8801624792536
xiaoexu39:miriamfyboatwrightfyr3810sumonkhan:IJZAiz&47:8801626854856
I am want delete first word until :
like
roydwk27:
deonnarsi15:
ernaalo68:
dorathyshi56:
actually I am want if sumonkhan starting line then no problem but if sumonkhan line area 1st position available : with something then need remove this.
below actually data show in my .txt file
nataliaunmcredmondung5478sumonkhan:UVDEud&61:8801624792536
miriamfyboatwrightfyr3810sumonkhan:IJZAiz&47:8801626854856
all line available sumonkhan so if sumon khan starting position like this then good else delete this : full word not full line.
I hope this regex would help you. This regex deletes everything until first colon(:).
If you are reading a file then, read it line by line and run following regex on each line.
$str = 'roydwk27:teenaibuchytilibu5762sumonkhan:IJQRiq&76:8801627574057';
$str =~ s/^(?:.*?):(.*)/$1/g;
This code is in perl, you can re-write equivalent code in any other language.
See this demo at regex101.com.
^[\w\d]+:(.*)
^ // match the beginning of a line
[\w\d]+ // match any letter and any number
: // match ":" literally
( // start of the capturing group
.* // match any characters
) // end of capturing group
Now in all your matches in the first group you have the text you want matched. Note the g (global) and m (multiline) modifiers.

Regular expression to get only the first word from each line

I have a text file
#sp_id int,
#sp_name varchar(120),
#sp_gender varchar(10),
#sp_date_of_birth varchar(10),
#sp_address varchar(120),
#sp_is_active int,
#sp_role int
Here, I want to get only the first word from each line. How can I do this? The spaces between the words may be space or tab etc.
Here is what I suggest:
Find what: ^([^ \t]+).*
Replace with: $1
Explanation: ^ matches the start of line, ([^ \t]+) matches 1 or more (due to +) characters other than space and tab (due to [^ \t]), and then any number of characters up to the end of the line with .*.
See settings:
In case you might have leading whitespace, you might want to use
^\s*([^ \t]+).*
I did something similar with this:
with open('handles.txt', 'r') as handles:
handlelist = [line.rstrip('\n') for line in handles]
newlist = [str(re.findall("\w+", line)[0]) for line in handlelist]
This gets a list containing all the lines in the document,
then it changes each line to a string and uses regex to extract the first word (ignoring white spaces)
My file (handles.txt) contained info like this:
JoIyke - personal twitter link;
newMan - another twitter handle;
yourlink - yet another one.
The code will return this list:
[JoIyke, newMan, yourlink]
Find What: ^(\S+).*$
Replace by : \1
You can simply use this to get the first word.Here we are capturing the first word in a group and replace the while line by the captured group.
Find the first word of each line with /^\w+/gm.