Regex across double line break

Regex across double line break - regex

I have the following text, and I need to extra parts out of it:
[Firstname LastName 21/06/2018 - 17:27]
Lorem Ipsum
[Foo Bar 25/01/2017 - 12:10]
Lorem Ipsum - First line
Lorem ipsum Second line
Lorem ipsum third line
Some other random text
I need to extract parts of this text, which I have almost managed to do using the following regex:
\[(?<name>\w+? \w+?) (?<date>\d{2}\/\d{2}\/\d{4}) - (?<time>\d{2}:\d{2})\]\n*(?<note>.+)
Everything works correctly, except for the group labelled <note>, it's only picking up the first line of the note. If there is a line break in the note, then anything after the line break is not picked up.
How can I get it to match all text in the note section, until the regex finds a double line break?

Instead of looking for . (which does not include newlines by default) you can look for [^[], or every character before the next square bracket, followed by two line breaks:
\[(?<name>\w+? \w+?) (?<date>\d{2}\/\d{2}\/\d{4}) - (?<time>\d{2}:\d{2})\]\n*(?<note>[^[]+\n\n)
https://regex101.com/r/12S3ZQ/3

I have modified your original regex to give you the expected output.
\[(?<name>\w+? \w+?) (?<date>\d{2}\/\d{2}\/\d{4}) - (?<time>\d{2}:\d{2})\]\n*(?<note>.+\n?\n?)+
It should match everything until the double line break, notice the only change is at the end.
Instead of...
(?<note>.+)
It is now...
(?<note>.+\n?\n?)+
Edit: Changed the regex so it will include lines separated by ONE line break, but not two.

You may use
\[(?<name>\w+? \w+?) (?<date>\d{2}\/\d{2}\/\d{4}) - (?<time>\d{2}:\d{2})\]\s*(?<note>[\s\S]+?)(?=\n{2}|$)
See the regex demo
The (?<note>[\s\S]+?)(?=\n{2}|$) will match 1+ chars, as few as possible, up to the first 2 newline chars or end of string.
If your regex engine supports \R construct to match any line break sequence, you can use (?=\R{2}|$).

Related

Remove duplicate lines containing same starting text

So I have a massive list of numbers where all lines contain the same format.
#976B4B|B|0|0
#970000|B|0|1
#974B00|B|0|2
#979700|B|0|3
#4B9700|B|0|4
#009700|B|0|5
#00974B|B|0|6
#009797|B|0|7
#004B97|B|0|8
#000097|B|0|9
#4B0097|B|0|10
#970097|B|0|11
#97004B|B|0|12
#970000|B|0|13
#974B00|B|0|14
#979700|B|0|15
#4B9700|B|0|16
#009700|B|0|17
#00974B|B|0|18
#009797|B|0|19
#004B97|B|0|20
#000097|B|0|21
#4B0097|B|0|22
#970097|B|0|23
#97004B|B|0|24
#2C2C2C|B|0|25
#979797|B|0|26
#676767|B|0|27
#97694A|B|0|28
#020202|B|0|29
#6894B4|B|0|30
#976B4B|B|0|31
#808080|B|1|0
#800000|B|1|1
#803F00|B|1|2
#808000|B|1|3
What I am trying to do is remove all duplicate lines that contain the same hex codes, regardless of the text after it.
Example, in the first line #976B4B|B|0|0 the hex #976B4B shows up in line 32 as #976B4B|B|0|31. I want all lines EXCEPT the first occurrence to be removed.
I have been attempting to use regex to solve this, and found ^(.*)(\r?\n\1)+$ $1 can remove duplicate lines but obviously not what I need. Looking for some guidance and maybe a possibility to learn from this.

You can use the following regex replacement, make sure you click Replace All as many times as necessary, until no match is found:
Find What: ^((#[[:xdigit:]]+)\|.*(?:\R.+)*?)\R\2\|.*
Replace With: $1
See the regex demo and the demo screenshot:
Details:
^ - start of a line
((#[[:xdigit:]]+)\|.*(?:\R.+)*?) - Group 1 ($1, it will be kept):
(#[[:xdigit:]]+) - Group 2: # and one or more hex chars
\| - a | char
.* - the rest of the line
(?:\R.+)*? - any zero or more non-empty lines (if they can be empty, replace .+ with .*)
\R\2\|.* - a line break, Group 2 value, | and the rest of the line.

Regular expression to get only the first word from each line

I have a text file
#sp_id int,
#sp_name varchar(120),
#sp_gender varchar(10),
#sp_date_of_birth varchar(10),
#sp_address varchar(120),
#sp_is_active int,
#sp_role int
Here, I want to get only the first word from each line. How can I do this? The spaces between the words may be space or tab etc.

Here is what I suggest:
Find what: ^([^ \t]+).*
Replace with: $1
Explanation: ^ matches the start of line, ([^ \t]+) matches 1 or more (due to +) characters other than space and tab (due to [^ \t]), and then any number of characters up to the end of the line with .*.
See settings:
In case you might have leading whitespace, you might want to use
^\s*([^ \t]+).*

I did something similar with this:
with open('handles.txt', 'r') as handles:
handlelist = [line.rstrip('\n') for line in handles]
newlist = [str(re.findall("\w+", line)[0]) for line in handlelist]
This gets a list containing all the lines in the document,
then it changes each line to a string and uses regex to extract the first word (ignoring white spaces)
My file (handles.txt) contained info like this:
JoIyke - personal twitter link;
newMan - another twitter handle;
yourlink - yet another one.
The code will return this list:
[JoIyke, newMan, yourlink]

Find What: ^(\S+).*$
Replace by : \1
You can simply use this to get the first word.Here we are capturing the first word in a group and replace the while line by the captured group.

Find the first word of each line with /^\w+/gm.

Add text at the end of specific lines

I know how to add something to the end of every line, but how to add text at the end of the lines containing specific words.
Some line of text here
Tomatoes Oranges
Mili Deci Centi
Some line of text there
Fire Flame
Dog Cat
Tall Small
Some line of text with more text
Mother farher
-------
I want to add characters at the end of the lines containing "Some line", something like this:
Some line of text here EXTRATEXT
Tomatoes Oranges
Mili Deci Centi
Some line of text there EXTRATEXT
Fire Flame
Dog Cat
Tall Small
Some line of text with more text EXTRATEXT
Mother farher
-------
The lines end in different characters, so I need to search for a pattern that is inside the line, and add text at the end of those line.

Replace the following pattern:
Some line.*
With:
$0 EXTRATEXT
This matches from Some line up to the end of the line (.*, as . matches any character but a newline).
You can then replace the whole match ($0) with itself followed by the extra text you want.

[a-zA-Z]+\n or \w+\n or mutliple \n+ at the end if you want to clean empty lines too. Finally if it's important that the word is capital on the firs letter: [A-Z][a-zA-Z]+\n

Why don't you try delimiting the regex pattern with a line-break, or a carriage return.
I think it might be achieved with \r\n at the end of the regex, on Notepad++.

How to match the end of a line but not a paragraph with a regex in Vim?

I am trying to join all lines in a paragraph, but not join one paragraph with the next.
In my text file, the paragraph is not defined by blank lines in between them, but with a period at the end of the line. There could be white spaces after the period but it still defines the end of the paragraph.
So, I wanted to do a macro that jumps to the next end of line, not stopping on those lines that have a period at the end.
I used this regex:
[^\.\s][\s]*$
Meaning: find any character that is not a period nor a whitespace, optionally followed by whitespaces to the end of the line.
I would then apply the J command to join the matched line with the next one, and then repeat.
It works fine on RegexPal, but in Vim it stops at lines that have a period and two spaces.
What am I doing wrong?

Instead of using the regex in a macro in conjunction with the J command, how about using a regex substitution to remove linebreaks? This seems to work for me:
:%s/[^\.]\s*\zs$\n\(^\s*$\n\)*/ /
Explanation:
[^\.]\s*\zs$\n -- lines not ending with a period; start the replacement before the linebreak.
\(^\s*$\n\)* -- include any further lines containing only whitespace
This regex is then replaced with a space.

If the cursor is located at the first line of a paragraph,
one can join its lines with
:,/\.\s*$/j
To do the same for all paragraphs in a buffer, use the command
:g/^/,/\.\s*$/j

This should get you part way there: use shime's regexp (\.\s*$) to identify lines you want to join, then use :v//j! to join each such line to the next line.
Then repeat the :v//j! command until done. (Define a macro to do it: :map v :v//j!<cr> then just hit v repeatedly.)
A better solution, if you're on a *NIX-like machine is:
awk '/\.\s*$/ { printf("%s\n", $0);} { printf("%s", $0); } END { printf("\n"); }' <your_file >your_other_file

How to find the 3rd occurrence of a pattern on a line

Today I had to align a table at only the first multiple spaces on a line.
p.e.
<ScrollWheelDown> move window three lines down
<S-ScrollWheelDown> move window one page down
<ScrollWheelUp> move window three lines up
<S-ScrollWheelUp> move window one page up
I use Tabular plugin to align tables but I could not find a way how to find only the first occurrence of multiple spaces and do an align only there.
I don't know it either in VIM:
What will be the regex if I only want to find the 3rd occurrence of a pattern on a line?
Is the regex the same as using Tabular?

The regex would be:
/\(.\{-}\zsPATTERN\)\{3}
So if, for example, you want to change the 3rd 'foo' to 'bar' on the following line:
lorem ifoopsum foo lor foor ipsum foo dolor foo
^1 ^2 ^3 ^4 ^5
run:
s/\(.\{-}\zsfoo\)\{3}/bar/
to get:
lorem ifoopsum foo lor barr ipsum foo dolor foo
^1 ^2 ^3=bar ^4 ^5

I don't know if it fits your needs, but you can search that way :
Place your cursor at the beginning line
Type 3 / pattern Return
It place the cursor on the 3rd occurrence of the next matching line (highlighting all occurrences)
You can also macro :
qa+3nq
then #a to go to the next line 3rd occurence

For Google users (like me) that search just for: "regex nth occurrence". This will return position of last character of third 'foo' (you need to change {3} to your n and foo to your text):
length(regexp_replace('lorem ifoopsum foo lor foor1 ipsum foo dolor foo', '((?:.*?foo){3}).*$', '\1'))
This: (?:.*?foo) searches for anything followed by 'foo', then it is repeated 3 times (?:.*?foo){3}, then string from start to (including) 3rd repetition is captured, then rest of string is matched by .*$, then whole string is replaced by captured thing, and length of it is position of last character of 3rd 'foo'.

Try this:
:Tabularize /^.\{-}\S\s\{2,}
Yes, Tabularize uses Vim's regex, so the example on Eelvex's answer should work.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex across double line break - regex

Instead of looking for . (which does not include newlines by default) you can look for [^[], or every character before the next square bracket, followed by two line breaks: \[(?<name>\w+? \w+?) (?<date>\d{2}\/\d{2}\/\d{4}) - (?<time>\d{2}:\d{2})\]\n*(?<note>[^[]+\n\n) https://regex101.com/r/12S3ZQ/3

Related

Remove duplicate lines containing same starting text

Regular expression to get only the first word from each line

Add text at the end of specific lines

How to match the end of a line but not a paragraph with a regex in Vim?

How to find the 3rd occurrence of a pattern on a line

Categories

Resources