Vimgrep before any empty line - regex

I have a lot of files which starts with some tags I defined.
Example:
=Title
#context
!todo
#topic
#subject
#etc
And some text (notice the blank line just before this text).
Foo
Bar
I'd like to write a Vim search command (with vimgrep) to match something before an empty line.
How do I grep only in the lines before the first blank line? Will it make quicker grep action? Please, no need to mention :grep and binary like Ag - silver search.
I know \_.* to match everything including EOL. I know the negation [^foo]. I succeed to match everything but empty lines with /[^^$]. But I didn't manage to compose my :vimgrep command. Thank you for your help!

If you want a general solution which works for any content of file let me tell you that AFAK, you can't with that form of text. You may ask why ?
Explanation:
vimgrep requires a pattern argument to do the search line by line which behaves exactly as the :global cmd.
For your case we need to get the first part preceding the first blank line. (It can be extended to: Get the first non blank text)
Let's call:
A :Every block of text not containing any single blank line inside
x :Blank lines
With these only 5 forms of content file you can get the first A block with vimgrep(case 2,4,5 are obvious):
1 | 2 | 3 | 4 | 5
x | x | A | x | A
A | A | x | A | x
x | x | A
A |
Looking to your file, it is having this form:
A
x
A
x
A
the middle block causes a problem that's why you cannot split the first A unless you delimit it by some known UNIQUE text.
So the only solution that I can come up for the only 5 cases is:
:vimgrep /\_.\{-}\(\(\n\s*\n\)\+\)\#=/ %

AFAIK the most you can do with :vimgrep is use the \%<XXl atom to search below a specific line number:
:vim /\%<20lfunction/ *.vim
That command will find all instances of function above line 20 in the given files.
See :help \%l.

[...] always matches a single character. [^^$] matches a character that is not ^ or $. This is not what you want.
One of the things you can do is:
/\%^\%(.\+\n\)\{-}.\{-}\zsfoo/
This matches
\%^ - the beginning of the file
\%( \) - a non-capturing group
\{-} - ... repeated 0 or more times (as few as possible)
.\+ - 1 or more non-newline characters
\n - a newline
.\{-} - 0 or more non-newline characters (as few as possible)
\zs - the official start of the match
This will find the first occurrence of foo, starting from the beginning of the file, searching only non-empty lines. But that's all it does: You can't use it to find multiple matches.
Alternatively:
/\%(^\n\_.*\)\#<!foo/
\%( \) - a non-capturing group
\#<! - not-preceded-by modifier
^ - beginning of line
\n - newline
\_.* - 0 or more of any character
This matches every occurrence of foo that is not preceded anywhere by an empty line (i.e. a beginning-of-line / newline combo).

Related

Parse SWIFT(Financial) message string with REGEX in Powershell

I am working on a Powershell script to parse SWIFT messages (text based) into a database. I am using REGEX to find the appropriate strings in the file and extract them. I now run into the issue that one of the data fields can have CR/LF characters in the string - in the example below I would need to extract the second line as well.
:61:2111261126D12000,00NTRF11000004217657P//03MT211124101166
JANE DOE 1232
I tested this regex pattern (:61:.*[\r\n].*) in RegExr and it recognizes the [\r\n] characters as requirement to be valid, so my plan was to have two expressions - one with and one without CR/LF characters to identify both messages - either with line break or without - however the code below will return all matches no matter whether a line break in included or not - it seems that PS stops evaluation strings after CR/LF.
$transaction = $swift | select-string ‘:61:.*[\r\n].*’ -AllMatches | % { $_.Matches } | % { $_.Value }
Can I use REGEX for this task or do I have to create a function to read the entire string and check for the next line tag to determine the end of this string?
Describe the first line more accurately, then whatever is left is necessarily the message:
$swift = #'
:61:2111261126D12000,00NTRF11000004217657P//03MT211124101166
JANE DOE 1232
'#
$swift |Select-String -Pattern '(?m):\d+:[^,]+,[^/]+//\d+MT\d+[\s\r\n]+.*$'
The regex pattern breaks down as follows:
(?m) # Multi-line mode, this will make `$` match end-of-line positions as well as end-of-string
:\d+: # 1 or more digits, surrounded by colons, matches `:61:`
[^,]+, # 1 or more non-commas followed by a comma, matches `2111261126D12000,`
[^/]+// # 1 or more non-slashes, followed by 2, matches `00NTRF11000004217657P//`
\d+MT\d+ # 1 or more digits followed by `MT` and more digits, matches `03MT211124101166`
[\s\r\n]+ # 1 or more white-space/CR/LF characters
.*$ # everything until the end of the current line, matches `JANE DOE 1232`
Since we're using [\s\r\n]+ to describe the potential line break, it'll still work when the linebreak is replaced with other whitespace characters.

Regex POSIX - How can i find if the start of a line contains a word from a word that appears later in line

I have a UNIX passwd file and i need to find using egrep if the first 7 characters from GECOS are inside the username. I want to check if the username (jkennedy) contains the word "kennedy" from the GECOS.
I was planning to use back-references but the username is before the gecos so i don't know how to implement it.
For example the passwd file contains this line:
jkennedy:x:2473:1067:kennedy john:/root:/bin/bash
As per my original comment, the regex below works for me.
See it in use here - note this regex differs slightly as it's more used for display purposes. The regex below is the POSIX version of this and removes non-capture groups and the unneeded capture group around the backreference.
^[^:]*([^:]{7})([^:]*:){4}\1.*$
^ assert position at the start of the line
[^:]* match any character except : any number of times
([^:]{7}) capture exactly seven of any character except :
([^:]*:){4} match the following exactly four times
[^:]*: match any character except : any number of times, followed by : literally
\1 match the backreference; matches what was previously matched by the first capture gorup
.* match any character (except newline characters) any number of times
$ assert position at the end of the line
Assuming you do NOT want case sensitivity to foul your matching -
declare -l tmpUsr tmpName
while IFS=: read usr x x x name x
do tmpUsr="$usr"; tmpName="$name"
(( ${#name} )) && [[ "$tmpUsr" =~ ${tmpName:0:7} ]] &&
printf "$usr ($name<${tmpName:0:7}>)\n"
done</etc/passwd

Notepad++ regex to insert character every nth character from a starting position

How do you use regex to insert | every two characters from a starting position to the end of the line?
Using regex on the following sample (tshark output of packet data), the regex inserts | after the first two characters and the next two characters, but does not apply the pattern to the rest of the line. I think the issue is with a repeated pattern on the 2nd grouping (or lackthereof).
Sample:
1478646603.255173000 10.10.10.1 0000000000000000000000
^(.{34})(..) replace with \1|\2| OR ^(.{34})(.*?(..)) replace with \1|\2
Produces this:
1478646603.255173000 10.10.10.1 00|00|000000000000000000
What I want is:
1478646603.255173000 10.10.10.1 00|00|00|00|00|00|00|00|00|00|00
You may use
(?:\G(?!^)|^.{36})\K..(?!$)
and replace with $&|.
Details:
(?:\G(?!^)|^.{36}) - matches the location at the end of the previous successful match (with \G(?!^)) or (|) the start of a line (^) and the first 36 characters other than linebreak chars (.{36})
\K - the match reset operator that discards the whole text matched so far
.. - any 2 chars other than linebreak chars
(?!$) - that are not at the end of the string.
The replacement pattern only contains the backreference to the whole match ($&) and a | pipe symbol (a literal symbol in the replacement pattern).

Match the line after delimiter and shorter than 9

me and regex never get along
i get every day an email from my supervisor
contains about 1000+ lines need to be sorted
its like :
name|code
the goal is to separate them to 2 files
example :
Garry Cooper|abc123h1n1
Andy Morray|abcd
John Travolta|123567
Simon Person | abcd1
what i do
i look after | character
i remove the whole line :
if code contains numbers only
and or contains letters only
and or is shorter than 9
the example list becomes :
Garry Cooper|abc123h1n1
this steps i do them daily sometimes i get 2000 lines :/ real pain
i used to work with regex in notepad++
but i cant found the match for this one
i am not very bad also in php
help me please
UPDATE 01 :
regex found (?i)^[^|]\|\h[a-z\d]{0,8}$\R?
Current question :
writting a small php script or maybe reusable classes
interface:
submit the data from text box (html form) or from txt file
processing :
lines that match the regex downloadable in txt file.
others in a files
output:
2 links of the files
Thank u all for your help in advance
If you just use a greedy dot matching with .* you do not check the length. It can be checked with the limiting quantifier. To match just 0 to 8 symbols, you can use {0,8}. All but | can be matched with [^|]* negated character class.
Use
(?i)^[^|]*\|\h*[a-z\d]{0,8}$\R?
See regex demo (note that gm flags are used by default in Notepad++ regex-based search and replace).
Explanation:
^ - start of a line
[^|]* - zero or more symbols other than a pipe
\| - a literal pipe symbol
\h* - zero or more horizontal whitespace
[a-z\d]{0,8} - letters a to z and A to Z (due to (?i) case insensitive modifier) or digits, zero to 8 occurrences
$ - end of line and
\R? - one or zero (otpional) line break.

Can't get a specific regex to work in Perl

I have a string formatted like:
project-version-project_test-type-other_info-other_info.file_type
I can strip most of the information I need out of this string in most cases. My trouble arises when my version has an extra qualifying character in it (i.e. normally 5 characters but sometimes a 6th is added).
Previously, I was using substrings to remove the excess information and get the 'project_test-type' however, now I need to switch to a regex (mostly to handle that extra version character). I could keep using substrings and change the length depending on whether I have that extra version character or not but a regex seems more appropriate here.
I tried using patterns like:
my ($type) = $_ =~ /.*-.*-(.*)-.*/;
But the extra '-' in the 'project_test-type' means I can't simply space my regex using that character.
What regex can I use to get the 'project_test-type' out of my string?
More information:
As a more human readable example, the information is grouped in the following way:
project - version - project_test-type - other_info - other_info . file_type
'project' is a simple string of chars
'version' is normally a string of 5 integers, but is sometimes followed by a char, i.e. 11111 is normal and 11111A is the rarer occurence.
'project_test-type' is a specific test associated with a project that can have both '_' and '-' in it's otherwise char name
Both cases of 'other_info' are additional bits of information for the system like an IP address or another version number. The first has no fixed length while the second is always 10 characters long
Since no field other than the desired one can contain -, any extra - belongs to the desired field.
+--------------------------- project
| +--------------------- version
| | +----------------- project_test-type
| | | +---------- other_info
| | | | +---- other_info.file_type
| | | | |
____| ____| _| ____| ____|
/^[^-]*-[^-]*-(.*)-[^-]*-[^-]*\z/
[^-] matches a character that's not a -.
[^-]* matches zero or more characters that's aren't -.
To match everything:
/^([^-]+)-([^-]+)-(.+)-([^-]+)-([^-]+)\.([a-zA-Z0-9]+)$/
[] defines character sets and ^ at the beginning of a set means "NOT". Also a - in a set usually means a range, unless it is at the beginning or end. So [^-]+ consumes as many non-dash characters as possible (at least one).
You can use
/\w+\s*-\s*\d{5}[a-zA-Z]?\s*-\s*(.*?)(?=\s*-\s*\d)/
Explanation:
\w+\s*- ==> match character sequence followed by any number of spaces and a -
\d{5}[a-zA-Z]? ==> always 5 digits with one or zero character
(.*?) => match everything in a non greedy way
(?=\s*-\s*\d) => look forward for a digit and stop (since IP starts with a digit)
Demo and Explanation
Greedy/non-greedy approach
($type) = /.*?-.*?-(.*)-.*-.*/;
.*? is a non-greedy match, meaning match any number of any character, but no more than necessary to match the regular expression. Using .* between the second and third dashes is a greedy match, matching as many characters as possible while still matching the regular expression, and using this will capture words with any extra dashes in them.