Add text to beginning and end of a line in vim - regex

I have a file with over 5000 lines. I need to add the following characters before and after the string, JAMES, for example:
'[^-_.*]JAMES[#-\.]'
Is this correct if I am trying to say that JAMES can be in the beginning of a line, or have dashes, underscores, any character 0 or more times before JAMES, and the #, dash, or period can be followed by JAMES?
I am using a list as a whitelist to print the emails that have a certain string (e.g. JAMES) to another file. So, it should take:
cat file.txt | egrep -e -i < whitelist | sort -u > newfile.txt
file.txt has email addresses separated by new line.
So far, I used the following command in vim:
'%s/^$/[^-_.*]'
However, all this did was add [^-_.*] to the end of the file.

You probably want something like:
:g/.*/s//^[-_.]*&[-#.]*/
The .* looks for the whole line, and the :g/.*/ looks for that on every line in the file. The s//X&Y/ replaces what the g// found (the whole line) with X, what was found, and Y, except that the X is:
^[-_.]*
which means start of line followed by zero of more dashes, underscores or dots, and Y is:
[-#.]*
which means zero or more dashes, at signs, or dots. If you need that anchored to the end of line, add a $ after the *.
Note that the - must appear 'first' in a character class (unless another character must appear first, such as ^ to negate the character class, or ] to match a close square bracket).

Related

How to delete leading characters until 1-9 are met

I have time spans in the format of
00h29m37s
01h31m24s
How to delete all leading characters until [1-9] is hit? For the above examples, the desired output is
29m37s
1h31m24s
We don't need to worry about 00h00m00s since it doesn't occur
$ cat input
00h29m37s
01h31m24s
$ sed 's/^[^1-9]*//' input
29m37s
1h31m24s
The [^1-9] matches all characters that are not 1 thru 9. The * makes that match the longest such string of characters. s only operates on the first match of the regex, so it effectively deletes the leading string of characters that do not match 1-9.
The ^ anchors the match to the beginning of the string, but note that it is not necessary. Without the ^, the [^1-9]* will match the leading string of length zero, so a string like 31m04s will not be edited. This is a bit confusing and worthy of mention. If you did want to edit 31m04s to remove the first longest string of characters that do not match 1 thru 9, you could do sed 's/[^1-9]\+//' or (if using a sed that does not support +) sed 's/[^1-9][^1-9]*//'
Using awk:
awk -F 'h' '+$1 > 0 { print +$1FS$2;next } { print $2 }' file
Set the field separator to "h" and then when the first "h" separated field is greater than 1, print the field, stripping out the leading 0 (+$1) along with the field separator and the second field, then skip to the next record. If the first field is equal to 0, print the second field only.

sed insert spaces between digits for long numbers

How one can exploit sed to insert spaces between every three digits but only if a number is longer than 10 digits, ie:
blahaaaaaa goog sdd 234 3242423
ala el 213123123123
1231231313123 i 14124124141411
should turn into:
blahaaaaaa goog sdd 234 3242423
ala el 213 123 123 123
123 123 131 312 3 i 141 241 241 414 11
I can easily separate 3-digits numbers using sed 's/[0-9]\{3\}/& /g' but cannot combine that with a number length.
A single (GNU) sed command could be enough:
sed -E 's/([0-9]{10,})/\n&\n/g; :a; s/([ \n])([0-9]{3})([0-9]+\n)/\1\2 \3/; ta; s/\n//g' file
Update:
Walter A suggested a bit more concise sed expression which works fine if I haven't overlooked something:
sed -E 's/([0-9]{10,})/&\n/g; :a; s/([0-9]{3})([0-9]+\n)/\1 \2/; ta; s/\n//g' file
Explanation:
-E flag instructs the sed to use the extended regular expression syntax (to get rid of escape slashes before (){}+ characters).
s/([0-9]{10,})/&\n/g appends a new-line (\n) character to all digit sequences with 10 or more digits. This is in order to differentiate the digit sequences we are dealing with. The \n is a safe choice here because it cannot occur in the pattern space as read from the input line since it is the delimiter terminating the line. Notice that we are processing a single line per cycle (ie, since no multiline techniques are used, \n can be used as an anchor without interfering with other characters in the line).
:a; s/([0-9]{3})([0-9]+\n)/\1 \2/; ta This is a loop. :a is a label and could be any word (the : indicates the label). ta means jump to the label a if the last substitution (s command) is successful. The s command here repeatedly (because it is the body of the loop) replaces, from left to right, a 3-digit sequence with the same 3 digits concatenated by a space character, only if this 3-digit sequence is immediately followed by one or more digits delimited by a \n character, until no substitution is possible.
s/\n//g removes all \n instances from the resultant pattern space. They have been used as an anchor, or marker, to delimit the end of the digit sequences with more than or equal to 10 characters. Their mission has been completed now.
When you need to meet a complex set of requirements like this, it is more convenient to use perl:
perl -i -pe 's/\d{10,}/$&=~s|\d{3}|$& |gr/ge' file
Here,
\d{10,} - matches 10 or more consecutive digits
$&=~s|\d{3}|$& |gr - takes the whole match (the 10+ digit substring) and replaces every 3-digit chunk (matched with \d{3}) with this match (since $& is the placeholder for the whole match) and a space. g is used to perform as many replacements as there are matches in the input, and r is used to return substitution and leave the original string untouched.
ge - this flag combination means the all matches will be replaced (g), and e is necessary since the replacement string here is a regular expression to be evaluated.
preprocess and postprocess the file:
tr "\n " "\r\n" < "${file}" | sed -r '/[0-9]{10}/ s/[0-9]{3}/& /g' | tr '\r\n' '\n '
This might work for you (GNU sed):
cat <<\!|sed -Ef - file
/[[:digit:]]{10,}/{
s//\n&\n/
h
s/.*\n(.*)\n.*/\1/
s/.{3}\B/& /g
G
s/(.*)(\n.*)\n.*\n/\2\1/
D
}
!
Determine if the current line has any 10 or more digit numbers and if so process them.
Surround the first such number by newlines.
Copy the whole line to the hold space (HS).
Remove everything except the number from the current line.
Space the number every 3 digits (only do so if there is a following digit).
Append the original line from HS to the current line.
Replace the original number by the spaced number and remove all introduced newlines except the first.
Delete the introduced newline and thus repeat the process.
N.B. The D command removes upto and including the first newline in the current line i.e. the pattern space. If there is no newline, it acts the same as the d command. However if there is a newline, once it has removed the text before and including the newline, if there is further text it begins a new cycle but does not read in another line from the input. Thus it treats whatever remains in the pattern space as if it has read in a another line of input and starts the sed cycle again. By inserting a newline and then using D command it is identical to :a;...;ba.
Or if you prefer:
sed -E '/[[:digit:]]{10,}/{s//\n&\n/;h;s/.*\n(.*)\n.*/\1/;s/.{3}\B/& /g;G;s/(.*)(\n.*)\n.*\n/\2\1/;D}' file
An alternative that just uses the pattern space:
sed -E '/[[:digit:]]{10,}/{s//\n&\n/;s/(.*)(\n.*)(\n.*)/\1\3\2/;:a;s/^(.*\n.*\n([[:digit:]]{3} )*[[:digit:]]{3}\B)/\1 /;ta;s/(.*)\n(.*)\n(.*)/\n\1\3\2/;D}' file

Regex POSIX - How can i find if the start of a line contains a word from a word that appears later in line

I have a UNIX passwd file and i need to find using egrep if the first 7 characters from GECOS are inside the username. I want to check if the username (jkennedy) contains the word "kennedy" from the GECOS.
I was planning to use back-references but the username is before the gecos so i don't know how to implement it.
For example the passwd file contains this line:
jkennedy:x:2473:1067:kennedy john:/root:/bin/bash
As per my original comment, the regex below works for me.
See it in use here - note this regex differs slightly as it's more used for display purposes. The regex below is the POSIX version of this and removes non-capture groups and the unneeded capture group around the backreference.
^[^:]*([^:]{7})([^:]*:){4}\1.*$
^ assert position at the start of the line
[^:]* match any character except : any number of times
([^:]{7}) capture exactly seven of any character except :
([^:]*:){4} match the following exactly four times
[^:]*: match any character except : any number of times, followed by : literally
\1 match the backreference; matches what was previously matched by the first capture gorup
.* match any character (except newline characters) any number of times
$ assert position at the end of the line
Assuming you do NOT want case sensitivity to foul your matching -
declare -l tmpUsr tmpName
while IFS=: read usr x x x name x
do tmpUsr="$usr"; tmpName="$name"
(( ${#name} )) && [[ "$tmpUsr" =~ ${tmpName:0:7} ]] &&
printf "$usr ($name<${tmpName:0:7}>)\n"
done</etc/passwd

Swap minus sign from after the number to in front of the number using SED (and Regex)

I've got a text-file with the following line:
201174480 11-01-1911 J Student 25-07 11585 2 0 SPOED BEZORGEN 1ST 25,00
320819019 11-01-1911 T. Student 28-07 13561 1 15786986 DESLORATADINE TABL OMH 5MG 60ST 3,60
706059901 11-01-1911 ST Student-Student 30-06 14956 1 15356221 METOPROLOLSUCC RET T 100MG 180ST 12,90-
I want to change this line with SED into:
201174480 11-01-1911 J Student 25-07 11585 2 0 SPOED BEZORGEN 1ST 25,00
320819019 11-01-1911 T. Student 28-07 13561 1 15786986 DESLORATADINE TABL OMH 5MG 60ST 3,60
706059901 11-01-1911 ST Student-Student 30-06 14956 1 15356221 METOPROLOLSUCC RET T 100MG 180ST -12,90
So I want to swap the minus sign so that I get-12,90 in stead of 12,90- with SED. I tried:
try 1:
sed 's/\([0-9.]\+\)-/-\1/g' file.txt > file1.txt
try 2:
sed 's/\([0-9].\+\)-$/-\1/g' file.txt > file1.txt
So there must be something wrong with the REGEX but I donot really understand it. Please help.
You may use
sed 's/\([0-9][0-9,.]\+\)-\($\|[^0-9]\)/-\1\2/g'
See the online demo
The point is that after matching a number and a - (see \([0-9][0-9,.]\+\)-), there should come either end of string or non-digit (\($\|[^0-9]\)). Thus, we have 2 capturing groups now, and that is why we need a second backreference in the replacement pattern (\2).
I added a dot . to the bracket expression just in case you have mixed number formats, you may remove it if you always have a comma as the decimal separator.
Pattern details:
\([0-9][0-9,.]\+\) - Group 1 capturing
[0-9] - a digit
[0-9,.]\+ - one or more digits, commas or dots
- - a literal hyphen
\($\|[^0-9]\) - Group 2 capturing the end of string $ or a non-digit ([^0-9])
In your example, both files are identical, but I think I know what you mean.
For this particular file, you want to match a space, followed by zero or more digits, followed by a comma, followed by at least one digit, followed by a dash,
followed by zero or more spaces to the end of the line.
Then you want to replace the space in front of the matched digits and the comma with a dash. This will do the trick:
sed -e 's/ \([0-9]*,[0-9][0-9]*\)- *$/-\1/' <file.txt >file1.txt
Your first regular expression attempts to match against a string of numbers and .s, but the text contains a comma, not a .. It does the substitution you want if you replace [0-9.] with [0-9,], giving:
sed 's/\([0-9,]\+\)-/-\1/g' file.txt > file1.txt
However, it also replaces 25-07 in that case with -2507. I suggest you explicitly match against the end of the line:
sed 's/\([0-9,]\+\)-$/-\1/g'
or alternatively, you can demand that the match contains exactly one comma:
sed 's/\([0-9]\+,[0-9]\+\)-$/-\1/g'
I also find these things easier to read if you use the -r option to sed, which enables "extended regular expressions":
sed -r 's/([0-9]+,[0-9]+)-$/-\1/g'
Fewer special characters need to be escaped (on the other hand, more literal characters need to be escaped, but I find that tends to be a rarer occurrence).
(Aside: note that . usually means "any character", but inside a character class [.] it means "literally a .", since after all having it mean "any character" in there would be pretty useless.)

Vim: Match spaces at end of line but not lines consisting of a single space

I realise that in vim, I can highlight trailing spaces at the end of a line using
match /\s\+$/
Now I would like to exclude those lines that contain exactly one space from being matched. How do I go about doing this? (It does not need to be a single line/regex.)
match /\(\S\zs\s\+$\)\|\(^\s\{2,}$\)/
This should work - breaking it down into 2 sections
Part 1 - search for spaces at the end of a line that has other stuff on the line: \(\S\zs\s\+$\)
not a space \S,
then start matching \zs,
1 or more spaces at the end of the line \s\+$
OR match \|
Part 2 - Search for more than one space which is the entire line: \(^\s\{2,}$\)
start at the beginning of the line ^
search for at least 2 spaces \s\{2,}
at the end of the line $
This matches all lines that contain more than one space, leaving out lines that contain one space.
match /\s\s\+$/