sed insert spaces between digits for long numbers - regex

How one can exploit sed to insert spaces between every three digits but only if a number is longer than 10 digits, ie:
blahaaaaaa goog sdd 234 3242423
ala el 213123123123
1231231313123 i 14124124141411
should turn into:
blahaaaaaa goog sdd 234 3242423
ala el 213 123 123 123
123 123 131 312 3 i 141 241 241 414 11
I can easily separate 3-digits numbers using sed 's/[0-9]\{3\}/& /g' but cannot combine that with a number length.

A single (GNU) sed command could be enough:
sed -E 's/([0-9]{10,})/\n&\n/g; :a; s/([ \n])([0-9]{3})([0-9]+\n)/\1\2 \3/; ta; s/\n//g' file
Update:
Walter A suggested a bit more concise sed expression which works fine if I haven't overlooked something:
sed -E 's/([0-9]{10,})/&\n/g; :a; s/([0-9]{3})([0-9]+\n)/\1 \2/; ta; s/\n//g' file
Explanation:
-E flag instructs the sed to use the extended regular expression syntax (to get rid of escape slashes before (){}+ characters).
s/([0-9]{10,})/&\n/g appends a new-line (\n) character to all digit sequences with 10 or more digits. This is in order to differentiate the digit sequences we are dealing with. The \n is a safe choice here because it cannot occur in the pattern space as read from the input line since it is the delimiter terminating the line. Notice that we are processing a single line per cycle (ie, since no multiline techniques are used, \n can be used as an anchor without interfering with other characters in the line).
:a; s/([0-9]{3})([0-9]+\n)/\1 \2/; ta This is a loop. :a is a label and could be any word (the : indicates the label). ta means jump to the label a if the last substitution (s command) is successful. The s command here repeatedly (because it is the body of the loop) replaces, from left to right, a 3-digit sequence with the same 3 digits concatenated by a space character, only if this 3-digit sequence is immediately followed by one or more digits delimited by a \n character, until no substitution is possible.
s/\n//g removes all \n instances from the resultant pattern space. They have been used as an anchor, or marker, to delimit the end of the digit sequences with more than or equal to 10 characters. Their mission has been completed now.

When you need to meet a complex set of requirements like this, it is more convenient to use perl:
perl -i -pe 's/\d{10,}/$&=~s|\d{3}|$& |gr/ge' file
Here,
\d{10,} - matches 10 or more consecutive digits
$&=~s|\d{3}|$& |gr - takes the whole match (the 10+ digit substring) and replaces every 3-digit chunk (matched with \d{3}) with this match (since $& is the placeholder for the whole match) and a space. g is used to perform as many replacements as there are matches in the input, and r is used to return substitution and leave the original string untouched.
ge - this flag combination means the all matches will be replaced (g), and e is necessary since the replacement string here is a regular expression to be evaluated.

preprocess and postprocess the file:
tr "\n " "\r\n" < "${file}" | sed -r '/[0-9]{10}/ s/[0-9]{3}/& /g' | tr '\r\n' '\n '

This might work for you (GNU sed):
cat <<\!|sed -Ef - file
/[[:digit:]]{10,}/{
s//\n&\n/
h
s/.*\n(.*)\n.*/\1/
s/.{3}\B/& /g
G
s/(.*)(\n.*)\n.*\n/\2\1/
D
}
!
Determine if the current line has any 10 or more digit numbers and if so process them.
Surround the first such number by newlines.
Copy the whole line to the hold space (HS).
Remove everything except the number from the current line.
Space the number every 3 digits (only do so if there is a following digit).
Append the original line from HS to the current line.
Replace the original number by the spaced number and remove all introduced newlines except the first.
Delete the introduced newline and thus repeat the process.
N.B. The D command removes upto and including the first newline in the current line i.e. the pattern space. If there is no newline, it acts the same as the d command. However if there is a newline, once it has removed the text before and including the newline, if there is further text it begins a new cycle but does not read in another line from the input. Thus it treats whatever remains in the pattern space as if it has read in a another line of input and starts the sed cycle again. By inserting a newline and then using D command it is identical to :a;...;ba.
Or if you prefer:
sed -E '/[[:digit:]]{10,}/{s//\n&\n/;h;s/.*\n(.*)\n.*/\1/;s/.{3}\B/& /g;G;s/(.*)(\n.*)\n.*\n/\2\1/;D}' file
An alternative that just uses the pattern space:
sed -E '/[[:digit:]]{10,}/{s//\n&\n/;s/(.*)(\n.*)(\n.*)/\1\3\2/;:a;s/^(.*\n.*\n([[:digit:]]{3} )*[[:digit:]]{3}\B)/\1 /;ta;s/(.*)\n(.*)\n(.*)/\n\1\3\2/;D}' file

Related

How do I filter lines in a text file that start with a capital letter and end with a positive integer with regex on the command line in linux?

I am attempting to use Regex with the grep command in the linux terminal in order to filter lines in a text file that start with Capital letter and end with a positive integer. Is there a way to modify my command so that it does this all in one line with one call of grep instead of two? I am using windows subsystem for linux and the microsoft store ubuntu.
Text File:
C line 1
c line 2
B line 3
d line 4
E line five
The command that I have gotten to work:
grep ^[A-Z] cap*| grep [0-9]$ cap*
The Output
C line 1
B line 3
This works but i feel like the regex statement could be combined somehow but
grep ^[A-Z][0-9]$
does not yield the same result as the command above.
You need to use
grep '^[A-Z].*[0-9]$'
grep '^[[:upper:]].*[0-9]$'
See the online demo. The regex matches:
^ - start of string
[A-Z] / [[:upper:]] - an uppercase letter
.* - any zero or more chars ([^0-9]* matches zero or more non-digit chars)
[0-9] - a digit.
$ - end of string.
Also, if you want to make sure there is no - before the number at the end of string, you need to use a negated bracket expression, like
grep -E '^[[:upper:]](.*[^-0-9])?[1-9][0-9]*$'
Here, the POSIX ERE regx (due to -E option) matches
^[[:upper:]] - an uppercase letter at the start and then
(.*[^-0-9])? - an optional occurrence of any text and then any char other than a digit and -
[1-9] - a non-zero digit
[0-9]* - zero or more digits
$ - end of string.
When you use a pipeline, you want the second grep to act on standard input, not on the file you originally grepped from.
grep ^[A-Z] cap*| grep [0-9]$
However, you need to expand the second regex if you want to exclude negative numbers. Anyway, a better solution altogether might be to switch to Awk:
awk '/^[A-Z]/ && /[0-9]$/ && $NF > 0' cap*
The output format will be slightly different than from grep; if you want to include the name of the matching file, you have to specify that separately:
awk '/^[A-Z]/ && /[0-9]$/ && $NF > 0 { print FILENAME ":" $0 }' cap*
The regex ^[A-Z][0-9]$ matches exactly two characters, the first of which must be an alphabetic, and the second one has to be a number. If you want to permit arbitrary text between them, that would be ^[A-Z].*[0-9]$ (and for less arbitrary, use something a bit more specific than .*, like (.*[^-0-9])? perhaps, where you need grep -E for the parentheses and the question mark for optional, or backslashes before each of these for the BRE regex dialect you get out of the box with POSIX grep).

How to delete leading characters until 1-9 are met

I have time spans in the format of
00h29m37s
01h31m24s
How to delete all leading characters until [1-9] is hit? For the above examples, the desired output is
29m37s
1h31m24s
We don't need to worry about 00h00m00s since it doesn't occur
$ cat input
00h29m37s
01h31m24s
$ sed 's/^[^1-9]*//' input
29m37s
1h31m24s
The [^1-9] matches all characters that are not 1 thru 9. The * makes that match the longest such string of characters. s only operates on the first match of the regex, so it effectively deletes the leading string of characters that do not match 1-9.
The ^ anchors the match to the beginning of the string, but note that it is not necessary. Without the ^, the [^1-9]* will match the leading string of length zero, so a string like 31m04s will not be edited. This is a bit confusing and worthy of mention. If you did want to edit 31m04s to remove the first longest string of characters that do not match 1 thru 9, you could do sed 's/[^1-9]\+//' or (if using a sed that does not support +) sed 's/[^1-9][^1-9]*//'
Using awk:
awk -F 'h' '+$1 > 0 { print +$1FS$2;next } { print $2 }' file
Set the field separator to "h" and then when the first "h" separated field is greater than 1, print the field, stripping out the leading 0 (+$1) along with the field separator and the second field, then skip to the next record. If the first field is equal to 0, print the second field only.

How do i replace in Linux blanks with underscore between letters only, ignoring numbers

Using Linux, i need a way to replace blanks in a string with underscores. The special point is to do this only between two letters (regardless if upper- or lowercase). Not between two numbers or a number and a letter.
Example:
"This is a test File of 100 MB Size - 45 of 50 files processed"
Output should be:
"This_is_a_test_File_of 100 MB_Size - 45 of 50 files_processed"
Thanks in advance for your help.
I tried a lot of sed regex combinations, but none of them did the job.
Seems a bit tricky.
sed 's/\([a-z]\)[[:space:]]\([A-Z]\)/_/g'
sed 's/\([a-z]\) \([A-Z]\)/_/g'
A way that puts hyphens around digits and that plays with word boundaries:
sed -E 's/([0-9_])/-\1-/g;s/\b \b/_/g;s/-([0-9_])-/\1/g' file
Or more direct with perl:
perl -pe's/\pL\K (?=\pL)/_/g' file
You may use
sed ':A;s/\([[:alpha:]]\) \([[:alpha:]]\)/\1_\2/;tA' file
Or
sed ':A;s/\([[:alpha:]]\)[[:space:]]\([[:alpha:]]\)/\1_\2/;tA' file
The point is that you match and capture a letter into Group 1 with the first \([[:alpha:]]\), then match a space (or whitespace with [[:space:]]), and then match and capture into Group 2 a letter (with the second \([[:alpha:]]\)), replace this match with the contents of Group 1 (\1), _ and Group 2 contents (\2), and then get back to search for a match after the preceding match start.
Note your approach would partly work if you added \1 and \2 placeholders to your RHS at right places, but the fact there are one-letter words would prevent it from working. However, if you pipe the second idedentical sed command you would get the expected output:
sed 's/\([[:alpha:]]\) \([[:alpha:]]\)/\1_\2/g' file | sed 's/\([[:alpha:]]\) \([[:alpha:]]\)/\1_\2/g'
See this online demo.

Swap minus sign from after the number to in front of the number using SED (and Regex)

I've got a text-file with the following line:
201174480 11-01-1911 J Student 25-07 11585 2 0 SPOED BEZORGEN 1ST 25,00
320819019 11-01-1911 T. Student 28-07 13561 1 15786986 DESLORATADINE TABL OMH 5MG 60ST 3,60
706059901 11-01-1911 ST Student-Student 30-06 14956 1 15356221 METOPROLOLSUCC RET T 100MG 180ST 12,90-
I want to change this line with SED into:
201174480 11-01-1911 J Student 25-07 11585 2 0 SPOED BEZORGEN 1ST 25,00
320819019 11-01-1911 T. Student 28-07 13561 1 15786986 DESLORATADINE TABL OMH 5MG 60ST 3,60
706059901 11-01-1911 ST Student-Student 30-06 14956 1 15356221 METOPROLOLSUCC RET T 100MG 180ST -12,90
So I want to swap the minus sign so that I get-12,90 in stead of 12,90- with SED. I tried:
try 1:
sed 's/\([0-9.]\+\)-/-\1/g' file.txt > file1.txt
try 2:
sed 's/\([0-9].\+\)-$/-\1/g' file.txt > file1.txt
So there must be something wrong with the REGEX but I donot really understand it. Please help.
You may use
sed 's/\([0-9][0-9,.]\+\)-\($\|[^0-9]\)/-\1\2/g'
See the online demo
The point is that after matching a number and a - (see \([0-9][0-9,.]\+\)-), there should come either end of string or non-digit (\($\|[^0-9]\)). Thus, we have 2 capturing groups now, and that is why we need a second backreference in the replacement pattern (\2).
I added a dot . to the bracket expression just in case you have mixed number formats, you may remove it if you always have a comma as the decimal separator.
Pattern details:
\([0-9][0-9,.]\+\) - Group 1 capturing
[0-9] - a digit
[0-9,.]\+ - one or more digits, commas or dots
- - a literal hyphen
\($\|[^0-9]\) - Group 2 capturing the end of string $ or a non-digit ([^0-9])
In your example, both files are identical, but I think I know what you mean.
For this particular file, you want to match a space, followed by zero or more digits, followed by a comma, followed by at least one digit, followed by a dash,
followed by zero or more spaces to the end of the line.
Then you want to replace the space in front of the matched digits and the comma with a dash. This will do the trick:
sed -e 's/ \([0-9]*,[0-9][0-9]*\)- *$/-\1/' <file.txt >file1.txt
Your first regular expression attempts to match against a string of numbers and .s, but the text contains a comma, not a .. It does the substitution you want if you replace [0-9.] with [0-9,], giving:
sed 's/\([0-9,]\+\)-/-\1/g' file.txt > file1.txt
However, it also replaces 25-07 in that case with -2507. I suggest you explicitly match against the end of the line:
sed 's/\([0-9,]\+\)-$/-\1/g'
or alternatively, you can demand that the match contains exactly one comma:
sed 's/\([0-9]\+,[0-9]\+\)-$/-\1/g'
I also find these things easier to read if you use the -r option to sed, which enables "extended regular expressions":
sed -r 's/([0-9]+,[0-9]+)-$/-\1/g'
Fewer special characters need to be escaped (on the other hand, more literal characters need to be escaped, but I find that tends to be a rarer occurrence).
(Aside: note that . usually means "any character", but inside a character class [.] it means "literally a .", since after all having it mean "any character" in there would be pretty useless.)

Add text to beginning and end of a line in vim

I have a file with over 5000 lines. I need to add the following characters before and after the string, JAMES, for example:
'[^-_.*]JAMES[#-\.]'
Is this correct if I am trying to say that JAMES can be in the beginning of a line, or have dashes, underscores, any character 0 or more times before JAMES, and the #, dash, or period can be followed by JAMES?
I am using a list as a whitelist to print the emails that have a certain string (e.g. JAMES) to another file. So, it should take:
cat file.txt | egrep -e -i < whitelist | sort -u > newfile.txt
file.txt has email addresses separated by new line.
So far, I used the following command in vim:
'%s/^$/[^-_.*]'
However, all this did was add [^-_.*] to the end of the file.
You probably want something like:
:g/.*/s//^[-_.]*&[-#.]*/
The .* looks for the whole line, and the :g/.*/ looks for that on every line in the file. The s//X&Y/ replaces what the g// found (the whole line) with X, what was found, and Y, except that the X is:
^[-_.]*
which means start of line followed by zero of more dashes, underscores or dots, and Y is:
[-#.]*
which means zero or more dashes, at signs, or dots. If you need that anchored to the end of line, add a $ after the *.
Note that the - must appear 'first' in a character class (unless another character must appear first, such as ^ to negate the character class, or ] to match a close square bracket).