Insert Decimal After Character Match in Text File - regex

I have a CSV file that has some data values. I need to insert a decimal point after the second character when the string has 3 values and after the third character when the string has 4 values.
CSV File:
956,938,987,964,1004,934,1018,912
Attempted Code:
sed -e "s/\([0-9]\{2\}\)/\1./g"
Current Result:
95.6,93.8,98.7,96.4,10.04.,93.4,10.18.,91.2
Expected Result:
95.6,93.8,98.7,96.4,100.4,93.4,101.8,91.2
My current code (using sed) appears to be working for 3-value strings but, failing when it detects 4-value strings.

You may capture 2 or more digits into 1 group, and then capture a trailing digit into another group:
s='956,938,987,964,1004,934,1018,912'
echo $s | sed 's/\([0-9]\{2,\}\)\([0-9]\)/\1.\2/g'
See the online demo, output: 95.6,93.8,98.7,96.4,100.4,93.4,101.8,91.2.
Details:
\([0-9]\{2,\}\) - Group 1: two or more (\{2,\}) digits ([0-9])
\([0-9]\) - Group 2: a single digit.

In awk:
$ awk '{gsub(/.(,|$)/,".&")}1' file
95.6,93.8,98.7,96.4,100.4,93.4,101.8,91.2
Just in case if there was spaces or other stuff, you could:
$ awk '{gsub(/[0-9] *(,|$)/,".&")}1' file

How about simply replacing
\B([0-9])\b
with
.\1
like
sed 's/\B\([0-9]\)\b/.\1/g'
Explanation:
\B Matches if the position being match is in a word/number sequence (not a word boundary)
([0-9]) Matches and captures a digit
\b Matches if the position being match is in on a word/number boundary
By your examples I gather you simply want to have all numbers with one decimal. What this regex does is to match, and capture, the last digit in a multi digit number. Replacing it with itself preceded by a . gives you the desired output.
Online demo and here at regex101 for a more visual illustration.
Edit
If Wiktors concerns are an issue, change it to
\B([0-9])([0-9])\b
replaced by
\1.\2
like
sed 's/\B\([0-9]\)\([0-9]\)\b/\1.\2/g'
Here at regex101.

Looks like you are just dividing all numbers by 10, hence you can use this non-regex approach:
awk 'BEGIN{FS=OFS=","} {for (i=1; i<=NF; i++) $i/=10} 1' file
95.6,93.8,98.7,96.4,100.4,93.4,101.8,91.2

Related

How do I filter lines in a text file that start with a capital letter and end with a positive integer with regex on the command line in linux?

I am attempting to use Regex with the grep command in the linux terminal in order to filter lines in a text file that start with Capital letter and end with a positive integer. Is there a way to modify my command so that it does this all in one line with one call of grep instead of two? I am using windows subsystem for linux and the microsoft store ubuntu.
Text File:
C line 1
c line 2
B line 3
d line 4
E line five
The command that I have gotten to work:
grep ^[A-Z] cap*| grep [0-9]$ cap*
The Output
C line 1
B line 3
This works but i feel like the regex statement could be combined somehow but
grep ^[A-Z][0-9]$
does not yield the same result as the command above.
You need to use
grep '^[A-Z].*[0-9]$'
grep '^[[:upper:]].*[0-9]$'
See the online demo. The regex matches:
^ - start of string
[A-Z] / [[:upper:]] - an uppercase letter
.* - any zero or more chars ([^0-9]* matches zero or more non-digit chars)
[0-9] - a digit.
$ - end of string.
Also, if you want to make sure there is no - before the number at the end of string, you need to use a negated bracket expression, like
grep -E '^[[:upper:]](.*[^-0-9])?[1-9][0-9]*$'
Here, the POSIX ERE regx (due to -E option) matches
^[[:upper:]] - an uppercase letter at the start and then
(.*[^-0-9])? - an optional occurrence of any text and then any char other than a digit and -
[1-9] - a non-zero digit
[0-9]* - zero or more digits
$ - end of string.
When you use a pipeline, you want the second grep to act on standard input, not on the file you originally grepped from.
grep ^[A-Z] cap*| grep [0-9]$
However, you need to expand the second regex if you want to exclude negative numbers. Anyway, a better solution altogether might be to switch to Awk:
awk '/^[A-Z]/ && /[0-9]$/ && $NF > 0' cap*
The output format will be slightly different than from grep; if you want to include the name of the matching file, you have to specify that separately:
awk '/^[A-Z]/ && /[0-9]$/ && $NF > 0 { print FILENAME ":" $0 }' cap*
The regex ^[A-Z][0-9]$ matches exactly two characters, the first of which must be an alphabetic, and the second one has to be a number. If you want to permit arbitrary text between them, that would be ^[A-Z].*[0-9]$ (and for less arbitrary, use something a bit more specific than .*, like (.*[^-0-9])? perhaps, where you need grep -E for the parentheses and the question mark for optional, or backslashes before each of these for the BRE regex dialect you get out of the box with POSIX grep).

How do i replace in Linux blanks with underscore between letters only, ignoring numbers

Using Linux, i need a way to replace blanks in a string with underscores. The special point is to do this only between two letters (regardless if upper- or lowercase). Not between two numbers or a number and a letter.
Example:
"This is a test File of 100 MB Size - 45 of 50 files processed"
Output should be:
"This_is_a_test_File_of 100 MB_Size - 45 of 50 files_processed"
Thanks in advance for your help.
I tried a lot of sed regex combinations, but none of them did the job.
Seems a bit tricky.
sed 's/\([a-z]\)[[:space:]]\([A-Z]\)/_/g'
sed 's/\([a-z]\) \([A-Z]\)/_/g'
A way that puts hyphens around digits and that plays with word boundaries:
sed -E 's/([0-9_])/-\1-/g;s/\b \b/_/g;s/-([0-9_])-/\1/g' file
Or more direct with perl:
perl -pe's/\pL\K (?=\pL)/_/g' file
You may use
sed ':A;s/\([[:alpha:]]\) \([[:alpha:]]\)/\1_\2/;tA' file
Or
sed ':A;s/\([[:alpha:]]\)[[:space:]]\([[:alpha:]]\)/\1_\2/;tA' file
The point is that you match and capture a letter into Group 1 with the first \([[:alpha:]]\), then match a space (or whitespace with [[:space:]]), and then match and capture into Group 2 a letter (with the second \([[:alpha:]]\)), replace this match with the contents of Group 1 (\1), _ and Group 2 contents (\2), and then get back to search for a match after the preceding match start.
Note your approach would partly work if you added \1 and \2 placeholders to your RHS at right places, but the fact there are one-letter words would prevent it from working. However, if you pipe the second idedentical sed command you would get the expected output:
sed 's/\([[:alpha:]]\) \([[:alpha:]]\)/\1_\2/g' file | sed 's/\([[:alpha:]]\) \([[:alpha:]]\)/\1_\2/g'
See this online demo.

Swap minus sign from after the number to in front of the number using SED (and Regex)

I've got a text-file with the following line:
201174480 11-01-1911 J Student 25-07 11585 2 0 SPOED BEZORGEN 1ST 25,00
320819019 11-01-1911 T. Student 28-07 13561 1 15786986 DESLORATADINE TABL OMH 5MG 60ST 3,60
706059901 11-01-1911 ST Student-Student 30-06 14956 1 15356221 METOPROLOLSUCC RET T 100MG 180ST 12,90-
I want to change this line with SED into:
201174480 11-01-1911 J Student 25-07 11585 2 0 SPOED BEZORGEN 1ST 25,00
320819019 11-01-1911 T. Student 28-07 13561 1 15786986 DESLORATADINE TABL OMH 5MG 60ST 3,60
706059901 11-01-1911 ST Student-Student 30-06 14956 1 15356221 METOPROLOLSUCC RET T 100MG 180ST -12,90
So I want to swap the minus sign so that I get-12,90 in stead of 12,90- with SED. I tried:
try 1:
sed 's/\([0-9.]\+\)-/-\1/g' file.txt > file1.txt
try 2:
sed 's/\([0-9].\+\)-$/-\1/g' file.txt > file1.txt
So there must be something wrong with the REGEX but I donot really understand it. Please help.
You may use
sed 's/\([0-9][0-9,.]\+\)-\($\|[^0-9]\)/-\1\2/g'
See the online demo
The point is that after matching a number and a - (see \([0-9][0-9,.]\+\)-), there should come either end of string or non-digit (\($\|[^0-9]\)). Thus, we have 2 capturing groups now, and that is why we need a second backreference in the replacement pattern (\2).
I added a dot . to the bracket expression just in case you have mixed number formats, you may remove it if you always have a comma as the decimal separator.
Pattern details:
\([0-9][0-9,.]\+\) - Group 1 capturing
[0-9] - a digit
[0-9,.]\+ - one or more digits, commas or dots
- - a literal hyphen
\($\|[^0-9]\) - Group 2 capturing the end of string $ or a non-digit ([^0-9])
In your example, both files are identical, but I think I know what you mean.
For this particular file, you want to match a space, followed by zero or more digits, followed by a comma, followed by at least one digit, followed by a dash,
followed by zero or more spaces to the end of the line.
Then you want to replace the space in front of the matched digits and the comma with a dash. This will do the trick:
sed -e 's/ \([0-9]*,[0-9][0-9]*\)- *$/-\1/' <file.txt >file1.txt
Your first regular expression attempts to match against a string of numbers and .s, but the text contains a comma, not a .. It does the substitution you want if you replace [0-9.] with [0-9,], giving:
sed 's/\([0-9,]\+\)-/-\1/g' file.txt > file1.txt
However, it also replaces 25-07 in that case with -2507. I suggest you explicitly match against the end of the line:
sed 's/\([0-9,]\+\)-$/-\1/g'
or alternatively, you can demand that the match contains exactly one comma:
sed 's/\([0-9]\+,[0-9]\+\)-$/-\1/g'
I also find these things easier to read if you use the -r option to sed, which enables "extended regular expressions":
sed -r 's/([0-9]+,[0-9]+)-$/-\1/g'
Fewer special characters need to be escaped (on the other hand, more literal characters need to be escaped, but I find that tends to be a rarer occurrence).
(Aside: note that . usually means "any character", but inside a character class [.] it means "literally a .", since after all having it mean "any character" in there would be pretty useless.)

Regular Expression - Want to find a match for a word but only if it does not appear after certain characters

Please help me find a regular expression that will do the following.....
I have a file that could contain lines such as...
with replace into currency;
--if at least one client currency is not found
#id_base int, -- base currency Id
from currency -- need this table as it stores the fx currency
I need to search this file and return any lines that contain the word currency BUT only if there is no -- before it. The -- represents a comment and can be excluded however as you can see from the last line above I may get a case where the word is included in the comment section and non comment section so in this case it should be classed as a match so from the above file i would expect only the following lines to return.
with replace into currency;
from currency -- need this table as it stores the fx currency
Can anyone help?
Perl One-Liner
You can use this regex:
(?m)^(?:(?!--).)*?currency.*
With a perl one-liner:
perl -ne 'print "$&\n" if m/(?m)^(?:(?!--).)*?currency.*/m' yourfile
Output:
with replace into currency;
from currency -- need this table as it stores the fx currency
Match with the following regex:
/^(?:currency.*(*ACCEPT)|(?!--).)+(*F)$/m
Explanation:
^ Asserts position at start of line. (m modifier)
(?: Non-capturing group: Matches character before "--".
currency.*(*ACCEPT) Captures word "currency" and matches the rest of the line.
| or
(?!--). Something that's not "--".
)+ More than once.
(*F) Fail the match. This will only be reached if the match isn't already accepted by (*ACCEPT).
$ Asserts position at end of the line. (m modifier)
Regex Demo
Taking the above slightly further, if I wanted to search for two words, "currency" or "market..currency". Is it possible to do this in 1 regex?
For this new requirement, see the following regex:
/^(?:(?!--)(?:(happyness)|(money)|(health)|.))+(?(1)(?(2)(?(3).*(*ACCEPT))))(*F)$/m

Regex to match ZIP code without punctuation

I have a file with a bunch of different ZIP codes:
12345
12345-6789
1234567890
12345:6789
12345-7890
12:1234678
I want to only match on codes that have the format 12345 or 12345-6789, but ignore all other forms.
I have my regex as:
grep -E '\<[0-9]{5}\>[^[:punct:]]|\<[0-9]{5}\>-[0-9]{4}' samplefile
It matches on the 12345-6789 because the "or" clause matches on that particular one. I am confused as to why it won't match on the first 12345 since my expression should say "match on 5 numbers but ignore any punctuation."
An expression that matches your desired output is:
egrep "^[0-9]{5}([-][0-9]{4})?$" samplefile
The expression breakdown:
^[0-9]{5} - Find a line that starts with 5 digits. ^ means start of line and [0-9]{5} means exactly five digits between zero and nine.
([-][0-9]{4})?$ - May end with a dash and four digits or nothing at all. () groups the expressions together, [-] represents the dash character, [0-9]{4} represents exactly four digits between zero and nine, ? indicates the grouped expression either exists entirely or does not exist and $ marks the end of the line.
test.dat
12345
12345-6789
1234567890
12345:6789
12345-7890
12:1234678
Running the expression on the test data:
mike#test:~$ egrep "^[0-9]{5}([-][0-9]{4})?$" test.dat
12345
12345-6789
12345-7890
Additional info: grep -E can alternatively be written as egrep. This also works for grep -F which is the same as fgrep and grep -r which is the same as rgrep.
It won't match "12345" but will match "12345a". The first clause needs to end in a non-punctuation character, the way you wrote it.
Consider Mike's answer; it's clearer.