Want to use Bash & Regex to replace comma in file

Want to use Bash & Regex to replace comma in file - regex

I need to replace a specific character, like comma, in a csv file.
I have files with text and numeric separated by ';' (csv as French...)
Example:
value;x;y;comment;
abc;123,45;987,65;abc;
abc;123.45;987.65;abc;
abc;123,45;987,65;abc, blabla;
There is a mix for the decimal separator , both ',' and '.' are used.
I want to replace ',' by '.' but ONLY for decimal values, not text like comments.
I tried sed with regex
sed -i '/;[0-9]\+,[0-9]\+;/s/,/./g' file.csv
But that replace all comma. I can't found how to replace only what I want.
I want to do that only in bash.

One sed idea using extended regex and capture groups:
sed -E 's/([0-9]),([0-9])/\1.\2/g' file.csv
Where:
-E - enable extended regex support
([0-9]),([0-9]) - match a single digit + , + single digit
([0-9]) - define a capture group (there are 2 capture groups in this case)
\1.\2 - print capture group #1 + . + capture group #2
This generates:
value;x;y;comment;
abc;123.45;987.65;abc;
abc;123.45;987.65;abc;
abc;123.45;987.65;abc, blabla;
NOTES:
once OP is satisfied the code performs the desired operation the -i flag can be added to have sed perform an in-place update of the file
this will erroneously replace the comma in a string such as ;3,2,4 five 6,7 eight ; (this can be addressed but will require a more complex regex)

You may use this simpler sed:
sed -i.bak -E 's/([0-9]),([0-9])/\1.\2/g' file
value;x;y;comment;
abc;123.45;987.65;abc;
abc;123.45;987.65;abc;
abc;123.45;987.65;abc, blabla;
Details:
([0-9]),([0-9]): Match a digit followed by comma followed by a digit. Capture before and after digits in capture group #1 and #2
\1.\2: Replace with back-reference #1 followed by dot followed by back-reference #2
Alternatively, you may use this more robust awk solution:
awk 'BEGIN {FS=OFS=";"} {for (i=1; i<=NF; ++i)
if ($i ~ /^[0-9]+,[0-9]+$/) sub(/,/, ".", $i)} 1' file
value;x;y;comment;
abc;123.45;987.65;abc;
abc;123.45;987.65;abc;
abc;123.45;987.65;abc, blabla;

You can try:
sed -i 's/;\([0-9]\+\),\([0-9]\+\)/;\1.\2/g' file.csv
Note: if you use the -i option, don't forget to make a backup of your original data, just in case.

Related

How can I replace a string containing special characters?

I have a text file that contains a line with brackets, character, integers, and : symbols.
I want to replace [0:1] with [2:4]
$ cat input.txt
str(tr.dx)[0:1]
Expected output:
str(tr.dx)[2:4]
I tried
sed -i 's/str(tr.dx)[0:1]/str(tr.dx)[2:4]/g' input.txt
but it does not work. How can I fix this?

You may use this sed:
sed 's/\(str(tr\.dx)\)\[0:1]/\1[2:4]/' file
str(tr.dx)[2:4]
Here:
\(str(tr\.dx)\) matches str(tr.dx) and captures it in group #1
We need to escape the dot in regex
\[0:1] matches [0:1]. Here we need to escape [
\1 is back-reference for capture group #1

Capture word after pattern with slash

I want to extract word1 from:
something /CLIENT_LOGIN:word1 something else
I would like to extract the first word after matching pattern /CLIENT_LOGIN:.
Without the slash, something like this works:
A=something /CLIENT_LOGIN:word1 something else
B=$(echo $A | awk '$1 == "CLIENT_LOGIN" { print $2 }' FS=":")
With the slash though, I can't get it working (I tried putting / and \/ in front of CLIENT_LOGIN). I don't care getting it done with awk, grep, sed, ...

Using sed:
s='=something /CLIENT_LOGIN:word1 something else'
sed -E 's~.* /CLIENT_LOGIN:([^[:blank:]]+).*~\1~' <<< "$s"
word1
Details:
We use ~ as regex delimiter in sed
/CLIENT_LOGIN:([^[:blank:]]+) matches /CLIENT_LOGIN: followed by 1+ non-whitespace characters that is captured in group #1
.* on both sides matches text before and after our match
\1 is used in substitution to put 1st group's captured value back in output

1st solution: With your shown samples, please try following GNU grep solution.
grep -oP '^.*? /CLIENT_LOGIN:\K(\S+)' Input_file
Explanation: Simple explanation would be, using GNU grep's o and P options. Which are responsible for printing exact match and enabling PCRE regex. In main program, using regex ^.*? /CLIENT_LOGIN:\K(\S+): which means using lazy match from starting of value to till /CLIENT_LOGIN: to match very first occurrence of string. Then using \K option to forget till now matched values so tat we can print only required values, which is followed by \S+ which means match all NON-Spaces before any space comes.
2nd solution: Using awk's match function along with its split function to print the required value.
awk '
match($0,/\/CLIENT_LOGIN:[^[:space:]]+/){
split(substr($0,RSTART,RLENGTH),arr,":")
print arr[2]
}
' Input_file
3rd solution: Using GNU awk's FPAT option please try following solution. Simple explanation would be, setting FPAT to /CLIENT_LOGIN: followed by all non-spaces values. In main program of awk using sub to substitute everything till : with NULL for first field and then printing first field.
awk -v FPAT='/CLIENT_LOGIN:[^[:space:]]+' '{sub(/.*:/,"",$1);print $1}' Input_file

Performing a regex match and capturing the resulting string in BASH_REMATCH[]:
$ regex='.*/CLIENT_LOGIN:([^[:space:]]*).*'
$ A='something /CLIENT_LOGIN:word1 something else'
$ unset B
$ [[ "${A}" =~ $regex ]] && B="${BASH_REMATCH[1]}"
$ echo "${B}"
word1
Verifying B remains undefined if we don't find our match:
$ A='something without the desired string'
$ unset B
$ [[ "${A}" =~ $regex ]] && B="${BASH_REMATCH[1]}"
$ echo "${B}"
<<<=== nothing output

Fixing your awk command, you can use
A="/CLIENT_IPADDR:23.4.28.2 /CLIENT_LOGIN:xdfmb1d /MXJ_C"
B=$(echo "$A" | awk 'match($0,/\/CLIENT_LOGIN:[^[:space:]]+/){print substr($0,RSTART+14,RLENGTH-14)}')
See the online demo yielding xdfmb1d. Details:
\/CLIENT_LOGIN: - a /CLIENT_LOGIN: string
[^[:space:]]+ - one or more non-whitespace chars
The pattern above is what awk searches for, and once matched, the part of this match value after /CLIENT_LOGIN: is "extracted" using substr($0,RSTART+14,RLENGTH-14) (where 14 is the length of the /CLIENT_LOGIN: string).

add dot before first integer in all lines

I have lines like
Input:
abcd1234
bdfghks4506
agfdch6985
I would like to add "." before the first integer in line, how do I do it?
Output:
abcd.1234
bdfghks.4506
agfdch.6985

This might work for you (GNU sed):
sed -i 's/[[:digit:]]/.&/' file
If there is a digit in a line, put a . before it.
N.B. To put a . before every digit in a file use:
sed -i 's/[[:digit:]]/.&/g' file

$ cat > input.txt
abcd1234
bdfghks4506
agfdch6985
$ sed -e 's/^\([^0-9]*\)\([0-9]\)\(.*\)$/\1.\2\3/' input.txt
abcd.1234
bdfghks.4506
agfdch.6985
Use sed string replacement with regular expression capture groups.
Match the beginning of the line.
Start a capture group that matches any number of non-numeric characters.
Start a second capture group that matches a single digit.
Start a third capture group that matches the remainder of the line.
Match the end of the line.
Replace the entire line with the contents of the first capture group, ".", the second capture group and, finally, the third capture group.

The function sub of awk may help,
$ awk 'sub(/[0-9]/,".&",$0)1' file
abcd.1234
bdfghks.4506
agfdch.6985
Brief explanation,
sub: replace only the first matching substring in each line
&: is replaced with the text that was actually matched (i.e. [0-9])
Appended 1: to print the result.

The most strict command for your case would be
sed -E -i 's/([a-z])([1-9])/\1\.\2/' file.txt
-E Use extended regex
-i '' Replace in file (instead of writing to output)
This will match any example you provided

Not familiar with awk / sed specifically, but a regex replace using this regex should be all you need:
Search: (\d+.*?)$ (match everything from the first found number to the end of the line)
Replace by: .$1 (captured group #1 prefixed by a literal .)
The notation of the capture group in the replace command may differ depending on the implementation. I used $1 here, but some implementations may use \1.

Extract QueryString value using sed

I have the following lines in an apache access log
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229655&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229656&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229657&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229658&blah
and i want to extract the MSISDN value only, so expected output would be
647930229655
647930229656
647930229657
647930229658
I'm using the following sed command but i can't get it to stop capturing at &
sed 's/.*MSISDN=\(.*\)/\1/'

sed solution:
sed -E 's/.*&MSISDN=([^&]+).*/\1/' file
& - is key/value pair separator in URL syntax, so you should rely on it
([^&]+) - 1st captured group containing any character sequence except &
\1 - backreference to the 1st captured group
The output:
647930229655
647930229656
647930229657
647930229658

-o : means print only matching string not the whole line.
-P: To enable pcre regex.
\K: means ignore everything on the left. But should be part of actual input string.
\d: means digit, + means one or more digit.
grep -oP 'MSISDN=\K\d+' input
647930229655
647930229656
647930229657
647930229658

Following simple sed may help you on same.
sed 's/.*MSISDN=//;s/&.*//' Input_file
Explanation:
s/.*MSISDN=//: s means substitute .*MSISDN= string with // NULL in current line.
; semi colon tells sed that there is 1 more statement to be executed.
s/&.*//g': s/&.*// means substitute &.* from & to everything with NULL.

$ grep -oP '(?<=&MSISDN=)\d+' file
647930229655
647930229656
647930229657
647930229658
-o option is meant to show only matched output
-P option is meant to enable PCRE (Perl Compatible Regex)
(?<=regex) this is to enforce positive look behind assertion. You can read more about them over here. Lookarounds dont consume any characters while matching unlike normal regex. Hence the only matched output you get it \d+ which is 1 or more digits.
or using sed:
$ sed -r 's/^.*MSISDN=([0-9]+).*$/\1/' file
647930229655
647930229656
647930229657
647930229658

you can also pipe cut to cut
cut -d '&' -f3 Input_file |cut -d '=' -f2

Replace the match itself, not the wild card, using awk gsub or generic regular expressions

I have a tab-delimited file of the following appearance:
12-38070040-39070040 13-92416321-93446176 14-47539055-48560868 14-89244697-90244697 14-90046821-91047886 14-98556636-99556636 15-47718221-48718221
I want to replace all instances of:
tab, then any two digits, then a hyphen \t[0-9][0-9]-
with:
tab, then the same two digits, then a colon \t SAME TWO DIGITS :
12:38070040-39070040 13:92416321-93446176 14:47539055-48560868 14:89244697-90244697 14:90046821-91047886 14:98556636-99556636 15:47718221-48718221
How can I match using a wildcard, but then replace the match, instead of replacing the wildcard?
One last note, I have asked about awk '{gsub()}' because I use it the most, however if there is generic "pseudo regex" that would work in most environments, most text editors, etc., I would be just as happy to learn about that.

It sounds like what you're referring to is a capture group. Capture groups enable you to use part of the matched pattern in the replacement string.
Normal gsub doesn't allow you to use capture groups but if you're using GNU awk, you can use gensub instead:
awk '{print gensub(/\y([0-9][0-9])-/, "\\1:", "g")}' file
This captures the two digits preceded by a word boundary \y and followed by a hyphen. The digits are then used in the replacement (that's what the \\1 is for), followed by a colon. The "g" argument means that a global substitution is performed. If multiple capture groups were specified, they would be \\2, \\3 etc.
Testing it out on your file:
$ awk '{print gensub(/\y([0-9][0-9])-/, "\\1:", "g")}' file
12:38070040-39070040 13:92416321-93446176 14:47539055-48560868 14:89244697-90244697 14:90046821-91047886 14:98556636-99556636 15:47718221-48718221
You can use sed to do the same job:
sed -r 's/(^|[[:space:]])([0-9]{2})-/\1\2:/g' file
This matches any two digits preceded by either a character in the space class (tabs and spaces are included) or the start of the line ^ and followed by a hyphen. Now there are two capture groups, so the replacement contains them both as well as the colon. Using BSD sed (e.g. on a Mac), use -E instead of -r to enable extended regex mode.
Since we're dealing with regexes, it seems unreasonable not to mention Perl:
perl -pe 's/\b(\d{2})-/\1:/g' file
This uses the word boundary \b which matches the gap between the beginning of the number and either the start of the line or whitespace. \d is the digit class, a shorthand for [0-9]. The replacement is similar to the one use in awk, except that we don't need to escape the backslash.
Output in all cases:
12:38070040-39070040 13:92416321-93446176 14:47539055-48560868 14:89244697-90244697 14:90046821-91047886 14:98556636-99556636 15:47718221-48718221

One awk and one sed solution are included here. Let us start with the sed solution:
$ sed -r 's/^([0-9]{2})-/\1:/; s/\t([0-9]{2})-/\t\1:/g' file
12:38070040-39070040 13:92416321-93446176 14:47539055-48560868 14:89244697-90244697 14:90046821-91047886 14:98556636-99556636 15:47718221-48718221
This uses two sed commands:
s/^([0-9]{2})-/\1:/
If a line begins with two numbers followed by a dash, this matches and substitutes in the same numbers (\1) and a colon.
s/\t([0-9]{2})-/\t\1:/g
Anytime a tab is followed by two numbers and a dash, this substitutes in a tab, the same two numbers (\1), and a colon.
The -r option on GNU sed (-E on OSX) tells sed to use extended regex so that fewer backslashes are needed.
For Mac OSX and other non-GNU platforms, try:
$ sed -E -e 's/^([0-9]{2})-/\1:/' -e 's/\t([0-9]{2})-/\t\1:/g' file
Awk solution
If we restrict ourselves to standard parts of the awk language, then we lose the elegance of regexes but we can still assemble the right answer by using substr:
$ awk -v 'OFS=\t' '{for (i=1;i<=NF;i++) {if ($i ~ /^[0-9][0-9]-/) {$i=substr($i,0,2)":"substr($i,4)}}} 1' file
12:38070040-39070040 13:92416321-93446176 14:47539055-48560868 14:89244697-90244697 14:90046821-91047886 14:98556636-99556636 15:47718221-48718221
Taking each part, a piece at a time:
-v 'OFS=\t'
This sets the output field separator to tab.
{for (i=1;i<=NF;i++) {if ($i ~ /^[0-9][0-9]-/) {$i=substr($i,0,2)":"substr($i,4)}}}
This loops over every field and, if the field starts with two numbers followed by a dash, then a new value is assigned to the field consisting of the first two numbers followed by a colon followed by the rest of the field.
1
This is cryptic shorthand for print the whole line.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Want to use Bash & Regex to replace comma in file - regex

You can try: sed -i 's/;\([0-9]\+\),\([0-9]\+\)/;\1.\2/g' file.csv Note: if you use the -i option, don't forget to make a backup of your original data, just in case.

Related

How can I replace a string containing special characters?

Capture word after pattern with slash

add dot before first integer in all lines

Extract QueryString value using sed

Replace the match itself, not the wild card, using awk gsub or generic regular expressions

Categories

Resources