Extract fixed-position substrings from file [duplicate] - regex

This question already has answers here:
bash: shortest way to get n-th column of output
(8 answers)
Closed 1 year ago.
I need to extract substrings from a file into a new file. Mac or Linux.
The data is between the 4th and 5th "|" symbol.
HD|262339|9400530374||K7UKD|A|HA|12/15/2009|03/13/2020
The actual columnar position varies, sometimes by a lot, but the data is always between the 4th and 5th pipe symbol.
Sample data is as above, expected output would be K7UKD.
I've tried various hacks at a regex:
grep "/\|(\w+)\|/" input.txt > output.txt

Converting my comment to answer so that solution is easy to find for future visitors.
There are 2 ways to get it:
Any awk version:
awk -F'|' '{print $5}' file
K7UKD
or using gnu-awk:
awk -v RS='|' 'NR == 5' file
Here is a bash solution using read:
IFS='|' read -ra arr <<< 'HD|262339|9400530374||K7UKD|A|HA|12/15/2009|03/13/2020' &&
echo "${arr[4]}"
K7UKD
Or using cut:
cut -d'|' -f5 file
Or using sed:
sed -E 's/^([^|]*\|){3}\|([^|]*).*/\2/' file

Related

Add delimiters at specific indexes

I want to add a delimiter in some indexes for each line of a file.
I have a file with data:
10100100010000
20200200020000
And I know the offset of each column (2, 5 and 9)
With this sed command: sed 's/\(.\{2\}\)/&,/;s/\(.\{6\}\)/&,/;s/\(.\{11\}\)/&,/' myFile
I get the expected output:
10,100,1000,10000
20,200,2000,20000
but with a large number of columns (~200) and rows (300k) is really slow.
Is there an efficient alternative?
1st solution: With GNU awk could you please try following:
awk -v OFS="," '{$1=$1}1' FIELDWIDTHS="2 3 4 5" Input_file
2nd Solution: Using sed try following.
sed 's/\(..\)\(...\)\(....\)\(.....\)/\1,\2,\3,\4/' Input_file
3rd solution: awk solution using substr.
awk 'BEGIN{OFS=","} {print substr($0,1,2) OFS substr($0,3,3) OFS substr($0,6,4) OFS substr($0,10,5)}' Input_file
In above substr solution, I have taken 5 digits/characters in substr($0,10,5) in case you want to take all characters/digits etc starting from 10th position use substr($0,10) which will take rest of all line's characters/digits here to print.
Output will be as follows.
10,100,1000,10000
20,200,2000,20000
Modifying your sed command to make it add all the separators in one shot would likely make it perform better :
sed 's/^\(.\{2\}\)\(.\{3\}\)\(.\{4\}\)/\1,\2,\3,/' myFile
Or with extended regular expression:
sed -E 's/(.{2})(.{3})(.{4})/\1,\2,\3,/' myFile
Output:
10,100,1000,10000
20,200,2000,20000
With GNU awk for FIELDWIDTHS:
$ awk -v FIELDWIDTHS='2 3 4 *' -v OFS=',' '{$1=$1}1' file
10,100,1000,10000
20,200,2000,20000
You'll need a newer version of gawk for * at the end of FIELDWIDTHS to mean "whatever's left", with older version just choose a large number like 999.
If you start the substitutions from the back, you can use the number flag to s to specify which occurrence of any character you'd like to append a comma to:
$ sed 's/./&,/9;s/./&,/5;s/./&,/2' myFile
10,100,1000,10000
20,200,2000,20000
You could automate that a bit further by building the command with a printf statement:
printf -v cmd 's/./&,/%d;' 9 5 2
sed "$cmd" myFile
or even wrap that in a little shell function so we don't have to care about listing the columns in reverse order:
gencmd() {
local arr
# Sort arguments in descending order
IFS=$'\n' arr=($(sort -nr <<< "$*"))
printf 's/./&,/%d;' "${arr[#]}"
}
sed "$(gencmd 2 5 9)" myFile

Use Variable in SED [duplicate]

This question already has answers here:
Escape a string for a sed replace pattern
(17 answers)
Closed 4 years ago.
I cannot expand this variable in sed. I've tried everything I can think of.
I am trying to put the md5sum of file1 in line 10 of file2
I can take $x out of the regex and put some text and it works. It just will not accept the variable. printf the variable is fine.
#!/bin/bash
x=$(md5sum /etc/file1)
printf "$x \n"
sed -i 10"s/.*/$x/g" /usr/bin/file2
You may use this command that uses ~ as regex delimiter instead of / since output of md5sum contains /:
sed -i "10s~.*~$x~" /usr/bin/file2
After I reduced the variable from the md5sum output which includes the filename and directory by running $x thru:
x=$(echo $x | head -n1 | awk '{print $1;}')
Leaving only the MD5 it worked and quit erroring.

Dynamically substitue pattern with env variable with bash [duplicate]

This question already has answers here:
Bash Search File for Pattern, Replace Pattern With Code that Includes Git Branch Name
(1 answer)
Replace a string in shell script using a variable
(12 answers)
Closed 6 years ago.
I have a file file.txt with this content: Hi {YOU}, it's {ME}
I would like to dynamically create a new file file1.txt like this
YOU=John
ME=Leonardo
cat ./file.txt | sed 'SED_COMMAND_HERE' > file1.txt
which content would be: Hi John, it's Leonardo
The sed command I tried so far is like this s#{\([A-Z]*\)}#'"$\1"'#g but the "substitution" part doesn't work correctly, it prints out Hi $YOU, it's $ME
The sed utility can do multiple things to each input line:
$ sed -e "s/{YOU}/$YOU/" -e "s/{ME}/$ME/" inputfile.txt >outputfile.txt
This assumes that {YOU} and {ME} occurs only once each on the line, otherwise, just add g ("s/{YOU}/$YOU/g" etc.)
You can use awk with 2 files.
$> cat file.txt
Hi {YOU}, it's {ME}
$> cat repl.txt
YOU=John
ME=Leonardo
$> awk -F= 'FNR==NR{a["{" $1 "}"]=$2; next} {for (i in a) gsub(i,a[i])}1' repl.txt file.txt
Hi John, it's Leonardo
First awk command goes through replacement file and stores each key-value in an array a be wrapping keys with { and }.
In second iteration we just replace each key by value in actual file.
Update:
To do this without creating repl.txt you can use `process substitution**:
awk -F= 'FNR==NR{a["{" $1 "}"]=$2; next} {
for (i in a) gsub(i,a[i])} 1' <(( set -o posix ; set ) | grep -E '^(YOU|ME)=') file.txt

Using sed, awk or grep to split a string with numbers [duplicate]

This question already has an answer here:
Learning Regular Expressions [closed]
(1 answer)
Closed 7 years ago.
I have this file:
mmD_154Lbb_e_dxk_83233.orc
154L_bbe_Bddxk_3259.txt
14Lbe_3233.orc
m2_154Lbbe_dxk_67233.op
mZZ_1A4Lbbe_dxk_32823.op
mmD_154Lbbe_dxk_99333.orc
mmD_oS154be_dxk_12338.txt
I'm trying to use sed or awk to split the numbers and I don't have solution:
I need out put:
83233
2597
3233
67233
32823
99333
12338
How can I get it to split on each delimiter?
Thanks
This awk can get you that:
awk -F '[_.]' '{print $(NF-1)}' file
83233
3259
3233
67233
32823
99333
12338
sed 's/\..*$//;s/^.*_//' file
| |
| > remove all chars from beginning until last '_' char
> remove last '.' char and all chars after that
output
83233
3259
3233
67233
32823
99333
12338
IHTH
with GNU sed
sed 's/^.*_\([[:digit:]]\+\)\..*$/\1/' file

Extract word after a known pattern in UNIX [duplicate]

This question already has answers here:
get the next word after grep matching [duplicate]
(3 answers)
Closed 7 years ago.
I have a file called in.txt which contains a whole bunch of code, however I need to extract a user ID which is guaranteed to be of the form 'EID:nmb685', potentially with content before and/or after the guaranteed format. I want to extract the 'nmb685' using a bash script. I've tried some combinations of grep and sed but nothing has worked.
if your grep doesn't support -p but supports -o, you can combine grep and awk.
grep -o 'EID:\w\+' file|awk -F':' '{print $2}'
Though can it be done by awk alone, but this is more straightforward.
If your grep supports -P, perl-regexp parameter, you may use this.
grep -oP 'EID:\K\w+' file
What is being output after the ID? Is there anything consistent that you can match against?
If you know the length of the userid you can use:
grep "EID:......" in.txt > out.txt
or if you don't maybe something like this (checks all char/num followed by space, preceeded by EID:)
grep "EID:[A-Za-z0-9]* " in.txt > out.txt
Not very elegant, but this works:
grep "EID:" in.txt | sed 's/\(.*\EID:......\).*/\1/g' | sed 's/^.*EID://'
Select all lines with the substring "EID:"
Remove everything after "EID:" plus 6 characters
Remove everything before (and including) "EID:"