Add delimiters at specific indexes - regex

I want to add a delimiter in some indexes for each line of a file.
I have a file with data:
10100100010000
20200200020000
And I know the offset of each column (2, 5 and 9)
With this sed command: sed 's/\(.\{2\}\)/&,/;s/\(.\{6\}\)/&,/;s/\(.\{11\}\)/&,/' myFile
I get the expected output:
10,100,1000,10000
20,200,2000,20000
but with a large number of columns (~200) and rows (300k) is really slow.
Is there an efficient alternative?

1st solution: With GNU awk could you please try following:
awk -v OFS="," '{$1=$1}1' FIELDWIDTHS="2 3 4 5" Input_file
2nd Solution: Using sed try following.
sed 's/\(..\)\(...\)\(....\)\(.....\)/\1,\2,\3,\4/' Input_file
3rd solution: awk solution using substr.
awk 'BEGIN{OFS=","} {print substr($0,1,2) OFS substr($0,3,3) OFS substr($0,6,4) OFS substr($0,10,5)}' Input_file
In above substr solution, I have taken 5 digits/characters in substr($0,10,5) in case you want to take all characters/digits etc starting from 10th position use substr($0,10) which will take rest of all line's characters/digits here to print.
Output will be as follows.
10,100,1000,10000
20,200,2000,20000

Modifying your sed command to make it add all the separators in one shot would likely make it perform better :
sed 's/^\(.\{2\}\)\(.\{3\}\)\(.\{4\}\)/\1,\2,\3,/' myFile
Or with extended regular expression:
sed -E 's/(.{2})(.{3})(.{4})/\1,\2,\3,/' myFile
Output:
10,100,1000,10000
20,200,2000,20000

With GNU awk for FIELDWIDTHS:
$ awk -v FIELDWIDTHS='2 3 4 *' -v OFS=',' '{$1=$1}1' file
10,100,1000,10000
20,200,2000,20000
You'll need a newer version of gawk for * at the end of FIELDWIDTHS to mean "whatever's left", with older version just choose a large number like 999.

If you start the substitutions from the back, you can use the number flag to s to specify which occurrence of any character you'd like to append a comma to:
$ sed 's/./&,/9;s/./&,/5;s/./&,/2' myFile
10,100,1000,10000
20,200,2000,20000
You could automate that a bit further by building the command with a printf statement:
printf -v cmd 's/./&,/%d;' 9 5 2
sed "$cmd" myFile
or even wrap that in a little shell function so we don't have to care about listing the columns in reverse order:
gencmd() {
local arr
# Sort arguments in descending order
IFS=$'\n' arr=($(sort -nr <<< "$*"))
printf 's/./&,/%d;' "${arr[#]}"
}
sed "$(gencmd 2 5 9)" myFile

Related

Bash - numbers of multiple lines matching regex (possible oneliner?)

I'm not very fluent in bash but actively trying to improve, so I'd like to ask some experts here for a little suggestion :)
Let's say I've got a following text file:
Some
spam
about which I don't care.
I want following letters:
X1
X2
X3
I do not want these:
X4
X5
Nor this:
X6
But I'd like these, too:
I want following letters:
X7
And so on...
And I'd like to get numbers of lines with these letters, so my desired output should look like:
5 6 7 15
To clarify: I want all lines matching some regex /\s*X./, that occur right after one match with another regex /\sI want following letters:/
Right now I've got a working solution, which I don't really like:
cat data.txt | grep -oPz "\sI want following letters:((\s*X.)*)" | grep -oPz "\s*X." > tmp.txt
for entry in $(cat tmp.txt); do
grep -n $entry data.txt | cut -d ":" -f1
done
My question is: Is there any smart way, any tool I don't know with a functionality to do this in one line? (I esspecially don't like having to use temp file and a loop here)
You can use awk:
awk '/I want following/{p=1;next}!/^X/{p=0;next}p{print NR}' file
Explanation in multiline version:
#!/usr/bin/awk
/I want following/{
# Just set a flag and move on with the next line
p=1
next
}
!/^X/ {
# On all other lines that doesn't start with a X
# reset the flag and continue to process the next line
p=0
next
}
p {
# If the flag p is set it must be a line with X+number.
# print the line number NR
print NR
}
Following may help you here.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1} flag' Input_file
Above will print the lines which have I want following letters: too in case you don't want these then use following.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1;next} flag' Input_file
To add line number to output use following.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1;next} flag{print FNR}' Input_file
First, let's optimize a little bit your current script:
#!/bin/bash
FILE="data.txt"
while read -r entry; do
[[ $entry ]] && grep -n $entry "$FILE" | cut -d ":" -f1
done < <(grep -oPz "\sI want following letters:((\s*X.)*)" "$FILE"| grep -oPz "\s*X.")
And here's some comments:
No need to use cat file|grep ... => grep ... file
Do not use the syntaxe for i in $(command), it's often the cause of multiple bugs and there's always a smarter solution.
No need to use a tmp file either
And then, there's a lot of shorter possible solutions. Here's one using awk:
$ awk '{ if($0 ~ "I want following letters:") {s=1} else if(!($0 ~ "^X[0-9]*$")) {s=0}; if (s && $0 ~ "^X[0-9]*$") {gsub("X", ""); print}}' data.txt
1
2
3
7

Combining sed and awk commands

I am trying to add a few columns to a file with about 500 rows in them but for now lets say I was using one file with 500 lines.
I have two commands. One sed command and one awk command
The sed command is used to place a string at the front of every line. (It works perfectly)
Example of the script:
sed -e "2,$s#^#https://confidential/index.pl?Action=AgentTicketZoom;TicketID=#" C:\Users\hd\Desktop\action.txt > C:\Users\hd\Desktop\test.txt
The awk command is meant to place a string at the beginning of every line, before the sed string and increment the two numbers ( example below). So technically speaking the sed command will be in column 2 and the awk command will be column 1.
I would use another sed command but sed doesn't increment values as easily. Please help!
Example of the script:
awk `{
for (i=0; i<=10; i++)
{
printf "=HYPERLINK(B%d, C%d), \n, i"
}exit1
}`
The awk code is supposed to show something like
=HYPERLINK(B2,C2), https://confidential/index.pl?Action=AgentTicketZoom;TicketID=
=HYPERLINK(B3,C3), https://confidential/index.pl?Action=AgentTicketZoom;TicketID=
=HYPERLINK(B4,C4), https://confidential/index.pl?Action=AgentTicketZoom;TicketID=
=HYPERLINK(B5,C5), https://confidential/index.pl?Action=AgentTicketZoom;TicketID=
=HYPERLINK(B6,C6), https://confidential/index.pl?Action=AgentTicketZoom;TicketID=
You never need sed if you're using awk, and you should never use sed for anything other than simple substitutions on a single line. Just use this awk script:
awk 'NR>1{printf "=HYPERLINK(B%d,C%d), https://confidential/index.pl?Action=AgentTicketZoom;TicketID=%s\n", NR-1, NR-1, $0}' file

Remove everything after 2nd occurrence in a string in unix

I would like to remove everything after the 2nd occurrence of a particular
pattern in a string. What is the best way to do it in Unix? What is most elegant and simple method to achieve this; sed, awk or just unix commands like cut?
My input would be
After-u-math-how-however
Output should be
After-u
Everything after the 2nd - should be stripped out. The regex should also match
zero occurrences of the pattern, so zero or one occurrence should be ignored and
from the 2nd occurrence everything should be removed.
So if the input is as follows
After
Output should be
After
Something like this would do it.
echo "After-u-math-how-however" | cut -f1,2 -d'-'
This will split up (cut) the string into fields, using a dash (-) as the delimiter. Once the string has been split into fields, cut will print the 1st and 2nd fields.
This might work for you (GNU sed):
sed 's/-[^-]*//2g' file
You could use the following regex to select what you want:
^[^-]*-\?[^-]*
For example:
echo "After-u-math-how-however" | grep -o "^[^-]*-\?[^-]*"
Results:
After-u
#EvanPurkisher's cut -f1,2 -d'-' solution is IMHO the best one but since you asked about sed and awk:
With GNU sed for -r
$ echo "After-u-math-how-however" | sed -r 's/([^-]+-[^-]*).*/\1/'
After-u
With GNU awk for gensub():
$ echo "After-u-math-how-however" | awk '{$0=gensub(/([^-]+-[^-]*).*/,"\\1","")}1'
After-u
Can be done with non-GNU sed using \( and *, and with non-GNU awk using match() and substr() if necessary.
awk -F - '{print $1 (NF>1? FS $2 : "")}' <<<'After-u-math-how-however'
Split the line into fields based on field separator - (option spec. -F -) - accessible as special variable FS inside the awk program.
Always print the 1st field (print $1), followed by:
If there's more than 1 field (NF>1), append FS (i.e., -) and the 2nd field ($2)
Otherwise: append "", i.e.: effectively only print the 1st field (which in itself may be empty, if the input is empty).
This can be done in pure bash (which means no fork, no external process). Read into an array split on '-', then slice the array:
$ IFS=-
$ read -ra val <<< After-u-math-how-however
$ echo "${val[*]}"
After-u-math-how-however
$ echo "${val[*]:0:2}"
After-u
awk '$0 = $2 ? $1 FS $2 : $1' FS=-
Result
After-u
After
This will do it in awk:
echo "After" | awk -F "-" '{printf "%s",$1; for (i=2; i<=2; i++) printf"-%s",$i}'

Awk to skip the blank lines

The output of my script is tab delimited using awk as :
awk -v variable=$bashvariable '{print variable"\t single\t" $0"\t double"}' myinfile.c
The awk command is run in a while loop which updates the variable value and the file myinfile.c for every cycle.
I am getting the expected results with this command .
But if the inmyfile.c contains a blank line (it can contain) it prints no relevant information. can I tell awk to ignore the blank line ?
I know it can be done by removing the blank lines from myinfile.c before passing it on to awk .
I am in knowledge of sed and tr way but I want awk to do it in the above mentioned command itself and not a separate solution as below or a piped one.
sed '/^$/d' myinfile.c
tr -s "\n" < myinfile.c
Thanks in advance for your suggestions and replies.
There are two approaches you can try to filter out lines:
awk 'NF' data.txt
and
awk 'length' data.txt
Just put these at the start of your command, i.e.,
awk -v variable=$bashvariable 'NF { print variable ... }' myinfile
or
awk -v variable=$bashvariable 'length { print variable ... }' myinfile
Both of these act as gatekeepers/if-statements.
The first approach works by only printining out lines where the number of fields (NF) is not zero (i.e., greater than zero).
The second method looks at the line length and acts if the length is not zero (i.e., greater than zero)
You can pick the approach that is most suitable for your data/needs.
You could just add
/^\s*$/ {next;}
To the front of your script that will match the blank lines and skip the rest of the awk matching rules. Put it all together:
awk -v variable=$bashvariable '/^\s*$/ {next;} {print variable"\t single\t" $0"\t double"}' myinfile.c
may be you could try this out:
awk -v variable=$bashvariable '$0{print variable"\t single\t" $0"\t double"}' myinfile.c
Try this:
awk -v variable=$bashvariable '/^.+$/{print variable"\t single\t" $0"\t double"}' myinfile.c
I haven't seen this solution, so: awk '!/^\s*$/{print $1}' will run the block for all non-empty lines.
\s metacharacter is not available in all awk implementations, but you can also write !/^[ \t]*$/.
https://www.gnu.org/software/gawk/manual/gawk.html
\s Matches any space character as defined by the current locale. Think of it as shorthand for ‘[[:space:]]’.
Based on Levon's answer, you may just add | awk 'length { print $1 }' to the end of the command.
So change
awk -v variable=$bashvariable '{ whatever }' myinfile.c
to
awk -v variable=$bashvariable '{ whatever }' myinfile.c | awk 'length { print $1 }'
In case this doesn't work, use | awk 'NF { print $1 }' instead.
another awk way of only trimming out actually zero length lines but keep the ones with only spaces tabs is this :
awk 8 RS=
just doing awk NF trims out lines 3 (zero length) and 5 (spaces and tabs) …..
1 abc
2 def
3
4 3591952
5
6 93253
1 abc
2 def
3 3591952
4
5 93253
1 abc
2 def
3 3591952
4 93253
but the RS= approach keeps line 5 for u:
1 abc
2 def
3 3591952
4
5 93253
** lines with \013 \v VT :: \014 \f FF :: \015 \r CR aren't skipped by default FS = " ", despite them also belonging to POSIX [[:space:]]

Passing the value of NR to a variable in AWK

Can we pass NR to a variable in awk ?
I have a script which goes like this :
awk -v { blah blah..
..........
count--
print count
}
if (count==0)
{print "The end of function"
print NR
exit
}
This is the awk part of the code . I want to pass the NR to var2 as :
sed -n ''"$var1"','"$var2"'p'
Which has to be reused several times !
Thanks for your replies .
If you only want to print a certain subset of lines you're almost there. The -v flag is the way to go.
awk -v var1=15 -v var2=25 'NR>=var1 && NR<=var2 {blah blah ...}'
Of course you have to change 15 and 25 to what you need. Observe that variables shoudn't be encapsulated in quotes.
As others have suggested, there are better ways to accomplish the overall goal.
However, in order to answer your specific question:
var2=$(awk 'END {print NR}' inputfile)
and add anything else you may need within the AWK script.
I don't know what you want to achieve with awk, sed and the NR variable. Do you mean the number of lines of the file?
This command gets it:
wc -l infile | sed -e 's/ .*$//'
So, use it with -v switch to awk and use it as you want. Next command will print 10 because infile has ten lines in my computer.
awk -v num_lines=$(wc -l infile | sed -e 's/ .*$//') 'BEGIN { print num_lines }'