Sed replace string, based on input file - regex

Wanting to see if there is a better/quicker way to do this.
Basically, I have a file and I need to add some more information to it, based on one of its fields. e.g.
File to edit:
USER|ROLE
user1|role1
user1|role2
user2|role1
user2|role11
Input File:
Role|Application
role1|applicationabc
role2|application_qwerty
role3|application_new_app_new
role4|qwerty_abc_123
role11|applicationabc123
By the end, I want to be left with something like this:
USER|ROLE|Application
user1|role1|applicationabc
user1|role2|application_qwerty
user2|role11|applicationabc123
user2|role3|application_new_app_new
My idea:
cat inputfile | while IFS='|' read src rep
do
sed -i "s#\<$src\>#$src\|$rep#" /path/to/file/filename.csv
done
What I've written works to an extent, but it is very slow. Also, if it finds a match anywhere in the line, it will replace it. For example, for user2, and role11, the script would match role1 before it matches role11.
So my questions are:
Is there a quicker way to do this?
Is there a way to match against the exact expression/string? Putting quotes in my input file doesn't seem to work.

With join:
join -i -t "|" -1 2 -2 1 <(sort -t '|' -k2b,2 file) <(sort -t '|' -k 1b,1 input)
From the join manpage:
Important: FILE1 and FILE2 must be sorted on the join fields.
That's why we need to sort the two files first: file on the first field and input on the second.
Then join joins the two file on those fields -1 2 -2 1. Output would then be:
ROLE|USER|Application
role1|user1|applicationabc
role1|user2|applicationabc
role11|user2|applicationabc123
role2|user1|application_qwerty

Piece of cake with awk:
$ cat file1
USER|ROLE
user1|role1
user1|role2
user2|role1
user2|role11
$ cat file2
ROLE|Application
role1|applicationabc
role2|application_qwerty
role3|application_new_app_new
role4|qwerty_abc_123
role11|applicationabc123
$ awk -F'\\|' 'NR==FNR{a[$1]=$2; next}; {print $0 "|" a[$2]}' file2 file1
USER|ROLE|Application
user1|role1|applicationabc
user1|role2|application_qwerty
user2|role1|applicationabc
user2|role11|applicationabc123

Please try the following:
awk 'FNR==NR{A[$1]=$2;next}s=$2 in A{ $3=A[$2] }s' FS='|' OFS='|' file2 file1
or:
awk 'FNR==NR{A[$1]=$2;next} $3 = $2 in A ? A[$2] : 0' FS='|' OFS='|' file2 file1
Explanation
awk '
# FNR==NR this is true only when awk reading first file
FNR==NR{
# Create array A where index = field1($1) and value = field2($2)
A[$1]=$2
# stop processing and go to next line
next
}
# Here we read 2nd file that is file1 in your case
# var in Array returns either 1=true or 0=false
# if array A has index field2 ($2) then s will be 1 otherwise 0
# whenever s is 1 that is nothing but true state, we create new field
# $3 and its value will be array element corresponds to array index field2
s=$2 in A{
$3=A[$2]
}s
# An awk program is a series of condition-action pairs,
# conditions being outside of curly braces and actions being enclosed in them.
# A condition is considered false if it evaluates to zero or the empty string,
# anything else is true (uninitialized variables are zero or empty string,
# depending on context, so they are false).
# Either a condition or an action can be implied;
# braces without a condition are considered to have a true condition and
# are always executed if they are hit,
# and any condition without an action will print the line
# if and only if the condition is met.
# So finally }s at the end of script
# it executes the default action for every line,
# printing the line whenever s is 1 that is true
# which may have been modified by the previous action in braces
# FS = Input Field Separator
# OFS = Output Field Separator
' FS='|' OFS='|' file2 file1

Related

Replace a block of text

I have a file in this pattern:
Some text
---
## [Unreleased]
More text here
I need to replace the text between '---' and '## [Unreleased]' with something else in a shell script.
How can it be achieved using sed or awk?
Perl to the rescue!
perl -lne 'my #replacement = ("First line", "Second line");
if ($p = (/^---$/ .. /^## \[Unreleased\]/)) {
print $replacement[$p-1];
} else { print }'
The flip-flop operator .. tells you whether you're between the two strings, moreover, it returns the line number relative to the range.
This might work for you (GNU sed):
sed '/^---/,/^## \[Unreleased\]/c\something else' file
Change the lines between two regexp to the required string.
This example may help you.
$ cat f
Some text
---
## [Unreleased]
More text here
$ seq 1 5 >mydata.txt
$ cat mydata.txt
1
2
3
4
5
$ awk '/^---/{f=1; while(getline < c)print;close(c);next}/^## \[Unreleased\]/{f=0;next}!f' c="mydata.txt" f
Some text
1
2
3
4
5
More text here
awk -v RS="\0" 'gsub(/---\n\n## \[Unreleased\]\n/,"something")+1' file
give this line a try.
An awk solution that:
is portable (POSIX-compliant).
can deal with any number of lines between the start line and the end line of the block, and potentially with multiple blocks (although they'd all be replaced with the same text).
reads the file line by line (as opposed to reading the entire file at once).
awk -v new='something else' '
/^---$/ { f=1; next } # Block start: set flag, skip line
f && /^## \[Unreleased\]$/ { f=0; print new; next } # Block end: unset flag, print new txt
! f # Print line, if before or after block
' file

How to replace a text sequence that includes "\n" in a text file

This may sound duplicated, but I can't make this works.
Consider:
_ = space
- = minus sign
particle_little.csv is a file of this form:
waste line to be deleted
__data__data__data
_-data__data_-data
__data_-data__data
I need to get a standard csv format in particle_std.csv, like this:
data,data,data
-data,data,-data
data,-data,data
I am trying to use tail and tr to do that conversion, here I split the command:
tail -n +2 particle_little.csv to delete the first line
| tr -s ' ' to remove duplicated spaces
| tr '/\b\n \b/' '\n' to delete the very beginning space
| tr ' ' ',' to change spaces for commas
> particle_std.csv to put it in a output file
But I get this (without the 4th step):
data
data
data
-data
...
Finally, the file is huge, so it is almost impossible to open in editors (I know there are super editors that maybe can)
I would suggest that you used awk:
$ cat file
waste line to be deleted
data data data
-data data -data
data -data data
$ awk -v OFS=, '{ $1 = $1 } NR > 1' file
data,data,data
-data,data,-data
data,-data,data
The script sets the output field separator OFS to , and reassigns the first field to itself $1 = $1, causing awk to touch each line (and replace the spaces with commas). Lines after the first, where NR > 1, are printed (the default action is to print the line).
So if I'm reading you right - ignore lines that don't start with whitespace. Comma separate everything else.
I'd suggest perl:
perl -lane 'next unless /^\s/; print join ",", #F';
This, when given:
waste line to be deleted
data data data
-data data -data
data -data data
On STDIN (Or specified in a filename) outputs:
data,data,data
-data,data,-data
data,-data,data
This is because:
-l strips linefeeds (and replaces them after each print);
-a autosplits on any whitespace
-n wraps it in a while ( <> ) { loop which iterates line by line - functionally it means it works just like sed/grep/tr and reads STDIN or files specified as args.
-e allows specifying a perl snippet.
In this case:
skip any lines that don't start with \s or any whitespace.
any other lines, join the fields (#F generated by -a) with , as delimiter. (This auto-inserts a linefeed because -l)
Then you can either redirect the output to a file (>output.csv) or use -i.bak to edit inplace.
You should probably use sed or awk for this:
sed -e 1d -e 's/^ *//' -e 's/ */,/g'
One way to do it in Awk is:
awk 'NR == 1 { next }
{ pad=""; for (i = 1; i <= NF; i++) { printf "%s%s", pad, $i; pad="," } print "" }'
but there's a better way to do it in Awk:
awk 'BEGIN { OFS=","} NR == 1 { next } { $1 = $1; print }' data
The BEGIN block sets the output field separator; the assignment $1 = $1; forces Awk to rework the output line; the print prints it.
I've left the first Awk version around because it shows there's more than one way to do it, and in some circumstances, such methods can be useful. But for this task, the second Awk version is better — simpler, more compact (and isomorphic with Tom Fenech's answer).

get the last word in body of text

Given a body of text than can span a varying number of lines, I need to use a grep, sed or awk solution to search through many files for the same pattern and get the last word in the body.
A file can include formats such as these where the word I want can be named anything
call function1(input1,
input2, #comment
input3) #comment
returning randomname1,
randomname2,
success3
call function1(input1,
input2,
input3)
returning randomname3,
randomname2,
randomname3
call function1(input1,
input2,
input3)
returning anothername3,
randomname2, anothername3
I need to print out results as
success3
randomname3
anothername3
Also I need some the filename and line information about each .
I've tried
pcregrep -M 'function1.*(\s*.*){6}(\w+)$' filename.txt
which is too greedy and I still need to print out just the specific grouped value and not the whole pattern. The words function1 and returning in my sample code will always be named as this and can be hard coded within my expression.
Last word of code blocks
Split file in blocks using awk's record separator RS. A record will be defined as a block of text, records are separated by double newlines.
A record consists of fields, each two consecutive fields are separated by white space or a single newline.
Now all we have to do is print the last field for each record, resulting in following code:
awk 'BEGIN{ FS="[\n\t ]"; RS="\n\n"} { print $NF }' file
Explanation:
FS this is the field separator and is set to either a newline, a tab or a space: [\n\t ].
RS this is the record separator and is set to a doulbe newline: \n\n
print $NF this will print the field $ with index NF, which is a variable containing the number of fields. Hence this prints the last field.
Note: To capture all paragraphs the file should end in double newline, this can easily be achieved by pre processing the file using: $ echo -e '\n\n' >> file.
Alternate solution based on comments
A more elegant ans simple solution is as follows:
awk -v RS='' '{ print $NF }' file
How about the following awk solution:
awk 'NF == 0 {if(last) print last; last=""} NF > 0 {last=$NF} END {print last}' file
the $NF is getting the value of the last "word" where NF stands for number of fields. Then the last variable always stores the last word on a line and prints it if it encounters an empty line, representing the end of a paragraph.
New version with matches function1 condition.
awk 'NF == 0 {if(last && hasF) print last; last=hasF=""}
NF > 0 {last=$NF; if(/function1/)hasF=1}
END {if(hasF) print last}' filename.txt
This will produce the output you show from the input file you posted:
$ awk -v RS= '{print $NF}' file
success3
randomname3
anothername3
If you want to print FILENAME and line number like you mention then this may be what you want:
$ cat tst.awk
NF { nr=NR; last=$NF; next }
{ prt() }
END { prt() }
function prt() { if (nr) print FILENAME, nr, last; nr=0 }
$ awk -f tst.awk file
file 6 success3
file 13 randomname3
file 20 anothername3
If that doesn't do what you want, edit your question to provide clearer, more truly representative and accurate sample input and expected output.
This is the perl version of Shellfish's awk solution (plus the keywords):
perl -00 -nE '/function1/ and /returning/ and say ((split)[-1])' file
or, with one regex:
perl -00 -nE '/^(?=.*function1)(?=.*returning).*?(\S+)\s*$/s and say $1' file
But the key is the -00 option which reads the file a paragraph at a time.

Remove matching and previous line

I need to remove a line containing "not a dynamic executable" and a previous line from a stream using grep, awk, sed or something other. My current working solution would be to tr the entire stream to strip off newlines then replace the newline preceding my match with something else using sed then use tr to add the newlines back in and then use grep -v. I'm somewhat weary of artifacts with this approach, but I don't see how else I can to it at the moment:
tr '\n' '|' | sed 's/|\tnot a dynamic executable/__MY_REMOVE/g' | tr '|' '\n'
EDIT:
Input is a list of mixed files piped to xargs ldd, basically I want to ignore all output about non library files since that has nothing to do with what I'm doing next. I didn't want to use lib*.so mask since that could concievably be different
Most simply with pcregrep in multi-line mode:
pcregrep -vM '\n\tnot a dynamic executable' filename
If pcregrep is not available to you, then awk or sed can also do this by reading one line ahead and skipping the printing of previous lines when a marker line appears.
You could be boring (and sensible) with awk:
awk '/^\tnot a dynamic executable/ { flag = 1; next } !flag && NR > 1 { print lastline; } { flag = 0; lastline = $0 } END { if(!flag) print }' filename
That is:
/^\tnot a dynamic executable/ { # in lines that start with the marker
flag = 1 # set a flag
next # and do nothing (do not print the last line)
}
!flag && NR > 1 { # if the last line was not flagged and
# is not the first line
print lastline # print it
}
{ # and if you got this far,
flag = 0 # unset the flag
lastline = $0 # and remember the line to be possibly
# printed.
}
END { # in the end
if(!flag) print # print the last line if it was not flagged
}
But sed is fun:
sed ':a; $! { N; /\n\tnot a dynamic executable/ d; P; s/.*\n//; ba }' filename
Explanation:
:a # jump label
$! { # unless we reached the end of the input:
N # fetch the next line, append it
/\n\tnot a dynamic executable/ d # if the result contains a newline followed
# by "\tnot a dynamic executable", discard
# the pattern space and start at the top
# with the next line. This effectively
# removes the matching line and the one
# before it from the output.
# Otherwise:
P # print the pattern space up to the newline
s/.*\n// # remove the stuff we just printed from
# the pattern space, so that only the
# second line is in it
ba # and go to a
}
# and at the end, drop off here to print
# the last line (unless it was discarded).
Or, if the file is small enough to be completely stored in memory:
sed ':a $!{N;ba}; s/[^\n]*\n\tnot a dynamic executable[^\n]*\n//g' filename
Where
:a $!{ N; ba } # read the whole file into
# the pattern space
s/[^\n]*\n\tnot a dynamic executable[^\n]*\n//g # and cut out the offending bit.
This might work for you (GNU sed):
sed 'N;/\n.*not a dynamic executable/d;P;D' file
This keeps a moving window of 2 lines and deletes them both if the desired string is found in the second. If not the first line is printed and then deleted and then next line appended and the process repeated.
Always keep in mind that while grep and sed are line-oriented awk is record-oriented and so can easily handle problems that span multiple lines.
It's a guess given you didn't post any sample input and expected output but it sounds like all you need is (using GNU awk for multi-char RS):
awk -v RS='^$' -v ORS= '{gsub(/[^\n]+\n\tnot a dynamic executable/,"")}1' file

Skipping line and regex in awk [duplicate]

This question already has answers here:
how to ignore blank lines and comment lines using awk
(3 answers)
Closed 8 years ago.
I'm working with awk and I need to skip lines that are blank or comments. I've been trying inside the loop to see if it match the regex for this and then using next
{if ($0 ~"/^($|#)/" ) {next;}}
but the if statement is never getting hit and I can't figure out why. (My input has blank lines and comments)
I need to add this line inside of an awkscript in the block, not a command line argument.
Assuming you're inside a block of awk code that doesn't benefit from default print of matching patterns and you need to use an if test , here is the basis for a solution
$ echo "a
b
c
d
#
#e
f
" | awk '{if ($0 ~ /^(#|$)/ ) {next;} ;print}'
produces output of
a
b
c
d
f
If you want to skip blank lines that have spaces/tabs included, you can add
awk '{if ($0 ~ /^(#|[ \t]*$)/ ) {next;} ;print}'
#-------------------^^^^^^
# means char-class of space and tab char
# * means zero or more of preceding
IHTH
In awk, a regular expression is marked by the beginning and ending slashes. If you place it inside quotes, it ceases to be a regex and becomes a string. Thus, replace:
{if ($0 ~"/^($|#)/" ) {next;}}
With:
{if ($0 ~ /^($|#)/ ) {next;}}
Example
Consider the input file:
$ cat input
one
#comment
two
three
four
Now observe the awk script:
$ awk '{if ($0 ~ /^($|#)/ ) {next;}} 1' input
one
two
three
four
You can use the following one:
awk '! /^($|#)/' infile
It uses a default action of print for each line that doesn't begin with # or is a blank one.
awk '/^$|#/{next} {print $0}'
would do the job
or
more simply
awk '/^[^$#]/ '
what it does?
/^[^$#]/ matches each line to the regex and if a match is found, the default action to print the entire record is done.
^ anchors the regex at the begining of the line.
[^$#] negatates the character class. ensures that the start of line is not followed by
$ => line is empty, negation skips the line
# => comment
eg
$ cat input
hello
#world
this is a
test
$ awk '/^[^$#]/ ' input
hello
this is a
test