Grep pattern to output substring if line contains string - regex

I'd like Grep (or awk or sed) to output word2 on a new line if word3 is 'nn1'. Each line in my tab delimited source text file is
<number> <word1> <word2> <word3> <lots of junk>
Or do I need to do this in two passes - one to isolate the line, and one to pull out word2?
Any help gratefully received!

You can use awk:
awk '$4 == "nn1"{print $3}' file
Note: For a tab delimited file as well above awk command will work since space or tab are default delimiters.
However if you want fields to be split only on tabs and NOT on spaces then use:
awk -F'\t' '$4 == "nn1"{print $3}' file

Awk is the tool for the job:
awk '$4 == "nn1" { print $3 }' file
If the fourth column is nn1, print the third.
By default, awk splits the line on any number of white space characters (tabs or spaces). As you have said "word1", "word2", etc. I guess that there are no spaces within each field, so the default behaviour should be OK. However, if you want to be explicit, you can specify the field separator yourself:
awk -F'\t' '$4 == "nn1" { print $3 }' file

Related

print the last letter of each word to make a string using `awk` command

I have this line
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
i am trying to print the last letter of each word to make a string using awk command
awk '{ print substr($1,6) substr($2,6) substr($3,6) substr($4,6) substr($5,6) substr($6,6) }'
In case I don't know how many characters a word contains, what is the correct command to print the last character of $column, and instead of the repeding substr command, how can I use it only once to print specific characters in different columns
If you have just this one single line to handle you can use
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($i))} END{print r}' file
If you have multiple lines in the input:
awk '{r=""; for (i=1;i<=NF;i++) r = r "" substr($i,length($i)); print r}' file
Details:
{for (i=1;i<=NF;i++) r = r "" substr($i,length($i)) - iterate over all fields in the current record, i is the field ID, $i is the field value, and all last chars of each field (retrieved with substr($i,length($i))) are appended to r variable
END{print r} prints the r variable once awk script finishes processing.
In the second solution, r value is cleared upon each line processing start, and its value is printed after processing all fields in the current record.
See the online demo:
#!/bin/bash
s='UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS'
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s"
Output:
GMUCHOS
Using GNU awk and gensub:
$ gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' file
Output:
GMUCHOS
1st solution: With GNU awk you could try following awk program, written and tested eith shown samples.
awk -v RS='.([[:space:]]+|$)' 'RT{gsub(/[[:space:]]+/,"",RT);val=val RT} END{print val}' Input_file
Explanation: Set record separator as any character followed by space OR end of value/line. Then as per OP's requirement remove unnecessary newline/spaces from fetched value; keep on creating val which has matched value of RS, finally when awk program is done with reading whole Input_file print the value of variable then.
2nd solution: Using record separator as null and using match function on values to match regex (.[[:space:]]+)|(.$) to get last letter values only with each match found, keep adding matched values into a variable and at last in END block of awk program print variable's value.
awk -v RS= '
{
while(match($0,/(.[[:space:]]+)|(.$)/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}
END{
gsub(/[[:space:]]+/,"",val)
print val
}
' Input_file
Simple substitutions on individual lines is the job sed exists to do:
$ sed 's/[^ ]*\([^ ]\) */\1/g' file
GMUCHOS
using many tools
$ tr -s ' ' '\n' <file | rev | cut -c1 | paste -sd'\0'
GMUCHOS
separate the words to lines, reverse so that we can pick the first char easily, and finally paste them back together without a delimiter. Not the shortest solution but I think the most trivial one...
I would harness GNU AWK for this as follows, let file.txt content be
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
then
awk 'BEGIN{FPAT="[[:alpha:]]\\>";OFS=""}{$1=$1;print}' file.txt
output
GMUCHOS
Explanation: Inform AWK to treat any alphabetic character at end of word and use empty string as output field seperator. $1=$1 is used to trigger line rebuilding with usage of specified OFS. If you want to know more about start/end of word read GNU Regexp Operators.
(tested in gawk 4.2.1)
Another solution with GNU awk:
awk '{$0=gensub(/[^[:space:]]*([[:alpha:]])/, "\\1","g"); gsub(/\s/,"")} 1' file
GMUCHOS
gensub() gets here the characters and gsub() removes the spaces between them.
or using patsplit():
awk 'n=patsplit($0, a, /[[:alpha:]]\>/) { for (i in a) printf "%s", a[i]} i==n {print ""}' file
GMUCHOS
An alternate approach with GNU awk is to use FPAT to split by and keep the content:
gawk 'BEGIN{FPAT="\\S\\>"}
{ s=""
for (i=1; i<=NF; i++) s=s $i
print s
}' file
GMUCHOS
Or more tersely and idiomatic:
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' file
GMUCHOS
(Thanks Daweo for this)
You can also use gensub with:
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' file
GMUCHOS
The advantage here of both is that single letter "words" are handled properly:
s2='SINGLE X LETTER Z'
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' <<< "$s2"
EXRZ
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' <<< "$s2"
EXRZ
Where the accepted answer and most here do not:
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s2"
ER # WRONG
gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' <<< "$s2"
EX RZ # WRONG

Regexp to catch string between first and second comma, where there's alphabetical character in number

First, I must mention my native language is french, so I may make english mistake!
I try to use sed to catch and delete the lines where the second item in a CSV file contains other characters then numbers.
Here is an example of a OK line :
2323421,9781550431209,,2012-07-24 13:30:57,False,2012-07-01 00:00:00,False,118,,1,246501
A line that must be deleted :
1901461,3002CAN,,2010-09-29 13:46:59,True,,True,,,,
or
2977837,9782/76132396,,2015-04-27 10:14:47,True,2015-04-26 00:00:00,True,,,,
etc...
I'm not sure this is possible to be honest!
Thank you !
Here it is using sed
sed -e '/^[^,]*,[^,]*[^0-9,]/d'
A breakdown of the pattern:
^ Start of line
[^,]*, Everything up to the first comma inclusive
[^,]* Everything which isn't a comma
[^0-9,] At least one character which isn't a number or comma
Using awk you can do this:
awk -F, '$2 ~ /^[[:digit:]]+$/' file
Or (thanks to #ghoti):
awk -F, '$2 !~ /[^[:digit:]]/' file
to get only those line where 2nd column is an integer number.
Or using sed you can do:
sed -i.bak '/^[^,]*,[[:digit:]]*[^,[:digit:]]/d' file
Perl:
perl -F, -lane 'print if $F[1] =~ /^\d+$/' file
-a autosplit line to array #F, fields start with 0
-F, splits line using commas
print the line only if field 1 contain only digits: /^\d+$/

Including break line in sed replacement

I've got the following sed replacement, which replaces an entire line with different text, if a certain string is found in the line.
sed "s/.*FOUNDSTRING.*/replacement \"text\" for line"/ test.txt
This works fine - but, for example I want to add a new line after 'for'. My initial thought was to try this:
sed "s/.*FOUNDSTRING.*/replacement \"text\" for \n line"/ test.txt
But this ends out replacing with the following:
replacement "text" for n line
Desired outcome:
replacement "text" for
line
It can be painful to work with newlines in sed. There are also some differences in the behaviour depending on which version you're using. For simplicity and portability, I'd recommend using awk:
awk '{print /FOUNDSTRING/ ? "replacement \"text\" for\nline" : $0}' file
This prints the replacement string, newline included, if the pattern matches, otherwise prints the line $0.
Testing it out:
$ cat file
blah
blah FOUNDSTRING blah
blah
$ awk '{print /FOUNDSTRING/ ? "replacement \"text\" for\nline" : $0}' file
blah
replacement "text" for
line
blah
Of course there is more than one way to skin the cat, as shown in the comments:
awk '/FOUNDSTRING/ { $0 = "replacement \"text\" for \n line" } 1' file
This replaces the line $0 with the new string when the pattern matches and uses the common shorthand 1 to print each line.

sed replace character only between two known strings

Is it possible to replace a character between two known strings only? I have a number of files in the format
title.header.index.subtitle.goes.here.footer
I can pick out the "subtitle.goes.here" with pattern matching between the index (which I need to backreference) and a footer (which is constant), but I then want to replace the period/dot character with an underscore, to give me
title.header.index.subtitle_goes_here.footer
So from input such as
title.header.01.the.first.subtitle.is.here.footer
I want to end up with
title.header.01.the_first_subtitle_is_here.footer
What I have so far is useless, but a start:
sed -r 's/([0-9][0-9]\.)([a-z]*\.*)*footer/\1footer/g'
But this is removing the entire subtitle and footer before manually adding it back in and has plenty of other flaws I'm sure. Any help would be much appreciated.
This might work for you:
echo "title.header.01.the.first.subtitle.is.here.footer" |
sed 's/\./_/4g;s/.\(footer\)/.\1/'
title.header.01.the_first_subtitle_is_here.footer
An ugly alternative:
sed 'h;s/\([0-9][0-9]\.\).*\(\.footer\)/\1\n\2/;x;s/.*[0-9][0-9]\.\(.*\).footer/\1/;s/\./_/g;x;G;s/\(\n\)\(.*\)\1\(.*\)/\3\2/' file
If you are open to awk solution then this might help -
awk '
{for (i=1;i<=NF;i++) if (i!=NF) {printf (3<i && i<(NF-1))?$i"_":$i"."} print $NF}
' FS='.' OFS='.' file
Input File:
[jaypal:~/Temp] cat file
title.header.index.subtitle.goes.here.footer
title.header.01.the.first.subtitle.is.here.footer
Test:
[jaypal:~/Temp] awk '
{for (i=1;i<=NF;i++) if (i!=NF) {printf (3<i && i<(NF-1))?$i"_":$i"."} print $NF}
' FS='.' OFS='.' file
title.header.index.subtitle_goes_here.footer
title.header.01.the_first_subtitle_is_here.footer

AWK Matching Positive and Negative Numbers

I have a data that looks like this:
-1033
-
222
100
-30
-
10
What I want to do is to capture all the numbers excluding "dash only" entry.
Why my awk below failed?
awk '$4 != "-" {print $4}'
Your awk script says
If the fourth field is not a dash, print it out
However, you want to print it out if the line is not a dash
awk '$0 != "-"'
Default action is to print so no body is needed.
If you want to print group of numbers, you can use a GNU awk extension if you use gawk. It allows splitting records using regular expressions:
gawk 'BEGIN { RS="(^|\n)-($|\n)" } { print "Numbers:\n" $0 }'
Now, instead of lines, it takes a group of numbers separated by a line containing only -. Setting the field separator (FS) to a newline allows you to iterate over the numbers within such a group:
gawk 'BEGIN { FS="\n"; RS="(^|\n)-($|\n)" }
{ print "Numbers:"; for(i=1;i<=NF;i++) print " *: " $i }'
However I agree with other answers. If you just want to filter out lines matching some text, grep is the better tool for that.
Why are you checking $4? It appears you should check $1 or $0 as litb said.
But awk is a heavyweight tool for this job. Try
grep -v '^-$'
To remove lines containing only a dash or
grep -v '^ *- *$'
To remove lines containing only a dash and possibly some space characters.
Assuming that your data file is actually multi-column, and that the values are in column 4, the following will work:
awk '$4 != "-" {print $4} {}'
It prints the value only where it isn't "-". Your version will probably print the value regardless (or twice) since the default action is to print. Adding the {} makes the default action "do nothing".
If the data is actually as shown (one column only), you should be using $1 rather than $4 - I wouldn't use $0 since that's the whole line and it appears you have spaces at the end of your first two lines which would cause $0 to be "-1033 " and "- ".
But, if it were a single column, I wouldn't use awk at all but rather:
grep -v '^-$'
grep -v '^ *- *$'
the second allowing for spaces on either side of the "-" character.