extract all values for specific key from space delimited text file - regex

have a text file in the format
1=23 2=44 15=17:31:37.640 5=abc 15=17:31:37.641 4=23 15=17:31:37.643 15=17:31:37.643
I need a regex to extract all the values for key 15 for a multiline text file
output should be
17:31:37.640 17:31:37.641 17:31:37.643 17:31:37.643
Sorry, I should have stated that the values I'm trying to extract are timestamps in the form 17:31:37.643

You can use GNU grep to extract the substrings.
grep -Po '\b15=\K\S+' | tr '\n' ' '
-P option interprets the pattern as a Perl regular expression.
-o option shows only the matching part that matches the pattern.
\K throws away everything that it has matched up to that point.
Output
17:31:37.640 17:31:37.641 17:31:37.643 17:31:37.643

You can use sed:
sed 's/15=\([^ ]*\)/\1/g;s/[0-9]\+[^ ]\+ //g' input.file
Gave that answer before OP added the expected output, it will work too, but adds a new line after every value:
If you have GNU grep, you can use a lookbehind assertion that comes with perl compatible regex mode:
grep -oP '(?<=15=)[^ ]*' <<< '1=23 2=44 15=xyz 5=abc 15=yyy 4=23 15=omnet 15=that'
Output:
xyz
yyy
omnet
that

Using awk:
awk -F'=' -v RS=' ' -v ORS=' ' '$1==15 { print $2 }' file
xyz yyy omnet that
Set the Input and Output Record Separator to space and Input Field Separator to =. Test the condition of column1 to be 15. If that is true, print the second column.
As suggested by Ed Morton in the comments, this would leave a trailing blank char or even an absent newline. If thats a concern, you can use the following using GNU awk for multi-char RS.
gawk -F'=' -v RS='[[:space:]]+' '$1==15{ printf "%s%s", (c++?OFS:""), $2 } END{print ""}' file

Related

print the last letter of each word to make a string using `awk` command

I have this line
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
i am trying to print the last letter of each word to make a string using awk command
awk '{ print substr($1,6) substr($2,6) substr($3,6) substr($4,6) substr($5,6) substr($6,6) }'
In case I don't know how many characters a word contains, what is the correct command to print the last character of $column, and instead of the repeding substr command, how can I use it only once to print specific characters in different columns
If you have just this one single line to handle you can use
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($i))} END{print r}' file
If you have multiple lines in the input:
awk '{r=""; for (i=1;i<=NF;i++) r = r "" substr($i,length($i)); print r}' file
Details:
{for (i=1;i<=NF;i++) r = r "" substr($i,length($i)) - iterate over all fields in the current record, i is the field ID, $i is the field value, and all last chars of each field (retrieved with substr($i,length($i))) are appended to r variable
END{print r} prints the r variable once awk script finishes processing.
In the second solution, r value is cleared upon each line processing start, and its value is printed after processing all fields in the current record.
See the online demo:
#!/bin/bash
s='UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS'
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s"
Output:
GMUCHOS
Using GNU awk and gensub:
$ gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' file
Output:
GMUCHOS
1st solution: With GNU awk you could try following awk program, written and tested eith shown samples.
awk -v RS='.([[:space:]]+|$)' 'RT{gsub(/[[:space:]]+/,"",RT);val=val RT} END{print val}' Input_file
Explanation: Set record separator as any character followed by space OR end of value/line. Then as per OP's requirement remove unnecessary newline/spaces from fetched value; keep on creating val which has matched value of RS, finally when awk program is done with reading whole Input_file print the value of variable then.
2nd solution: Using record separator as null and using match function on values to match regex (.[[:space:]]+)|(.$) to get last letter values only with each match found, keep adding matched values into a variable and at last in END block of awk program print variable's value.
awk -v RS= '
{
while(match($0,/(.[[:space:]]+)|(.$)/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}
END{
gsub(/[[:space:]]+/,"",val)
print val
}
' Input_file
Simple substitutions on individual lines is the job sed exists to do:
$ sed 's/[^ ]*\([^ ]\) */\1/g' file
GMUCHOS
using many tools
$ tr -s ' ' '\n' <file | rev | cut -c1 | paste -sd'\0'
GMUCHOS
separate the words to lines, reverse so that we can pick the first char easily, and finally paste them back together without a delimiter. Not the shortest solution but I think the most trivial one...
I would harness GNU AWK for this as follows, let file.txt content be
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
then
awk 'BEGIN{FPAT="[[:alpha:]]\\>";OFS=""}{$1=$1;print}' file.txt
output
GMUCHOS
Explanation: Inform AWK to treat any alphabetic character at end of word and use empty string as output field seperator. $1=$1 is used to trigger line rebuilding with usage of specified OFS. If you want to know more about start/end of word read GNU Regexp Operators.
(tested in gawk 4.2.1)
Another solution with GNU awk:
awk '{$0=gensub(/[^[:space:]]*([[:alpha:]])/, "\\1","g"); gsub(/\s/,"")} 1' file
GMUCHOS
gensub() gets here the characters and gsub() removes the spaces between them.
or using patsplit():
awk 'n=patsplit($0, a, /[[:alpha:]]\>/) { for (i in a) printf "%s", a[i]} i==n {print ""}' file
GMUCHOS
An alternate approach with GNU awk is to use FPAT to split by and keep the content:
gawk 'BEGIN{FPAT="\\S\\>"}
{ s=""
for (i=1; i<=NF; i++) s=s $i
print s
}' file
GMUCHOS
Or more tersely and idiomatic:
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' file
GMUCHOS
(Thanks Daweo for this)
You can also use gensub with:
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' file
GMUCHOS
The advantage here of both is that single letter "words" are handled properly:
s2='SINGLE X LETTER Z'
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' <<< "$s2"
EXRZ
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' <<< "$s2"
EXRZ
Where the accepted answer and most here do not:
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s2"
ER # WRONG
gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' <<< "$s2"
EX RZ # WRONG

Using protected wildcard character in awk field separator doesn't work

I have a file that contains paragraphs separated by lines of *(any amount). When I use egrep with the regex of '^\*+$' it works as intended, only displaying the lines that contain only stars.
However, when I use the same expression in awk -F or awk FS, it doesn't work and just prints out the whole document, excluding the lines of stars.
Commands that I tried so far:
awk -F'^\*+$' '{print $1, $2}' msgs
awk -F'/^\*+$/' '{print $1, $2}' msgs
awk 'BEGIN{ FS="/^\*+$/" } ; { print $1,$2 }' msgs
Printing the first field always prints out the whole document, using the first version it excludes the lines with the stars, other versions include everything from the file.
Example input:
Par1 test teststsdsfsfdsf
fdsfdsfdsftesyt
fdsfdsfdsf fddsteste345sdfs
***
Par2 dsadawe232343a5edsfe
43s4esfsd s45s45e4t rfgsd45
***
Par3 dsadasd
fasfasf53sdf sfdsf s45 sdfs
dfsf dsf
***
Par4 dasdasda r3ar d afa fs
ds fgdsfgsdfaser ar53d f
***
Par 5 dasdawr3r35a
fsada35awfds46 s46 sdfsds5 34sdf
***
Expected output for print $1:
Par1 test teststsdsfsfdsf fdsfdsfdsftesyt fdsfdsfdsf fddsteste345sdfs
EDIT: Added example input and expected output
Strings used as regexps in awk are parsed twice:
to turn them into a regexp, and
to use them as a regexp.
So if you want to use a string as a regexp (including any time you assign a Field Separator or Record Separator as a regexp) then you need to double any escapes as each iteration of parsing will consume one of them. See https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps for details.
Good (a literal/constant regexp):
$ echo 'a(b)c' | awk '$0 ~ /\(b)/'
a(b)c
Bad (a poorly-written dynamic/computed regexp):
$ echo 'a(b)c' | awk '$0 ~ "\(b)"'
awk: cmd. line:1: warning: escape sequence `\(' treated as plain `('
a(b)c
Good (a well-written dynamic/computed regexp):
$ echo 'a(b)c' | awk '$0 ~ "\\(b)"'
a(b)c
but IMHO if you're having to double escapes to make a char literal then it's clearer to use a bracket expression instead:
$ echo 'a(b)c' | awk '$0 ~ "[(]b)"'
a(b)c
Also, ^ in a regexp means "start of string" which is only matched at the start of all the input, just like $ would only be matched at the end of all of the output. ^ does not mean "start of line" as some documents/scripts may lead you to believe. It only appears to mean that in grep and sed because they are line-oriented and so usually the script is being compared to 1 line at a time, but awk isnt line-oriented, it's record-oriented and so the input being compared to the regexp isn't necessarily just a line (the same is true in sed if you read multiple lines into its hold space).
So to match a line of *s as a Record Separator (RS) assuming you're using gawk or some other awk that can treat a multi-char RS as a regexp, you'd have to write this regexp:
(^|\n)[*]+(\n|$)
but be aware that also matches the newlines before the first and after the last *s on the target lines so you need to handle that appropriately in your code.
It seems like this is what you're really trying to do:
$ awk -v RS='(^|\n)[*]+(\n|$)' 'NR==1{$1=$1; print}' file
Par1 test teststsdsfsfdsf fdsfdsfdsftesyt fdsfdsfdsf fddsteste345sdfs

Print the line matching 'pattern' string, excluding the 'pattern'

I have the following lines in a text file 'file.txt'
String1 ABCDEFGHIJKL
String2 DCEGIJKLQMAB
I want to print the characters corresponding to 'String1' in another text file 'text.txt' like this
ABCDEFGHIJKL
Here, I don't want to use any line numbers. Any suggestions using 'sed' command?. I tried with between 'string 1' and 'string 2', but couldn't obtain command excluding 'string1'. This following code for excluding only 'string2'.
sed -n '/^string1/,/^string2/{p;/^string2/q}' file.txt | sed '$d' > text.txt
awk '$1=="String1" { print $2 }' file.txt > text.txt
Where the first space delimited field equals "String1", print the second field. Redirect the output to text.txt.
Use GNU grep:
grep -Po 'String1\s+\K.*' in_file
Here, grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.
\K : Cause the regex engine to "keep" everything it had matched prior to the \K and not include it in the match. Specifically, ignore the preceding part of the regex when printing the match.
SEE ALSO:
grep manual
perlre - Perl regular expressions

Grep value between strings with regex

$ acpi
Battery 0: Charging, 18%, 01:37:09 until charged
How to grep the battery level value without percentage character (18)?
This should do it but I'm getting an empty result:
acpi | grep -e '(?<=, )(.*)(?=%)'
Your regex is correct but will work with experimental -P or perl mode regex option in gnu grep. You will also need -o to show only matching text.
Correct command would be:
grep -oP '(?<=, )\d+(?=%)'
However, if you don't have gnu grep then you can also use sed like this:
sed -nE 's/.*, ([0-9]+)%.*/\1/p' file
18
Could you please try following, written and tested in link https://ideone.com/nzSGKs
your_command | awk 'match($0,/Charging, [0-9]+%/){print substr($0,RSTART+10,RLENGTH-11)}'
Explanation: Adding detailed explanation for above only for explanation purposes.
your_command | ##Running OP command and passing its output to awk as standrd input here.
awk ' ##Starting awk program from here.
match($0,/Charging, [0-9]+%/){ ##Using match function to match regex Charging, [0-9]+% in line here.
print substr($0,RSTART+10,RLENGTH-11) ##Printing sub string and printing from 11th character from starting and leaving last 11 chars here in matched regex of current line.
}'
Using awk:
awk -F"," '{print $2+0}'
Using GNU sed:
sed -rn 's/.*\, *([0-9]+)\%\,.*/\1/p'
You can use sed:
$ acpi | sed -nE 's/.*Charging, ([[:digit:]]*)%.*/\1/p'
18
Or, if Charging is not always in the string, you can look for the ,:
$ acpi | sed -nE 's/[^,]*, ([[:digit:]]*)%.*/\1/p'
Using bash:
s='Battery 0: Charging, 18%, 01:37:09 until charged'
res="${s#*, }"
res="${res%%%*}"
echo "$res"
Result: 18.
res="${s#*, }" removes text from the beginning to the first comma+space and "${res%%%*}" removes all text from end till (and including) the last occurrence of %.

Print everything before relevant symbol and keep 1 character after relevant symbol

I'm trying to find a one-liner to print every before relevant symbol and keep just 1 character after relevant symbol:
Input:
thisis#atest
thisisjust#anothertest
just#testing
Desired output:
thisis#a
thisjust#a
just#t
awk -F"#" '{print $1 "#" }' will almost give me what I want but I need to find a way to print the second character as well. Any ideas?
You can substitute what's after the first character after # with nothing with sed:
sed 's/\(#.\).*/\1/'
You could use grep:
$ grep -o '[^#]*#.' infile
thisis#a
thisisjust#a
just#t
This matches a sequence of characters other than #, followed by # and any character. The -o option retains only the match itself.
With the special RT variable in GNU's awk, you can do:
awk 'BEGIN{RS="#.|\n"}RT!="\n"{print $0 RT}'
Get the index of the '#', then pull out the substring.
$ awk '{print substr($0,1,index($0,"#")+1);}' in.txt
thisis#a
thisisjust#a
just#t
1st Solution: Could you please try following.
awk 'match($0,/[^#]*#./){print substr($0,RSTART,RLENGTH)}' Input_file
Above will print lines as per your ask which have # in them and leave lines which does not have it, in case you want to completely print those lines use following then.
awk 'match($0,/[^#]*#./){print substr($0,RSTART,RLENGTH);next} 1' Input_file
2nd solution:
awk 'BEGIN{FS=OFS="#"} {print $1,substr($2,1,1)}' Input_file
Some small variation of Ravindes 2nd example
awk -F# '{print $1"#"substr($2,1,1)}' file
awk -F# '{print $1FS substr($2,1,1)}' file
Another grep variation (shortest posted so far):
grep -oP '.+?#.' file
o print only matching
P Perl regex (due to +?)
. any character
+ and more
? but stop with:
#
. pluss one more character
If we do not add ?. This line test#one#two becomes test#one#t instead of test#o do to the greedy +
If you want to use awk, the cleanest way to do this with is using index which finds the position of a character:
awk 'n=index($0,'#') { print substr($0,1,n+1) }' file
There are, however, shorter and more dedicated tools for this. See the other answers.