using sed or grep to extract word - regex

I get only white space when I try to extract the ID by code:
grep "latest" sometextfile | sed 's/.*latest\([\s*]*\).*/\1/'
REPO SAL ID CREATED SIZE
asdasfdg.dshgs.asd:54000/my-thing latest c5521d9803e7 asdfa days ago asdfafd.ad
Code:
grep "latest" sometextfile | sed 's/.*latest\([\s*]*\).*/\1/'
The out put from command above should be the ID: c5521d9803e7.
What is missing at the sed command above?

Probably a simple awk might work:
awk '/latest/{ print $3 }' file
See this online awk demo. It finds a line with latest in it and prints Field 3 contents.
However, following your original logic, you may use sed alone to extract that piece of string after latest:
sed -n '/[[:space:]]latest[[:space:]]/s/.*latest[[:space:]]*\([^[:space:]]*\).*/\1/p' file
See the online demo
Details
/[[:space:]]latest[[:space:]]/ - finds the line with whitespace+latest+whitespace
s/.*latest[[:space:]]*\([^[:space:]]*\).*/\1/p - finds and replaces with Group 1 contents:
.*latest - any 0+ chars up to the last occurrence of latest
[[:space:]]* - 0 or more whitespaces
\([^[:space:]]*\) - Group 1: any 0 or more non-whitespace chars
.* - any 0+ chars to the end of the line
The -n option suppresses line output and p only shows the substitution result.

Related

How do I take only the first occurrence of a hyphen in sed?

I have a string, for example home/JOHNSMITH-4991-common-task-list, and I want to take out the uppercase part and the numbers with the hyphen between them. I echo the string and pipe it to sed like so, but I keep getting all the hyphens I don't want, e.g.:
echo home/JOHNSMITH-4991-common-task-list | sed 's/[^A-Z0-9-]//g'
gives me:
JOHNSMITH-4991---
I need:
JOHNSMITH-4991
How do I ignore all but the first hyphen?
You can use
sed 's,.*/\([^-]*-[^-]*\).*,\1,'
POSIX BRE regex details:
.* - any zero or more chars
/ - a / char
\([^-]*-[^-]*\) - Group 1: any zero or more chars other than -, a hyphen, and then again zero or more chars other than -
.* - any zero or more chars
The replacement is the Group 1 placeholder, \1, to restore just the text captured.
See the online demo:
#!/bin/bash
s="home/JOHNSMITH-4991-common-task-list"
sed 's,.*/\([^-]*-[^-]*\).*,\1,' <<< "$s"
# => JOHNSMITH-4991
1st solution: With awk it will be much easier and we could keep it simple, please try following, written and tested with your shown samples.
echo "echo home/JOHNSMITH-4991-common-task-list" | awk -F'/|-' '{print $2"-"$3}'
Explanation: Simple explanation would be, setting field separator as / OR - and printing 2nd field - and 3rd field of current line.
2nd solution: Using match function of awk program here.
echo "echo home/JOHNSMITH-4991-common-task-list" |
awk '
match($0,/\/[^-]*-[^-]*/){
print substr($0,RSTART+1,RLENGTH-1)
}'
3rd solution: Using GNU grep solution here. Using -oP option of grep here, to print matched values with o option and to enable ERE(extended regular expression) with P option. Then in main program of grep using .*/ followed by \K to ignore previous matched part and then mentioning [^-]*-[^-]* to make sure to get values just before 2nd occurrence of - in matched line.
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '.*/\K[^-]*-[^-]*'
Here is a simple alternative solution using cut with bash string substitution:
s='home/JOHNSMITH-4991-common-task-list'
cut -d- -f1-2 <<< "${s##*/}"
JOHNSMITH-4991
You could match until the first occurrence of the /, then clear the match buffer with \K and then repeat the character class 1+ times with a hyphen in between to select at least characters before and after the hyphen.
[^/]*/\K[A-Z0-9]+-[A-Z0-9]+
If supported, using gnu grep:
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '[^/]*/\K[A-Z0-9]+-[A-Z0-9]+'
Output
JOHNSMITH-4991
If gnu awk is an option, using the same pattern but with a capture group:
echo "home/JOHNSMITH-4991-common-task-list" | awk 'match($0, /[^\/]*\/([A-Z0-9]+-[A-Z0-9]+)/, a) {print a[1]}'
If the desired output is always the first match where the character class with a hyphen matches:
echo "home/JOHNSMITH-4991-common-task-list" | awk -v FPAT="[A-Z0-9]+-[A-Z0-9]+" '$0=$1'
Output
JOHNSMITH-4991
Assumptions:
could be more than one fwd slash in string
(after the last fwd slash) there are 2 or more hyphens in the string
desired output is between last fwd slash and 2nd hyphen
One idea using parameter substitutions:
$ string='home/dir/JOHNSMITH-4991-common-task-list'
$ string1="${string##*/}"
$ typeset -p string1
declare -- string1="JOHNSMITH-4991-common-task-list"
$ string1="${string1%%-*}"
$ typeset -p string1
declare -- string1="JOHNSMITH"
$ string2="${string#*-}"
$ typeset -p string2
declare -- string2="4991-common-task-list"
$ string2="${string2%%-*}"
$ typeset -p string2
declare -- string2="4991"
$ newstring="${string1}-${string2}"
$ echo "${newstring}"
JOHNSMITH-4991
NOTES:
typeset commands added solely to show progression of values
a bit of typing but if doing this a lot of times in bash the overall performance should be good compared to other solutions that require spawning a sub-process
if there's a need to parse a large number of strings best performance will come from streaming all strings at once (via a file?) to one of the other solutions (eg, a single awk call that processes all strings will be faster than running the set of strings through a bash loop and performing all of these parameter substitutions)

How to do multiple grep pattern to find value in grepped string

I am trying to do multiple grep pattern to find a number within a grepped string.
I have a text file like this:
This is the first sample line 1
this is the second sample line
another line
total lines: 3 tot
I am trying to find a way to get just the number of total lines. So the output here should be "3"
Here are the things I've tried:
grep "total lines: [0-9]" myfile.txt
grep "total lines" myfile.txt | grep "[0-9]"
You could use sed:
sed -En 's/^total lines: ([0-9]+).*/\1/p' myfile.txt
-E extended regular expressions
-n suppress automatic printing
Match ^total lines: ([0-9]+).* (capture the number)
\1 replace the whole line with the captured number
p print the result
1st solution: Using GNU grep try following. Simply using -o option to print only matched value, -P enables PCRE regex for program. Then in regex portion matching from starting ^total lines: in each line and if a match found then discard matched values by \K option(to remove it from expected output) which is followed by 1 or more digits, using positive look ahead to make sure its followed by space(s) tot here.
grep -oP '^total lines: \K[0-9]+(?=\s+tot)' Input_file
2nd solution: With your shown samples, please try following in awk. This could be done in a single awk itself. Searching line which has string /total lines: / in it then printing 2nd last field of that line.
awk '/total lines: /{print $(NF-1)}' Input_file
3rd solution: Using awk's match function here. Matching total lines: [0-9]+ tot and then substituting everything apart from digits with null in matched values.
awk 'match($0,/total lines: [0-9]+ tot/){val=substr($0,RSTART,RLENGTH);gsub(/[^0-9]+/,"",val);print val}' Input_file
Do you have to use grep?
$ echo myfile.txt | wc -l
If you mean that the file has a line in it formatted as
total lines: 3 tot
Then refer to https://unix.stackexchange.com/questions/13466/can-grep-output-only-specified-groupings-that-match and use something like:
grep -Po 'total lines: \K\d+' myfile.txt
Notes:
Perl regex is not my forte, so the \d\w part might not work.
This may be doable without -P, but I cannot test from this windows computer.
regex101.com helped me test the above line, so it may work.
Problem with relying on pattern of last line and applying grep/sed to find pattern is that if any line in file contains such pattern, then you will have to apply some additional logic to filter that.
e.g. Consider case of below input file.
line001
total lines: 883 tot
This is the first sample line 1
this is the second sample line
another line
total lines: 883 tot
Assuming your file format is constant (i.e. Second last line will be blank and last line will contain total count), instead of using any pattern matching commands you can directly count number of rows using below awk command.
awk 'END { print NR - 2 }' myfile.txt
You can use the following awk to get the third field on a line that starts with total count: and stop processing the file further:
awk '/^total lines:/{print $3; exit}' file
See this online demo.
You can use the following GNU grep:
# Extract a non-whitespace chunk after a certain pattern
grep -oP '^total lines:\s*\K\S+' file
# Extract a number after a pattern
grep -oP '^total lines:\s*\K\d+(?:\.\d+)?' file
See an online demo. Details:
^ - start of string
total lines: - a literal string
\s* - any zero or more whitespace chars
\K - match reset operator discarding all text matched so far
\S+ - one or more non-whitespace chars
\d+(?:\.\d+)? - one or more digits and then an optional sequence of . and one or more digits.
See the regex demo.

Match from beginning to word as long as there are no . in between: Convert grep -Po command to sed

I have made the following command to be able to match the string from the beginning of the line until the first occurrence of ".enabled" as long as there are no "." in between.
grep -Po '^\K[\w-]*?(?=\.enabled)'
input:
a-b-c.a.enabled.xxx.xx
a-b-c.a.b.enabled.xxx.xx
a-b-c.enabled.xxx.xx
output:
a-b-c
It runs properly on my local env with grep v3.1 but on Busybox v1.28.4 it says "grep: unrecognized option: P"
For that reason, I would like to convert this command to sed. Any input would be really helpful.
You can use
awk -F'.' '$2 == "enabled"{print $1}' file
sed -n 's/^\([^.]*\)\.enabled.*/\1/p' file
See the online demo.
Details:
awk:
-F'.' - the field separator is set to a .
$2 == "enabled" - if Group 2 value is enabled, then
{print $1} - print Field 1 value
sed:
-n - suppresses default line output in the sed command
s/^\([^.]*\)\.enabled.*/\1/p - finds any zero or more chars other than . at the start of string (placing them into Group 1, \1), then a .enabled and then the rest of the string and replaces with the Group 1 value, and prints the resulting value.
You may use this equivalent sed of your grep -P command:
sed -nE 's/^([-_[:alnum:]]+)\.enabled.*/\1/p' file
a-b-c
Details:
-n: Suppress notmal output
-E: Enables extended regex mode
([-_[:alnum:]]+): -_[:alnum:]]is equivalent of [-\w] or [-_a-zA-Z0-9]. It matches 1+ of these characters and captures them in group #1
\.enabled.*: matches .enabled followed by 0 or more of any string
\1: is replacement string that put value captured in capture group #1 back in replacement
With your shown samples, you could try following.
awk -F'\\.enabled' '$1~/^[-_[:alnum:]]+$/{print $1}' Input_file
Explanation: Simply making field separator as .enabled for all the lines here. Then in main program checking condition if 1st field is having --or_` or alphanumeric then print 1st field here.

SED invalid command code for JSON response

I am trying to get a value from a JSON from my local server (https://regex101.com/r/qeGcGu/1) on a headless mac mini (catalina), via sed. However, with the sed command I'd expect to work:
usr#mcMini ~/Documents/qBitTorrent cat /tmp/json.out | sed -i.bak '"hash":"(.*?)"'
sed: 1: ""hash":"(.*?)"": invalid command code "
usr#mcMini ~/Documents/qBitTorrent cat /tmp/json.out | sed -i.bak '\"hash\":\"(.*?)\"'
sed: 1: "\"hash\":\"(.*?)\"": unterminated regular expression
usr#mcMini ~/Documents/qBitTorrent cat /tmp/json.out | sed -i '' '\"hash\":\"(.*?)\"'
sed: 1: "\"hash\":\"(.*?)\"": unterminated regular expression
usr#mcMini ~/Documents/qBitTorrent cat /tmp/json.out | sed -i '' '"hash":"(.*?)"'
sed: 1: ""hash":"(.*?)"": invalid command code "
The file that I am trying to get the string from is a raw json.
[{"added_on":1587102956,"amount_left":0,"auto_tmm":false,"availability":-1,"category":"radarr","completed":1218638934,"completion_on":1587108704,"dl_limit":-1,"dlspeed":0,"downloaded":1220894674,"downloaded_session":0,"eta":8640000,"f_l_piece_prio":false,"force_start":true,"hash":"87802183fc647548ec6efe18feb16149522f6aa0","last_activity":1587119220,"magnet_uri":"magnet:?xt=urn:btih:87802183fc647548ec6efe18feb16149522f6aa0&dn=Fantasia%202000%20(1999)%20%5b1080p%5d%20%5bYTS.AG%5d&tr=udp%3a%2f%2ftracker.coppersurfer.tk%3a6969%2fannounce&tr=udp%3a%2f%2f9.rarbg.com%3a2710%2fannounce&tr=udp%3a%2f%2fp4p.arenabg.com%3a1337&tr=udp%3a%2f%2ftracker.leechers-paradise.org%3a6969&tr=udp%3a%2f%2ftracker.internetwarriors.net%3a1337&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337%2fannounce&tr=udp%3a%2f%2ftracker.zer0day.to%3a1337%2fannounce&tr=udp%3a%2f%2ftracker.leechers-paradise.org%3a6969%2fannounce&tr=udp%3a%2f%2fcoppersurfer.tk%3a6969%2fannounce","max_ratio":-1,"max_seeding_time":-1,"name":"Fantasia 2000 (1999) [1080p] [YTS.AG]","num_complete":22,"num_incomplete":4,"num_leechs":0,"num_seeds":0,"priority":0,"progress":1,"ratio":0.1782183661159947,"ratio_limit":-2,"save_path":"/Volumes/1049/Media/","seeding_time_limit":-2,"seen_complete":1587118087,"seq_dl":false,"size":1218638934,"state":"forcedUP","super_seeding":false,"tags":"","time_active":13224,"total_size":1218638934,"tracker":"udp://tracker.coppersurfer.tk:6969/announce","up_limit":-1,"uploaded":217585854,"uploaded_session":128831791,"upspeed":0}]
Actually what I want to accomplish is to get the first 6 chars from hash:
"hash":"87802183fc647548ec6efe18feb16149522f6aa0"
In this case my desired value is 878021
Could you please guide me in the correct direction?
You may use
sed -n 's/.*"hash":"\([^"]*\).*/\1/p' /tmp/json.out
Here, note that the file can be provided directly to the sed command, no need piping it with cat.
How it works
-n - option that suppresses the default line output (by default, sed will output non-matching lines)
s/ - substitute command (we are replacing)
.*"hash":"\([^"]*\).* - matches
.* - 0+ chars
"hash":" - "hash":" substring
\([^"]*\) - Group 1 (capturing group, \1 is used in the replacement part to refer to this value) - any 0+ chars other than "
.* - 0+ chars
\1 - the replacement is Group 1 value (it is all that remains on the matching line)
p - if there was a valid replacement print the result after replacement only.

Problem to change an occurence in a file with sed

I have a file with several lines :
OTU3055 UniRef90_A0A0F7KBB1 UniRef90_A0A1Z9IPT2
OTU0856 OTU53699 UniRef90_D6PC25 UniRef90_D6PCA5 UniRef90_D6PCG3
OTU0125 UniRef90_A0A075FUN0 UniRef90_A0A075G8Q1 UniRef90_A0A075GDT2
I want to remove all OTUXXXX occurences (there are always 4 numbers after the "OTU") which appears in the file. I used sed but it didn't work. The OTUXXXX always appearat the beginning of the lines.
sed 's/OTU[0-9]{4} //g' my_file.txt
I put a space after OTU[0-9]{4} because I want the Uniref90 IDs are at the beginning of eacg line.
Edit :
sed -r 's/OTU[0-9]{4} //g' my_file.txt works. But I get another problem,
UniRef90_A0A0F7KBB1 UniRef90_A0A1Z9IPT2
UniRef90_D6PC25 UniRef90_D6PCA5 UniRef90_D6PCG3
UniRef90_A0A075FUN0 UniRef90_A0A075G8Q1 UniRef90_A0A075GDT2
Some lines still begin with a white space. I tried sed 's/^ *//' my_file.txt and it does not work. I want the second line of my file starts like the two other lines, without any space.
You may use
sed -r 's/[[:space:]]*\bOTU[0-9]{4,}\b[[:space:]]*//g' file > newfile
Or, if the matches can be found anywhere, not only at the string start:
sed -r 's/[[:space:]]*\bOTU[0-9]{4,}\b//g' file | sed 's/[[:space:]]*$//' > newfile
The whitespaces after the OTU<digits> won't get matched with the second snippet, so a piped sed command is necessary.
See the online demo.
Details
[[:space:]]* - 0+ whitespace chars
\b a word boundary
OTU[0-9]{4,} - OTU and 4 or more digits
\b - a word boundary
[[:space:]]* - 0+ whitespace chars.
There is no explanation for your posted actual output given your posted input and the command you ran but if you want to match on 4 or more digits and the space after the OTU* strings could be a tab or some other white space that's not a blank char then this is what you need using GNU or OSX/BSD awk for -E:
$ sed -E 's/(OTU[0-9]{4,}[[:space:]]+)+//' file
UniRef90_A0A0F7KBB1 UniRef90_A0A1Z9IPT2
UniRef90_D6PC25 UniRef90_D6PCA5 UniRef90_D6PCG3
UniRef90_A0A075FUN0 UniRef90_A0A075G8Q1 UniRef90_A0A075GDT2