awk strings for git - regex

I'm trying to do an awk to retrieve the directory for certain git repos.
Current
git#ssh.gitlab.org:repo1/dir/dir/file1.git
git#ssh.gitlab.org:repo1/dir/dir/file2.git
git#ssh.gitlab.org:repo1/dir/dir/file3.git
git#ssh.gitlab.org:repo1/dir/dir/file4.git
I have this below using a field separate, but I'm unsure how to remote .git
awk -F':' '{print $2}' file
repo1/dir/dir/file1.git
repo1/dir/dir/file2.git
repo1/dir/dir/file3.git
repo1/dir/dir/file4.git
Desired result
repo1/dir/dir/file1
repo1/dir/dir/file2
repo1/dir/dir/file3
repo1/dir/dir/file4

You may use
awk -F':' '{sub(/\.[^.\/]*$/, "", $2); print $2;}' file
Using -F':' you will split all records (lines) into colon-separated fields. You access the second item only using $2, but before printing it, you need to remove the final . and any 0 or more chars other than . and / up to the end of the field value, which is done with sub(/\.[^.\/]*$/, "", $2).
See the online demo
With this solution, you may handle files and folders that may have any amount of dots in their names.

Could you please try following.
awk -F'[:.]' '{print $(NF-1)}' input_file
2nd solution: In case you don't want to hard code field number then try following.
awk 'match($0,/:[^.]*/){print substr($0,RSTART+1,RLENGTH-1)}' Input_file

With sed
$ sed 's/^[^:]*://; s/\.git$//' file
repo1/dir/dir/file1
repo1/dir/dir/file2
repo1/dir/dir/file3
repo1/dir/dir/file4
s/^[^:]*:// remove up to first : from start of line
s/\.git$// remove .git from end of line
you can also use sed -E 's/^[^:]*:|\.git$//g' to do it with single substitution

With regex, you can use:
(?<=:)[a-z0-9\/]*
Match anything composed of letters, numbers and slash after the semicolon. So it will stop at the dot.
Or directly match everything between : and . with
(?<=:).*(?=\.)

Related

rename specific lines in a text file with sed

I have a file that looks like this:
>alks|keep1|aoiuor|lskdjf
ldkfj
alksj
asdflkj
>jhoj_kl|keep2|kjghoij|adfjl
aldskj
alskj
alsdkj
I would like to edit just the lines starting with >, ideally in-place, to get a file:
>keep1
ldkfj
alksj
asdflkj
>keep2
aldskj
alskj
alsdkj
I know that in principle this is achievable with various combinations of sed/awk/cut, but I haven't been able to figure out the right combination. Ideally it should be fast - the file has many millions of lines, and many of the lines are also very long.
Key things about the lines I want to edit:
Always start with >
The bit I want to keep is always between the first and second pipe symbol | (hence thinking cut is going to help
The bit I want to keep has alphanumeric symbols and sometimes underscores. The rest of the string on the same line can have any symbols
What I've tried that seems helpful
(Most of my sed attempts are pure garbage)
cut -d '|' -f 2 test.txt
Gets me the bit of the string that I want, and it keeps the other lines too. So it's close, but (of course) it doesn't preserve the initial > on the lines where cut applies, so it's missing a crucial part of the solution.
With sed:
sed -E '/^>/ s/^[^|]+\|([^|]+).*/>\1/'
/^>/ to select lines starting with >, not strictly necessary for given sample but sometimes this provides faster result than using s alone
^[^|]+\| this will match non | characters from the start of line
([^|]+) capture the second field
.* rest of the line
>\1 replacement string where \1 will have the contents of ([^|]+)
If your input has only ASCII character, this would give you much faster results:
LC_ALL=C sed -E '/^>/ s/^[^|]+\|([^|]+).*/>\1/'
Timing
Checking the timing results by creating a huge file from given input sample, awk is much faster and mawk is even faster
However, OP reports that the sed solution is faster for the actual data
With your shown samples, you could simply try following. In this code, we are setting field separator as | for all the lines of Input_file then in main program checking if line starts from > then print 2nd field else print the complete line.
awk -F'|' '/^>/{print ">"$2;next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk -F'|' ' ##Starting awk program from here and setting field separator as | here.
/^>/{ ##Checking condition if line starts from > then do following.
print ">"$2 ##Printing 2nd field of current line here.
next ##next will skip all further statements from here.
}
1 ##Will print current line.
' Input_file ##mentioning Input_file name here.
You can also use the following awk command:
awk -F\| '/^>/{print ">"$2} !/^>/{print}' file
# Inplace replacement with gawk (GNU awk)
gawk -i inplace -F\| '/^>/{print ">"$2} !/^>/{print}' file
# "Inline-like" replacement with any awk
awk -F\| '/^>/{print ">"$2} !/^>/{print}' file > tmp && mv tmp file
Here,
-F\| - sets the field separator to a | char
/^>/ is the condition: if line starts with < (and !/^>/ means the opposite)
{print ">"$2} prints the Field 2 value with a > char prepended to it
{print} simply prints the full line.
Note that since !/^>/{print} can be reduced to !/^>/ as print is the default action.
See an online demo:
s='>alks|keep1|aoiuor|lskdjf
ldkfj
alksj
asdflkj
>jhoj_kl|keep2|kjghoij|adfjl
aldskj
alskj
alsdkj'
awk -F\| '/^>/{print ">"$2} !/^>/{print}' <<< "$s"
Output:
>keep1
ldkfj
alksj
asdflkj
>keep2
aldskj
alskj
alsdkj

Remove hostnames from a single line that follow a pattern in bash script

I need to cat a file and edit a single line with multiple domains names. Removing any domain name that has a set certain pattern of 4 letters ex: ozar.
This will be used in a bash script so the number of domain names can range, I will save this to a csv later on but right now returning a string is fine.
I tried multiple commands, loops, and if statements but sending the output to variable I can use further in the script proved to be another difficult task.
Example file
$ echo file.txt
ozarkzshared.com win.ad.win.edu win_fl.ozarkzsp.com ap.allk.org allk.org >ozarkz.com website.com
What I attempted (that was close)
domains_1=$(cat /tmp/file.txt | sed 's/ozar*//g')
domains_2=$( cat /tmp/file.txt | printf '%s' "${string##*ozar}")
Goal
echo domain_x
win.ad.win.edu ap.allk.org allk.org website.com
If all the domains are on a single line separated by spaces, this might work:
awk '/ozar/ {next} 1' RS=" " file.txt
This sets RS, your record separator, then skips any record that matches the keyword. If you wanted to be able to skip a substring provided in a shell variable, you could do something like this:
$ s=ozar
$ awk -v re="$s" '$0 ~ re {next} 1' RS=" " file.txt
Note that the ~ operator is comparing a regular expression, not precisely a substring. You could leverage the index() function if you really want to check a substring:
$ awk -v s="$s" 'index($0,s) {next} 1' RS=" " file.txt
Note that all of the above is awk, which isn't what you asked for. If you'd like to do this with bash alone, the following might be for you:
while read -r -a a; do
for i in "${a[#]}"; do
[[ "$i" = *"$s"* ]] || echo "$i"
done
done < file.txt
This assigns each line of input to the array $a[], then steps through that array testing for a substring match and printing if there is none. Text processing in bash is MUCH less efficient than in a more specialized tool like awk or sed. YMMV.
you want to delete the words until a space delimiter
$ sed 's/ozar[^ ]*//g' file
win.ad.win.edu win_fl. ap.allk.org allk.org website.com

sed replace AFTER match and retain

I've been racking my brains for hours on this, but it seems simple enough. I have a large list of strings similar to the ones below and would like to replace the hyphens only after the comma, to commas:
abc-d-ef,1-2-3-4
gh-ij,1-2-3-4
to this
abc-def,1,2,3,4
gh-ij,1,2,3,4
I can't use s/-/,/2g to replace from second occurrence as the data differs, and also though about using cut, but there must be a way to use sed with something like:
"s/\(,\).*-/\1,&/g"
Thank you
This is more suitable for awk as we can break all lines using comma as field separator:
awk 'BEGIN{FS=OFS=","} {gsub(/-/, OFS, $2)} 1' file
abc-d-ef,1,2,3,4
gh-ij,1,2,3,4
If you want sed solution only then use:
sed -E -e ':a' -e 's/([^,]+,[^-]+)-/\1,/g;ta' file
abc-d-ef,1,2,3,4
gh-ij,1,2,3,4
An awk proposal.
awk -F, '{sub(/d-ef/,"def")gsub(/-/,",",$2)}1' OFS=, file
abc-def,1,2,3,4
gh-ij,1,2,3,4

Get specific Text between Specific Tags

At the top of my HTML files, I have...
<H2>City</H2>
<P>Liverpool</P>
or
<H2>City</H2>
<P>Dublin</P>
I want to output the text between the tags straight after <H2>City</H2> instances. So in the examples above which are separate files, I want to print out Liverpool and in the second example, Dublin.
Looking at this thread, I try:
sed -e 's/City\(.*\)\/P/\1/'
which I hope would get me half way there... but that just prints out the entire file. Any ideas?
awk to the rescue! You need multi-char RS support though (gawk has it)
$ awk -F'[<>]' -v RS='<H2>City</H2>' 'NF{print $3}' file
another approach can be
$ awk 'c&&c--{sub(/<[^>]*>/,""); print} /<H2>City<\/H2>/{c=1}' file
find the next record after City and trim the angle brackets...
Try using the following regex :
(?s)(?<=City<\/H2>\n<P>).*?(?=<\/P>)
see regex demo / explanation
sed
sed -e 's/(?s)(?<=City<\/H2>\n<P>).*?(?=<\/P>)/'
I checked and the \s seem not work for spaces. You should use the newline character \n:
sed -e 's/<H2>City<\/H2>\n<P>\(.*\)<\/P>/\1/'
There is no need of use lookbehind (like above), that is an overkill.
With sed, you can use the n command to read next line after your pattern. Then just remove the tag to output your content:
sed -n '/<H2>City<\/H2>/n;s/ *<\/*P> *//gp;' file
I think this should work in your mac:
echo -e "<H2>City</H2>\n<P>Dublin</P>" |awk -F"[<>]" '/City/{getline;print $3}'
Dublin

Using regex to match a pattern in the middle of a string with awk, sed, grep ... something linux-y

I have a file with ID numbers and a bunch of patterns that represent gene trees
ex:
021557 (sfra,(pdep,snud),((spal,sint),(sdro,(hpul,(sprp,afra)))));
005852 (snud,sfra,(pdep,(hpul,((afra,sprp),(sint,(spal,sdro))))));
023685 (sfra,snud,(pdep,(hpul,((sprp,(afra,spal)),(sdro,sint)))));
022020 (sfra,snud,(pdep,(hpul,(afra,(sprp,(sdro,(sint,spal)))))));
028284 (sfra,snud,(pdep,(hpul,(sprp,((sdro,sint),(spal,afra))))));
I am interested in a certain sister taxon grouping of (spal,afra).I want to print the IDs from another column if the tree contains (spal,afra).
Output if it was only run on the data above should be:
023685
028284
I was going to do something like:
awk '{if ($2 == "(spal,afra)") { print $1 } }'
but I realize that the part that I'm trying to match is within a bunch of other characters, and at no predictable location...
So I need to search for
any number of lowercase letters or parentheses or commas
(spal,afra)
any number of lowercase letters or parentheses or commas or ;
Also, I guess I want to know of occurences in the other order (afra,spal). But I was going to run separate matches, combine the output and do something with sort and uniq-c if I remember right... I can probably figure that out by myself later.
I'm kind of new to this and I've already spent a couple of hours trying to figure something out. Thank you!
You seem to have this as an input file
$ cat file
021557 (sfra,(pdep,snud),((spal,sint),(sdro,(hpul,(sprp,afra)))));
005852 (snud,sfra,(pdep,(hpul,((afra,sprp),(sint,(spal,sdro))))));
023685 (sfra,snud,(pdep,(hpul,((sprp,(afra,spal)),(sdro,sint)))));
022020 (sfra,snud,(pdep,(hpul,(afra,(sprp,(sdro,(sint,spal)))))));
028284 (sfra,snud,(pdep,(hpul,(sprp,((sdro,sint),(spal,afra))))));
Using awk
To print the first column for any line that contains (spal,afra):
$ awk '/[(]spal,afra[)]/{print $1}' file
028284
The condition /[(]spal,afra[)]/ selects lines that contain (spal,afra) and print $1 prints the first field on those lines.
In awk regular expressions, parens are active characters. Since we want to match literal parens, we put them in square brackets like [(] and [)].
Using sed
$ sed -n '/(spal,afra)/ s/\t.*//p' file
028284
sed -n will not print anything unless we explicitly ask it to. /(spal,afra)/ selects lines containing (spal,afra). s/\t.*//p removes everything after the first tab and then prints what remains.
By default, sed uses basic regular expressions. This means that ( and ) are not active. Consequently, we do not need to escape them.
Using grep and cut
$ grep '(spal,afra)' file | cut -f1
028284
grep '(spal,afra)' file selects lines that contain (spal,afra) and cut -f1 selects the first field from those lines.
Like sed, grep defaults to using basic regular expressions. This means that ( and ) are both treated as literal characters and there is no need to escape them.
Alternative: Looking for either (spal,afra) or (afra,spal)
If we want to look for (afra,spal) in addition to (spal,afra), then we need to update the regular expressions. Taking awk for example:
awk '/[(](spal,afra|afra,spal)[)]/{print $1}' file2
023685
028284
Here, the vertical bar, |, separates choices. The regex accepts either what is before or after the bar.
You can use this non-regex search in awk:
awk 'index($0, "(spal,afra)") || index($0, "(afra,spal)") {print $1}' file
023685
028284
This should work (sed with extended regex):
sed -nr 's/([^[:space:]]*)[^;]*(\(spal,afra\)|\(afra,spal\)).*/\1/p' file
Output:
023685
028284