Extract the string matched in a regex, not the line, with awk

Extract the string matched in a regex, not the line, with awk - regex

This should not be too difficult but I could not find a solution.
I have a HTML file, and I want to extract all URLs with a specific pattern.
The pattern is /users/<USERNAME>/ - I actually only need the USERNAME.
I got only to this:
awk '/users\/.*\//{print $0}' file
But this filters me the complete line. I don't want the line.
Even just the whole URL is fine (e.g. get /users/USERNAME/), but I really only need the USERNAME....

If you want to do this in single awk then use match function:
awk -v s="/users/" 'match($0, s "[^/[:blank:]]+") {
print substr($0, RSTART+length(s), RLENGTH-length(s))
}' file
Or else this grep + cut will do the job:
grep -Eo '/users/[^/[:blank:]]+' file | cut -d/ -f

set the delimiter and do a literal match to second field and print the third.
$ awk -F/ '$2=="users"{print $3}'

Assuming your statement gives you the entire line of something like
/users/USERNAME/garbage/otherStuff/
You could pipe this result through head assuming you always know that it will be
/users/USERNAME/....
After piping through head, you can also use cut commands to remove more of the end text until you have only the piece you want.
The command will look something like this
awk '/users\/.*\//{print $0}' file | head (options) | cut (options)

Related

awk strings for git

I'm trying to do an awk to retrieve the directory for certain git repos.
Current
git#ssh.gitlab.org:repo1/dir/dir/file1.git
git#ssh.gitlab.org:repo1/dir/dir/file2.git
git#ssh.gitlab.org:repo1/dir/dir/file3.git
git#ssh.gitlab.org:repo1/dir/dir/file4.git
I have this below using a field separate, but I'm unsure how to remote .git
awk -F':' '{print $2}' file
repo1/dir/dir/file1.git
repo1/dir/dir/file2.git
repo1/dir/dir/file3.git
repo1/dir/dir/file4.git
Desired result
repo1/dir/dir/file1
repo1/dir/dir/file2
repo1/dir/dir/file3
repo1/dir/dir/file4

You may use
awk -F':' '{sub(/\.[^.\/]*$/, "", $2); print $2;}' file
Using -F':' you will split all records (lines) into colon-separated fields. You access the second item only using $2, but before printing it, you need to remove the final . and any 0 or more chars other than . and / up to the end of the field value, which is done with sub(/\.[^.\/]*$/, "", $2).
See the online demo
With this solution, you may handle files and folders that may have any amount of dots in their names.

Could you please try following.
awk -F'[:.]' '{print $(NF-1)}' input_file
2nd solution: In case you don't want to hard code field number then try following.
awk 'match($0,/:[^.]*/){print substr($0,RSTART+1,RLENGTH-1)}' Input_file

With sed
$ sed 's/^[^:]*://; s/\.git$//' file
repo1/dir/dir/file1
repo1/dir/dir/file2
repo1/dir/dir/file3
repo1/dir/dir/file4
s/^[^:]*:// remove up to first : from start of line
s/\.git$// remove .git from end of line
you can also use sed -E 's/^[^:]*:|\.git$//g' to do it with single substitution

With regex, you can use:
(?<=:)[a-z0-9\/]*
Match anything composed of letters, numbers and slash after the semicolon. So it will stop at the dot.
Or directly match everything between : and . with
(?<=:).*(?=\.)

AWK: get file name from LS

I have a list of file names (name plus extension) and I want to extract the name only without the extension.
I'm using
ls -l | awk '{print $9}'
to list the file names and then
ls -l | awk '{print $9}' | awk /(.+?)(\.[^.]*$|$)/'{print $1}'
But I get an error on escaping the (:
-bash: syntax error near unexpected token `('
The regex (.+?)(\.[^.]*$|$) to isolate the name has a capture group and I think it is correct, while I don't get is not working within awk syntax.
My list of files is like this ABCDEF.ext in the root folder.

Your specific error is caused by the fact that your awk command is incorrectly quoted. The single quotes should go around the whole command, not just the { action } block.
However, you cannot use capture groups like that in awk. $1 refers to the first field, as defined by the input field separator (which in this case is the default: one or more "blank" characters). It has nothing to do with the parentheses in your regex.
Furthermore, you shouldn't start from ls -l to process your files. I think that in this case your best bet would be to use a shell loop:
for file in *; do
printf '%s\n' "${file%.*}"
done
This uses the shell's built-in capability to expand * to the list of everything in the current directory and removes the .* from the end of each name using a standard parameter expansion.
If you really really want to use awk for some reason, and all your files have the same extension .ext, then I guess you could do something like this:
printf '%s\0' * | awk -v RS='\0' '{ sub(/\.ext$/, "") } 1'
This prints all the paths in the current directory, and uses awk to remove the suffix. Each path is followed by a null byte \0 - this is the safe way to pass lists of paths, which in principle could contain any other character.
Slightly less robust but probably fine in most cases would be to trust that no filenames contain a newline, and use \n to separate the list:
printf '%s\n' * | awk '{ sub(/\.ext$/, "") } 1'
Note that the standard tool for simple substitutions like this one would be sed:
printf '%s\n' * | sed 's/\.ext$//'

(.+?) is a PCRE construct. awk uses EREs, not PCREs. Also you have the opening script delimiter ' in the middle of the script AFTER the condition instead of where it belongs, before the start of the script.
The syntax for any command (awk, sed, grep, whatever) is command 'script' so this should be is awk 'condition{action}', not awk condition'{action}'.
But, in any case, as mentioned by #Aaron in the comments - don't parse the output of ls, see http://mywiki.wooledge.org/ParsingLs

Try this.
ls -l | awk '{ s=""; for (i=9;i<=NF;i++) { s = s" "$i }; sub(/\.[^.]+$/,"",s); print s}'
Notes:
read the ls -l output is weird
It doesn't check the items (they are files? directories? ... strip extentions everywhere)
Read the other answers :D

If the extension is always the same pattern try a sed replacement:
ls -l | awk '{print $9}' | sed 's\.ext$\\'

Finding and replacing the last space at or before nth character works with sed but not awk, what am I doing wrong?

I have a string in a test.csv file like this:
here is my string
when I use sed it works just as I expect:
cat test.csv | sed -r 's/^(.{1,9}) /\1,/g'
here is,my string
Then when I use awk it doesn't work and I'm not sure why:
cat test.csv | awk '{gsub(/^(.{1,9}) /,","); print}'
,my string
I need to use awk because once I get this figured out I will be selecting only one column to split into two columns with the added comma. I'm using extended regex with sed, "-r" and was wondering how or if it's supported with awk, but I don't know if that really is the problem or not.

awk does not support back references in gsub. If you are on GNU awk, then gensub can be used to do what you need.
echo "here is my string" | awk '{print gensub(/^(.{1,9}) /,"\\1,","G")}'
here is,my string
Note the use of double \ inside the quoted replacement part. You can read more about gensub here.

Print matched pattern with AWK

For example i have this data:
/home/test/dat1.txt
/home/test/dat2.txt
/home/test/test1/dat3.txt
/home/test/test2/dat4.txt
/home/test/test3/test4/dat5.txt
I need to print only the name and extension, that output should be:
dat1.txt
dat2.txt
dat3.txt
dat4.txt
dat5.txt
I need to use the awk command... anyone can help?
I use this regular expression: '/\/*\.txt/{print ???}

If you are going to use awk, you do not need a regex for this purpose.
You can just tell awk to print the last field, using a field separator of /.
awk -F'/' '{print $NF}' Input.txt
As hd1's comment already noted, NF is the number of fields on the current input record (in this case line). Since awk starts indexing fields at $1, $NF gives you the last field.

You could use this short awk
awk -F/ '$0=$NF' Input.txt
If you need empty line use
awk -F/ '{$0=$NF}1' Input.txt

Replacing Part of Text Using Sed

I have the following text file
Eif2ak1.aSep07
Eif2ak1.aSep07
LOC100042862.aSep07-unspliced
NADH5_C.0.aSep07-unspliced
LOC100042862.aSep07-unspliced
NADH5_C.0.aSep07-unspliced
What I want to do is to remove all the text starting from period (.) to the end.
But why this command doesn't do it?
sed 's/\.*//g' myfile.txt
What's the right way to do it?

You're missing a period there. You want:
s/\..*$//g

you can use awk or cut, since dots are your delimters.
$4 awk -F"." '{print $1}' file
Eif2ak1
Eif2ak1
LOC100042862
NADH5_C
LOC100042862
NADH5_C
$ cut -d"." -f1 file
Eif2ak1
Eif2ak1
LOC100042862
NADH5_C
LOC100042862
NADH5_C
easier than using regular expression.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js