AWK: get file name from LS

AWK: get file name from LS - regex

I have a list of file names (name plus extension) and I want to extract the name only without the extension.
I'm using
ls -l | awk '{print $9}'
to list the file names and then
ls -l | awk '{print $9}' | awk /(.+?)(\.[^.]*$|$)/'{print $1}'
But I get an error on escaping the (:
-bash: syntax error near unexpected token `('
The regex (.+?)(\.[^.]*$|$) to isolate the name has a capture group and I think it is correct, while I don't get is not working within awk syntax.
My list of files is like this ABCDEF.ext in the root folder.

Your specific error is caused by the fact that your awk command is incorrectly quoted. The single quotes should go around the whole command, not just the { action } block.
However, you cannot use capture groups like that in awk. $1 refers to the first field, as defined by the input field separator (which in this case is the default: one or more "blank" characters). It has nothing to do with the parentheses in your regex.
Furthermore, you shouldn't start from ls -l to process your files. I think that in this case your best bet would be to use a shell loop:
for file in *; do
printf '%s\n' "${file%.*}"
done
This uses the shell's built-in capability to expand * to the list of everything in the current directory and removes the .* from the end of each name using a standard parameter expansion.
If you really really want to use awk for some reason, and all your files have the same extension .ext, then I guess you could do something like this:
printf '%s\0' * | awk -v RS='\0' '{ sub(/\.ext$/, "") } 1'
This prints all the paths in the current directory, and uses awk to remove the suffix. Each path is followed by a null byte \0 - this is the safe way to pass lists of paths, which in principle could contain any other character.
Slightly less robust but probably fine in most cases would be to trust that no filenames contain a newline, and use \n to separate the list:
printf '%s\n' * | awk '{ sub(/\.ext$/, "") } 1'
Note that the standard tool for simple substitutions like this one would be sed:
printf '%s\n' * | sed 's/\.ext$//'

(.+?) is a PCRE construct. awk uses EREs, not PCREs. Also you have the opening script delimiter ' in the middle of the script AFTER the condition instead of where it belongs, before the start of the script.
The syntax for any command (awk, sed, grep, whatever) is command 'script' so this should be is awk 'condition{action}', not awk condition'{action}'.
But, in any case, as mentioned by #Aaron in the comments - don't parse the output of ls, see http://mywiki.wooledge.org/ParsingLs

Try this.
ls -l | awk '{ s=""; for (i=9;i<=NF;i++) { s = s" "$i }; sub(/\.[^.]+$/,"",s); print s}'
Notes:
read the ls -l output is weird
It doesn't check the items (they are files? directories? ... strip extentions everywhere)
Read the other answers :D

If the extension is always the same pattern try a sed replacement:
ls -l | awk '{print $9}' | sed 's\.ext$\\'

Related

How to find specific text in a text file, and append it to the filename?

I have a collection of plain text files which are named as yymmdd_nnnnnnnnnn.txt, which I want to append another number sequence to the filenames, so that they each become named as yymmdd_nnnnnnnnnn_iiiiiiiii.txt instead, where the iiiiiiiii is taken from the one line in each file which contains the text "GST: 123456789⏎" (or similar) at the end of the line. While I am sure that there will only be one such matching line within each file, I don't know exactly which line it will be on.
I need an elegant one-liner solution that I can run over the collection of files in a folder, from a bash script file, to rename each file in the collection by appending the specific GST number for each filename, as found within the files themselves.
Before even getting to the renaming stage, I have encountered a problem with this. Here is what I tried, which didn't work...
# awk '/\d+$/' | grep -E 'GST: ' 150101_2224567890.txt
The grep command alone works perfectly to find the relevant line within the file, but the awk doesn't return just the final digits group. It fails with the error "warning: regexp escape sequence \d is not a known regexp operator". I had assumed that this regex should return any number of digits which are at the end of the line. The text file in question contains a line which ends with "GST: 112060340⏎". Can someone please show me how to make this work, and maybe also to help with the appropriate coding to move the collection of files to the new filenames? Thanks.
Thanks to a comment from #Renaud, I now have the following code working to obtain just the GST registration number from within a text file, which puts me a step closer towards a workable solution.
awk '/GST: / {printf $NF}' 150101_2224567890.txt
I still need to loop this over the collection instead of just specifying one filename. I also need to be able to use the output from #Renaud's contribution, to rename the files. I'm getting closer to a working solution, thanks!

This awk should work for you:
awk '$1=="GST:" {fn=FILENAME; sub(/\.txt$/, "", fn); print "mv", FILENAME, fn "_" $2 ".txt"; nextfile}' *_*.txt | sh
To make it more readable:
awk '$1 == "GST:" {
fn = FILENAME
sub(/\.txt$/, "", fn)
print "mv", FILENAME, fn "_" $2 ".txt"
nextfile
}' *_*.txt | sh
Remove | sh from above to see all mv commands together.

You may try
for f in *_*.txt; do echo mv "$f" "${f%.txt}_$(sed '/.*GST: /!d; s///; q' "$f").txt"; done
Drop the echo if you're satisfied with the output.

As you are sure there is only one matching line, you can try:
$ n=$(awk '/GST:/ {print $NF}' 150101_2224567890.txt)
$ mv 150101_2224567890.txt "150101_2224567890_$n.txt"
Or, for all .txt files:
for f in *.txt; do
n=$(awk '/GST:/ {print $NF}' "$f")
if [[ -z "$n" ]]; then
printf '%s: GST not found\n' "$f"
continue
fi
mv "$f" "$f{%.txt}_$n.txt"
done

Another one-line solution to consider, although perhaps not so elegant.
for original_filename in *_*.txt; do \
new_filename=${original_filename%'.txt'}_$(
grep -E 'GST: ' "$original_filename" | \
sed -E 's/.*GST//g; s/[^0-9]//g'
)'.txt' && \
mv "$original_filename" "$new_filename"; \
done
Output:
150101_2224567890_123456789.txt

If you are open to a multi line script:-
#!/bin/sh
for f in *.txt; do
prefix=$(echo "${f}" | sed s'#\.txt##')
cp "${f}" f1
sed -i s'#GST#%GST#' "./f1"
cat "./f1" | tr '%' '\n' > f2
number=$(cat "./f2" | sed -n '/GST/'p | cut -d':' -f2 | tr -d ' ')
newname="${prefix}_${number}.txt"
mv -v "${f}" "${newname}"
rm -v "./f1"
rm -v "./f2"
done
In general, if you want to make your files easy to work with, then leave as many potential places for them to be split with newlines as possible. It is much easier to alter files by simply being able to put what you want to delete or print on its' own line, than it is to search for things horizontally with regular expressions.

Extract the string matched in a regex, not the line, with awk

This should not be too difficult but I could not find a solution.
I have a HTML file, and I want to extract all URLs with a specific pattern.
The pattern is /users/<USERNAME>/ - I actually only need the USERNAME.
I got only to this:
awk '/users\/.*\//{print $0}' file
But this filters me the complete line. I don't want the line.
Even just the whole URL is fine (e.g. get /users/USERNAME/), but I really only need the USERNAME....

If you want to do this in single awk then use match function:
awk -v s="/users/" 'match($0, s "[^/[:blank:]]+") {
print substr($0, RSTART+length(s), RLENGTH-length(s))
}' file
Or else this grep + cut will do the job:
grep -Eo '/users/[^/[:blank:]]+' file | cut -d/ -f

set the delimiter and do a literal match to second field and print the third.
$ awk -F/ '$2=="users"{print $3}'

Assuming your statement gives you the entire line of something like
/users/USERNAME/garbage/otherStuff/
You could pipe this result through head assuming you always know that it will be
/users/USERNAME/....
After piping through head, you can also use cut commands to remove more of the end text until you have only the piece you want.
The command will look something like this
awk '/users\/.*\//{print $0}' file | head (options) | cut (options)

Remove hostnames from a single line that follow a pattern in bash script

I need to cat a file and edit a single line with multiple domains names. Removing any domain name that has a set certain pattern of 4 letters ex: ozar.
This will be used in a bash script so the number of domain names can range, I will save this to a csv later on but right now returning a string is fine.
I tried multiple commands, loops, and if statements but sending the output to variable I can use further in the script proved to be another difficult task.
Example file
$ echo file.txt
ozarkzshared.com win.ad.win.edu win_fl.ozarkzsp.com ap.allk.org allk.org >ozarkz.com website.com
What I attempted (that was close)
domains_1=$(cat /tmp/file.txt | sed 's/ozar*//g')
domains_2=$( cat /tmp/file.txt | printf '%s' "${string##*ozar}")
Goal
echo domain_x
win.ad.win.edu ap.allk.org allk.org website.com

If all the domains are on a single line separated by spaces, this might work:
awk '/ozar/ {next} 1' RS=" " file.txt
This sets RS, your record separator, then skips any record that matches the keyword. If you wanted to be able to skip a substring provided in a shell variable, you could do something like this:
$ s=ozar
$ awk -v re="$s" '$0 ~ re {next} 1' RS=" " file.txt
Note that the ~ operator is comparing a regular expression, not precisely a substring. You could leverage the index() function if you really want to check a substring:
$ awk -v s="$s" 'index($0,s) {next} 1' RS=" " file.txt
Note that all of the above is awk, which isn't what you asked for. If you'd like to do this with bash alone, the following might be for you:
while read -r -a a; do
for i in "${a[#]}"; do
[[ "$i" = *"$s"* ]] || echo "$i"
done
done < file.txt
This assigns each line of input to the array $a[], then steps through that array testing for a substring match and printing if there is none. Text processing in bash is MUCH less efficient than in a more specialized tool like awk or sed. YMMV.

you want to delete the words until a space delimiter
$ sed 's/ozar[^ ]*//g' file
win.ad.win.edu win_fl. ap.allk.org allk.org website.com

sed -i '/$(command 1)/$(command 2)/' myHtmlFile ? Inline editing with sed and awk

I'm writing a shell script that builds and edits an html file whose main content is basically clamscan's (ClamAV) output.
So, the script's mission is : translating the output, removing unhelpful stuff, adding html tags and so on.
Though, i'm stuck with the last modification i want.
One part of the edited output from clamscan looks like this :
/path/to/infected-file: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
/path/to/infected-zipfile!(1)ZIP:eicar.com: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
/path/to/infected-zipfilewithinzipfile!ZIP:eicar_com.zip!(2)ZIP:eicar.com: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
I want to shrink those long lines. Something like this would be the best :
infected-file: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
infected-zipfile: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
infected-zipfilewithinzipfile: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
But i'd already be happy to just remove the path to the infected file.
Since it seemed easy to get some results with awk and i used sed for all previous editing, I thought my best option was going with something like :
sed -i 's/<awk command 1>/<awk command 2>/' myHtmlFile
Unfortunately i spent a few hours turning this in various way with no luck.
There seems to be syntax issues with things like :
sed "s#$(awk -F': ' '{print $1}' testfile)#$(awk -F': ' '{print $1}' testfile | awk -F'\' '{print $NF}')#" testfile
whether i use single or double quotes, whether i try to concatenate sed strings or try to escape various chars depending on the chosen syntax.
I also though about loops (so that i could make sed work with vars containing awk results) but i'm unsure how to manage a loop for this kind of inline editing.
It could probably be solved with a powerful regex, but i'm quite bad at it ^^

$ sed -E 's#[^:]+/([^:!]+).*: #\1: #' file
infected-file: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
infected-zipfile: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
infected-zipfilewithinzipfile: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
The above regexp does this:
[^:]+/ - consumes a leading string that contains no colons and ends in /, e.g. /path/to/
([^:!]+) - saves the subsequent string that contains no colons or exclamation marks in a capture group, e.g. infected-zipfile
.*: - consumes the subsequent string leading up to a colon followed by a blank char, e.g. !(1)ZIP:eicar.com:.
and then the replacement does this:
\1 - print the string saved into capture group 1 in step 2 above
: - print a colon followed by a blank char (I could have used a capture group for this too)

Ed Morton has already explained how to do this with a single regex substitution (i.e. the right way); I'll explain what's wrong with the original approach, and show how to do it with a shell loop (i.e. the wrong way).
The problem with the combined sed+awk+awk approach is that you need them to operate line-by-line in lockstep. That is, when sed processes line N of the file, it needs to replace the Nth line of output from the first awk command with the Nth line of output from the second awk pipeline. But the commands don't interrelate that way; the shell runs all of the awk commands, collects their entire output, then feeds that to sed as a single huge (and malformed) substitute expression. Given your sample data (and assuming the last awk command should have -f '/' instead of -f '\'), it would do essentially this:
sed 's#/path/to/infected-file
/path/to/infected-zipfile!(1)ZIP:eicar.com
/path/to/infected-zipfilewithinzipfile!ZIP:eicar_com.zip!(2)ZIP:eicar.com#infected-file
infected-zipfile!(1)ZIP:eicar.com
infected-zipfilewithinzipfile!ZIP:eicar_com.zip!(2)ZIP:eicar.com#' testfile
sed will reject this because of the newlines in the pattern (and replacement string as well). If it weren't for it being rejected, sed would go ahead and try to apply the whole mess to every line separately, but since it's not actually what you wanted that wouldn't work either.
In order to get all of those commands to work line-by-line in lockstep, you'd have to do something like use a shell loop to read and process each line through each of the commands (&pipeline) individually, like this:
while read -r line; do
fullpath=$(echo "$line" | awk -F': ' '{print $1}')
trimmedpath=$(echo "$line" | awk -F': ' '{print $1}' testfile | awk -F'/' '{print $NF}'
echo "$line" | sed "s#$fullpath#$trimmedpath#"
done < testfile
You could actually skip the fullpath and trimmedpath variables, and use the two $(echo "$line" | awk...) substitutions directly in the sed command if you wanted. But really, you shouldn't do it this way at all; use Ed's single-regex solution.

This might work for you (GNU sed):
sed -r 's#^([^/]*/)*([[:alpha:]-]*)([^:]*:)* #\2: #' file
This removes any directories, keeps the file name and removes any superfluous characters up to a : followed by a space.

bash search for multiple patterns on different lines in a file

I have a number of files and I want to filter out the ones that contain 2 patterns. However these patterns are on different lines. I've tried it using grep and awk but in both cases they only seem to work on matches patterns on the same line. I know grep is line based but I'm less familiar with awk. Here's what I came up with but it only works prints lines that match both strings:
awk '/string1/ && /string2/' file

Grep will easily handle this using xargs:
grep -l string1 * | xargs grep -l string2
Use this command in the directory where the files are located, and resulting matches will be displayed.

Depending om whether you really want to search for regexps:
gawk -v RS='^$' '/regexp1/ && /regexp2/ {print FILENAME}' file
or for strings:
gawk -v RS='^$' 'index($0,"string1") && index($0,"string2") {print FILENAME}' file
The above uses GNU awk for multi-char RS to read the whole file as a single record.

You can do it with find
find -type f -exec bash -c "grep -q string1 {} && grep -q string2 {} && echo {}" ";"

You could do it like this with GNU awk:
awk '/foo/{seenFoo++} /bar/{seenBar++} seenFoo&&seenBar{print FILENAME;seenFoo=seenBar=0;nextfile}' file*
That says... if you see foo, increment variable seenFoo, likewise if you see bar, increment variable seenBar. If, at any point, you have seen both foo and bar, print the name of the current file and skip to the next input file ignoring all remaining lines in current file, and, before you start the next file, clear the flags to say we have seen neither foo nor bar in the new file.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js