AWK print based on FILENAME pattern - regex

I have a directory of files with filenames of the form file000.txt to filennn.txt. I would like to be able to specify a range of file names and print the content of those files based on a match. I have achieved it with a single file pattern:
$ gawk 'FILENAME ~/file038.txt/ {print FILENAME, $0}' file*.txt
file038.txt Some 038 text here
But I cannot get a pattern that would allow me to specify a range of file names, for instance
gawk 'FILENAME ~/file[038-040].txt/ {print FILENAME, $0}' file*.txt
I'm sure I'm missing something simple here, I'm an AWK newbie. Any suggestions?

you can do some substitution on the filename, for example:
awk '{x=FILENAME;gsub(/[^0-9]/,"",x);x+=0}x>10&&x<50{your logic}' file*.txt
in this way, file file011.txt ~ file049.txt would be handled with "your logic"
You can adjust the part: x>10&&x<50 for example, handle only file with the number in the name as odd/even/.... just write boolean expressions there.

Odd way but something on these lines:
awk '{ if (match(FILENAME,/file0[3-4][0-8].txt/)) { print FILENAME, $0}}' file*.txt

Solution using gawk and a recent version of bash
There is a bash primitive to handle file[038-040].txt. It makes the code quite simple:
gawk 'FNR==1 {print FILENAME, $0} {quit}' file{038..040}.txt
Key points:
FNR==1 {print FILENAME, $0}
This prints the filename and the first line of each file
{quit}
This saves time by skipping directly to the next file.
file{038..040}.txt
The construct {038..040} is a bash feature called brace expansion. bash will replace this with the file names that you want. If you want to test out brace expansion to see how it works, try it on the command line with this simple statement:
echo file{038..040}.txt
UPDATE 1: Mac OSX currently uses bash v3.2 which does not support leading zeros in brace expansion.
UPDATE 2: If there are missing files and you have a modern gawk (v4.0 or better), use this instead:
gawk 'BEGINFILE{ if (ERRNO) nextfile} FNR==1 {print FILENAME, $0} {quit}' file{038..040}.txt
Solution using gawk with a plain POSIX shell
gawk '{n=0+substr(FILENAME,5,3)} FNR==1 && n>=38 && n<=40 {print FILENAME, $0} {quit}' file*.txt
Explanation:
n=0+substr(FILENAME,5,3)
Extract the number from the filename. 0+ is a trick to force awk to treat n as numeric.
n>=38 && n<=40 {print FILENAME, $0}
This selects the file based on its number and prints the filename and first line.
{quit}
As before, this saves time by stopping awk from reading the rest of each file.
file*.txt
This can be expanded by any POSIX shell to the list of file names.

Should work
awk '(x=FILENAME)~/(3[8-9]|40).txt$/{print x,$0;quit}' file*.txt
As quit doesn't work(atleast with my version of awk) here is another way
awk 'FNR==((x=FILENAME)~/(3[8-9]|40).txt$/){print x,$0}' file*.txt

Related

rename specific lines in a text file with sed

I have a file that looks like this:
>alks|keep1|aoiuor|lskdjf
ldkfj
alksj
asdflkj
>jhoj_kl|keep2|kjghoij|adfjl
aldskj
alskj
alsdkj
I would like to edit just the lines starting with >, ideally in-place, to get a file:
>keep1
ldkfj
alksj
asdflkj
>keep2
aldskj
alskj
alsdkj
I know that in principle this is achievable with various combinations of sed/awk/cut, but I haven't been able to figure out the right combination. Ideally it should be fast - the file has many millions of lines, and many of the lines are also very long.
Key things about the lines I want to edit:
Always start with >
The bit I want to keep is always between the first and second pipe symbol | (hence thinking cut is going to help
The bit I want to keep has alphanumeric symbols and sometimes underscores. The rest of the string on the same line can have any symbols
What I've tried that seems helpful
(Most of my sed attempts are pure garbage)
cut -d '|' -f 2 test.txt
Gets me the bit of the string that I want, and it keeps the other lines too. So it's close, but (of course) it doesn't preserve the initial > on the lines where cut applies, so it's missing a crucial part of the solution.
With sed:
sed -E '/^>/ s/^[^|]+\|([^|]+).*/>\1/'
/^>/ to select lines starting with >, not strictly necessary for given sample but sometimes this provides faster result than using s alone
^[^|]+\| this will match non | characters from the start of line
([^|]+) capture the second field
.* rest of the line
>\1 replacement string where \1 will have the contents of ([^|]+)
If your input has only ASCII character, this would give you much faster results:
LC_ALL=C sed -E '/^>/ s/^[^|]+\|([^|]+).*/>\1/'
Timing
Checking the timing results by creating a huge file from given input sample, awk is much faster and mawk is even faster
However, OP reports that the sed solution is faster for the actual data
With your shown samples, you could simply try following. In this code, we are setting field separator as | for all the lines of Input_file then in main program checking if line starts from > then print 2nd field else print the complete line.
awk -F'|' '/^>/{print ">"$2;next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk -F'|' ' ##Starting awk program from here and setting field separator as | here.
/^>/{ ##Checking condition if line starts from > then do following.
print ">"$2 ##Printing 2nd field of current line here.
next ##next will skip all further statements from here.
}
1 ##Will print current line.
' Input_file ##mentioning Input_file name here.
You can also use the following awk command:
awk -F\| '/^>/{print ">"$2} !/^>/{print}' file
# Inplace replacement with gawk (GNU awk)
gawk -i inplace -F\| '/^>/{print ">"$2} !/^>/{print}' file
# "Inline-like" replacement with any awk
awk -F\| '/^>/{print ">"$2} !/^>/{print}' file > tmp && mv tmp file
Here,
-F\| - sets the field separator to a | char
/^>/ is the condition: if line starts with < (and !/^>/ means the opposite)
{print ">"$2} prints the Field 2 value with a > char prepended to it
{print} simply prints the full line.
Note that since !/^>/{print} can be reduced to !/^>/ as print is the default action.
See an online demo:
s='>alks|keep1|aoiuor|lskdjf
ldkfj
alksj
asdflkj
>jhoj_kl|keep2|kjghoij|adfjl
aldskj
alskj
alsdkj'
awk -F\| '/^>/{print ">"$2} !/^>/{print}' <<< "$s"
Output:
>keep1
ldkfj
alksj
asdflkj
>keep2
aldskj
alskj
alsdkj

Extract the string matched in a regex, not the line, with awk

This should not be too difficult but I could not find a solution.
I have a HTML file, and I want to extract all URLs with a specific pattern.
The pattern is /users/<USERNAME>/ - I actually only need the USERNAME.
I got only to this:
awk '/users\/.*\//{print $0}' file
But this filters me the complete line. I don't want the line.
Even just the whole URL is fine (e.g. get /users/USERNAME/), but I really only need the USERNAME....
If you want to do this in single awk then use match function:
awk -v s="/users/" 'match($0, s "[^/[:blank:]]+") {
print substr($0, RSTART+length(s), RLENGTH-length(s))
}' file
Or else this grep + cut will do the job:
grep -Eo '/users/[^/[:blank:]]+' file | cut -d/ -f
set the delimiter and do a literal match to second field and print the third.
$ awk -F/ '$2=="users"{print $3}'
Assuming your statement gives you the entire line of something like
/users/USERNAME/garbage/otherStuff/
You could pipe this result through head assuming you always know that it will be
/users/USERNAME/....
After piping through head, you can also use cut commands to remove more of the end text until you have only the piece you want.
The command will look something like this
awk '/users\/.*\//{print $0}' file | head (options) | cut (options)

Remove hostnames from a single line that follow a pattern in bash script

I need to cat a file and edit a single line with multiple domains names. Removing any domain name that has a set certain pattern of 4 letters ex: ozar.
This will be used in a bash script so the number of domain names can range, I will save this to a csv later on but right now returning a string is fine.
I tried multiple commands, loops, and if statements but sending the output to variable I can use further in the script proved to be another difficult task.
Example file
$ echo file.txt
ozarkzshared.com win.ad.win.edu win_fl.ozarkzsp.com ap.allk.org allk.org >ozarkz.com website.com
What I attempted (that was close)
domains_1=$(cat /tmp/file.txt | sed 's/ozar*//g')
domains_2=$( cat /tmp/file.txt | printf '%s' "${string##*ozar}")
Goal
echo domain_x
win.ad.win.edu ap.allk.org allk.org website.com
If all the domains are on a single line separated by spaces, this might work:
awk '/ozar/ {next} 1' RS=" " file.txt
This sets RS, your record separator, then skips any record that matches the keyword. If you wanted to be able to skip a substring provided in a shell variable, you could do something like this:
$ s=ozar
$ awk -v re="$s" '$0 ~ re {next} 1' RS=" " file.txt
Note that the ~ operator is comparing a regular expression, not precisely a substring. You could leverage the index() function if you really want to check a substring:
$ awk -v s="$s" 'index($0,s) {next} 1' RS=" " file.txt
Note that all of the above is awk, which isn't what you asked for. If you'd like to do this with bash alone, the following might be for you:
while read -r -a a; do
for i in "${a[#]}"; do
[[ "$i" = *"$s"* ]] || echo "$i"
done
done < file.txt
This assigns each line of input to the array $a[], then steps through that array testing for a substring match and printing if there is none. Text processing in bash is MUCH less efficient than in a more specialized tool like awk or sed. YMMV.
you want to delete the words until a space delimiter
$ sed 's/ozar[^ ]*//g' file
win.ad.win.edu win_fl. ap.allk.org allk.org website.com

AWK: get file name from LS

I have a list of file names (name plus extension) and I want to extract the name only without the extension.
I'm using
ls -l | awk '{print $9}'
to list the file names and then
ls -l | awk '{print $9}' | awk /(.+?)(\.[^.]*$|$)/'{print $1}'
But I get an error on escaping the (:
-bash: syntax error near unexpected token `('
The regex (.+?)(\.[^.]*$|$) to isolate the name has a capture group and I think it is correct, while I don't get is not working within awk syntax.
My list of files is like this ABCDEF.ext in the root folder.
Your specific error is caused by the fact that your awk command is incorrectly quoted. The single quotes should go around the whole command, not just the { action } block.
However, you cannot use capture groups like that in awk. $1 refers to the first field, as defined by the input field separator (which in this case is the default: one or more "blank" characters). It has nothing to do with the parentheses in your regex.
Furthermore, you shouldn't start from ls -l to process your files. I think that in this case your best bet would be to use a shell loop:
for file in *; do
printf '%s\n' "${file%.*}"
done
This uses the shell's built-in capability to expand * to the list of everything in the current directory and removes the .* from the end of each name using a standard parameter expansion.
If you really really want to use awk for some reason, and all your files have the same extension .ext, then I guess you could do something like this:
printf '%s\0' * | awk -v RS='\0' '{ sub(/\.ext$/, "") } 1'
This prints all the paths in the current directory, and uses awk to remove the suffix. Each path is followed by a null byte \0 - this is the safe way to pass lists of paths, which in principle could contain any other character.
Slightly less robust but probably fine in most cases would be to trust that no filenames contain a newline, and use \n to separate the list:
printf '%s\n' * | awk '{ sub(/\.ext$/, "") } 1'
Note that the standard tool for simple substitutions like this one would be sed:
printf '%s\n' * | sed 's/\.ext$//'
(.+?) is a PCRE construct. awk uses EREs, not PCREs. Also you have the opening script delimiter ' in the middle of the script AFTER the condition instead of where it belongs, before the start of the script.
The syntax for any command (awk, sed, grep, whatever) is command 'script' so this should be is awk 'condition{action}', not awk condition'{action}'.
But, in any case, as mentioned by #Aaron in the comments - don't parse the output of ls, see http://mywiki.wooledge.org/ParsingLs
Try this.
ls -l | awk '{ s=""; for (i=9;i<=NF;i++) { s = s" "$i }; sub(/\.[^.]+$/,"",s); print s}'
Notes:
read the ls -l output is weird
It doesn't check the items (they are files? directories? ... strip extentions everywhere)
Read the other answers :D
If the extension is always the same pattern try a sed replacement:
ls -l | awk '{print $9}' | sed 's\.ext$\\'

awk regex pattern does not match beginning of the line

I'm using GNU awk version 3.1.7 on Windows 10, MinGW installation.
File to test this has this contents but same behaviour is with other files as well.
test.txt
line one
second line
another line
end this one should match
double test
yet another
I want to print only first words beginning with e.
awk command I'm using is:
awk '{ if ($1 ~ /^e/) {print $1} }' test.txt
But this prints every first word which has character e anywhere.
output
line
second
another
end
double
yet
When I want to match end of the word works fine.
Match every first word ending with d.
awk '{ if ($1 ~ /d$/) {print $1} }' test.txt
output
second
end
Any idea why first example matching beginning of the word does not work?
What I'm I doing wrong there?
That's got nothing to do with gawk it's Windows quoting rules. gawk doesn't even see the quotes - it just runs on whatever script Windows passes to it (i.e. the part between the quotes) and it's entirely Windows that interprets the quotes to isolate the script that it then passes to gawk. The standard advice is to avoid the problem is by putting the awk script in a file and running as awk -f script instead of trying to deal with the Windows quoting nightmare. The best advice though is to run cygwin on top of Windows.
I just tried it with gawk 3.1.6 - 1 on Windows 10.
When I try with single quotes it gives syntax error:
awk '{ if ($1 ~ /^e/) {print $1} }' test.txt
// Error
awk: '{
awk: ^ invalid char ''' in expression
With double quotes works fine, prints only end.
awk "{ if ($1 ~ /^e/) {print $1} }" test.txt
So I have tried this line with double quotes on gawk 3.1.7 as well.
It works.
Prints only end.
gawk 3.1.7 does not give any error when I use line example with single quotes but /^e/ regex in it does not match as it should for some reason.
So at least from my perspective if you are using gawk on windows, always use double quotes for awk code in a command line.
awk "{ if ($1 ~ /^^e/) {print $1} }" test.txt
on Windows platform:
1- exchange " with ' and vice versa
2- for ^ use ^^