awk regular expression to compare the file extensions

awk regular expression to compare the file extensions - regex

I'm trying to find the files with extensions sh, xls etc as shown in the FILTER variable below.
following is the output of ls -ltr, the output of of below script is coming as hourly_space_update.sh and kent.ksh, but I don't want .ksh file, could you please tell where I'm going wrong with my regex.
[root#SVRVSVN ~]# ls -ltr
total 20
-rw-r--r-- 1 root sqaadmin 44 Oct 9 18:24 hourly_space_update.sh
-rw-r--r-- 1 root sqaadmin 0 Oct 30 12:34 kent.ksh
-rw-r--r-- 1 root sqaadmin 0 Oct 30 12:34 a.abc
-rw-r--r-- 1 root sqaadmin 0 Oct 30 13:02 hh.h
#!/bin/sh
ls -ltr | awk '
BEGIN {
FILTER=".(sh|xls|xlsx|pdf)$"
}
{
for (i = 1; i < 9; i++) $i = ""; sub(/^ */, "");
if(match(tolower($1),FILTER))
{
print $1
}
}'

Try this regexp:
\.(sh|xls|xlsx|pdf)$

See the comments I made in the answers you got so far, but more importantly - your approach of testing one of the fields will fail for file names that contain spaces, and any piped solution will fail if one of those white spaces is a newline. You should just use shell as:
ls -tr *.sh *.xls *.xlsx *.pdf
and get rid of the need for a filter at all.
If you MUST keep an awk script, though, then the way to write it is this if you can guarantee your file names don't contain any spaces:
ls -ltr | awk 'BEGIN{FILTER="\\.(sh|xlsx?|pdf)$"} tolower($NF) ~ FILTER { print $NF }'
Note that I abbreviated your RE since "xslx?" will match "xls" or "xlsx".
Before I give you a solution for file names that contain spaces or newlines, though - why are you using "ls -ltr" instead of simply "ls -tr" if you only want to process the file name?

In bash/ksh/zsh, you can use brace expansion:
ls *.{sh,xls,xlsx,pdf}
Also don't parse ls.

Try with (\bsh\b|\bxls\b|\bxlsx\b|\bpdf\b) filter.
In you're filter you want .ksh file because it containts sh sequence.

Your code actually works in my gawk 4.0.1 running under cygwin.
But how come you don't want to do:
awk 'BEGIN {FILTER=".(sh|xls|xlsx|pdf)$"}{if(match(tolower($9),FILTER)){print $9}}'
This would make the for loop redundant, and clean up the code a bit. I guess output of ls -ltr use the same format each time you execute it. :)
Unfortunately I do not have access to a clean awk command for testing, but you could also try to double escape the \\. if that is the problem in you awk. A tips is to print $1 before the if statement to make sure it contain what you expect it to be.

Related

How to print portion of linux filenames that match a regex

I would like to list all files in a linux directory then apply a regular expression on them to format the file name, and print these formatted files names.
Example:
ls -lthrh
.
.
-rwxrwxrwx. 1 root root 633 Oct 31 2016 Oracle_Schedule_ARC-Oracle_ARCH-1477938600005-1002-Oracleorcl-rman1.txt
-rwxrwxrwx. 1 root root 610 Nov 7 2016 MOD-1478512353102-1002-Oracleorcl-rman1.txt
After applying my regex '.+?(?=-)' I would have everything before the first '-' to be:
Oracle_Schedule_ARC
MOD
I've tried using awk, but I couldn't pass a regex to it. I will apply later | sort | uniq to have a unique output of the regex output.

In any POSIX shell (bash, pdksh, ksh93, zsh, dash):
for name in *; do
printf '%s\n' "${name%%-*}"
done
This would go through all the names in the current directory and output the bit before the first - character. It does this by removing the longest suffix string matching -* from the filename using a standard parameter substitution.
Note that -* is a shell globbing pattern, not a regular expression. Regular expressions are useful for working on text, but globbing patterns are fast and efficient for working with filenames and pathnames in general, as you don't have to start another process with a regex engine, such as awk or sed.
In bash, you could also get away from using a loop at all:
set -- *
printf '%s\n' "${#%%-*}"
This first sets the positional parameters to the names in the current directory. printf is then invoked on the set of names, each individually transformed with the same parameter substitution as in the first part of this answer.
The same thing, but using an array variable other than the array of positional parameters:
names=( * )
printf '%s\n' "${names[#]%%-*}"

sed / awk - remove space in file name

I'm trying to remove whitespace in file names and replace them.
Input:
echo "File Name1.xml File Name3 report.xml" | sed 's/[[:space:]]/__/g'
However the output
File__Name1.xml__File__Name3__report.xml
Desired output
File__Name1.xml File__Name3__report.xml

You named awk in the title of the question, didn't you?
$ echo "File Name1.xml File Name3 report.xml" | \
> awk -F'.xml *' '{for(i=1;i<=NF;i++){gsub(" ","_",$i); printf i<NF?$i ".xml ":"\n" }}'
File_Name1.xml File_Name3_report.xml
$
-F'.xml *' instructs awk to split on a regex, the requested extension plus 0 or more spaces
the loop {for(i=1;i<=NF;i++) is executed for all the fields in which the input line(s) is(are) splitted — note that the last field is void (it is what follows the last extension), but we are going to take that into account...
the body of the loop
gsub(" ","_", $i) substitutes all the occurrences of space to underscores in the current field, as indexed by the loop variable i
printf i<NF?$i ".xml ":"\n" output different things, if i<NF it's a regular field, so we append the extension and a space, otherwise i equals NF, we just want to terminate the output line with a newline.
It's not perfect, it appends a space after the last filename. I hope that's good enough...
▶    A D D E N D U M    ◀
I'd like to address:
the little buglet of the last space...
some of the issues reported by Ed Morton
generalize the extension provided to awk
To reach these goals, I've decided to wrap the scriptlet in a shell function, that changing spaces into underscores is named s2u
$ s2u () { awk -F'\.'$1' *' -v ext=".$1" '{
> NF--;for(i=1;i<=NF;i++){gsub(" ","_",$i);printf "%s",$i ext (i<NF?" ":"\n")}}'
> }
$ echo "File Name1.xml File Name3 report.xml" | s2u xml
File_Name1.xml File_Name3_report.xml
$
It's a bit different (better?) 'cs it does not special print the last field but instead special-cases the delimiter appended to each field, but the idea of splitting on the extension remains.

This seems a good start if the filenames aren't delineated:
((?:\S.*?)?\.\w{1,})\b
( // start of captured group
(?: // non-captured group
\S.*? // a non-white-space character, then 0 or more any character
)? // 0 or 1 times
\. // a dot
\w{1,} // 1 or more word characters
) // end of captured group
\b // a word boundary
You'll have to look-up how a PCRE pattern converts to a shell pattern. Alternatively it can be run from a Python/Perl/PHP script.
Demo

Assuming you are asking how to rename file names, and not remove spaces in a list of file names that are being used for some other reason, this is the long and short way. The long way uses sed. The short way uses rename. If you are not trying to rename files, your question is quite unclear and should be revised.
If the goal is to simply get a list of xml file names and change them with sed, the bottom example is how to do that.
directory contents:
ls -w 2
bob is over there.xml
fred is here.xml
greg is there.xml
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
echo "${a_glob[i]}";
done
shopt -u nullglob
# output
bob is over there.xml
fred is here.xml
greg is there.xml
# then rename them
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
# I prefer 'rename' for such things
# rename 's/[[:space:]]/_/g' "${a_glob[i]}";
# but sed works, can't see any reason to use it for this purpose though
mv "${a_glob[i]}" $(sed 's/[[:space:]]/_/g' <<< "${a_glob[i]}");
done
shopt -u nullglob
result:
ls -w 2
bob_is_over_there.xml
fred_is_here.xml
greg_is_there.xml
globbing is what you want here because of the spaces in the names.
However, this is really a complicated solution, when actually all you need to do is:
cd [your space containing directory]
rename 's/[[:space:]]/_/g' *.xml
and that's it, you're done.
If on the other hand you are trying to create a list of file names, you'd certainly want the globbing method, which if you just modify the statement, will do what you want there too, that is, just use sed to change the output file name.
If your goal is to change the filenames for output purposes, and not rename the actual files:
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
echo "${a_glob[i]}" | sed 's/[[:space:]]/_/g';
done
shopt -u nullglob
# output:
bob_is_over_there.xml
fred_is_here.xml
greg_is_there.xml

You could use rename:
rename --nows *.xml
This will replace all the spaces of the xml files in the current folder with _.
Sometimes it comes without the --nows option, so you can then use a search and replace:
rename 's/[[:space:]]/__/g' *.xml
Eventually you can use --dry-run if you want to just print filenames without editing the names.

How to find lines using patterns in a file in UNIX

I am trying to use a .txt file with around 5000 patterns (spaced with a line) to search through another file of 18000 lines for any matches. So far I've tried every form of grep and awk I can find on the internet and it's still not working, so I am completely stumped.
Here's some text from each file.
Pattern.txt
rs2622590
rs925489
rs2798334
rs6801957
rs6801957
rs13137008
rs3807989
rs10850409
rs2798269
rs549182
There's no extra spaces or anything.
File.txt
snpid hg18chr bp a1 a2 zscore pval CEUmaf
rs3131972 1 742584 A G 0.289 0.7726 .
rs3131969 1 744045 A G 0.393 0.6946 .
rs3131967 1 744197 T C 0.443 0.658 .
rs1048488 1 750775 T C -0.289 0.7726 .
rs12562034 1 758311 A G -1.552 0.1207 0.09167
rs4040617 1 769185 A G -0.414 0.6786 0.875
rs4970383 1 828418 A C 0.214 0.8303 .
rs4475691 1 836671 T C -0.604 0.5461 .
rs1806509 1 843817 A C -0.262 0.7933 .
The file.txt was downloaded directly from a med directory.
I'm pretty new to UNIX so any help would be amazing!
Sorry edit: I have definitely tried every single thing you guys are recommending and the result is blank. Am I maybe missing a syntax issue or something in my text files?
P.P.S I know there are matches as doing individual greps works. I'll move this question to unix.stackexchange. Thanks for your answers guys I'll try them all out.
Issue solved: I was obviously using DOS carriages. I didn't know about this before so thank you everyone that answered. For future users who are having this issue, here is the solution that worked:
dos2unix *
awk 'NR==FNR{p[$0];next} $1 in p' Patterns.txt File.txt > Output.txt

You can use grep -Fw here:
grep -Fw -f Pattern.txt File.txt
Options used are:
-F - Fixed string search to tread input as non-regex
-w - Match full words only
-f file - Read pattern from a file

idk if it's what you want or not, but this will print every line from File.txt whose first field equals a string from Patterns.txt:
awk 'NR==FNR{p[$0];next} $1 in p' Patterns.txt File.txt
If that is not what you want, tell us what you do want. If it is what you want but doesn't produce the output you expect then one or both of your files contains control characters courtesy of being created in Windows so run dos2unix or similar on them both first.

Use a shell script to read each line of the file containing your patterns then fgrep it.
#!/bin/bash
FILENAME=$1
awk '{kount++;print $0}' $FILENAME | fgrep -f - PATTERNFILE.txt

AWK end of line sign in regular expressions

I have a simple awk script named "script.awk" that contains:
/\/some_simple_string/ { print $0;}
I'm using it to parse some file that contains:
(by using: cat file | awk -f script.awk)
14 catcat one_two/some_thing
15 catcat one_three/one_more_some_simple_string
16 dogdog one_two/some_simple_string_again
17 dogdog one_four/some_simple_string
18 qweqwe firefire/ppp
I want the script to only print the stroke that fully reflect "/some_simple_string[END_OF_LINE]" but not 2 or 3.
Is there any simple way to do it?
I think, the most appropriate way is to add end-of-line sigh to the regular expression.
So it will parse only strokes that starting with "/some.." and have a new line at the end of "..string[END_OF_LINE]"
Desired output:
17 dogdog one_four/some_simple_string
Sorry for confusion, I was asking for END OF LINE sign in regular expressions.
The correct answer is:
/\/some_simple_string$/ { print $0;}

You can always use:
/\/some_simple_string$/ { print $0 }
I.e. match not only "some_simple_string" but match "/some_simple_string" followed by the end of the line ($ is end of line in regex)

grep '\some_simple_string$' file | tail -n 1 should do the trick.
Or if you really want to use awk do awk '/\/some_simple_string/{x = $0}END{print x}'

To return just the last of a group of matches, ...
Store the line in a variable and print it in the END block.
/some_simple_string/ { x = $0 }
END{ print x }

To print all the matches that end with the string /some_simple_string using regular expression you need to anchor to the the end of the line using $. The most suitable tool for this job is grep:
$ grep '/some_simple_string$' file
In awk the command is much the same:
$ awk '/[/]some_simple_string$/' file
To print all lines after the matching you would do:
$ awk 'print_flag{print;f=0} /[/]some_simple_string$/{print_flag=1}' file
Or just combine grep and tail if it makes it clearer using context option -A to print the following lines:
$ grep -A1 '/some_simple_string$' file | tail -n 1

I sometimes find that the input records can have a trailing carriage return (\r).
Yes, I deal with both Windows and Linux text files.
So I add the following 'pre-processor' to my awk scripts:
1 == 1 { # preprocess all records
res = gsub("\r", "") # remove unwanted trailing char
if(res>0 && NR<100) { print "(removed stuff)" > "/dev/stderr" } # optional
}

more optimally, let FS do the work instead of having it perform unnecessary and unrelated field splitting (adding the \r bit for Windows/DOS completeness):
mawk '!_<NF' FS='[/]some_simple_string[\r]?$'
17 dogdog one_four/some_simple_string

Match Range of Lines in Log

I am trying to figure out how to take a log that that has millions of lines in
a day and easily dump a range (based on begin and end timestamp) of lines to
another file. Here is an excerpt from the log to show how it is constructed:
00:04:59.703: <abcxyz,v1 from YYY::Process at 14 Nov 2012 07:04:59>
00:04:59.703: 20121114070459 - XXX - 7028429950500220900257201211131000000003536
00:04:59.703: </abcxyz,v1>
00:04:59.711: <abcxyz,v1 from YYY::Process at 14 Nov 2012 07:04:59>
00:04:59.711: 20121114070459 - XXX - 7028690080500220900257201211131000000003538
00:04:59.711: </abcxyz,v1>
00:04:59.723: <abcxyz,v1 from YYY::Process at 14 Nov 2012 07:04:59>
00:04:59.723: 20121114070459 - XXX - 7028395150500220900257201211131000000003540
00:04:59.723: </abcxyz,v1>
00:04:59.744: <abcxyz,v1 from YYY::Process at 14 Nov 2012 07:04:59>
As you can see there are multiple lines per millisecond. What I would like to
do is be able to give as an input a begin and end timestamp such as
begin=11:00: and end=11:45: and have it dump all the lines in that range.
I have been racking my brain trying to figure this one out, but so far haven't
come up with a satisfactory result.
UPDATE: Of course just the first thing I try after I post the question seems to
work. Here is what I have:
sed -n '/^06:25/,/^08:25:/p' logFile > newLogFile
More than happy to take suggestions if there is a better way.

I think your sed oneliner is ok for the task.
Besides, you can optimize that for speed (considering the file has millions of lines), exiting the sed script when the desired block was printed (assuming there are no repeated blocks of time in a file).
sed -n '/^06:25/,/^08:25/{p;/^08:25/q}' logFile > newLogFile
This tells sed to quit when the last line of the block was found.

You can use following oneliner:
awk -v start='00:04:59.000' -v end='00:04:59.900' \
'{if(start <= $1 && end >= $1) print $0}' < your.log > reduced.log
Notice the full format of start and end ranges - it's to keep it simple and doesn't make much problem IMO

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

awk regular expression to compare the file extensions - regex

Try this regexp: \.(sh|xls|xlsx|pdf)$

In bash/ksh/zsh, you can use brace expansion: ls *.{sh,xls,xlsx,pdf} Also don't parse ls.

Try with (\bsh\b|\bxls\b|\bxlsx\b|\bpdf\b) filter. In you're filter you want .ksh file because it containts sh sequence.

Related

How to print portion of linux filenames that match a regex

sed / awk - remove space in file name

How to find lines using patterns in a file in UNIX

AWK end of line sign in regular expressions

Match Range of Lines in Log

Categories

Resources