How to capture results of regex and replace patterns in bash

How to capture results of regex and replace patterns in bash - regex

Bash scripting does my head in. I have searched for regex assignment, but not really finding answers I understand.
I have files in a directory. I need to loop through the files and check if they fit certain criteria. File names under a certain sequence need to have their sequence increased. Those over a certain sequence need to trigger an alert.
I have pseudo code and need help turning it into correct bash syntax:
#!/bin/sh
function check_file()
{
# example file name "LOG_20101031144515_001.csv"
filename=$1
# attempt to get the sequence (ex. 001) part of file
# if sequence is greater than 003, then raise alert
# else change file name to next sequence (ex. LOG_20101031144515_002.csv)
}
for i in `ls -Ar`; do check_file $i; done;
If PHP were an option, I could do the following:
function check_file($file){
//example file 'LOG_20101031144515_001.csv';
$parts = explode('.',$file);
preg_match('/\d{3}$/', $parts[0], $matches);
if ($matches){
$sequence = $matches[0];
$seq = intval($sequence);
if ($seq > 3){
// do some code to fire alert (ex. email webmaster)
}
else{
// rename file with new sequence
$new_seq = sprintf('%03d', ++$seq);
$new_file = str_replace("_$sequence", "_$new_seq", $file);
rename($file, $new_file);
}
}
}
So long story short, I'm hoping someone can help port the PHP check_file function to the bash equivalent.
Thank you

First of all, your question is tagged [bash], but your shebang is #!/bin/sh. I'm going to assume Bash.
#!/bin/bash
function check_file()
{
# example file name "LOG_20101031144515_001.csv"
filename=$1
# attempt to get the sequence (ex. 001) part of file
seq=${filename%.csv}
seq=${seq##*_}
# if sequence is greater than 003, then raise alert
if (( 10#$seq > 3 ))
then
echo "Alert!"
else
# else change file name to next sequence (ex. LOG_20101031144515_002.csv)
printf -v newseq "%03d" $((seq + 1))
echo "${filename/$seq/$newseq}" # or you could set a variable or do a mv
fi
}

PHP IS an option. If you master PHP, you can run it from shell.
Run
php myfile.php
and get the output right into console. If the PHP file is executable and begins with
#!/path/to/php/executable
then you can run
./myfile.php
I'm no big expert in bash programming, but in order to obtain the list of files that match a certain patter you can use the command
ls -l | grep "pattern_unquoted"
I suggest you to go for the PHP ;-)

A different take on the problem:
#!/bin/sh
YOUR_MAX_SEQ=3
find /path/to/files -maxdepth 1 -name 'LOG_*.csv' -print \
| sed -e 's/\.csv$//' \
| awk -F_ '$3 > SEQ { print }' SEQ=$YOUR_MAX_SEQ
Brief explanation:
Find all files in /path/to/files matching LOG_*.csv
Chop the .csv off the end of each line
Using _ as a separator, print lines where the third field is greater than $YOUR_MAX_SEQ
This will leave you with a list of the files that met your criteria. Optionally, you could pipe the output through sed to stick the .csv back on.
If you're comfortable with PHP, you'd probably be comfortable with Perl, too.

Related

iterate over apache 2 log file names and compare numbers using linux bash

Here is an example of logs in my /var/www/apache2/log folder-
./no_domain_access.log.7.gz
./no_domain_access.log.8.gz
./no_domain_access.log.9.gz
./no_domain_error.log.10.gz
./no_domain_error.log.11.gz
./no_domain_error.log.12.gz
./no_domain_error.log.13.gz
./no_domain_error.log.14.gz
./no_domain_error.log.15.gz
./no_domain_error.log.16.gz
./no_domain_error.log.17.gz
./no_domain_error.log.18.gz
./no_domain_error.log.19.gz
./no_domain_error.log.20.gz
and goes until 50...
I would like to iterate over those files and remove all log files that are greater then 5.
using regex syntax will give me the option to match numbers in the pattern of [1-9] or {1,2} but this will also match that log files that i dont want to delete ( single numbers 1-5 log files that i wish to keep)
How can i match only file names with numbers higher than 5 ?
Thanks!

You can use awk one-liner for this:
printf '%s\n' *[0-9].gz | awk -F '.' '$(NF-1) >= 5'
This awk command uses dot as field separator and compared $(NF-1) (that is the numeric field before extension) with number 5.
To delete these files use:
printf '%s\n' *[0-9].gz | awk -F '.' '$(NF-1) >= 5' | xargs rm
xargs takes input from awk and rm command just deletes those files.

Use the bash, regex operator ~ to extract the number and list the file if the number was greater than 5
for file in /var/www/apache2/log/*.gz; do
test -f "$file" || continue
[[ $file =~ ^.*log\.([[:digit:]]+).*$ ]] && { (( "${BASH_REMATCH[1]}" > 5 )) && printf "%s\n" "$file"; }
done
If you just want to delete the files, replace printf "%s\n" by just rm.

Find with regular expressions
find . -regex './no_domain_access.log.*gz' ! -regex './no_domain_access.log.[1-5].gz'
Find all files matching no_domain... and then run another regular expression to attain all these results minus files with 1 to 5.

Without regular expressions, using shell globs and entirely native & portable POSIX shell code:
rm -f no_domain_access.log.[6-9].gz no_domain_access.log.[0-9][0-9].gz
It's easier in bash:
rm -f no_domain_access.log.{6..50}.gz
These are probably created with logrotate or a similar log rotation utility. You might want to just change its configuration to only store five logs.
If it's controlled by logrotate, you can find the documentation with man logrotate and you'll probably find something like this:
/var/log/no_domain_access.log {
rotate 50
daily
}
Change the 50 to 5 and you're done. You probably(?) still have to clean up the current old logs using one of the above commands.

Shell :Select lowercase words from a file,sort them and copy to another file

I want to make a shell script which gets two parameters from command line,the first should be an existing file,another one the new file which will contents the result.From the first file,i want to select the lowercase words and then sort them and copy the result in second file. The grep command is obviously not good,how should i change it to get the result?
#!/bin/bash
file1=$1
file2=$2
if [ ! -f $file1]
then
echo "this file doesn't exist or is not a file
break
else
grep '/[a-z]*/' $file1 | sort > $file2

You can change the grep command like this:
grep -o '\<[[:lower:]][[:lower:]]*\>' "$file1" | sort -u > "$file2"
The -o is an output control switch that forces grep to return each match in a newline.
\< is a left word boundary and \> a right word boundary. (this way the word Site doesn't return ite)
[[:lower:]][[:lower:]]* ensures there's at least one lower case letter.
(The use of [[:lower:]] instead of the range [a-z] is preferable because with some locales, letters may be alphabetically ordered despite of the character case: aBbCcDd...YyZz)
Notice: I added the -u switch to the sort command to remove duplicate entries, if you don't want this behaviour, remove it.

I'm in a hurry so I won't rewrite what I pointed out in a comment, but here is your code with all these problems fixed :
#!/bin/bash
file1=$1
file2=$2
if [ ! -f $file1 ]
then
echo "this file doesn't exist or is not a file"
else
grep '[a-z]*' $file1 | sort > $file2
fi
ShellCheck gives one more tip which you should definitely apply, I'll let you check it out.
It would also be a good practice to exit with a non-zero code when the script can't execute its task, that is in your case when the file isn't found.

Using awk and sort, First the test file:
$ cat file
This is a test.
This is another one.
Code:
$ awk -v RS="[ .\n]+" '/^[[:lower:]]+$/' file | sort
a
another
is
is
one
test
I'm using space, newline and period as record separator to separate each word as its own record and print words that consists of only lower case letters.

Your shell code could use some fixing up.
#!/bin/bash
file1=$1
file2=$2
if [ ! -f "$file1" ] # need space before ]; quote expansions
# send error messages to stderr instead of stdout
# include program and file name in message
printf >&2 '%s: file "%s" does not exist or is not a file\n' "$0" "$file1"
# exit with nonzero code when something goes wrong
exit 1
fi
# -w to get only whole words
# -o to print out each match on a separate line
grep -wo '[a-z][a-z]*' "$file1" | sort > "$file2"
As written that will include multiple copies of the same word if it occurs multiple times in the file; change to sort -u if you don't want that.

Should I use AWK or SED to remove commas between quotation marks from a CSV file? (BASH)

I have a bunch of daily printer logs in CSV format and I'm writing a script to keep track of how much paper is being used and save the info to a database, but I've come across a small problem
Essentially, some of the document names in the logs include commas in them (which are all enclosed within double quotes), and since it's in comma separated format, my code is messing up and pushing everything one column to the right for certain records.
From what I've been reading, it seems like the best way to go about fixing this would be using awk or sed, but I'm unsure which is the best option for my situation, and how exactly I'm supposed to implement it.
Here's a sample of my input data:
2015-03-23 08:50:22,Jogn.Doe,1,1,Ineo 4000p,"MicrosoftWordDocument1",COMSYRWS14,A4,PCL6,,,NOT DUPLEX,GRAYSCALE,35kb,
And here's what I have so far:
#!/bin/bash
#Get today's file name
yearprefix="20"
currentdate=$(date +"%m-%d-%y");
year=${currentdate:6};
year="$yearprefix$year"
month=${currentdate:0:2};
day=${currentdate:3:2};
filename="papercut-print-log-$year-$month-$day.csv"
echo "The filename is: $filename"
# Remove commas in between quotes.
#Loop through CSV file
OLDIFS=$IFS
IFS=,
[ ! -f $filename ] && { echo "$Input file not found"; exit 99; }
while read time user pages copies printer document client size pcl blank1 blank2 duplex greyscale filesize blank3
do
#Remove headers
if [ "$user" != "" ] && [ "$user" != "User" ]
then
#Remove any file name with an apostrophe
if [[ "$document" =~ "'" ]];
then
document="REDACTED"; # Lazy. Need to figure out a proper solution later.
fi
echo "$time"
#Save results to database
mysql -u username -p -h localhost -e "USE printerusage; INSERT INTO printerlogs (time, username, pages, copies, printer, document, client, size, pcl, duplex, greyscale, filesize) VALUES ('$time', '$user', '$pages', '$copies', '$printer', '$document', '$client', '$size', '$pcl', '$duplex', '$greyscale', '$filesize');"
fi
done < $filename
IFS=$OLDIFS
Which option is more suitable for this task? Will I have to create a second temporary file to get this done?
Thanks in advance!

As I wrote in another answer:
Rather than interfere with what is evidently source data, i.e. the stuff inside the quotes, you might consider replacing the field-separator commas (with say |) instead:
s/,([^,"]*|"[^"]*")(?=(,|$))/|$1/g
And then splitting on | (assuming none of your data has | in it).
Is it possible to write a regular expression that matches a particular pattern and then does a replace with a part of the pattern

There is probably an easier way using sed alone, but this should work. Loop on the file, for each line match the parentheses with grep -o then replace the commas in the line with spaces (or whatever it is you would like to use to get rid of the commas - if you want to preserve the data you can use a non printable and explode it back to commas afterward).
i=1 && IFS=$(echo -en "\n\b") && for a in $(< test.txt); do
var="${a}"
for b in $(sed -n ${i}p test.txt | grep -o '"[^"]*"'); do
repl="$(sed "s/,/ /g" <<< "${b}")"
var="$(sed "s#${b}#${repl}#" <<< "${var}")"
done
let i+=1
echo "${var}"
done

Shell Script - list files, read files and write data to new file

I have a special question to shell scripting.
Simple scripting is no Problem for me but I am new on this and want to make me a simple database file.
So, what I want to do is:
- Search for filetypes (i.e. .nfo) <-- should be no problem :)
- read inside of each found file and use some strings inside
- these string of each file should be written in a new file. Each found file informations
should be one row in new file
I hope I explained my "project" good.
My problem is now, to understand how I can tell the script it has to search for files and then use each of this files to read in it and use some information in it to write this in a new file.
I will explain a bit better.
I am searching for files and that gives me back:
file1.nfo
file2.nfo
file3.nfo
Ok now in each of that file I need the information between 2 lines. i.e.
file1.nfo:
<user>test1</user>
file2.nfo:
<user>test2</user>
so in the new file there should now be:
file1.nfo:user1
file2.nfo:user2
OK so:
find -name *.nfo > /test/database.txt
is printing out the list of files.
and
sed -n '/<user*/,/<\/user>/p' file1.nfo
gives me back the complete file and not only the information between <user> and </user>
I try to go on step by step and I am reading a lot but it seems to be very difficult.
What am I doing wrong and what should be the best way to list all files, and write the files and the content between two strings to a file?
EDIT-NEW:
Ok here is an update for more informations.
I learned now a lot and searched the web for my problems. I can find a lot of informations but i don´t know how to put them together so that i can use it.
Working now with awk is that i get back filename and the String.
Here now the complete Informations (i thought i can go on by myself with a bit of help but i can´t :( )
Here is an example of: /test/file1.nfo
<string1>STRING 1</string1>
<string2>STRING 2</string2>
<string3>STRING 3</string3>
<string4>STRING 4</string4>
<personal informations>
<hobby>Baseball</hobby>
<hobby>Baskeball</hobby>
</personal informations>
Here an example of /test/file2.nof
<string1>STRING 1</string1>
<string2>STRING 2</string2>
<string3>STRING 3</string3>
<string4>STRING 4</string4>
<personal informations>
<hobby>Soccer</hobby>
<hobby>Traveling</hobby>
</personal informations>
The File i want to create has to look like this.
STRING 1:::/test/file1.nfo:::Date of file:::STRING 4:::STRING 3:::Baseball, Basketball:::STRING 2
STRING 1:::/test/file2.nfo:::Date of file:::STRING 4:::STRING 3:::Baseball, Basketball:::STRING 2
"Date of file" should be the creation date of the file. So that i can see how old is the file.
So, that´s what i need and it seems not easy.
Thanks a lot.
UPATE ERROR -printf
find: unrecognized: -printf
Usage: find [PATH]... [OPTIONS] [ACTIONS]
Search for files and perform actions on them.
First failed action stops processing of current file.
Defaults: PATH is current directory, action is '-print'
-follow Follow symlinks
-xdev Don't descend directories on other filesystems
-maxdepth N Descend at most N levels. -maxdepth 0 applies
actions to command line arguments only
-mindepth N Don't act on first N levels
-depth Act on directory *after* traversing it
Actions:
( ACTIONS ) Group actions for -o / -a
! ACT Invert ACT's success/failure
ACT1 [-a] ACT2 If ACT1 fails, stop, else do ACT2
ACT1 -o ACT2 If ACT1 succeeds, stop, else do ACT2
Note: -a has higher priority than -o
-name PATTERN Match file name (w/o directory name) to PATTERN
-iname PATTERN Case insensitive -name
-path PATTERN Match path to PATTERN
-ipath PATTERN Case insensitive -path
-regex PATTERN Match path to regex PATTERN
-type X File type is X (one of: f,d,l,b,c,...)
-perm MASK At least one mask bit (+MASK), all bits (-MASK),
or exactly MASK bits are set in file's mode
-mtime DAYS mtime is greater than (+N), less than (-N),
or exactly N days in the past
-mmin MINS mtime is greater than (+N), less than (-N),
or exactly N minutes in the past
-newer FILE mtime is more recent than FILE's
-inum N File has inode number N
-user NAME/ID File is owned by given user
-group NAME/ID File is owned by given group
-size N[bck] File size is N (c:bytes,k:kbytes,b:512 bytes(def.))
+/-N: file size is bigger/smaller than N
-links N Number of links is greater than (+N), less than (-N),
or exactly N
-prune If current file is directory, don't descend into it
If none of the following actions is specified, -print is assumed
-print Print file name
-print0 Print file name, NUL terminated
-exec CMD ARG ; Run CMD with all instances of {} replaced by
file name. Fails if CMD exits with nonzero
-delete Delete current file/directory. Turns on -depth option

The pat1,pat2 notation of sed is line based. Think of it like this, pat1 sets an enable flag for its commands and pat2 disables the flag. If both pat1 and pat2 are on the same line the flag will be set, and thus in your case print everything following and including the <user> line. See grymoire's sed howto for more.
An alternative to sed, in this case, would be to use a grep that supports look-around assertions, e.g. GNU grep:
find . -type f -name '*.nfo' | xargs grep -oP '(?<=<user>).*(?=</user>)'
If grep doesn't support -P, you can use a combination of grep and sed:
find . -type f -name '*.nfo' | xargs grep -o '<user>.*</user>' | sed 's:</\?user>::g'
Output:
./file1.nfo:test1
./file2.nfo:test2
Note, you should be aware of the issues involved with passing files on to xargs and perhaps use -exec ... instead.

It so happens that grep outputs in the format you need and is enough for an one-liner.
By default a grep '' *.nfo will output something like:
file1.nfo:random data
file1.nfo:<user>test1</user>
file1.nfo:some more random data
file2.nfo:not needed
file2.nfo:<user>test2</user>
file2.nfo:etc etc
By adding the -P option (Perl RegEx) you can restrict the output to matches only:
grep -P "<user>\w+<\/user>" *.nfo
output:
file1.nfo:<user>test1</user>
file2.nfo:<user>test2</user>
Now the -o option (only show what matched) saves the day, but we'll need a bit more advanced RegEx since the tags are not needed:
grep -oP "(?<=<user>)\w+(?=<\/user>)" *.nfo > /test/database.txt
output of cat /test/database.txt:
file1.nfo:test1
file2.nfo:test2
Explained RegEx here: http://regex101.com/r/oU2wQ1
And your whole script just became a single command.
Update:
If you don't have the --perl-regexp option try:
grep -oE "<user>\w+<\/user>" *.nfo|sed 's#</?user>##g' > /test/database.txt

All you need is:
find -name '*.nfo' | xargs awk -F'[><]' '{print FILENAME,$3}'
If you have more in your file than just what you show in your sample input then this is probably all you need:
... awk -F'[><]' '/<user>/{print FILENAME,$3}' file
Try this (untested):
> outfile
find -name '*.nfo' -printf "%p %Tc\n" |
while IFS= read -r fname tstamp
do
awk -v tstamp="$tstamp" -F'[><]' -v OFS=":::" '
{ a[$2] = a[$2] sep[$2] $3; sep[$2] = ", " }
END {
print a["string1"], FILENAME, tstamp, a["string4"], a["string3"], a["hobby"], a["string2"]
}
' "$fname" >> outfile
done
The above will only work if your file names do not contain spaces. If they can, we'd need to tweak the loop.
Alternative if your find doesn't support -printf (suggestion - seriously consider getting a modern "find"!):
> outfile
find -name '*.nfo' -print |
while IFS= read -r fname
do
tstamp=$(stat -c"%x" "$fname")
awk -v tstamp="$tstamp" -F'[><]' -v OFS=":::" '
{ a[$2] = a[$2] sep[$2] $3; sep[$2] = ", " }
END {
print a["string1"], FILENAME, tstamp, a["string4"], a["string3"], a["hobby"], a["string2"]
}
' "$fname" >> outfile
done
If you don't have "stat" then google for alternatives to get a timestamp from a file or consider parsing the output of ls -l - it's unreliable but if it's all you've got...

Regex to remove lines in file(s) that ending with same or defined letters

i need a bash script for mac osx working in this way:
./script.sh * folder/to/files/
#
# or #
#
./script.sh xx folder/to/files/
This script
read a list of files
open each file and read each lines
if lines ended with the same letters ('*' mode) or with custom letters ('xx') then
remove line and RE-SAVE file
backup original file
My first approach to do this:
#!/bin/bash
# ck init params
if [ $# -le 0 ]
then
echo "Usage: $0 <letters>"
exit 0
fi
# list files in current dir
list=`ls BRUTE*`
for i in $list
do
# prepare regex
case $1 in
"*") REGEXP="^.*(.)\1+$";;
*) REGEXP="^.*[$1]$";;
esac
FILE=$i
# backup file
cp $FILE $FILE.bak
# removing line with same letters
sed -Ee "s/$REGEXP//g" -i '' $FILE
cat $FILE | grep -v "^$"
done
exit 0
But it doesn't work as i want....
What's wrong?
How can i fix this script?
Example:
$cat BRUTE02.dat BRUTE03.dat
aa
ab
ac
ad
ee
ef
ff
hhh
$
If i use '*' i want all files that ended with same letters to be clean.
If i use 'ff' i want all files that ended with 'ff' to be clean.
Ah, it's on Mac OSx. Remember that sed is a little different from classical linux sed.
man sed
sed [-Ealn] command [file ...]
sed [-Ealn] [-e command] [-f command_file] [-i extension] [file
...]
DESCRIPTION
The sed utility reads the specified files, or the standard input
if no files are specified, modifying the input as specified by a list
of commands. The
input is then written to the standard output.
A single command may be specified as the first argument to sed.
Multiple commands may be specified by using the -e or -f options. All
commands are applied
to the input in the order they are specified regardless of their
origin.
The following options are available:
-E Interpret regular expressions as extended (modern)
regular expressions rather than basic regular expressions (BRE's).
The re_format(7) manual page
fully describes both formats.
-a The files listed as parameters for the ``w'' functions
are created (or truncated) before any processing begins, by default.
The -a option causes
sed to delay opening each file until a command containing
the related ``w'' function is applied to a line of input.
-e command
Append the editing commands specified by the command
argument to the list of commands.
-f command_file
Append the editing commands found in the file
command_file to the list of commands. The editing commands should
each be listed on a separate line.
-i extension
Edit files in-place, saving backups with the specified
extension. If a zero-length extension is given, no backup will be
saved. It is not recom-
mended to give a zero-length extension when in-place
editing files, as you risk corruption or partial content in situations
where disk space is
exhausted, etc.
-l Make output line buffered.
-n By default, each line of input is echoed to the standard
output after all of the commands have been applied to it. The -n
option suppresses this
behavior.
The form of a sed command is as follows:
[address[,address]]function[arguments]
Whitespace may be inserted before the first address and the
function portions of the command.
Normally, sed cyclically copies a line of input, not including
its terminating newline character, into a pattern space, (unless there
is something left
after a ``D'' function), applies all of the commands with
addresses that select that pattern space, copies the pattern space to
the standard output, append-
ing a newline, and deletes the pattern space.
Some of the functions use a hold space to save all or part of the
pattern space for subsequent retrieval.
anything else?
it's clear my problem?
thanks.

I don't know bash shell too well so I can't evaluate what the failure is.
This is just an observation of the regex as understood (this may be wrong).
The * mode regex looks ok:
^.*(.)\1+$ that ended with same letters..
But the literal mode might not do what you think.
current: ^.*[$1]$ that ended with 'literal string'
This shouldn't use a character class.
Change it to: ^.*$1$
Realize though the string in $1 (before it goes into the regex) should be escaped
incase there are any regex metacharacters contained within it.
Otherwise, do you intend to have a character class?

perl -ne '
BEGIN {$arg = shift; $re = $arg eq "*" ? qr/([[:alpha:]])\1$/ : qr/$arg$/}
/$re/ && next || print
'
Example:
echo "aa
ab
ac
ad
ee
ef
ff" | perl -ne '
BEGIN {$arg = shift; $re = $arg eq "*" ? qr/([[:alpha:]])\1$/ : qr/$arg$/}
/$re/ && next || print
' '*'
produces
ab
ac
ad
ee
ef

A possible issue:
When you put * on the command line, the shell replaces it with the name of all the files in your directory. Your $1 will never equal *.
And some tips:
You can replace replace:
This:
# list files in current dir
list=`ls BRUTE*`
for i in $list
With:
for i in BRUTE*
And:
This:
cat $FILE | grep -v "^$"
With:
grep -v "^$" $FILE
Besides the possible issue, I can't see anything jumping out at me. What do you mean clean? Can you give an example of what a file should look like before and after and what the command would look like?

This is the problem!
grep '\(.\)\1[^\r\n]$' *
on MAC OSX, ( ) { }, etc... must be quoted!!!
Solved, thanks.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to capture results of regex and replace patterns in bash - regex

Related

iterate over apache 2 log file names and compare numbers using linux bash

Shell :Select lowercase words from a file,sort them and copy to another file

Should I use AWK or SED to remove commas between quotation marks from a CSV file? (BASH)

Shell Script - list files, read files and write data to new file

Regex to remove lines in file(s) that ending with same or defined letters

Categories

Resources