Find all pdf in folder of a given year in filename

Find all pdf in folder of a given year in filename - regex

I have a folder with thousands of pdf named according to dates like 20100820.pdf or 20000124.pdf etc.
In the command line, I used the following command in other projects to search for all pdf in a folder and attach a command to it, like so ls | grep -E "\.pdf$" | [command here]. Now I would like it to search only those pdf in a given folder from the year 2010 for example. How can I achieve that?

Wow, it was so easy in the end.
This solves the problem:
ls | grep -E "\.pdf$" | grep -E "2010"

ls | grep -E "\.pdf$" | awk -F "." '{if ($1>20100000) print $1}'
This command takes all the pdfs and splits the filename, the digits in file name are then compared with 20100000. If greater then, it prints the filename.

Related

How to read version number from file name using shell script and regex

I have a file in a folder that will be named something like version1.txt or version99.txt. I am on a Windows box that has GNU utilities installed and am doing this from command prompt. Currently, my output looks like this:
command: dir | grep version
result: 12/08/2016 04:50 PM 0 version12.txt
I want it to return the number 12 in this case.
I've written the regex which will match version12 (although I need it to match only 12), but I cannot figure out how to get it to be read with sed (I do not have awk available). This is what I am trying:
dir | grep version | sed "/version[0-9]{2}|version[0-9]/g"
How do I get only the version number to appear?

You can use awk instead of grep to extract version number:
dir | awk '/version/{gsub(/[^0-9]+/, "", $NF); print $NF}'
12
You can use sed also:
dir | sed 's/.* version\|\..*//g'

Here is a simpler alternative which removes the grep requirement:
dir /b version*.txt | sed 's/[^0-9]*//g'

Using pcregrep for multiple files

I am trying to use pcregrep multiline match on a set of files. And those files itself are coming out some searches from the current directory, something like below:
l | grep -P "\d\.mt.+" | cut -d":" -f 2 | cut -d" " -f 2 | xargs
So, I want to do a pcregrep on these set of files, and that is a multiline match, as below:
pcregrep -Mi "index(.+\n)+" list of files
I don't know, if it's possible to give the list of file names like this.
Can someone help?
Regards,
Manu

Try this :
l | grep -P "\d\.mt.+" | cut -d":" -f 2 | cut -d" " -f 2 | xargs pcregrep -Mi "index(.+\n)+"
Your command provides xargs at the end but with no command to use it.
Now, xargs is useful and the command is just like
pcregrep <*list of all found files*>
That's the idea behind xargs.

How to remove both matching lines while removing duplicates

I have a large text file containing a list of emails called "main", and I have sent mails to some of them. I have a list of 'sent' emails. Now, I want to remove the 'sent' emails from the list "main".
In other words, I want to remove both the matching raw from the text file while removing duplicates. Example:
I have:
email#email.com
test#test.com
email#email.com
I want:
test#test.com
Is there any easier way to achieve this? Please suggest a tool or method to do this, but please consider the text file is larger than 10MB.

In terminal:
cat test| sort | uniq -c | awk -F" " '{if($1==1) print $2}'

I use cygwin a lot for such tasks, as the unix command line is incredibly powerful.
Here's how to achieve what you want:
cat main.txt | sort -u | grep -Fvxf sent.txt
sort -u will remove duplicates (by sorting the main.txt file first), and grep will take care of removing the unwanted addresses.
Here's what the grep options mean:
-F plain text search
-v invert results
-x will force the whole line to match the pattern
-f read patterns from the specified file
Oh, and if your files are in the Windows format (CR LF newlines) you'll rather have to do this:
cat main.txt | dos2unix | sort -u | grep -Fvxf <(cat sent.txt | dos2unix)
Just like with the Windows command line, you can simply add:
> output.txt
at the end of the command line to redirect the output to a text file.

Regex grep file contents and invoke command

I have a file that has been generated containing MD5 info along with filenames. I'm wanting to remove the files from the directory they are in. I'm not sure how to go about doing this exactly.
filelist (file) contains:
MD5 (dupe) = 1fb218dfef4c39b4c8fe740f882f351a
MD5 (somefile) = a5c6df9fad5dc4299f6e34e641396d38
my command (which i would like to include with rm) looks like this:
grep -o "\((.*)\)" filelist
returns this:
(dupe)
(somefile)
*almost good, although the parentheses need to be eliminated (not sure how). I tried using grep -Po "(?<=\().*(?=\))" filelist using a lookahead/lookaround, but the command didn't work.
The next thing I would like to do is take the output filenames and delete them from the directory they are in. I'm not sure how to script it, but it would essentially do:
<returned results from grep>
rm dupe $target
rm somefile $target

If I understand correctly, you want to take lines like these
MD5 (dupe) = 1fb218dfef4c39b4c8fe740f882f351a
MD5 (somefile) = a5c6df9fad5dc4299f6e34e641396d38
extract the second column without the parentheses to get the filenames
dupe
somefile
and then delete the files?
Assuming the filenames don't have spaces, try this:
# this is where your duplicate files are.
dupe_directory='/some/path'
# Check that you found the right files:
awk '{print $2}' file-with-md5-lines.txt | tr -d '()' | xargs -I{} ls -l "$dupe_directory/{}"
# Looks ok, delete:
awk '{print $2}' file-with-md5-lines.txt | tr -d '()' | xargs -I{} rm -v "$dupe_directory/{}"
xargs -I{} means to replace the argument (dupe filename) with {} so it can be used in a more complex command.

The tool you're looking for is xargs: http://unixhelp.ed.ac.uk/CGI/man-cgi?xargs
It's pretty standard on *nix systems.
UPDATE: Given that target equals the directory where the files live...
I believe the syntax would look something like:
yourgrepcmd | xargs -I{} rm "$target{}"
The -I creates a placeholder string, and each line from your grep command gets inserted there.
UPDATE:
The step you need to remove the parens is a little use of sed's substitution command (http://unixhelp.ed.ac.uk/CGI/man-cgi?sed)
Something like this:
cat filelist | sed "s/MD5 (\([^)]*\)) .*$/\1/" | xargs -I{} rm "$target/{}"
The moral of the story here is, if you learn to utilize sed and xargs (or awk if you want something a little more advanced) you'll be a more capable linux user.

Using sed/awk and regex to process logs

I have 1000s of log files generated by a very verbose PHP script. The general structure is as follows
###Unknown no of lines, which I want to ignore###
=================================================
$insert_vars['cdr_pkey']=17568
$id<TAB>$g1<TAB>$i1<tab>rating1<TAB>$g2<TAB>$i2<tab>rating2 #<TAB>more $gX,$iX,$ratingX
#numerical values of $id $g1 $i1 etc. separated by tab
#numerical values of ---""---
#I do not know how many lines will be there (unique column is $id)
=================================================
###Unknown no of lines, which I want to ignore###
I have to process these log files and create an excel sheet (I am thinking csv format) and report the data back. I am really bad at excel, but I thought of outputting something like :
cdr_pkey<TAB>id<TAB>g1<TAB>i1<TAB>rating1<TAB>g2<TAB>rating2 #and so on
17568<TAB>1349<TAB>0.0004532<TAB>0.01320<TAB>2.014E-4<TAB>...#rest of numerical values
17568<TAB>1364<TAB>...#values for id=1364
17568<TAB>1321<TAB>...#values for id=1321
...
17569<TAB>1048<TAB>...#values for id=1048
17569<TAB>1426<TAB>...#values for id=1426
...
...
So my cdr_pkey is unique column in the sheet, and for each $cdr_pkey, I have multiple $ids, each having their own set of $g1,$i1,$rating1...
After testing such format, it can be read by excel. Now I just want to extend it to all those 1000s of files.
I am just not sure how to proceed further. What's the next step?

The following bash script does something that might be related to what you want. It is parameterized by what you meant when you said <TAB>. I assume you mean the ascii tab character, but if your logs are so verbose that they spell out <TAB> you will need to modify the variable $WHAT_DID_YOU_MEAN_BY_TAB accordingly. Note that there is very little about this script that does The Right Thing™; it reads the entire file into a string variable, which might not even be possible depending on how big your log files are. On the up side, the script could be easily modified to make two passes, instead, if you think that's better.
#!/bin/bash
WHAT_DID_YOU_MEAN_BY_TAB='\t'
if [[ $# -ne 1 ]] ; then echo "Requires one argument: the file to process" ; exit 1 ; fi
FILENAME="$1"
RELEVANT=$(sed -n '/^==*$/,/^==*$/p' "$FILENAME" | sed '1d' | head -n '-1')
CDR_PKEY=$(echo "$RELEVANT" | \
grep '$insert_vars\['"'cdr_pkey'\]" | \
sed 's/.*=\(.*\)/\1/')
echo "$RELEVANT" | sed '1,2d' | \
sed "s/.*/${CDR_PKEY}$WHAT_DID_YOU_MEAN_BY_TAB\0/"
The following find command is an example use, but your case will depend on how your logs are organized.
find . LOG_PATTERN -exec THIS_SCRIPT '{}' \;
Lastly, I have ignored the issue of putting the CSV headers on the output. This is easily done out-of-band.
(Edit: updated the script to reflect discussion in the comments.)

EDIT: James tells me that changing the sed in last echo from ... 1d ... to ... 1,2 ... and dropping the grep -v 'id' should do the trick.
Confirmed that it works. So changing it below. Thanks again to James Wilcox.
Based on #James script this is what I came up with. I just piped the final echo to grep -v 'id'
Thanks again James Wilcox
WHAT_DID_YOU_MEAN_BY_TAB='\t'
if [[ $# -lt 1 ]] ; then echo "Requires at least one argument: the files to process" ; exit 1 ; fi
echo -e "key\tid\tg1\ti1\td1\tc1\tr1\tg2\ti2\td2\tc2\tr2\tg3\ti3\td3\tc3\tr3"
for i in "$#"
do
FILENAME="$i"
RELEVANT=$(sed -n '/^==*$/,/^==*$/p' "$FILENAME" | sed '1d' | head -n '-1')
CDR_PKEY=$(echo "$RELEVANT" | \
grep '$insert_vars\['"'cdr_pkey'\]" | \
sed 's/.*=\(.*\)/\1/')
echo "$RELEVANT" | sed '1, 2d' | \
sed "s/.*/${CDR_PKEY}$WHAT_DID_YOU_MEAN_BY_TAB\0/"
#the one with grep looked like :-
#echo "$RELEVANT" | sed '1d' | \
#sed "s/.*/${CDR_PKEY}$WHAT_DID_YOU_MEAN_BY_TAB\0/" | grep -v 'id'
done

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Find all pdf in folder of a given year in filename - regex

Wow, it was so easy in the end. This solves the problem: ls | grep -E "\.pdf$" | grep -E "2010"

ls | grep -E "\.pdf$" | awk -F "." '{if ($1>20100000) print $1}' This command takes all the pdfs and splits the filename, the digits in file name are then compared with 20100000. If greater then, it prints the filename.

Related

How to read version number from file name using shell script and regex

Using pcregrep for multiple files

How to remove both matching lines while removing duplicates

Regex grep file contents and invoke command

Using sed/awk and regex to process logs

Categories

Resources