Using regular expression to search for a specific pattern in UNIX - regex

I have file names like
ABCD20140207090842 ABCD20140207090847 ABCD20140207090849 ABCD20140207090850 ABCD2014556644219268 ABCD20140508525691 tf in my directory.
I want to search for files with specific pattern. i.e FileNameYearMonthDayHourMinSec.txt
Note: files tf and ABCD2014556644219268 should not get matched.
Answer with exact pattern would be appreciated.

Based on NAME, followed by DATE.txt I get this:
find . -regex ".*[19|20][0-9][0-9][0-1][0-2][0-3][0-9][0-2][0-9][0-6][0-9][0-6][0-9]\.txt$"
This doesn't account for leap-years though, and could match dates such as 31 Feb.

Change the idea. Here is the script no matter leap years or not.
Using GNU date to identify the date
for file in *.txt
do
time=${file%.*} # remove suffix and get ABCD20140207090842
time=${time:(-14)} # get the date/time 20140207090842
time="${time:0:4}/${time:4:2}/${time:6:2} ${time:8:2}:${time:10:2}:${time:12}" # convert time to: 2014/02/07 09:08:42
date -d "$time" >/dev/null 2>&1 && echo $file
done
ABCD20140207090842.txt
ABCD20140207090847.txt
ABCD20140207090849.txt
ABCD20140207090850.txt

Related

How to pass captured regex group to a shell command inside perl-rename

I have a set of files that I want to batch rename using rename utility available in WSL Ubuntu. My files names contains the following pattern and I want to correct the date format in the files.
file_10Feb2022.pptx
file_10Mar2022.pptx
file_17Feb2022.pptx
file_17Mar2022.pptx
file_24Feb2022.pptx
file_3Feb2022.pptx
file_3Mar2022.pptx
I tried to use the following command to rename
rename -n "s/_(.*)\./_`date +%F -d \1`\./g" *.pptx
I capture the date part with regex and I am trying to use date command (inside the ``) to format correctly, but I am unable to pass the captured regex group (\1) to the shell command.
Running it with rename -n flags returns the following output, where it would try to rename to current date (2022-08-03) instead of the date specified.
rename(file_10Feb2022.pptx, file_2022-08-03.pptx)
rename(file_10Mar2022.pptx, file_2022-08-03.pptx)
rename(file_17Feb2022.pptx, file_2022-08-03.pptx)
rename(file_17Mar2022.pptx, file_2022-08-03.pptx)
rename(file_24Feb2022.pptx, file_2022-08-03.pptx)
rename(file_3Feb2022.pptx, file_2022-08-03.pptx)
rename(file_3Mar2022.pptx, file_2022-08-03.pptx)
I have another folder full of files which have suffix with varying date formats and I would like to capture it and let date command deal with the format, instead of me capturing individual parts like date, month and year. Any ideas on how to execute this properly?
This is probably getting beyond where a one-liner is a good idea but:
$ rename 's{_(.*?)(\.[^.]+)$}{
my ($d,$s) = ($1,$2);
my $nd = `date +%F -d "$d"`;
chomp $nd;
$? ? $& : "_$nd$s"
}e' file_*
s{}{}e - search and replace, treating replacement as code
code in backticks is the shell command (with interpolation)
if $? is set something went wrong, return original value; else do replacement
date errors will appear on stderr as usual; affected files will not be renamed
chomp removes the newline received from the backticks, otherwise it would end up in the filename
?: is the trinary operator: if ? then : else
some input could be unsafe when passed to the shell and should be escaped; #ikegami notes that the shell can be avoided entirely. For example, something like:
use IPC::System::Simple qw( capturex );
my $nd = capturex( "date", "+%F", "-d", $d );
The reason your command returns current date:
rename -n "s/_(.*)\./_`date +%F -d \1`\./g" *.pptx
is that because you used double-quotes, the backticks are expanded by the shell, not by rename (but you need /e for rename not to take the backtick statement as literal text to insert.

Extract Google drive folder id from URL's

I am just trying to extract the Google drive folder id from bunch of different google drive URL's
cat links.txt
https://drive.google.com/drive/mobile/folders/1mzr8lgf50p9z6p-7RyHn4XjnyKSvyyuE?usp=sharing
https://drive.google.com/open?id=1_7vwy0-y0BqvPOtG2Or4pvoChnZHrHAx
https://drive.google.com/folderview?id=1rOLhig0g3DdgB9YfvW8HiqRA6o6LxAFF
https://drive.google.com/file/d/1o2J_NwHS3l1-fM71HaDN-xxres1jHkb_/view?usp=drivesdk
https://drive.google.com/drive/folders/0AKzaqn_X7nxiUk9PVA
https://drive.google.com/drive/mobile/folders/0AKzaqn_X7nxiUk9PVA
https://drive.google.com/drive/mobile/folders/0AKzaqn_X7nxiUk9PVA/1re_-YAGfTuyE1Gt848vzTu4ZDC6j23sG/1Ye90fM5qYMYkXp4QMAcQftsJCFVHswWj/149W7xNROO33zaPvIYTNwvtVGAXFxCg_b?sort=13&direction=a
https://drive.google.com/drive/mobile/folders/1nY48t6MATb0XM-iEdeWzEs70qXW2N4Y9?sort=13&direction=a
https://drive.google.com/drive/folders/1M3Xp3xz44NS8QJO5XJT5DK55MohwN6tF?sort=13&direction=a
Expected Output
1mzr8lgf50p9z6p-7RyHn4XjnyKSvyyuE
1_7vwy0-y0BqvPOtG2Or4pvoChnZHrHAx
1rOLhig0g3DdgB9YfvW8HiqRA6o6LxAFF
1o2J_NwHS3l1-fM71HaDN-xxres1jHkb_
0AKzaqn_X7nxiUk9PVA
0AKzaqn_X7nxiUk9PVA
149W7xNROO33zaPvIYTNwvtVGAXFxCg_b
1nY48t6MATb0XM-iEdeWzEs70qXW2N4Y9
1M3Xp3xz44NS8QJO5XJT5DK55MohwN6tF
After an hour of trial/error , i did came up with this regex - ([01A-Z])(?=[\w-]*[A-Za-z])[\w-]+
It seems to work almost well except it can't process the 3rd last link properly. If there are multiple nested folder ids in URL, i need the innermost one in the output . Can someone please help me out with this error and possibly improve the regex if it can be done in a more efficient way than mine
You may try this sed:
sed -E 's~.*[/=]([01A-Z][-_[:alnum:]]+)([?/].*|$)~\1~' links.txt
1mzr8lgf50p9z6p-7RyHn4XjnyKSvyyuE
1_7vwy0-y0BqvPOtG2Or4pvoChnZHrHAx
1rOLhig0g3DdgB9YfvW8HiqRA6o6LxAFF
1o2J_NwHS3l1-fM71HaDN-xxres1jHkb_
0AKzaqn_X7nxiUk9PVA
0AKzaqn_X7nxiUk9PVA
149W7xNROO33zaPvIYTNwvtVGAXFxCg_b
1nY48t6MATb0XM-iEdeWzEs70qXW2N4Y9
1M3Xp3xz44NS8QJO5XJT5DK55MohwN6tF
With GNU awk:
awk '{print $NF}' FPAT='[a-zA-Z0-9_-]{19,34}' file
$NF: contains last column
FPAT: A regular expression describing the contents of the fields in a record. When set, gawk parses the input into fields, where the fields match the regular expression, instead of using the value of FS as the field separator.
Output:
1mzr8lgf50p9z6p-7RyHn4XjnyKSvyyuE
1_7vwy0-y0BqvPOtG2Or4pvoChnZHrHAx
1rOLhig0g3DdgB9YfvW8HiqRA6o6LxAFF
1o2J_NwHS3l1-fM71HaDN-xxres1jHkb_
0AKzaqn_X7nxiUk9PVA
0AKzaqn_X7nxiUk9PVA
149W7xNROO33zaPvIYTNwvtVGAXFxCg_b
1nY48t6MATb0XM-iEdeWzEs70qXW2N4Y9
1M3Xp3xz44NS8QJO5XJT5DK55MohwN6tF

Remove end of file after matching regex keeping the expression matched in multiple files (sed?)

I'm cleaning up a lot of markdown files to import them into Pelican (a static website generator). While compiling I get errors about the date format in multiple files. What I need to do is leave the date (yyyy-mm-dd) and delete to the end of the line after it. This is the last try I've made with sedand RegEx:
sed -i "s/\(\d{4}-\d{2}-\d{2}\)\*/\1 /g" *.md
My hope was that sed would take the whole pattern within the parenthesis as 1 and then keep it as the substitution string.
This is an example of the errors (all numbers change):
ERROR: Could not process ./2010-12-28-the-open-internet-a-case-for-net-neutrality.html.md
| ValueError: '2010-12-28 21:22:00.000000000 +01:00 true' is not a valid date
ERROR: Could not process ./2011-05-27-two-one-must-read-internet-business-book.html.md
| ValueError: '2011-05-27 13:08:00.000000000 +02:00 true' is not a valid date
I've looked around SO but all I've found is about static strings, while mine change all the time.
Thanks for your help.
Please take care of these files, at least make a backup before using sed on them.
This can be done by using the i flag with an extension: -i.bckup.
So I am not sure that You want to modify the content of the files or the names itself.
An expression that would only keep the date would be:
sed -r 's/([^-]*[-][^-]*[-][^-]*).*/\1/'
I suspect your sed is not seeing \d as a metacharacter meaning [0-9], so use it instead.
sed -i -r 's/([0-9]{4}-[0-9]{2}-[0-9]{2}).*/\1/' *.md
Note:
# with the -r extended regex option you do not escape your pattern groupings ()
# no need for the /g option since you are removing everything after the first match
# .* is probably the wildcard you meant to use. * matches any number of the preceeding pattern and . matches any single character.
Here is a command line test:
echo '2011-05-27 13:08:00.000000000 +02:00 true' | sed -r 's/([0-9]{4}-[0-9]{2}-[0-9]{2}).*/\1/'
which outputs:
2011-05-27

Adding YYYY-mm-dd to a hh:mm:ss timestamp in a CSV file

Let's consider the following CSV file format :
server_name status_code timestamp probe_name
where status_code can be either I or E, and all fields are separated by tabulations
As an example, we can consider the following CSV line :
albatros.benches.com I 14:55:23.145 througput_probe
Every single CSV file contains a whole day worth logs. I'm trying to have all hh:mm:ss timestamps been prefixed with the actual YYYY-mm-dd, so that the resulting line would be as follow :
albatros.benches.com I 2013-02-25 14:55:23.145 througput_probe
As a bonus, since the CSV file holds the date (the filename is log_2013_02_25.txt) and since I have many of these files (for different days of course) to be 'sed'ed', I wish I could automatically use the filename as a seed for grabbing the suitable day in it and applying it in the timestamp transform.
EDIT: The filename is log_YYYY_MM_DD.txt, not log_YYYY-MM-DD.txt as described previously)
My sed and regex knowledge is rather limited. So far I'm using something like :
s/I^T/I^T 2013-02-25 /g
s/E^T/E^T 2013-02-25 /g
(^T actually is a ^V followed by a tabulation keypress)
on all of my files, but this really looks very awkward to me. If one day we add another statuscode (for example X), this trick will not work. I guess it would be more error prone to have sed handle the 3rd field, and prefix it. I can't figure out how to do this properly.
Any ideas welcome !
Thank you
Assuming that all your csv files are named like log_YYYY_MM_DD.txt you can try running this bash script in the directory where your csv files are:
#!/bin/bash
for file in log_*.txt; do
[[ $file =~ [0-9]{4}_[0-9]{2}_[0-9]{2} ]] \
&& date="${BASH_REMATCH}" \
&& sed -E -i.bak "s/\t(E|I)\t/\t\1\t${date//_/-} /" $file
done
All status codes that should be handled can be put in the parentheses. E.g. to also handle status code X, just change (E|I) to (E|I|X).
If you notice that everything works as expected you can remove the .bak to disable creating backup files.

How can I extract a pattern from all files in a directory, using Perl?

I am running a command which returns 96 .txt files for each hour of a particular date.
so finally it gives me 24*96 files for one day in a directory.
My aim is to extract data for four months which will result in 30*24*96*4 files in a directory.
After I get the data I need to extract certain "pattern" from each of the files and display that as output.
1) Below script is only for one day where date is hardcoded in the script
2) I need to make it work for all days in a month and I need to run it from june to october
3) As data is huge , my disk will run out of space so I don't want to create these many files instead i just want to grep on the fly and get only one output file .
How can i efficiently do this ?
My shell script looks like this
for R1 in {0..9}; do
for S1 in {0..95}; do
echo $R1 $S1
curl -H "Accept-Encoding: gzip" "http://someservice.com/getValue?Count=96&data=$S1&fields=hitType,QueryString,pathInfo" | zcat > 20101008-mydata-$R1-$S1.txt
done
done
This returns the files I need.
After that, I extract a URL pattern from each of the file grep "test/link/link2" * | grep category > 1. output
you can use this awk command to get URLs
awk -vRS="</a>" '/href/&&/test.*link2/&&/category/{gsub(/.*<a.*href=\"|\".*/,"");print}' file
Here's how to loop over 4 months worth of dates
#!/usr/bin/perl
use strict;
use warnings;
use Date::Simple ':all';
for (my $date = ymd(2010,4,1), my $end = ymd(2010,8,1);$date < $end; $date++) {
my $YYYYMMDD = $date->format("%Y%m%d");
process_one_day($YYYYMMDD); # Add more formats if needed as parameters
}
sub process_one_day {
my $YYYYMMDD = shift;
# ...
# ... Insert your code to process that date
# ... Either call system() command on the sample code in your question
# ... Or better yet write a native Perl equivalent
# ...
# ... For native processing, use WWW::Mechanize to extract the data from the URL
# ... and Perl's native grep() to grep for it
}