Retrieve file name when regexp matches content - regex

So I have a directory which includes a bunch of text files, and inside each file there is a line that has the file's time stamp which has the format:
TimeStamp: mm/dd/yyyy
I am writing a script that takes in 3 inputs: month, date, and year, and I want to retrieve the name of the files that have the time stamps matched with the inputs.
I am using this line of code to match the files and output all the rows found to another file.
egrep 'TimeStamp: "$2"/"$3"/"$1"' inFile > outFile
However, I have not figured out a way to get the files names during the process.
Also, I believe there is a quick and simple way to do this with awk, but I am new to awk, so I am still struggling with it.

Note:
I'm assuming you want to BOTH capture matching lines AND, SEPARATELY, the names (paths) of the files that had matches (therefore, using just egrep -l is not enough).
Based on your question, I've changed 'TimeStamp: "$2"/"$3"/"$1"' to "TimeStamp: $2/$3/$1", because the former would treat $2, ... as literals (would not expand them), due to being enclosed in a single-quoted string.
If you already have a single filename to pass to egrep, you can use && to conditionally output that filename if that file contained matches (in addition to capturing the matches in a file).
egrep "TimeStamp: $2/$3/$1" inFile > outFile && printf '%s\n' inFile
When processing an entire directory, the simple and POSIX-compliant - but inefficient - approach is to process files in a loop:
for f in *; do
[ -f "$f" ] || continue # skip non-files or break, if dir. is empty
egrep "TimeStamp: $2/$3/$1" "$f" >> outFile && printf '%s\n' "$f"
done
If you use bash and GNU grep or BSD grep (also used on OSX), there's a more efficient solution:
egrep -sH "TimeStamp: $2/$3/$1" * |
tee >(cut -d: -f1 | sort -u > outFilenames) |
cut -d: -f2- > outFile
Since * potentially also matches directories, -s suppresses error message stemming from (invariably failing) attempts to process them as files.
-H ensures that each matching line is prefixed with the input filename followed by :
tee >(...) ... sends input to both stdout and the command inside >(...).
cut -d: -f1 | sort -u extracts the matching filenames from the result lines, creates a sorted list without duplicates from them, and sends them to file outFilenames.
cut -d: -f2- then extracts the matching lines (stripped of their filename prefix) and captures them in file outFile.

grep -l
Explanation
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which
output would normally have been printed. The scanning will stop on the first
match. (-l is specified by POSIX.)
Source

Related

How to find specific text in a text file, and append it to the filename?

I have a collection of plain text files which are named as yymmdd_nnnnnnnnnn.txt, which I want to append another number sequence to the filenames, so that they each become named as yymmdd_nnnnnnnnnn_iiiiiiiii.txt instead, where the iiiiiiiii is taken from the one line in each file which contains the text "GST: 123456789⏎" (or similar) at the end of the line. While I am sure that there will only be one such matching line within each file, I don't know exactly which line it will be on.
I need an elegant one-liner solution that I can run over the collection of files in a folder, from a bash script file, to rename each file in the collection by appending the specific GST number for each filename, as found within the files themselves.
Before even getting to the renaming stage, I have encountered a problem with this. Here is what I tried, which didn't work...
# awk '/\d+$/' | grep -E 'GST: ' 150101_2224567890.txt
The grep command alone works perfectly to find the relevant line within the file, but the awk doesn't return just the final digits group. It fails with the error "warning: regexp escape sequence \d is not a known regexp operator". I had assumed that this regex should return any number of digits which are at the end of the line. The text file in question contains a line which ends with "GST: 112060340⏎". Can someone please show me how to make this work, and maybe also to help with the appropriate coding to move the collection of files to the new filenames? Thanks.
Thanks to a comment from #Renaud, I now have the following code working to obtain just the GST registration number from within a text file, which puts me a step closer towards a workable solution.
awk '/GST: / {printf $NF}' 150101_2224567890.txt
I still need to loop this over the collection instead of just specifying one filename. I also need to be able to use the output from #Renaud's contribution, to rename the files. I'm getting closer to a working solution, thanks!
This awk should work for you:
awk '$1=="GST:" {fn=FILENAME; sub(/\.txt$/, "", fn); print "mv", FILENAME, fn "_" $2 ".txt"; nextfile}' *_*.txt | sh
To make it more readable:
awk '$1 == "GST:" {
fn = FILENAME
sub(/\.txt$/, "", fn)
print "mv", FILENAME, fn "_" $2 ".txt"
nextfile
}' *_*.txt | sh
Remove | sh from above to see all mv commands together.
You may try
for f in *_*.txt; do echo mv "$f" "${f%.txt}_$(sed '/.*GST: /!d; s///; q' "$f").txt"; done
Drop the echo if you're satisfied with the output.
As you are sure there is only one matching line, you can try:
$ n=$(awk '/GST:/ {print $NF}' 150101_2224567890.txt)
$ mv 150101_2224567890.txt "150101_2224567890_$n.txt"
Or, for all .txt files:
for f in *.txt; do
n=$(awk '/GST:/ {print $NF}' "$f")
if [[ -z "$n" ]]; then
printf '%s: GST not found\n' "$f"
continue
fi
mv "$f" "$f{%.txt}_$n.txt"
done
Another one-line solution to consider, although perhaps not so elegant.
for original_filename in *_*.txt; do \
new_filename=${original_filename%'.txt'}_$(
grep -E 'GST: ' "$original_filename" | \
sed -E 's/.*GST//g; s/[^0-9]//g'
)'.txt' && \
mv "$original_filename" "$new_filename"; \
done
Output:
150101_2224567890_123456789.txt
If you are open to a multi line script:-
#!/bin/sh
for f in *.txt; do
prefix=$(echo "${f}" | sed s'#\.txt##')
cp "${f}" f1
sed -i s'#GST#%GST#' "./f1"
cat "./f1" | tr '%' '\n' > f2
number=$(cat "./f2" | sed -n '/GST/'p | cut -d':' -f2 | tr -d ' ')
newname="${prefix}_${number}.txt"
mv -v "${f}" "${newname}"
rm -v "./f1"
rm -v "./f2"
done
In general, if you want to make your files easy to work with, then leave as many potential places for them to be split with newlines as possible. It is much easier to alter files by simply being able to put what you want to delete or print on its' own line, than it is to search for things horizontally with regular expressions.

sed regex command to output lines that end with html?

I need a sed regex command that will output every line in a while that ends with 'html', and does NOT start with 'a'.
Would my current code work?
sed 's/\[^a]\*\.\(html)\/p' text.txt
The sed command would be
sed -n '/^[^a].*html$/p'
But the canonical command to print matching lines is grep:
grep '^[^a].*html$'
Sed just over complicates things...you can use grep to handle that easily!
egrep "^[^a].+\.html$" -f sourcefile > text.txt
//loads file from within the program egrep
egrep "^[^a].+\.html$" < sourcefile > text.txt
//Converts stdin file descriptor with the input redirect
//to sourceFile for this stage of the` pipeline
are equivalent functionally.
or
pipe input | xargs -n1 egrep "^[^a].+\.html$" > text.txt
//xargs -n1 means take the stdin from the pipe and read it one line at a time in conjunction with the single command specified after any other xargs arguments
// ^ means from start of line,
//. means any one character
//+ means the previous matched expression(which can be a
//(pattern group)\1 or [r-ange], etc) one or more times
//\. means escape the single character match and match the period character
//$ means end of line(new line character)
//egrep is short for extended regular expression matches which are really
nice
(assuming you aren't using a pipe or cat, etc)
You can convert a newline delimited file into a single input line with this command:
cat file | tr -d '\n' ' '
//It converts all newlines into a space!
Anyway, get creative with simple utilities and you can do a lot:
xargs, grep, tr are a good combo that are easy to learn. Without the sedness of it all.
Don't do this with sed. Do it with two different calls to grep
grep -v ^a file.txt | grep 'html$'
The first grep gets all the lines that do not start with "a", and sends the output from that into the second grep, which pulls out all the lines that end with "html".

Sed multiple conditions regex matching

I am making a bash script that will take a txt file as input, delete all lines containing dash ("-") or any integer (anywhere in the line) from it and parse it to a new file.
I tried multiple ways but I had 0 success.
I'm stuck trying to figure out correct regex for "delete all lines containing number OR dash" since I can't make it work.
Here's my code:
wget -q awsfile1.csv.zip # downloads file
unzip "awsfile1".zip # unzips it
cut -d, -f 2 file1.csv > file2.csv # cuts it
sort file2.csv > file2.txt # translates csv into text
printf "Removing lines containing numbers.\n" # prints output
sed 's/[0-9][0-9]*/Number/g' file2.txt > file2-b.txt # doesn't do anything, file is empty on the output
Thanks.
you can combine cut and filter into an awk script and sort after
... get and unzip file
$ awk -F, '$2!~/[-0-9]/{print $2}' file | sort
print field 2 if it doesn't contain any digits or hyphen.
This might work for you (GNU sed):
sed -E 'h;s/\S+/\n&\n/2;/\n.*[-0-9].*\n/d;x' file
Copy the current line, isolate the 2nd field and delete the line if it contains the required strings, otherwise revert to the original line.
N.B. This prints the original line, if you only want the 2nd field, use:
sed -E 's/\S+/\n&\n/2;s/.*\n(.*)\n.*/\1/;/[-0-9]/d' file

Move around the characters in a filename

I've got a folder full of files with the names ab1234, abc5678, etc., and I want to switch them to abc3412, abc7856, etc. – just swap the last two characters out with the second-to-last two characters. The filenames are all in this format, no surprises. What's the easiest way to do this with a regex?
Use perl rename?
rename 's/(..)(..)$/$2$1/' *
Depending on your platform, you may have a rename utility that can directly do what you want.
For instance, anishsane's answer shows an elegant option using a Perl-based renaming utility.
Here's a POSIX-compliant way to do it:
printf '"%s"\n' * | sed 'p; s/\(..\)\(..\)"$/\2\1"/' | xargs -L 2 mv
printf '"%s"\n' * prints all files in the current folder line by line (if there are subdirs., you'd have to exclude them), enclosed in literal double-quotes.
sed 'p; s/\(..\)\(..\)"$/\2\1"/' produces 2 output lines:
p prints the input line as-is.
s/\(..\)\(..\)"$/\2\1"/' matches the last 2 character pairs (before the closing ") on the input lines and swaps them.
The net effect is that each input line produces a pair of output lines: the original filename, and the target filename, each enclosed in double-quotes.
xargs -L 2 mv then reads pairs of input lines (-L 2) and invokes the mv utility with each line as its own argument, which results in the desired renaming. Having each line enclosed in double-quotes ensures that xargs treats them as a single argument, even if they should contain whitespace.
Tip of the hat to anishsane for the enclose-in-literal-double-quotes approach, which makes the solution robust.
Note: If you're willing to use non-POSIX features, you can simplify the command as follows, to bypass the need for extra quoting:
GNU xargs:
printf '%s\n' * | sed 'p; s/\(..\)\(..\)$/\2\1/' | xargs -d '\n' -L 2 mv
Nonstandard option -d '\n' tells xargs to not perform word splitting on lines and treat each line as a single argument.
BSD xargs (also works with GNU xargs):
printf '%s\n' * | sed 'p; s/\(..\)\(..\)$/\2\1/' | tr '\n' '\0' | xargs -0 -L 2 mv
tr '\n' '\0' replaces newlines with \0 (NUL) chars, which is then specified as the input-line separator char. for xargs with the nonstandard -0 option, again ensuring that each input line is treated as a single argument.

sed while loop text format

i need to change some lines i another format using this while loop.
while IFS= read -r line;
do
var=$(echo "$line" | grep -oE "http://img[0-9].domain.xy/t/[0-9][0-9][0-9]/[0-9][0-9][0-9]/" | uniq);
echo "$line" | sed -e 's|http://img[0-9].domain.xy/t/[0-9][0-9][0-9]/[0-9][0-9][0-9]/||g' -e "s|.*|&"${var}"|g" >> newFile;
done < file;
that changes this format
<iframe src="http://domain.xy/load.php?file=2259929" frameborder="0" scrolling="no"></iframe>|http://img9.domain.xy/t/929/320/1_2259929.jpg;http://img9.domain.xy/t/929/320/2_2259929.jpg;http://img9.domain.xy/t/929/320/3_2259929.jpg;http://img9.domain.xy/t/929/320/4_2259929.jpg;http://img9.domain.xy/t/929/320/5_2259929.jpg;http://img9.domain.xy/t/929/320/6_2259929.jpg;http://img9.domain.xy/t/929/320/7_2259929.jpg;http://img9.domain.xy/t/929/320/8_2259929.jpg;http://img9.domain.xy/t/929/320/9_2259929.jpg;http://img9.domain.xy/t/929/320/10_2259929.jpg|13m5s
and gives me that output.
<iframe src="http://domain.xy/load.php?file=2259929" frameborder="0" scrolling="no"></iframe>|1_2259929.jpg;2_2259929.jpg;3_2259929.jpg;4_2259929.jpg;5_2259929.jpg;6_2259929.jpg;7_2259929.jpg;8_2259929.jpg;9_2259929.jpg;10_2259929.jpg|13m5s|http://img9.domain.xy/t/929/320/
that all works correct!!!
but there is also a time value that i want to change. 13m5s to 00:13:5 or better else 13m5s to 00:13:05
i try to use another grep + sed command at the end of the loop.
while IFS= read -r line;
do
var=$(echo "$line" | grep -oE "http://img[0-9].domain.xy/t/[0-9][0-9][0-9]/[0-9][0-9][0-9]/" | uniq);
echo "$line" | sed -e 's|http://img[0-9].domain.xy/t/[0-9][0-9][0-9]/[0-9][0-9][0-9]/||g' -e "s|.*|&"${var}"|g" >> newFile;
done < file;
grep -oE "[0-9]*m[0-9]*[0-9]s" newFile | sed -e 's|^|00:|' -e s'|m|:|' -e s'|s||'
this gives me only the output of the numbers not the full line.
00:13:5
00:3:18
00:1:50
and so on
how can i get the full line and just change 13m5s to 00:13:5 ?
if just use sed after the while loop without grep it changes the wrong letters. and puts 00: at the begin of every line.
what is the best way to handle that. i think its be the best to integrate the command in the existing loop. but i have try many differnt variations witout a result.
thx for helping
thx
I broke your code apart in to a few additional pieces to make understanding what was going on easier. Here's the result which I believe is correct:
# Read each field in to separate variables
while IFS='|' read iframe urls time; do
# Get the first URL from the ';'-separated list
url="${urls%%;*}"
# Get the base URL by matching up to the last '/' (and add it back since the match is exclusive)
base_url="${url%/*}"'/'
# Remove the base URL from the list of full URLs so only the filenames are left
files="${urls//$base_url/}"
# Parse the minute and second out from the '##m#s' string
IFS='ms' read min sec <<<"$time"
# Print the new line - note the numeric formatting in the third column
printf '%s|%s|00:%02d:%02d|%s\n' "$iframe" "$files" "$min" "$sec" "$base_url"
done <file
The lines that answer your specific request about how to turn 13m5s in to 00:13:05 are these two:
IFS='ms' read min sec <<<"$time"
printf '%s|%s|00:%02d:%02d|%s\n' "$iframe" "$files" "$min" "$sec" "$base_url"
The read line uses IFS to tell it to split on the characters m or s making it able to easily read the minute and second variables.
The printf line, with 00:%02d:%02d specifically, formats the $min and $sec variables as zero-padded two-digit numbers.
grep only outputs the lines that match the expression. Use sed's built-in line matching to restrict the substitution to certain lines:
sed '/[0-9]*m[0-9]*[0-9]s/{s|^|00:|;s|m|:|;s'|s||;}'
or maybe this:
sed 's/\([0-9]*\)m\([0-9]*[0-9]\)s/00:\1:\2/'