Advanced pattern matching in Makefile - regex

Is it possible to create a Makefile pattern matching with two or three varying patterns? I'm using Gnu make.
In my current set-up, in simplified form, I'm using two Bash for-loops in order to convert a certain set of files to another set, and to create the final result file. Example:
#!/bin/bash
XMIN=$1
XMAX=$2
YMIN=$3
YMAX=$4
z=$5
FINAL_LIST=
for y in `seq $YMIN $YMAX`;
do
SOURCE_LIST=
echo Processing column $y
for x in `seq $XMIN $XMAX`;
do
# Convert from file source/something_${x}_${y}_${z} to
# target/something_else_$${x}_${y}_${z}
echo Processing X ${x} Y ${y} with Z ${z}
# do_something
SOURCE_LIST+="target/something_else_$${x}_${y}_${z} "
done
# Create something for this line
echo Processing ${SOURCE_LIST} target_line_${y}_${z}
# process the line
FINAL_LIST+="target_line_${y}_${z} "
done
# Finally, compose the final thing
echo Process the final result: ${FINAL_LIST} result_${z}
# process the final result
# We're done
I would like to make this more effectively with Makefile, as it would allow me to execute things in parallel, and also it would take care, that "line results" are re-generated only when something changes for that particular line.
I'm already using Makefile to convert single datafiles to another format, with simple pattern matching. Makefile is very good in handling my base of >500k datafiles - it can very fast detect changed source files and execute the conversion only for the changed datafiles.
The problem here is that I don't know, how to make Makefile patterns with more than one varying pattern. Following is an easy pattern:
%.target : %.source
# do something
But I don't know, whether the following would be possible (as pseudocode):
<var_pat_Z>_<var_pat_Y>.target: <var_pat_Z>_<var_pat_Y>.source
# do something else
It is not necessary to implement this with Makefile, but I would still need to find a way to detect changed source files, and the capability to execute things in parallel. Currently I'm handling those detections in my bash scripts, and the parallelization by executing bash scripts in parallel with Gnu parallel command. Anyway, that is most likely not the optimal way.

If I understood your question correctly, you have a bunch of *.source files, and want a rule that turns each into a *.target file, while picking two sub-strings from whatever the * expands to.
Why not pick the stem in $* apart at the underscore? Here's a solution.
If you have these files
$ ls *.source
1_1.source
1_2.source
1_3.source
a_b.source
foo_bar.source
then running this GNUmakefile's default target
# all should depend on all targets for which a source exists.
all: $(shell echo *.source | sed 's/source/target/g')
%.target: %.source
#z="$*" y="$*"; \
z=$${z%%_*} y=$${y##*_}; \
echo z=$$z y=$$y
will give you
$ gmake
z=1 y=1
z=1 y=2
z=1 y=3
z=a y=b
z=foo y=bar

Related

Mass rename in shell script

I have a bunch of files which are of this format:
blabla.log.YYYY.MM.DD
Where YYYY.MM.DD is something like (2016.01.18)
I have quite a few folders with about 1000 files in each, so I wanted to have a simple script to rename them. I want to rename them to
blabla.log
So basically, I'm just stripping the date at the end. Here is what I have:
for f in [a-zA-Z]*.log.[0-9][0-9][0-9][0-9].[0-9][0-9].[0-9][0-9]; do
mv -v $f ${f#[0-9][0-9][0-9][0-9].[0-9][0-9].[0-9][0-9]};
done
This script outputs this:
mv: `blabla.log.2016.01.18' and `blabla.log.2016.01.18' are the same file
For more information:
I'm on windows, but I run this script in gitbash
For some reason, my gitbash doesn't recognize the "rename" command
Some regex patterns (like [0-9]{4} don't seem to work)
I'm really at a lost. Thanks.
EDIT: I need to rename every single file that has a date at the end and that is of the from: *.log.2016.01.18. They all need to keep their original names. All that should change is the removal of the date.
You have to use % instead of #: you want to remove from the end, not the start of your string.
Also, you're missing a . in what has to be removed, you don't want to end up with blabla.log..
Quoting the variable names prevents surprises when file names contain special characters.
Together:
mv -v "$f" "${f%.[0-9][0-9][0-9][0-9].[0-9][0-9].[0-9][0-9]}"

Run multiple tools as single bash script

I am doing different programs in isolation. Let say one command line arg for C++ tool, other one for R. But at first I run command line argument for C++ app, this will gives me a resulting file. Only then I can run another command line for R app, that required resulting file from C++ app.
I may have many different data to be processed. Is there any way to make a bash script to allow looping different tools (C++, R, any other)? So I just sit down and dont manually write many command line arguments?
I would like to go to sleep, while a time consuming loop is making noise in my computer.
Running multiple, different programms in some defined order is the fundamental idea of a (systems) scripting language like bash:
## run those three programms in sequence
first argument
second parameter
third
# same as
first argument; second parameter; third
You can do a lot of fancy things, like redirecting input and output streams:
grep secret secrets.file | grep -V strong | sort > result.file
# pipe | feeds everything from the standard output
# of the programm on the left into
# the standard input of the one on the right
This includes also things like conditionals and of course, loops:
while IFS= read -r -d '' file; do
preprocess "$file"
some_work | generate "$file.output"
done < <(find ./data -type f -name 'source*' -print0)
As you might see, bash is a programming language on its own, with a bit of a weird syntax IMHO.

Specifying a range of files using regex

I have a huge amount of files (in the hundreds of thousands) that all have the same format of name.
The filename format is:
[prefix][number]suffix]
where the [prefix] and [suffix] of all the files is the same, and just the number part changes. The number part is something like 0004732
So the filenames are:
[prefix]004732[suffix]
[prefix]004733[suffix]
[prefix]004734[suffix]
etc.
I need to move a range of about 100,000 files (with consecutive numbers) to another directory, and I was wondering if it is possible to do this with a regular expression.
You're looking for character classes. It's a bit difficult to specify number ranges using regex because it works on text, not numbers, but it can be done something like this (for files 1-100):
prefix[0-1][0-9][0-9]suffix
prefix[0-1]\d\dsuffix #this also works in PERL regex
More complicated numbers get trickier. For 0-211:
prefix([0-1][0-9][0-9]|20[0-9]|21[0-1])suffix
If you're on Windows, install Cygwin, and do the following. If you're on Mac OS X or Linux, just open a terminal. You'll need to do the following:
ls PREFIX* | sed 's/PREFIX\(0[0-9]\)SUFFIX/mv & tmp\/PREFIX\1SUFFIX/' | sh
What is this doing?
Lists all files starting with the specified prefix
Pipes this list to sed, which uses a regex pattern to match only files that fall within the range you specify
Create a new string using the move command
Pipes the move command string to the shell (sh) and executes it
You can tweak the regex to match your number range by looking at the following:
http://www.regular-expressions.info/numericranges.html
To the best of my knowledge, there is no regex (to handle complex cases), but you can use loop easily:
The following code runs in linux. I ran simnilar code on Windows using CygWin and it works as well. Maybe there is similar way to do in Windows.
If the two numbers are with the same digits;
Example: from
[prefix]000012345[suffix]
to
[prefix]000056789[suffix]
:
for (( i=12345; i<56789; i++)); do mv "[prefix]0000$i[suffix]" /newDirectoryPath done
Otherwise you can do with multiple (usually two or three) commands;
Example: from
[prefix]000012345[suffix]
to
[prefix]003456789[suffix]
:
for (( i=12345; i<99999; i++)); do mv "[prefix]0000$i[suffix]" /newDirectoryPath done
for (( i=100000; i<999999; i++)); do mv "[prefix]000$i[suffix]" /newDirectoryPath done
for (( i=1000000; i<3456789; i++)); do mv "[prefix]00$i[suffix]" /newDirectoryPath done

Apply regular expression substitution globally to many files with a script

I want to apply a certain regular expression substitution globally to about 40 Javascript files in and under a directory. I'm a vim user, but doing this by hand can be tedious and error-prone, so I'd like to automate it with a script.
I tried sed, but handling more than one line at a time is awkward, especially if there is no limit to how many lines the pattern might match.
I also tried this script (on a single file, for testing):
ex $1 <<EOF
gs/,\(\_\s*[\]})]\)/\1/
EOF
The pattern will eliminate a trailing comma in any Perl/Ruby-style list, so that "[a, b, c,]" will come out as "[a, b, c]" in order to satisfy Internet Explorer, which alone among browsers, chokes on such lists.
The pattern works beautifully in vim but does nothing if I run it in ex, as per the above script.
Can anyone see what I might be missing?
You asked for a script, but you mentioned that you are vim user. I tend to do project-wide find and replace inside of vim, like so:
:args **/*.js | argdo %s/,\(\_\s*[\]})]\)/\1/ge | update
This is very similar to the :bufdo solution mentioned by another commenter, but it will use your args list rather than your buflist (and thus doesn't require a brand new vim session nor for you to be careful about closing buffers you don't want touched).
:args **/*.js - sets your arglist to contain all .js files in this directory and subdirectories
| - pipe is vim's command separator, letting us have multiple commands on one line
:argdo - run the following command(s) on all arguments. it will "swallow" subsequent pipes
% - a range representing the whole file
:s - substitute command, which you already know about
:s_flags, ge - global (substitute as many times per line as possible) and suppress errors (i.e. "No match")
| - this pipe is "swallowed" by the :argdo, so the following command also operates once per argument
:update - like :write but only when the buffer has been modified
This pattern will obviously work for any vim command which you want to run on multiple files, so it's a handy one to keep in mind. For example, I like to use it to remove trailing whitespace (%s/\s\+$//), set uniform line-endings (set ff=unix) or file encoding (set filencoding=utf8), and retab my files.
1) Open all the files with vim:
bash$ vim $(find . -name '*.js')
2) Apply substitute command to all files:
:bufdo %s/,\(\_\s*[\]})]\)/\1/ge
3) Save all the files and quit:
:wall
:q
I think you'll need to recheck your search pattern, it doesn't look right. I think where you have \_\s* you should have \_s* instead.
Edit: You should also use the /ge options for the :s... command (I've added these above).
You can automate the actions of both vi and ex by passing the argument +'command' from the command line, which enables them to be used as text filters.
In your situation, the following command should work fine:
find /path/to/dir -name '*.js' | xargs ex +'%s/,\(\_\s*[\]})]\)/\1/g' +'wq!'
you can use a combination of the find command and sed
find /path -type f -iname "*.js" -exec sed -i.bak 's/,[ \t]*]/]/' "{}" +;
If you are on windows, Notepad++ allows you to run simple regexes on all opened files.
Search for ,\s*\] and replace with ]
should work for the type of lists you describe.

Controlling shell command line wildcard expansion in C or C++

I'm writing a program, foo, in C++. It's typically invoked on the command line like this:
foo *.txt
My main() receives the arguments in the normal way. On many systems, argv[1] is literally *.txt, and I have to call system routines to do the wildcard expansion. On Unix systems, however, the shell expands the wildcard before invoking my program, and all of the matching filenames will be in argv.
Suppose I wanted to add a switch to foo that causes it to recurse into subdirectories.
foo -a *.txt
would process all text files in the current directory and all of its subdirectories.
I don't see how this is done, since, by the time my program gets a chance to see the -a, then shell has already done the expansion and the user's *.txt input is lost. Yet there are common Unix programs that work this way. How do they do it?
In Unix land, how can I control the wildcard expansion?
(Recursing through subdirectories is just one example. Ideally, I'm trying to understand the general solution to controlling the wildcard expansion.)
You program has no influence over the shell's command line expansion. Which program will be called is determined after all the expansion is done, so it's already too late to change anything about the expansion programmatically.
The user calling your program, on the other hand, has the possibility to create whatever command line he likes. Shells allow you to easily prevent wildcard expansion, usually by putting the argument in single quotes:
program -a '*.txt'
If your program is called like that it will receive two parameters -a and *.txt.
On Unix, you should just leave it to the user to manually prevent wildcard expansion if it is not desired.
As the other answers said, the shell does the wildcard expansion - and you stop it from doing so by enclosing arguments in quotes.
Note that options -R and -r are usually used to indicate recursive - see cp, ls, etc for examples.
Assuming you organize things appropriately so that wildcards are passed to your program as wildcards and you want to do recursion, then POSIX provides routines to help:
nftw - file tree walk (recursive access).
fnmatch, glob, wordexp - to do filename matching and expansion
There is also ftw, which is very similar to nftw but it is marked 'obsolescent' so new code should not use it.
Adrian asked:
But I can say ls -R *.txt without single quotes and get a recursive listing. How does that work?
To adapt the question to a convenient location on my computer, let's review:
$ ls -F | grep '^m'
makefile
mapmain.pl
minimac.group
minimac.passwd
minimac_13.terminal
mkmax.sql.bz2
mte/
$ ls -R1 m*
makefile
mapmain.pl
minimac.group
minimac.passwd
minimac_13.terminal
mkmax.sql.bz2
mte:
multithread.ec
multithread.ec.original
multithread2.ec
$
So, I have a sub-directory 'mte' that contains three files. And I have six files with names that start 'm'.
When I type 'ls -R1 m*', the shell notes the metacharacter '*' and uses its equivalent of glob() or wordexp() to expand that into the list of names:
makefile
mapmain.pl
minimac.group
minimac.passwd
minimac_13.terminal
mkmax.sql.bz2
mte
Then the shell arranges to run '/bin/ls' with 9 arguments (program name, option -R1, plus 7 file names and terminating null pointer).
The ls command notes the options (recursive and single-column output), and gets to work.
The first 6 names (as it happens) are simple files, so there is nothing recursive to do.
The last name is a directory, so ls prints its name and its contents, invoking its equivalent of nftw() to do the job.
At this point, it is done.
This uncontrived example doesn't show what happens when there are multiple directories, and so the description above over-simplifies the processing.
Specifically, ls processes the non-directory names first, and then processes the directory names in alphabetic order (by default), and does a depth-first scan of each directory.
foo -a '*.txt'
Part of the shell's job (on Unix) is to expand command line wildcard arguments. You prevent this with quotes.
Also, on Unix systems, the "find" command does what you want:
find . -name '*.txt'
will list all files recursively from the current directory down.
Thus, you could do
foo `find . -name '*.txt'`
I wanted to point out another way to turn off wildcard expansion. You can tell your shell to stop expanding wildcards with the the noglob option.
With bash use set -o noglob:
> touch a b c
> echo *
a b c
> set -o noglob
> echo *
*
And with csh, use set noglob:
> echo *
a b c
> set noglob
> echo *
*