Replace groups of text all together with gVim - regex

Consider the following data:
Class Gender Condition Tenis
A Male Fail Fail 33
A Female Fail NotFail 23
S Male Yellow 14
BC Male Happy Elephant 44
I have a comma separated value with unformatted tabulation (it varies among tabs and whitespaces).
In one specific column I have compound words which I would like to eliminate the space. In the above example, I would like to replace "Fail " with "Fail_" and "Happy" with "Happy_".
The result would be the following:
Class Gender Condition Tenis
A Male Fail_Fail 33
A Female Fail_NotFail 23
S Male Yellow 14
BC Male Happy_Elephant 44
I already managed to do that in two steps:
:%s/Fail /Fail_/g
:%s/Happy /Happy_/g
Question: As I'm very new to gVim I am trying to implement these replacements all together, but I could not find how to do that*.
After this step, I will tabulate my data with the following:
:%s/\s\+/,/g
And get the final result:
Number,Gender,Condition,Tenis
A,Male,Fail_Fail,33
A,Female,Fail_NotFail,23
S,Male,Yellow,14
BC,Male,Happy_Elephant,44
On SO, I searched for [vim] :%s two is:question and some variations, but I could not find a related thread, so I guess I am lacking the correct terminology.
Edit: This is the actual data (with more than 1 million rows). The problem starts in the 12th column (e.g. "Fail Planting" should be "Fail_Planting").
SP1 51F001 3 1 1 2 3 2001 52 52 H Normal 17,20000076 23,39999962 NULL NULL
SP1 51F001 3 1 1 2 3 2001 53 53 F Fail Planting 0 0 NULL NULL
SP1 51F001 3 1 1 2 3 2001 54 54 N Normal 13,89999962 0 NULL NULL

You can use an expression on the right hand side of the substitution.
:%s/\(Fail\|Happy\) \|\s\+/\= submatch(0) =~# '^\s\+$' ? ',' : submatch(1).'_'/g
So this finds Fail or Happy or whitespace and then converts checks to see if the matched part is completely whitespace. It it is replace by a comma if it is not use the captured part and append an underscore. submatch(0) is the whole match and submatch(1) is the first capture group.
Take a look at :h sub-replace-expression. If you want to do something very complex define you can define a function.
Very magic version
:%s/\v(Fail|Happy) |\s+/\= submatch(0) =~# '^\v\s+$' ? ',' : submatch(1).'_'/g

You have all the parts you just need to combine them together with |. Example:
:%s/\>\s\</_/g|%s/\s\+/,/g
I am using \> and \< to find words that only have one space between them so we can replace it with _.
For more help see:
:h /\>
:h :range
:h :bar

You could perhaps try a macro if there are certain conditions that are true (or write a vimscript, but my vimscript is very rusty). I will show a sample macro you could use:
Go to first line in file after the headings
press q to begin recording a macro
press t to choose the register t for recording to (I use t for "temp")
press ^ to move to the beginning of the line
press 2w to move to the third word (move 2 words to the right)
press e to move to the end of the word
press l (letter l) to move right one character (to the space)
press r to enter replace single character mode
press _ to enter an underscore
press j to move down a line
press q to stop recording the macro
Now that you have the macro stored in register t you can run the macro on every line in the file. If there are 100 lines in the file, you have already done 1 and there is a header, so you would type the following to run it on the remaining 98 lines:
98#t

These two commands:
:%s/\(\a\) \(\a\)/\1_\2/g
:%s/\s\+/,/g
seem to work on your sample:
SP1,51F001,3,1,1,2,3,2001,52,52,H,Normal,17,20000076,23,39999962,NULL,NULL
SP1,51F001,3,1,1,2,3,2001,53,53,F,Fail_Planting,0,0,NULL,NULL
SP1,51F001,3,1,1,2,3,2001,54,54,N,Normal,13,89999962,0,NULL,NULL
but you have decimal numbers here with a comma as separator that will mess with the "comma-separated-ness" of your data. Changing those commas into periods beforehand might be a good idea:
:%s/,/./g
SP1,51F001,3,1,1,2,3,2001,52,52,H,Normal,17.20000076,23.39999962,NULL,NULL
SP1,51F001,3,1,1,2,3,2001,53,53,F,Fail_Planting,0,0,NULL,NULL
SP1,51F001,3,1,1,2,3,2001,54,54,N,Normal,13.89999962,0,NULL,NULL

Related

Trying to Extract Numeric from a text field

I have field with different text entered with a 13 or 17 Digit ID.Need to extract that ID from this field
regexp_substr(TXT,'CTRL ACDV\\s+(\\d+)',1,1,'ie')..
Txt can can be like this
SUPPRESSED AND FORWARDING CTRL{ACDV 36608732875895776 } {DRID 12345
SUPPRESSED AND FORWARDING CTRL 9809770899005 TO FRAUD DUE TO ID TH
SUPPRESSED AND FORWARDING CTRL ACDV 987878829039161097 .DRID 87569
regexp_substr(TXT,'CTRL ACDV\\s+(\\d+)',1,1,'ie')..
need to get
36608732875895776
9809770899005
987878829039161097
If you can assume the digits are a minimum length, this works for your 3 examples:
SELECT regexp_substr('SUPPRESSED AND FORWARDING CTRL{ACDV 36608732875895776 } {DRID 12345',
'(\\d{13,})', 1,1, 'e');
SELECT regexp_substr('SUPPRESSED AND FORWARDING CTRL 9809770899005 TO FRAUD DUE TO ID TH',
'(\\d{13,})', 1,1, 'e');
SELECT regexp_substr('SUPPRESSED AND FORWARDING CTRL ACDV 987878829039161097 .DRID 87569',
'(\\d{13,})', 1,1, 'e');
You might use a capturing group and use the (from the docs) e parameter to return only the part of the string that matches the first sub-expression in the pattern.
Note that the last number are 18 digits instead of 17.
\bCTRL\D+(\d{13,18})
Explanation
\bCTRL Match word boundary and CTRL
\D+ Match 1+ times not a digit
(\d{13,18}) Capture 1 group 1 matching 13 - 18 digits
Regex demo
Another option is to match 13 or more digits using \d{13,}
The docs state that the patterns are implicitly anchored at both ends, in that case you could use:
.*\bCTRL\D+(\d{13,18})\b.*
Regex demo
If the only big numbers are the ID's, then this is the shortest and fastest:
\d{13,17}
Test it here.
Be aware that the third ID (987878829039161097) is actually 18 digits long.
Therefore, if the minimum length is 13, you may want to use:
\d{13,}
Alternatively, if you want to delete everything except the long ID's, you can search for the regex:
([^\d]+|\d{,12})
and replace it with \n (= new line) or whatever you want (e.g. a space).
Test it here.
You may get better result if you do the replace in two steps. First for:
[^\d]+
(for non-digits)
and then for:
\s\d{1,12}(\s|$)
(for numbers with less than 13 digits)

Regular Expression for parsing a sports score

I'm trying to validate that a form field contains a valid score for a volleyball match. Here's what I have, and I think it works, but I'm not an expert on regular expressions, by any means:
r'^ *([0-9]{1,2} *- *[0-9]{1,2})((( *[,;] *)|([,;] *)|( *[,;])|[,;]| +)[0-9]{1,2} *- *[0-9]{1,2})* *$'
I'm using python/django, not that it really matters for the regex match. I'm also trying to learn regular expressions, so a more optimal regex would be useful/helpful.
Here are rules for the score:
1. There can be one or more valid set (set=game) results included
2. Each result must be of the form dd-dd, where 0 <= dd <= 99
3. Each additional result must be separated by any of [ ,;]
4. Allow any number of sets >=1 to be included
5. Spaces should be allowed anywhere except in the middle of a number
So, the following are all valid:
25-10 or 25 -0 or 25- 9 or 23 - 25 (could be one or more spaces)
25-10,25-15 or 25-10 ; 25-15 or 25-10 25-15 (again, spaces allowed)
25-1 2 -25, 25- 3 ;4 - 25 15-10
Also, I need each result as a separate unit for parsing. So in the last example above, I need to be able to separately work on:
25-1
2 -25
25- 3
4 - 25
15-10
It'd be great if I could strip the spaces from within each result. I can't just strip all spaces, because a space is a valid separator between result sets.
I think this is solution for your problem.
str.replace(r"(\d{1,2})\s*-\s*(\d{1,2})", "$1-$2")
How it works:
(\d{1,2}) capture group of 1 or 2 numbers.
\s* find 0 or more whitespace.
- find -.
$1 replace content with content of capture group 1
$2 replace content with content of capture group 2
you can also look at this.

Find repeating gps using regular expression

I work with text files, and I need to be able to see when the gps (last 3 columns of csv) "hangs up" for more than a few lines.
So for example, usually, part of a text file looks like this:
5451,1667,180007,35.7397387,97.8161897,375.8
5448,1053z,180006,35.7397407,97.8161814,375.7
5444,1667,180005,35.7397445,97.8161674,375.6
5439,1668,180004,35.7397483,97.8161526,375.5
5435,1669,180003,35.7397518,97.8161379,375.5
5431,1669,180002,35.7397554,97.8161269,375.6
5426,1054z,180001,35.7397584,97.8161115,375.6
5420,1670,175959,35.7397649,97.8160931,375.9
But sometimes there is an error with the gps and it looks like this:
36859,1598,202603.00,35.8867316,99.2515545,555.700
36859,1598,202608.00,35.8867316,99.2515545,555.700
36859,1142z,202610.00,35.8867316,99.2515545,555.700
36859,1597,202612.00,35.8867316,99.2515545,555.700
36859,1597,202614.00,35.8867316,99.2515545,555.700
36859,1596,202616.00,35.8867316,99.2515545,555.700
36859,1595,202618.00,35.8867316,99.2515545,555.700
I need to be able to figure out a way to search for matching strings of 7 different numbers, (the decimal portion of the gps) but so far I've only been able to figure out how to search for repeating #s or consecutive numbers.
Any ideas?
If you were to find such repetitions in an editor (such as Notepad++), you could use the following regex to find 4 or more repeating lines:
([^,]+(?:,[^,]+){2})\v+(?:(?:[^,]+,){3}\1(?:\v+|$)){3,}
To go a bit into detail
([^,]+(?:,[^,]+){2})\v+ is a group consisting of one or more non-commas followed by comma and another one or more non-commas followed by a vertical space (linebreak), that is not part of the group (e.g. 1,1,1\n)
(?:[^,]+,){3} matches one or more non-commas followed by comma, three times (your columns that don't have to be considered)
\1 is a backreference to group 1, matching if it contains exactly the same as group 1
(?:\v+|$) matches either another vertical whitespaces or the end of the text
{3,} for 3 or more repetitions - increase it if you want more
Here you can see, how it works
However, if you are using any programming language to check this, I wouldn't walk on the path of regex, as checking for those repetitions can be done a lot easier. Here is one example in Python, I hope you can adopt it for your needs:
oldcoords = [0,0,0]
lines = [line.rstrip('\n') for line in open(r'C:\temp\gps.csv')]
for line in lines:
gpscoords = line.split(',')[3:6]
if gpscoords == oldcoords:
repetitions += 1
else:
oldcoords = gpscoords
repetitions = 0
if repetitions == 4: #or however you define more than a few
print(', '.join(gpscoords) + ' is repeated')
If you can use perl, and if I understood you:
perl -ne 'm/^[^,]*,[^,]*,[^,]*,([^,]*,[^,]*,[^,]*$)/g; $current_line=$1; ++$line_number; if ($prev_line==$current_line){$equals++} else {if ($equals>=6){ print "Last three fields in lines ".($line_number-$equals-1)." to ".($line_number-1)." are equals to:\n$prev_line" } ; $equals=0}; $prev_line=$current_line' < onlyreplacethiswithyourfilepath should do the trick.
Sample output:
Last three fields in lines 1 to 7 are equals to:
35.8867316,99.2515545,555.700
Last three fields in lines 16 to 22 are equals to:
37.8782116,99.7825545,572.810
Last three fields in lines 31 to 44 are equals to:
36.6868916,77.2594245,581.358
Last three fields in lines 57 to 63 are equals to:
35.5128764,71.2874545,575.631

Regex to match numbers and commas, but not numbers starting with 0 unless it's 0,

Well I tried to sum it up in the title.
I need a reg ex to match numbers and commas, but not numbers starting with 0 unless it's 0,number
My users enter hours in a field, so they have to be able to enter 0,3 hours, but they are not allowed to write 002 or 09.
I have this reg ex
^[0-9]*\,?[0-9]+$
How can I extend it to not allow start with 0 unless the 0 is followed by a comma
Another one :)
^(0|[1-9]\d*(|,\d+)|0,\d+)$
This one should suit your needs:
^0,\d*[1-9]|[1-9]\d*$
either 0,\d*[1-9]: a 0, followed by a comma, followed by 0 or more digit, followed by one digit between 1 and 9
or [1-9]\d*: a digit between 1 and 9, followed by zero or more digit
Matches:
0,3
0,03
3
30
Doesn't match:
0
0,0
0,30
03
You don't need to force everything into a single regex to do this.
It will be far clearer if you use multiple regexes, each one making a specific check.
if ( /^[0-9]+,[0-9]+$/ || /^[1-9][0-9]*$/ )
Here we are making two different checks. "Either this one matches, or the other one matches", and then you don't have to jam both conditions into one regex.
Let the expressive form of your host language be used, rather than trying to cram logic into a regex.

Is it possible to increment numbers using regex substitution?

Is it possible to increment numbers using regex substitution? Not using evaluated/function-based substitution, of course.
This question was inspired by another one, where the asker wanted to increment numbers in a text editor. There are probably more text editors that support regex substitution than ones that support full-on scripting, so a regex might be convenient to float around, if one exists.
Also, often I've learned neat things from clever solutions to practically useless problems, so I'm curious.
Assume we're only talking about non-negative decimal integers, i.e. \d+.
Is it possible in a single substitution? Or, a finite number of substitutions?
If not, is it at least possible given an upper bound, e.g. numbers up to 9999?
Of course it's doable given a while-loop (substituting while matched), but we're going for a loopless solution here.
This question's topic amused me for one particular implementation I did earlier. My solution happens to be two substitutions so I'll post it.
My implementation environment is solaris, full example:
echo "0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909" |
perl -pe 's/\b([0-9]+)\b/0$1~01234567890/g' |
perl -pe 's/\b0(?!9*~)|([0-9])(?=9*~[0-9]*?\1([0-9]))|~[0-9]*/$2/g'
1 2 3 4 8 9 10 11 20 100 110 200 910 1000 1100 1910
Pulling it apart for explanation:
s/\b([0-9]+)\b/0$1~01234567890/g
For each number (#) replace it with 0#~01234567890. The first 0 is in case rounding 9 to 10 is needed. The 01234567890 block is for incrementing. The example text for "9 10" is:
09~01234567890 010~01234567890
The individual pieces of the next regex can be described seperately, they are joined via pipes to reduce substitution count:
s/\b0(?!9*~)/$2/g
Select the "0" digit in front of all numbers that do not need rounding and discard it.
s/([0-9])(?=9*~[0-9]*?\1([0-9]))/$2/g
(?=) is positive lookahead, \1 is match group #1. So this means match all digits that are followed by 9s until the '~' mark then go to the lookup table and find the digit following this number. Replace with the next digit in the lookup table. Thus "09~" becomes "19~" then "10~" as the regex engine parses the number.
s/~[0-9]*/$2/g
This regex deletes the ~ lookup table.
Wow, turns out it is possible (albeit ugly)!
In case you do not have the time or cannot be bothered to read through the whole explanation, here is the code that does it:
$str = '0 1 2 3 4 5 6 7 8 9 10 11 12 13 19 20 29 99 100 139';
$str = preg_replace("/\d+/", "$0~", $str);
$str = preg_replace("/$/", "#123456789~0", $str);
do
{
$str = preg_replace(
"/(?|0~(.*#.*(1))|1~(.*#.*(2))|2~(.*#.*(3))|3~(.*#.*(4))|4~(.*#.*(5))|5~(.*#.*(6))|6~(.*#.*(7))|7~(.*#.*(8))|8~(.*#.*(9))|9~(.*#.*(~0))|~(.*#.*(1)))/s",
"$2$1",
$str, -1, $count);
} while($count);
$str = preg_replace("/#123456789~0$/", "", $str);
echo $str;
Now let's get started.
So first of all, as the others mentioned, it is not possible in a single replacement, even if you loop it (because how would you insert the corresponding increment to a single digit). But if you prepare the string first, there is a single replacement that can be looped. Here is my demo implementation using PHP.
I used this test string:
$str = '0 1 2 3 4 5 6 7 8 9 10 11 12 13 19 20 29 99 100 139';
First of all, let's mark all digits we want to increment by appending a marker character (I use ~, but you should probably use some crazy Unicode character or ASCII character sequence that definitely will not occur in your target string.
$str = preg_replace("/\d+/", "$0~", $str);
Since we will be replacing one digit per number at a time (from right to left), we will just add that marking character after every full number.
Now here comes the main hack. We add a little 'lookup' to the end of our string (also delimited with a unique character that does not occur in your string; for simplicity I used #).
$str = preg_replace("/$/", "#123456789~0", $str);
We will use this to replace digits by their corresponding successors.
Now comes the loop:
do
{
$str = preg_replace(
"/(?|0~(.*#.*(1))|1~(.*#.*(2))|2~(.*#.*(3))|3~(.*#.*(4))|4~(.*#.*(5))|5~(.*#.*(6))|6~(.*#.*(7))|7~(.*#.*(8))|8~(.*#.*(9))|9~(.*#.*(~0))|(?<!\d)~(.*#.*(1)))/s",
"$2$1",
$str, -1, $count);
} while($count);
Okay, what is going on? The matching pattern has one alternative for every possible digit. This maps digits to successors. Take the first alternative for example:
0~(.*#.*(1))
This will match any 0 followed by our increment marker ~, then it matches everything up to our cheat-delimiter and the corresponding successor (that is why we put every digit there). If you glance at the replacement, this will get replaced by $2$1 (which will then be 1 and then everything we matched after the ~ to put it back in place). Note that we drop the ~ in the process. Incrementing a digit from 0 to 1 is enough. The number was successfully incremented, there is no carry-over.
The next 8 alternatives are exactly the same for the digits 1to 8. Then we take care of two special cases.
9~(.*#.*(~0))
When we replace the 9, we do not drop the increment marker, but place it to the left of our the resulting 0 instead. This (combined with the surrounding loop) is enough to implement carry-over propagation. Now there is one special case left. For all numbers consisting solely of 9s we will end up with the ~ in front of the number. That is what the last alternative is for:
(?<!\d)~(.*#.*(1))
If we encounter a ~ that is not preceded by a digit (therefore the negative lookbehind), it must have been carried all the way through a number, and thus we simply replace it with a 1. I think we do not even need the negative lookbehind (because this is the last alternative that is checked), but it feels safer this way.
A short note on the (?|...) around the whole pattern. This makes sure that we always find the two matches of an alternative in the same references $1 and $2 (instead of ever larger numbers down the string).
Lastly, we add the DOTALL modifier (s), to make this work with strings that contain line breaks (otherwise, only numbers in the last line will be incremented).
That makes for a fairly simple replacement string. We simply first write $2 (in which we captured the successor, and possibly the carry-over marker), and then we put everything else we matched back in place with $1.
That's it! We just need to remove our hack from the end of the string, and we're done:
$str = preg_replace("/#123456789~0$/", "", $str);
echo $str;
> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 30 100 101 140
So we can do this entirely in regular expressions. And the only loop we have always uses the same regex. I believe this is as close as we can get without using preg_replace_callback().
Of course, this will do horrible things if we have numbers with decimal points in our string. But that could probably be taken care of by the very first preparation-replacement.
Update: I just realised, that this approach immediately extends to arbitrary increments (not just +1). Simply change the first replacement. The number of ~ you append equals the increment you apply to all numbers. So
$str = preg_replace("/\d+/", "$0~~~", $str);
would increment every integer in the string by 3.
I managed to get it working in 3 substitutions (no loops).
tl;dr
s/$/ ~0123456789/
s/(?=\d)(?:([0-8])(?=.*\1(\d)\d*$)|(?=.*(1)))(?:(9+)(?=.*(~))|)(?!\d)/$2$3$4$5/g
s/9(?=9*~)(?=.*(0))|~| ~0123456789$/$1/g
Explanation
Let ~ be a special character not expected to appear anywhere in the text.
If a character is nowhere to be found in the text, then there's no way to make it appear magically. So first we insert the characters we care about at the very end.
s/$/ ~0123456789/
For example,
0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909
becomes:
0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909 ~0123456789
Next, for each number, we (1) increment the last non-9 (or prepend a 1 if all are 9s), and (2) "mark" each trailing group of 9s.
s/(?=\d)(?:([0-8])(?=.*\1(\d)\d*$)|(?=.*(1)))(?:(9+)(?=.*(~))|)(?!\d)/$2$3$4$5/g
For example, our example becomes:
1 2 3 4 8 9 19~ 11 29~ 199~ 119~ 299~ 919~ 1999~ 1199~ 1919~ ~0123456789
Finally, we (1) replace each "marked" group of 9s with 0s, (2) remove the ~s, and (3) remove the character set at the end.
s/9(?=9*~)(?=.*(0))|~| ~0123456789$/$1/g
For example, our example becomes:
1 2 3 4 8 9 10 11 20 100 110 200 910 1000 1100 1910
PHP Example
$str = '0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909';
echo $str . '<br/>';
$str = preg_replace('/$/', ' ~0123456789', $str);
echo $str . '<br/>';
$str = preg_replace('/(?=\d)(?:([0-8])(?=.*\1(\d)\d*$)|(?=.*(1)))(?:(9+)(?=.*(~))|)(?!\d)/', '$2$3$4$5', $str);
echo $str . '<br/>';
$str = preg_replace('/9(?=9*~)(?=.*(0))|~| ~0123456789$/', '$1', $str);
echo $str . '<br/>';
Output:
0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909
0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909 ~0123456789
1 2 3 4 8 9 19~ 11 29~ 199~ 119~ 299~ 919~ 1999~ 1199~ 1919~ ~0123456789
1 2 3 4 8 9 10 11 20 100 110 200 910 1000 1100 1910
Is it possible in a single substitution?
No.
If not, is it at least possible in a single substitution given an upper bound, e.g. numbers up to 9999?
No.
You can't even replace the numbers between 0 and 8 with their respective successor. Once you have matched, and grouped this number:
/([0-8])/
you need to replace it. However, regex doesn't operate on numbers, but on strings. So you can replace the "number" (or better: digit) with twice this digit, but the regex engine does not know it is duplicating a string that holds a numerical value.
Even if you'd do something (silly) as this:
/(0)|(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)/
so that the regex engine "knows" that if group 1 is matched, the digit '0' is matched, it still cannot do a replacement. You can't instruct the regex engine to replace group 1 with the digit '1', group '2' with the digit '2', etc. Sure, some tools like PHP will let you define a couple of different patterns with corresponding replacement strings, but I get the impression that is not what you were thinking about.
It is not possible by regular expression search and substitution alone.
You have to use use something else to help achieve that. You have to use the programming language at hand to increment the number.
Edit:
The regular expressions definition, as part of Single Unix Specification doesn't mention regular expressions supporting evaluation of aritmethic expressions or capabilities for performing aritmethic operations.
Nonetheless, I know some flavors ( TextPad, editor for Windows) allows you to use \i as a substitution term which is an incremental counter of how many times has the search string been found, but it doesn't evaluate or parse found strings into a number nor does it allow to add a number to it.
I have found a solution in two steps (Javascript) but it relies on indefinite lookaheads, which some regex engines reject:
const incrementAll = s =>
s.replaceAll(/(.+)/gm, "$1\n101234567890")
.replaceAll(/(?:([0-8]|(?<=\d)9)(?=9*[^\d])(?=.*\n\d*\1(\d)\d*$))|(?<!\d)9(?=9*[^\d])(?=(?:.|\n)*(10))|\n101234567890$/gm, "$2$3");
The key thing is to add a list of numbers in order at the end of the string in the first step, and in the second, to find the location relevant digit and capture the digit to its right via a lookahead. There are two other branches in the second step, one for dealing with initial nines, and the other for removing the number sequence.
Edit: I just tested it in safari and it throws an error, but it definately works in firefox.
I needed to increment indices of output files by one from a pipeline I can't modify. After some searches I got a hit on this page. While the readings are meaningful, they really don't give a readable solution to the problem. Yes it is possible to do it with only regex; no it is not as comprehensible.
Here I would like to give a readable solution using Python, so that others don't need to reinvent the wheels. I can imagine many of you may have ended up with a similar solution.
The idea is to partition file name into three groups, and format your match string so that the incremented index is the middle group. Then it is possible to only increment the middle group, after which we piece the three groups together again.
import re
import sys
import argparse
from os import listdir
from os.path import isfile, join
def main():
parser = argparse.ArgumentParser(description='index shift of input')
parser.add_argument('-r', '--regex', type=str,
help='regex match string for the index to be shift')
parser.add_argument('-i', '--indir', type=str,
help='input directory')
parser.add_argument('-o', '--outdir', type=str,
help='output directory')
args = parser.parse_args()
# parse input regex string
regex_str = args.regex
regex = re.compile(regex_str)
# target directories
indir = args.indir
outdir = args.outdir
try:
for input_fname in listdir(indir):
input_fpath = join(indir, input_fname)
if not isfile(input_fpath): # not a file
continue
matched = regex.match(input_fname)
if matched is None: # not our target file
continue
# middle group is the index and we increment it
index = int(matched.group(2)) + 1
# reconstruct output
output_fname = '{prev}{index}{after}'.format(**{
'prev' : matched.group(1),
'index' : str(index),
'after' : matched.group(3)
})
output_fpath = join(outdir, output_fname)
# write the command required to stdout
print('mv {i} {o}'.format(i=input_fpath, o=output_fpath))
except BrokenPipeError:
pass
if __name__ == '__main__': main()
I have this script named index_shift.py. To give an example of the usage, my files are named k0_run0.csv, for bootstrap runs of machine learning models using parameter k. The parameter k starts from zero, and the desired index map starts at one. First we prepare input and output directories to avoid overriding files
$ ls -1 test_in/ | head -n 5
k0_run0.csv
k0_run10.csv
k0_run11.csv
k0_run12.csv
k0_run13.csv
$ ls -1 test_out/
To see how the script works, just print its output:
$ python3 -u index_shift.py -r '(^k)(\d+?)(_run.+)' -i test_in -o test_out | head -n5
mv test_in/k6_run26.csv test_out/k7_run26.csv
mv test_in/k25_run11.csv test_out/k26_run11.csv
mv test_in/k7_run14.csv test_out/k8_run14.csv
mv test_in/k4_run25.csv test_out/k5_run25.csv
mv test_in/k1_run28.csv test_out/k2_run28.csv
It generates bash mv command to rename the files. Now we pipe the lines directly into bash.
$ python3 -u index_shift.py -r '(^k)(\d+?)(_run.+)' -i test_in -o test_out | bash
Checking the output, we have successfully shifted the index by one.
$ ls test_out/k0_run0.csv
ls: cannot access 'test_out/k0_run0.csv': No such file or directory
$ ls test_out/k1_run0.csv
test_out/k1_run0.csv
You can also use cp instead of mv. My files are kinda big, so I wanted to avoid duplicating them. You can also refactor how many you shift as input argument. I didn't bother, cause shift by one is most of my use cases.