Regex in bash scripting

Regex in bash scripting - regex

I have 2 similar files names that need to go into different directories. I tried using the following regex.
File 1: abc_xyz_2016_12_02.out
File 2: abc_xyz_test_2016-12-02.out
Regex used:
regex_abc_xyz="abc_xyz_[0-9]{1,4}-[0-9]{1,2}-[0-9]{1,2}.out"
regex_abc_xyz_test="abc_xyz_test_[0-9]{1,4}-[0-9]{1,2}-[0-9]{1,2}.out"
regex_abc_xyz works but regex_abc_xyz_test is failing.

Using your example test strings (the first of which I assume was mistyped, using underscores between the date components instead of hyphens), I entered these together with your regular expressions into RegEx 101. Both matched the appropriate filenames.
As one user stated, you ought to escape your period, i.e. \.out, but otherwise, your regular expressions are fine.
However, if all you need is to separate two lots of files into two different directories, and each begin with a fixed string (I’m implying this given your regex patterns that start with abc_xyz and abc_xyz_test), then could you not use a wildcard expression to move the latter group first, then the remaining group second ?
So:
mv abc_xyz_test*.out /path/to/new/folder/
Then:
mv abc_xyz*.out /path/to/new/folder/

Related

Is there a bash script for finding a specific character between two given expressions?

I have a 3-step problem: I need to
find all occurrences of the character : in a latex file but only when it is in a \ref{} or in a \label{}, in which there can be other characters. Example: The system's total energy (\ref{eq:E}).
replace those : with _. Example becomes: The system's total energy (\ref{eq_E}).
do this for all such occurrences of : in references or labels, in about 100 files.
I've never done this before. I've worked out that I can use regular expressions to find complex occurrences. I can find either \ref{ or \label{ with (\\ref\{|\\label\{), but I can't put it in a lookbehind because it is not fixed width. My other problem with lookbehind and lookahead is that I can only match everything between my assertions, not specific characters (from what I've understood).
I've also worked out that I can use sed for find and replace. I was planning on using a regular expression as my sed "find". Does that make sense?
And finally, I'm not sure how to go about looping on all my files (which have ordered names). Can I do an if or while loop in a bash script?
I know that my questions are all over the place, as I said, never done this before and there is a mountain of documentation I'm only beginning to tackle. Any help or pointers would be appreciated.

You can use the following command which relies on capturing groups to extract the different parts of a ref or label containing a colon to replace it with the equivalent using an underscore :
sed -E 's/\\(ref|label)\{([^:]*):([^}]*)}/\\\1\{\2_\3}/g'
The expression captures the whole ref or label tag, matching the tag name in the first capturing group, the part that precedes the colon in the second capturing group and the part that follows the colon in the third capturing group. The replacement pattern uses references to these capturing groups and can be read as \<tagName>{<before colon>_<after colon>}.
You can try it here.
Note that it would be prefereable to use a parser that understands the latex format, the regex is likely to fail for some edge cases.
And finally, I'm not sure how to go about looping on all my files (which have ordered names). Can I do an if or while loop in a bash script?
sed accepts a list of files as parameter and will apply its command on all of them. The list of files can be produced by the expansion of a glob, e.g. sed 'sedCommand' /your/directory/*.txt which would work on all file of /your/directory/ whose name end in .txt.
In this case you will likely want to use sed's -i "in place" flag which asks sed to direcly write its result in the target file rather than on its standard output. The flag can be followed by a suffix if you want a backup of the original, for instance sed -i.bak 'command' file.txt will have file.txt contain the result and file.txt.bak the original.

How to exclude list of folders from Mercurial/TortoiseHG's .hgignore file?

Ok. I need to ignore a list of files from the version control, except for files in three certain folders (let's call them Folder1, Folder2 and Folder3). I can list all folders I need to ignore as a plain list, but I consider this as not an elegant way, so I wrote the following regex:
.*/(Bin|bin)/(?!Folder1/|Folder2/|Folder3/).*
My thoughts were as follows, from left to right:
.* - Any number of any characters.
/ - Slash symbol, which separates folders from one another.
(Bin|bin) - Folder with "Bin" or "bin" name.
/ - Slash symbol, which separates folders from one another.
(?!Folder1/|Folder2/|Folder3/) - Folder name is not "Folder1/" and is not "Folder2/" and is not "Folder3/". This part was the most complicated, but I googled it somehow. I don't understand why should it work, but it works during the tests.
.* - Any number of any characters.
This expression works perfectly when I test it at regex101.com with a couple of text strings, representing paths to files, but nothing works when I put it in my .hgignore file, as follows:
syntax: regexp
.*/(Bin|bin)/(?!Folder1/|Folder2/|Folder3/).*
For some reason it ignores all files and sub-folders in all "Bin" and "bin" folders. How can I accomplish my task?
P.S. As soon as I know, Mercurial/TortoiseHG uses Python/Perl regular expressions.
Many thanks in advance.

To adjust the question a bit to make it clearer (at least to me), we have any number of /bin/somename/... and .../bin/anothername/... names that should be ignored, along with three sets of .../bin/folder1/..., .../bin/2folder/..., and .../Bin/third/... set of names that should not be ignored.
Hence, we want a regular expression that (without anchoring) will match the names-to-be-ignored but not the ones-to-be-kept. (Furthermore, glob matching won't work, since it's not as powerful: we'll either match too little or too much, and Mercurial lacks the "override with later un-ignore" feature of Git.)
The shortest regular expression for this should be:
/[Bb]in/(?!(folder1|2folder|third)/)
(The part of this regex that actually matches a string like /bin/somename/... is only the /bin/ part, but Mercurial does not look at what matched, only whether something matched.)
The thing is, your example regular expression should work, it's just a longer variant of this same thing with not-required but harmless (except for performance) .* added at the front and back. So if yours isn't working, the above probably won't work either. A sample repository, with some dummy files, that one could clone and experiment with, would help diagnose the issue.
Original (wrong) answer (to something that's not the question)
The shortest regular expression for the desired case is:
/[Bb]in/Folder[123]/
However, if the directory / folder names do not actually meet this kind of pattern, we need:
/[Bb]in/(somedir|another|third)/
Explanation
First, a side note: the default syntax is regexp, so the initial syntax: regexp line is unnecessary. As a result, it's possible that your .hgignore file is not in proper UTF-8 format: see Mercurial gives "invalid pattern" error for simple GLOB syntax. (But that would produce different behavior, so that's probably a problem. It's just worth mentioning in any answer about .hgignore files malfunctioning.)
Next, it's worth noting a few items:
Mercurial tracks only files, not directories / folders. So the real question is whether any given file name matches the pattern(s) listed in .hgignore. If they do match, and the file is currently untracked, the file will not be automatically added with a sweeping "add everything" operation, and Mercurial will not gripe that the file is untracked.
If some file is already tracked, the fact that its name matches an ignore pattern is irrelevant. If the file a/b/c.ext is not tracked and does match a pattern, hg add a/b/c.ext will add it anyway, while hg add a/b will en-masse add everything in a/b but won't add c.ext because it matches the pattern. So it's important to know whether the file is already tracked, and consider what you explicitly list to hg add. See also How to check which files are being ignored because of .hgignore?, for instance.
Glob patterns are much easier to write correctly than regular expressions. Unless you're doing this for learning or teaching purposes, or glob is just not powerful enough, stick with the glob patterns. (In very old versions of Mercurial, glob matching was noticeably slower than regexp matching, but that's been fixed for a long time.)
Mercurial's regexp ignore entries are not automatically anchored: if you want anchored behavior, use ^ at the front, and $ at the end, as desired. Here, you don't want anchored behavior, so you can eliminate the leading and trailing .*. (Mercurial refers to this as rooted rather than anchored, and it's important to note that some patterns are anchored, but .hgignore ones are not.)
Python/Perl regexp (?!...) syntax is the negation syntax: (?!...) matches if the parenthesized expression doesn't match the string. This is part of the problem.
We need not worry about capturing groups (see capturing group in regex) as Mercurial does nothing with the groups that come out of the regular expression. It only cares if we match.
Path names are really slash-separated components. The leading components are the various directories (folders) above the file name, and the final component is the file name. (That is, try not to think of the first parts as folders: it's not that it's wrong, it's that it's less general than "components", since the last part is also a component.)
What we want, in this case, is to match, and therefore "ignore", names that have one component that matches either bin or Bin followed immediately by another component that matches Folder1, Folder2, or Folder3 that is followed by a component-separator (so that we haven't stopped at /bin/Folder1, for instance, which is a file named Folder1 in directory /bin).
The strings bin and Bin both end with a common trailing part of in, so this is recognizable as (B|b)in, but single-character alternation is more easily expressed as a character class: [Bb], which eliminates the need for parentheses and vertical-bars.
The same holds for the names Folder1, Folder2, and Folder3, except that their common string leads rather than trails, so we can use Folder[123].
Suppose we had anchored matches. That is, suppose Mercurial demanded that we match the whole file name, which might be, say, /foo/hello/bin/Folder2/bar/world.ext. Then we'd need .*/[Bb]in/Folder[123]/.*, because we'd need to match any number of characters to skip over /foo/hello before matching /bin/Folder2/, and again skip over any number of characters to match bar/world.ext, in order to match the whole string. But since we don't have anchored matches, we'll find the pattern /bin/Folder2/ within the whole string, and hence ignore this file, using the simpler pattern without the leading and trailing .*.

Regular Expression remove specific text in file name

I am using a file transferring tool that allows the use of Regular Expression to rename files as they are copied into a new folder (so I am working with Regular Expression only and not inside a code base) I have a large set of files with a specific naming convention with a version number at the end of the file name. My goal is to remove this file version number along with the underscore.
Here are some examples of the file names:
the_file_name_DS_017_EN_35.pdf
the_file_name_DS_037_SP_35.pdf
different_filename_DS_EN_5.pdf
I am looking to change them to:
the_file_name_DS_017_EN.pdf
the_file_name_DS_037_SP.pdf
different_filename_DS_EN.pdf
I am trying to remove the version number so that the file naming convention on my new server will always be the same. I am not good with regex and this is what I tried so far but to no avail:
Using _[^_]+$ it selects last underscore along with the .pdf extension.
Using \_(.*?)\. it selects the first underscore until the period.
How do I select the last underscore until the period removing that text but keeping the period? Maybe there is a better method? Thanks in advance!

If you regex motor works with positive lookaheads, you might work it like this and replace it by nothing
(_\d+)(?=\.pdf$)
Demo
Explanation :
(_\d+) will follow an underscore following by one or more digits
(?=\.pdf$) will match as a positive lookahead the .pdf extension at the end of the file name

TRY to use the regular expression here:
_[0-9]*\.
and replace it by
.

Elasticsearch Regex to match url starting with one string and not ending with another, without look ahead/behind

I have two groups of strings that take the formats
http://example.com/foo/something
and
http://example.com/foo/something/something-else/bar/1
Where example.com, foo and bar are fixed, something and something else could be any string and 1 is any number.
I want to use regex to match strings following the first format (they must start with http://example.com/foo/) and not the second. The exclusion could be around number of slashes, the "bar" string or ending in a number.
I don't have support for look ahead or look back.
What's the best approach?
Examples of strings that should match
http://example.com/foo/apple
http://example.com/foo/bear-bear
http://example.com/foo/cake-cake
Examples of strings that should NOT match
http://example.com/baa/apple
http://example.com/foo/apple/cake/bar/1
http://example.com/foo/bear-apple/camel/bar/2
Examples of strings that wouldn't exist in the data set
(So it doesn't matter if they match or not)
http://example.com/foo/bear-bear/cake/bar/two
http://example.com/foo/bear/camel/tar/2
http://example.com/foo/bear-bear/camel
http://example.com/foo/bear/camel/
http://example.com/foo/bear-bear/camel/tar/2
UPDATE
It turns out that the regex engine the application I'm using this in is from Elasticsearch, so this documentation (and one of our developers) was helpful: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html
The end solution was:
(http://example.com/foo.*)&~(.*bar.*)

All your examples have a specific prefix URL, followed by one-and-only-one path element. If this is the general case, you can do this by simply looking for the prefix URL followed by a word which doesn't contain a path separator, followed by EOL.
You didn't say what engine you're using, so here's an example with Gnu grep in bash:
grep -e '^http://example.com/foo/[^/]\+$'
Bash makes for readable examples, because single-quoting means very few characters need escaping. The sole exception in my example is the + character.

Regex for SublimeText Snippet

I've been stuck for a while on this Sublime Snippet now.
I would like to display the correct package name when creating a new class, using TM_FILEPATH and TM_FILENAME.
When printing TM_FILEPATH variable, I get something like this:
/Users/caubry/d/[...]/src/com/[...]/folder/MyClass.as
I would like to transform this output, so I could get something like:
com.[...].folder
This includes:
Removing anything before /com/[...]/folder/MyClass.as;
Removing the TM_FILENAME, with its extension; in this example MyClass.as;
And finally finding all the slashes and replacing them by dots.
So far, this is what I've got:
${1:${TM_FILEPATH/.+(?:src\/)(.+)\.\w+/\l$1/}}
and this displays:
com/[...]/folder/MyClass
I do understand how to replace splashes with dots, such as:
${1:${TM_FILEPATH/\//./g/}}
However, I'm having difficulties to add this logic to the previous one, as well as removing the TM_FILENAME at the end of the logic.
I'm really inexperienced with Regex, thanks in advance.
:]
EDIT: [...] indicates variable number of folders.

We can do this in a single replacement with some trickery. What we'll do is, we put a few different cases into our pattern and do a different replacement for each of them. The trick to accomplish this is that the replacement string must contain no literal characters, but consist entirely of "backreferences". In that case, those groups that didn't participate in the match (because they were part of a different case) will simply be written back as an empty string and not contribute to the replacement. Let's get started.
First, we want to remove everything up until the last src/ (to mimic the behaviour of your snippet - use an ungreedy quantifier if you want to remove everything until the first src/):
^.+/src/
We just want to drop this, so there's no need to capture anything - nor to write anything back.
Now we want to match subsequent folders until the last one. We'll capture the folder name, also match the trailing /, but write back the folder name and a .. But I said no literal text in the replacement string! So the . has to come from a capture as well. Here comes the assumption into play, that your file always has an extension. We can grab the period from the file name with a lookahead. We'll also use that lookahead to make sure that there's at least one more folder ahead:
^.+/src/|\G([^/]+)/(?=[^/]+/.*([.]))
And we'll replace this with $1$2. Now if the first alternative catches, groups $1 and $2 will be empty, and the leading bit is still removed. If the second alternative catches, $1 will be the folder name, and $2 will have captured a period. Sweet. The \G is an anchor that ensures that all matches are adjacent to one another.
Finally, we'll match the last folder and everything that follows it, and only write back the folder name:
^.+/src/|\G([^/]+)/(?=[^/]+/.*([.]))|\G([^/]+)/[^/]+$
And now we'll replace this with $1$2$3 for the final solution. Demo.
A conceptually similar variant would be:
^.+/src/|\G([^/]+)/(?:(?=[^/]+/.*([.]))|[^/]+$)
replaced with $1$2. I've really only factored out the beginning of the second and third alternative. Demo.
Finally, if Sublime is using Boost's extended format string syntax, it is actually possible to get characters into the replacement conditionally (without magically conjuring them from the file extension):
^.+/src/|\G(/)?([^/]+)|\G/[^/]+$
Now we have the first alternative for everything up to src (which is to be removed), the third alternative for the last slash and file name (which is to be removed), and the middle alternative for all folders you want to keep. This time I put the slash to be replaced optionally at the beginning. With a conditional replacement we can write a . there if and only if that slash was matched:
(?1.:)$2
Unfortunately, I can't test this right now and I don't know an online tester that uses Boost's regex engine. But this should do the trick just fine.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js