Find and trim part of what is found using regular expression - regex

I'm a newbie in writing regular expressions
I have a file name like this TST0101201304-123.txt and my target is to get the numbers between '-' and '.txt'
So I wrote this formula -([0-9]*)\.txt this will get me the numbers that I want, but in addition, it is retrieving the highfin '-' and the last part of the string also '.txt' so the result in the example above is '-123.txt'
So my question is:
Is there a way in regular expressions to get only part of the matched string, like a submatch of the match without the need to trim it in my shell script code for unix?
I found this answer but it is getting the same result:
Regexp: Trim parts of a string and return what ever is left
Tip: To test my regular expressions is used this website

You can use lookbehind and lookahead
(?<=-)[0-9]*(?=[.]txt)
Don't know if it would work in unix

Different regex-engines are different. Since you're using expr match, you need to make two changes:
expr match expects a regex that matches the entire string; so, you need to add .* at the beginning of yours, to cover everything before the hyphen.
expr match uses POSIX Basic Regular Expressions (BREs), which use \( and \) for grouping (and capturing) rather than merely ( and ).
But, conveniently, when you give expr match a regex that contains a capture-group, its output is the content of that capture-group; you don't need to do anything else special. So:
$ expr match TST0101201304-123.txt '.*-\([0-9]*\)\.txt'
123

sed is your friend.
echo filename | sed -e 's/-\([0-9]*\)/\1'
should get you what you want.

Related

foo[E1,E2,...]* glob matches desired contents, but foo[E1,E2,...]_* does not?

I saw something weird today in the behaviour of the Bash Shell when globbing.
So I ran an ls command with the following Glob:
ls GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]* | grep ":"
the result was as expected
GM12878_Hs_InSitu_MboI_rE1_TagDirectory:
GM12878_Hs_InSitu_MboI_rE2_TagDirectory:
GM12878_Hs_InSitu_MboI_rF_TagDirectory:
GM12878_Hs_InSitu_MboI_rG1_TagDirectory:
GM12878_Hs_InSitu_MboI_rG2_TagDirectory:
GM12878_Hs_InSitu_MboI_rH_TagDirectory:
however when I change the same regex by introducing an underscore to this
ls GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]_* | grep ":"
my expected result is the complete set as shown above, however what I get is a subset:
GM12878_Hs_InSitu_MboI_rF_TagDirectory:
GM12878_Hs_InSitu_MboI_rH_TagDirectory:
Can someone explain what's wrong in my logic when I introduce an underscore sign before the asterisk?
I am using Bash.
You misunderstand what your glob is doing.
You were expecting this:
GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]*
to be a glob of files that have any of those comma-separated segments but that's not what [] globbing does. [] globbing is a character class expansion.
Compare:
$ echo GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]
GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]
to what you were trying to get (which is brace {} expansion):
$ echo GM12878_Hs_InSitu_MboI_r{E1,E2,F,G1,G2,H}
GM12878_Hs_InSitu_MboI_rE1 GM12878_Hs_InSitu_MboI_rE2 GM12878_Hs_InSitu_MboI_rF GM12878_Hs_InSitu_MboI_rG1 GM12878_Hs_InSitu_MboI_rG2 GM12878_Hs_InSitu_MboI_rH
You wanted that latter expansion.
Your expansion uses a character class which matches the character E-H, 1-2, and ,; it's identical to:
GM12878_Hs_InSitu_MboI_r[EFGH12,]_*
which, as I expect you can now see, isn't going to match any two character entries (where the underscore-less version will).
* in fileystem globs is not like * in regex. In a regex * means "0 or more of the preceeding pattern," but in filesystem globs it means "anything at all of any size". So in your first example, the _ is just part of the "anything" from the * but in the second you're matching any single character within your character class (not the patterns you seem to be trying to define) followed by _ followed by anything at all.
Also, character classes don't work the way you're trying to use them. [...] will match any character within the brackets, so your pattern is actually the same as [EFGH12,] since those are all the letters in class you define.
To get the grouping of patterns you want, you should use { instead of [ like
ls GM12878_Hs_InSitu_MboI_r{E1,E2,F,G1,G2,H}_* | grep ":"
As far as I know, and this article supports my me, the square brackets don't work as a choice but as a character set, so using [E1,E2,F,G1,G2,H] actually is equivalent to exactly one occurrence of [EGHF12,]. You can then interpret the second result as "one character of EGHF12, and an underscore", which matches GM12878_Hs_InSitu_MboI_rF_TagDirectory: but not GM12878_Hs_InSitu_MboI_rG1_TagDirectory: (there is the r followed by more that "one occurrence of...").
The first regex works because you used the asterisk, which matches what is missed by the wrong [...].
A correct expression would be:
ls GM12878_Hs_InSitu_MboI_r{E1|E2|F|G1|G2|H}* | grep ":"

Regular Expression - Perl

I am trying to get the a sub string from a string using regular expression but it getting error as my regular expression is not working. Can any one help me out in writing correct one :
Here is the Pattern on which i am trying to write the regular expression :
MSM8_BD_V4.3_1-1_idle-Kr_Run3.xlsx
MSM8_BD_V4.3_2-6_mp3-Kr_Run2.xlsx
MSM8_BD_V4.3_Camera_snap-7.xlsx
MSM8_BD_V4.3_Camera_snap-8.xlsx
MSM8_BD_V4.3_Radio_202.16-0.xlsx
I am trying to get the bold part of the substring .
below is the Regular expression i tried:
my $line = "MSM8939_BD_V4.3_1-1_idle-Kratos_Run3.xlsx";
my ($captured) = $line =~ /MSM8939_BD_V4\.\3\_[d]*(.+?)\w/gx;
print "$captured\n";
[d] matches nothing but the literal letter d. You want \d, without the brackets, to match a digit. However, it looks like you also want to include underscores. That would be [\d_].
Try this:
/^MSM8_BD_V4\.3_[\d_]*-?([^-]+)/
If I run this on your input (with e.g. perl -nE 'say $1 if /^MSM8_BD_V4\.3_[\d_]*-?([^-]+)/'), I get this output:
1_idle
6_mp3
Camera_snap
Camera_snap
Radio_202.16
my $line = "MSM8939_BD_V4.3_1-1_idle-Kratos_Run3.xlsx";
for (qw(
MSM8939_BD_V4.3_1-1_idle-Kratos_Run3.xlsx
MSM8939_BD_V4.3_2-6_mp3-Kratos_Run2.xlsx
MSM8939_BD_V4.3_Camera_snap-7.xlsx
MSM8939_BD_V4.3_Camera_snap-8.xlsx
MSM8939_BD_V4.3_Radio_202.16-0.xlsx
)) {
my ($captured) = ($_ =~ /.*[-_]([^\W_]+_[\w.]+)-/gx);
print "$captured\n";
}
Use a greedy pattern to go as far as possible, then grab the last two strings that look like what you want which are still followed by a hyphen.
As does the other answer which was just edited while I was typing, this produces:
1_idle
6_mp3
Camera_snap
Camera_snap
Radio_202.16
This one may be more general in that the beginning of the substring is not hard-coded, i.e., you could use it in other cases which did not necessarily start with MSM8_BD_V4.3.

Regex expression to find file extension in a file with multiple periods

How would you write a regular expression to find the file extension of the following files, keeping in mind that what I am looking for is the ".pdf" or ".xls" portion of the string?
REPORTPDF.20130810.pdf.pgp
REPORTXLS.20130810.xls.pgp
EDIT:
The resulting filenames I want to end up with are the following:
REPORT20130810.PDF
REPORT20130810.XLS
I am on a Windows platform. I've played around with this a bit at http://regexpal.com/ but so far I can only figure out how to match the date:
([0-9]{4}[0-9]{2}[0-9]{2})
Using sed:
sed 's/^\(.*[^.]*\)\.[^.]*$/\1/' <<< "REPORTPDF.20130810.pdf.pgp"
REPORTPDF.20130810.pdf
Using grep -P (PCRE regex):
grep -oP '^.+[^.]+(?=\.[^.]+$)' <<< "REPORTPDF.20130810.pdf.pgp"
REPORTPDF.20130810.pdf
.+\.(\w+)\.\w+$ would deliver the last but one extension as group 1, how this is accessed would then be dependent of your host language for the regex.
If you don't need the file extension to be capitalized, this should work
([a-zA-Z]+)\.([0-9]{4}[0-9]{2}[0-9]{2})\.(xls|pdf)\.pgp
Matches:
REPORTXLS.20130810.xls.pgp
And then the groups you'd use are two and three
REPORT\2.\3
Matches:
REPORT20130810.xls
Problem is that you don't provide much context for how you're going about changing these file names.
You don't say what language/library you're using, but this Perl one-liner does the trick:
perl -lpe "s/^([^.]*)(...)\.(\d+)(\.\2)\.pgp/\1\3\4/i; $_=uc"
I think this will work for you :)
^(([A-Z a-z]*)(?:XLS.|PDF.)(\d{8})(.pdf|.xls))
Edit live on Debuggex
^ starts at the beginning of the string
(.*) any character before
\d any number 0-9
{8} only 8 times for that character section (in this case 8 times of
the numbers 0-9)
?: is non capture groups
I wrapped the capture groups into one large one so the thing that you want will be in the first capture group :).
This can be prob be replaced
([A-Z a-z]*)
with
(REPORT)
This (.*?(?:\..*)?)(\..*) will hold things like:
'hello.1a.2bb.3' ---> group(1) == 'hello.1a.2bb', group(2) == '.3'
'yep.1' ---> group(1) == 'yep', group(2) == '.1'
If the format is pretty much fixed you could use
(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)
and cherry pick replacement based on what you want
Used java here but regex match would still be same
String a = "REPORTPDF.20130810.pdf.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1--$2--$3--$4--$5");
;
String b = "REPORTXLS.20130810.xls.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1--$2--$3--$4--$5");
System.out.println(a);
System.out.println(b);
REPORT--PDF--20130810--pdf--pgp
REPORT--XLS--20130810--xls--pgp
in your case "$1$3.$2"
String b = "REPORTXLS.20130810.xls.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1$3.$2");
which produces intended result
REPORT20130810.XLS

Annotating mismatches in regular expression

I need to "annotate" with a X character each mismatch in a regular expression, For example if I have a text file like:
Line1Name: this is a (string).
Line2Name: (a string)
Line3Name this is a line without parenthesis
Line4Name: (a string 2)
Now following regular expression will match everything before a :
^[^:]+(?=:)
so the result will be
Line1Name:
Line2Name:
Line4Name:
However I would need to annotate the mismatch at the 3rd line, having this output:
Line1Name:
Line2Name:
X
Line4Name:
Is this possible with regular expressions?
If you have a look at what a regular expression is, you will realize that it is not possible to do logical operations with a regex alone. Quoting Wikipedia:
In computing, a regular expression provides a concise and flexible means to “match” (specify and recognize) strings of text, such as particular characters, words, or patterns of characters.
emphasis mine – simply put, a regex is a fancy way to find a string; it either does (it matches), or not.
To achieve what you are after, you need some kind of logic switch that operates on the match / not-match result of your regex search and triggers an action. You haven’t specified in what environment you are using your regex, so providing a solution is a bit pointless, but as an example, this would do what you are trying to do in pure bash:
# assuming your string is in $str
result="$([[ $str =~ ^[^:]+: ]] && echo "${str%:*}" || echo "X")"
and this does the same thing in a language supporting your regex pattern (Ruby):
# assuming your string is in str
result = str.match(/^[^:]+(?=:)/) || "X"
As a side note, your example code does not match the output: you are using a lookahead for the colon, which excludes it in the match, but your output includes it. I’ve opted for sticking with your regex over your output pattern in my examples, thus excluding the colon from the result.

Regular Expression, dynamic number

The regular expression which I have provided will select the string 72719.
Regular expression:
(?<=bdfg34f;\d{4};)\d{0,9}
Text sample:
vfhnsirf;5234;72159;2;668912;28032009;4;
bdfg34f;8467;72719;7;6637912;05072009;7;
b5g342sirf;234;72119;4;774582;20102009;3;
How can I rewrite the expression to select that string even when the number 8467; is changed to 84677; or 846777; ? Is it possible?
First, when asking a regex question, you should always specify which language you are using.
Assuming that the language you are using does not support variable length lookbehind (and most don't), here is a solution which will work. Your original expression uses a fixed-length lookbehind to match the pattern preceding the value you want. But now this preceding text may be of variable length so you can't use a look behind. This is no problem. Simply match the preceding text normally and capture the portion that you want to keep in a capture group. Here is a tested PHP code snippet which grabs all values from a string, capturing each value into capture group $1:
$re = '/^bdfg34f;\d{4,};(\d{0,9})/m';
if (preg_match_all($re, $text, $matches)) {
$values = $matches[1];
}
The changes are:
Removed the lookbehind group.
Added a start of line anchor and set multi-line mode.
Changed the \d{4} "exactly four" to \d{4,} "four or more".
Added a capture group for the desired value.
Here's how I usually describe "fields" in a regex:
[^;]+;[^;]+;([^;]+);
This means "stuff that isn't semi-colon, followed by a semicolon", which describes each field. Do that twice. Then the third time, select it.
You may have to tweak the syntax for whatever language you are doing this regex in.
Also, if this is just a data file on disk and you are using GNU tools, there's a much easier way to do this:
cat file | cut -d";" -f 3
to match the first number with a minimum of 4 digits
(?<=bdfg34f;\d{4,};)\d{0,9}
and to match the first number with 1 or more length
(?<=bdfg34f;\d+;)\d{0,9}
or to match the first number only if the length is between 4 and 6
(?<=bdfg34f;\d{4,6};)\d{0,9}
This is a simple text parsing problem that probably doesn't mandate the use of regular expressions.
You could take the input line by line and split on ';', i.e. (in php, I have no idea what you're doing)
foreach (explode("\n", $string) as $line) {
$bits = explode(";", $line);
echo $bits[3]; // third column
}
If this is indeed in a file and you happen to be using PHP, using fgetcsv would be much better though.
Anyway, context is missing, but the bottom line is I don't think you should be using regular expressions for this.