Middle-portion regex

Middle-portion regex - regex

I'm tying to write some regex matching a start and end of a string.
start:
https://www.example.com.au/
end:
-end
Example input/match:
Input IsMatch
https://www.example.com.au/hithere-end Y
https://www.example.com.au/hi-there-end Y
https://www.example.com.au/hithere-endx N
https://www.example.com.au/end N
This is what i have so far:
^https?://(www\.)?example\.com\.au/[A-z](\-end)$
Any help?
Thanks.

Try this pattern:
^https?:\/\/(?:www\.)?example\.com\.au\/(.+)-end$
Changes from your pattern:
/ are escaped (with \, 3 times).
The first group changed to a non-capturing one (?:).
[A-z] matches a single capital letter. Changed to (.+)
(a capturing group).
Removed parentheses from the last group (you don't want to capture it), hence \ is also not needed.
The "middle part" you want to capture is in group 1.

Check this:
^(https?://(www\.)?example\.com\.au/)[A-z]*(-end)$
Should work.

Try this C# code
Somestring.StartsWith("https://www.example.com.au/")
Somestring.EndsWith("-end")

Related

Using regex to parse/reduce text file

edit
I've realized I made a mistake when explaining myself. Apologies for that.
Most of the artifacts come from this path:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\
then breaks into Artifact folders and its sub-folders like this:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.2\data.xxx
I would appreciate help with following thing:
I have this list (around 5k rows) of paths to different artifacts and they have different versions, to give you an example:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.2\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.2\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.3\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.2\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.3\data.xxx
And my goal to achieve is this:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.2\data.xxx
Basically to scope it down to just 1 version.
I've tried using ^(.*)(\n\1)+$ and $1. but that obviously didn't work. So I was wondering if you have an idea how to approach this. Greatly appreciate help, thanks!

You can use
Find what: ^(.*\.)(\d+)\\[^\\\n]+(\n\1\d+\\[^\\\n]+)+$
Replace: $1$2\\
See the regex demo. Details:
^ - start of a line (it is the default ^ behavior in Visual Studio Code)
(.*\.) - Group 1: any one or more chars other than line break chars as many as possible and then a .
(\d+) - Group 2:
\\ - a \ char
[^\\\n]+ - one or more chars other than \ and a line break
(\n\1\d+\\[^\\\n]+)+ - Group 3 capturing one or more sequences of a line break and then the value captured into Group 1, one or more digits, a \ char and then one or more chars other than \ and a line break
$ - end of a line.

Here is another attempt, see regex101 demo.
The basic idea is to isolate someText-\d?. in capture group 2.
Then look for $2 in following lines. What precedes $2 or follows $2 in those following lines can vary.
Find: ^(.*\\(?=.*\\))(.*-\d+\.)(.*\\?.*)(\n.*\2.*)*
Replace: $1$2$3
So here is the most interesting part: ^(.*\\(?=.*\\))(.*-\d+\.)
This will get your Artifact-1. or Artifact-17. or someText-2. into capture group 2. Because using a positive lookahead (?=.*\\) the following group 2 (.*-\d+\.) will be in the last directory only. And then (.*\\?.*) gathers the rest of that line into group 3.
Finally (\n.*\2.*)* checks to see if there is a backreference to group 2, \2, in any following lines. [Technically, that backreference could be anywhere in a line, even the beginning, that can be fixed if necessary - let me know if you need that for your data. See safer regex101 demo if 'someText-/d.' could appear anywhere and should be ignored if not last directory and use that find.]

You can not use a single capture group for the whole line using ^(.*), as you want to repeat only the part before the last dot using a backreference and that will not work capturing the whole line.
Therefore you have to capture the digits in the first match in a separate capture group to keep it in the replacement.
If you want to match all following lines with the same text before the last dot, you can use a repeating group:
^\s*(.*\.)(\d+\\[^\\\r\n]*)(?:\r?\n\s*\1\d*\\[^\\\r\n]*)+
The pattern matches:
^ Start of string
\s* Match optional whitespace chars
(.*\.) Capture group 1, match till the last dot
(\d+\\[^\\\r\n]*) Capture group 2, match 1+ digits, \ and optional chars other than \ or a newline
(?: Non capture group
\r?\n\s*\1 Match a newline and a backreference to group 1
\d+\\[^\\\r\n]* Same pattern as in the first part
)+ Close the non capture group and repeat 1+ times
See a regex demo.
In the replacement use the 2 capture groups $1$2
The replacement will look like
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.2\data.xxx

Can regular expression assert that 2 of submatches to be equal?

Let say for this simple regexp,
(?P<first>\d+)\.(?P<second>\d+)
it can match strings like "123.456" so that,
first -> 123, second -> 456
Based on this example, is there a way to assert "first" should equal "second", otherwise the input string won't be a match?

You could capture the first digits before the dot in a capturing group and use a backreference after the dot to group 1:
(?P<first>\d+)\.(?P<second>\1)
Or you can referer to the first capturing group by name:
(?P<first>\d+)\.(?P<second>(?P=first))
As per comment from UnbearableLightness you could use word boundaries \b or use anchors ^ and $ to assert the start and the end of the line.
\b(?P<first>\d+)\.(?P<second>(?P=first))\b

You can backreference to the matched group in capture one with expression:
^(?P<first>\d+)\.(?P<second>\1)$
You can check it live here.

Regex- to extract a string before and after string

Want extract string before and after the word. Below are the content.
Content:
1. http://www.example.com/myplan/mp/public/pl_be?Id=543543&timestamp=06280435435
2. http://www.example.com/course/df/public/pl_de?Id=454354&timestamp=0628031746
3. http://www.example.com/book/rg/public/pl_fo?Id=4445577&timestamp=0628031734
4. http://www.example.com/trip/tr/public/pl_ds?Id=454354&timestamp=06280314546
5. http://www.example.com/trip/tr/public/pl_ds
I want capture data for above string as below
1. http://www.example.com/myplan/mp/public/?Id=543543
2. http://www.example.com/course/df/public/?Id=454354
3. http://www.example.com/book/rg/public/?Id=4445577
4. http://www.example.com/trip/tr/public/?Id=454354
5. http://www.example.com/trip/tr/public/
I have tried with (./(?![A-Za-z]{2}_[A-Za-z]{2}).(?=&)). But it won't help.
I hope somebody can help me with this.

This pattern will catch what you want in two groups. It's more safe than other other examples that have been suggested so far because it allows for some variance in the URL.
(.*)\w\w_\w\w.*?(?:(?:[&?]\w+=\d+|%\w*)*?(\?Id=\d+)(?:.*))?
(.*) captures everything up until your xx_xx part (capture group 1)
\w\w_\w\w.* matches xx_xx and everything up until the next capture section
(?:[&?]\w+=\d+|%\w*)*? allows for there to be other & % or ? properties in your URL before your ?Id= property
(\?Id=\d+) captures your Id property (capture group 2)
(?:.*) is unnecessary but it bugs me when not all of the text is highlighted on regex101 ¯\_(ツ)_/¯
the optional non-capturing group here (?:(?:[&?]\w+=\d+|%\w*)*?(\?Id=\d+)(?:.*))? allows it to match URLs that don't have ID properties.
Here's an example of how it works

Response updated:
This pattern will do the work for you:
(.*\/)[^?]*(?:(\?[^&]*).*)?
Explanation:
(.*\/) -> Will match and capture every character until the / character is present (The .* is a greedy operator).
[^?]* -> Will match everything that's not a ? character.
(?:(\?[^&]*).*)? -> First of all, (?: ... ) is a non-capturing group, the ? at the end of this makes this group optional, (\?[^&]*) will match and capture the ? character and every non & character next to it, the last .* will match everything after the first param in the URL.
Then you can replace the string using only the first and second capture groups.
Here is a working example in regex101
Edit 2:
As emsimpson92 mentioned in the comments, the Id couldn't always be the first param, so you can use this pattern to match the Id param:
(.*\/)[^?]*(?:(\?).*?(Id=[^&]*).*)?
The important part here is that .*?(Id=[^&]*).* matches the Id param no matter its position.
.*? -> It matches all the characters until Id= is present. The trick here is that .* is a greedy quantifier but when is used in conjunction with ? it becomes a lazy one.
Here is an Example of this scenario in regex101

Regular Expression to Extract Text Bounded by '/'

I need to a regular expression to extract names from a GEDCOM file. The format is:
Fred Joseph /Smith/
Where the text bounded by the / is the surname and the Fred Joseph are the forenames. The complication is that the surname could be at any place in the text or may not be there at all. I need something that will extract the surname and capture everything else as the forenames.
This is as far as I have got and I have tried making groups optional with the ? qualifier but to no avail:
As you can see it has several problems: If the surname is missing nothing gets captured, the forename(s) sometimes have leading and trailing spaces, and I have 3 capture groups when I'd really like 2. Even better would be if the capture group for the surname didn't include the '/' characters.
Any help would be much appreciated.

For your last line, I'm not sure there is a way to join the group 1 with group 3 into a single group.
Here is my proposed solution. It doesn't capture spaces around forenames.
^(?:\h*([a-z\h]+\b)\h*)?(?:\/([a-z\h]+)\/)?(?:\h*([a-z\h]+\b)\h*)?$
To correctly match the names, care to use the insensitive flag, and if you test all lines at once, use multiline flag.
See the demo
Explanation
^ start of the line
(?:\h*([a-z\h]+\b)\h*)? first non-capturing group that matches 0 or 1 time:
\h* 0 or more horizontal spaces
([a-z\h]+\b) captures in a group letters and spaces, but stops at the end of the last word
\h* matches the possible remaining spaces without capturing
(?:\/([a-z\h]+)\/)? second non-capturing group that matches 0 or 1 time a name in a capturing group surrounded by slashes
(?:\h*([a-z\h]+\b)\h*)? third non-capturing group doing the same as first one, capturing the names in a third group.
$ end of the line

For your requirements
([A-z a-z /])+\w*
Sample

Hope this helps
(.\*?)\\/(.\*?)\\/(.\*)

Try this: ^([^/]*)(/[^/]+/)?([^/]*)$
This matches the following:
^ start of string (or with multiline modifier start of line)
([^/\n]*) anything other than / or new line zero or more times - this is captured as group 1
(/[^/\n]+/)? a single / followed by one or more non / or new line characters, then a single '/' character - this is captured as group 2, and is optional
([^/\n]*) anything other than / or new line zero or more times - this is captured as group 3
$ end of string (or with multiline modifier end of line)
You can see in action with your example text here: https://regex101.com/r/9kmKpy/1
To not capture the slashes you can add a non capturing group by adding ?: to the second set of brackets, and then adding another pair between the slashes:
^([^\/\n]*)(?:\/([^\/\n]+)\/)?([^\/\n]*)$
https://regex101.com/r/9kmKpy/2

I am not sure I follow what language is being used to extract the data, but based on what you have so far, you simply need to add '?':
(.*)(\/?.*\/?)(.*)
Not that this does not give you groupings for EACH name as some solutions will have multiple names in a single group
Edit:
Extending on Niitaku solution and looking at having each individual name in its own group, you could use:
^\s*(?:\/?([a-z]+)\/?)\s*(?:\/?([a-z]+)\/?)\s*(?:\/?([a-z]+)\/?)\s*$
As explained though, if using a language like ruby it would simply be:
ruby -pe '$_ = $_.scan(/\w+/)' file

Regular expression to match string only if trailed by a character

I need help creating a regular expression.
Here are two sample strings:
/path/to/file.jpg
/path/to/file.type.jpg
Respectively, I'm trying to capture:
file.jpg
file.type.jpg
But I want to capture the three as separate strings.
file,jpg
file,type,jpg
Note that I'm not capturing the periods.
I thought something like this could work (excluding the new lines):
([a-z]+)\.
[([a-z]+)[\.]{1}]?
([a-z]{3})
Guidance would be appreciated.
I'm wondering if there is another modified I would need to use to have it capture it properly.
The above expression errors out, by the way :(

I suggest you to use pattern
\/([^.]+)\.?([^.]+|)\.([^.]+)$
and you will have 3 groups: file, type (which will be empty, if not present) and extension

You'd have to use:
/(\w+)(\.(\w+))?\.(\w+){3,4}\b
Then capturing groups 1, 3 and 4 would be your: file(1) type(3) and jpg/png whatever(4)
Groups taken apart:
(\w+) - matches word characters 1 or more (equivalent of saying: {1, }
(\.(\w+))? - matches the 3rd group and with a dot in front, and makes the whole group optional ( ? )
(\w+) - as gr 1
(\w{3,4})\b - matchees 3 or 4 word characters ( {3,4} ) and ensures that after those chracters there are no other characters (word end - \b - ! if supported !)

You can use: "\/(?:\w+\/)+(\w+)\.?(\w+)?\.(\w+)" as regex.
Edit: didnt read about not matching dots.
Live Demo

This regex should work:
/(\w+)\.(\w+)(?:\.(\w+))?$/
Live Demo

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Middle-portion regex - regex

Check this: ^(https?://(www\.)?example\.com\.au/)[A-z]*(-end)$ Should work.

Try this C# code Somestring.StartsWith("https://www.example.com.au/") Somestring.EndsWith("-end")

Related

Using regex to parse/reduce text file

Can regular expression assert that 2 of submatches to be equal?

Regex- to extract a string before and after string

Regular Expression to Extract Text Bounded by '/'

Regular expression to match string only if trailed by a character

Categories

Resources