Using regex to parse/reduce text file

Using regex to parse/reduce text file - regex

edit
I've realized I made a mistake when explaining myself. Apologies for that.
Most of the artifacts come from this path:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\
then breaks into Artifact folders and its sub-folders like this:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.2\data.xxx
I would appreciate help with following thing:
I have this list (around 5k rows) of paths to different artifacts and they have different versions, to give you an example:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.2\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.2\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.3\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.2\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.3\data.xxx
And my goal to achieve is this:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.2\data.xxx
Basically to scope it down to just 1 version.
I've tried using ^(.*)(\n\1)+$ and $1. but that obviously didn't work. So I was wondering if you have an idea how to approach this. Greatly appreciate help, thanks!

You can use
Find what: ^(.*\.)(\d+)\\[^\\\n]+(\n\1\d+\\[^\\\n]+)+$
Replace: $1$2\\
See the regex demo. Details:
^ - start of a line (it is the default ^ behavior in Visual Studio Code)
(.*\.) - Group 1: any one or more chars other than line break chars as many as possible and then a .
(\d+) - Group 2:
\\ - a \ char
[^\\\n]+ - one or more chars other than \ and a line break
(\n\1\d+\\[^\\\n]+)+ - Group 3 capturing one or more sequences of a line break and then the value captured into Group 1, one or more digits, a \ char and then one or more chars other than \ and a line break
$ - end of a line.

Here is another attempt, see regex101 demo.
The basic idea is to isolate someText-\d?. in capture group 2.
Then look for $2 in following lines. What precedes $2 or follows $2 in those following lines can vary.
Find: ^(.*\\(?=.*\\))(.*-\d+\.)(.*\\?.*)(\n.*\2.*)*
Replace: $1$2$3
So here is the most interesting part: ^(.*\\(?=.*\\))(.*-\d+\.)
This will get your Artifact-1. or Artifact-17. or someText-2. into capture group 2. Because using a positive lookahead (?=.*\\) the following group 2 (.*-\d+\.) will be in the last directory only. And then (.*\\?.*) gathers the rest of that line into group 3.
Finally (\n.*\2.*)* checks to see if there is a backreference to group 2, \2, in any following lines. [Technically, that backreference could be anywhere in a line, even the beginning, that can be fixed if necessary - let me know if you need that for your data. See safer regex101 demo if 'someText-/d.' could appear anywhere and should be ignored if not last directory and use that find.]

You can not use a single capture group for the whole line using ^(.*), as you want to repeat only the part before the last dot using a backreference and that will not work capturing the whole line.
Therefore you have to capture the digits in the first match in a separate capture group to keep it in the replacement.
If you want to match all following lines with the same text before the last dot, you can use a repeating group:
^\s*(.*\.)(\d+\\[^\\\r\n]*)(?:\r?\n\s*\1\d*\\[^\\\r\n]*)+
The pattern matches:
^ Start of string
\s* Match optional whitespace chars
(.*\.) Capture group 1, match till the last dot
(\d+\\[^\\\r\n]*) Capture group 2, match 1+ digits, \ and optional chars other than \ or a newline
(?: Non capture group
\r?\n\s*\1 Match a newline and a backreference to group 1
\d+\\[^\\\r\n]* Same pattern as in the first part
)+ Close the non capture group and repeat 1+ times
See a regex demo.
In the replacement use the 2 capture groups $1$2
The replacement will look like
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.2\data.xxx

Related

replaceAll regex to remove last - from the output

I was able to achieve some of the output but not the right one. I am using replace all regex and below is the sample code.
final String label = "abcs-xyzed-abc-nyd-request-xyxpt--1-cnaq9";
System.out.println(label.replaceAll(
"([^-]+)-([^-]+)-(.+)-([^-]+)-([^-]+)", "$3"));
i want this output:
abc-nyd-request-xyxpt
but getting:
abc-nyd-request-xyxpt-
here is the code https://ideone.com/UKnepg

You may use this .replaceFirst solution:
String label = "abcs-xyzed-abc-nyd-request-xyxpt--1-cnaq9";
label.replaceFirst("(?:[^-]*-){2}(.+?)(?:--1)?-[^-]+$", "$1");
//=> "abc-nyd-request-xyxpt"
RegEx Demo
RegEx Details:
(?:[^-]+-){2}: Match 2 repetitions of non-hyphenated string followed by a hyphen
(.+?): Match 1+ of any characters and capture in group #1
(?:--1)?: Match optional --1
-: Match a -
[^-]+: Match a non-hyphenated string
$: End

The following works for your example case
([^-]+)-([^-]+)-(.+[^-])-+([^-]+)-([^-]+)
https://regex101.com/r/VNtryN/1
We don't want to capture any trailing - while allowing the trailing dashes to have more than a single one which makes it match the double --.

With your shown samples and attempts, please try following regex. This is going to create 1 capturing group which can be used in replacement. Do replacement like: $1in your function.
^(?:.*?-){2}([^-]*(?:-[^-]*){3})--.*
Here is the Online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^(?:.*?-){2} ##Matching from starting of value in a non-capturing group where using lazy match to match very near occurrence of - and matching 2 occurrences of it.
([^-]*(?:-[^-]*){3}) ##Creating 1st and only capturing group and matching everything before - followed by - followed by everything just before - and this combination 3 times to get required output.
--.* ##Matching -- to all values till last.

Regular expression with multiline matching (subtitles strings)

Need some help in regexp matching pattern.
The text goes like here (it's subtitles for video)
...
223
00:20:47,920 --> 00:20:57,520
- Hello! This is good subtitle text.
- Yes! How are you, stackoverflow?
224
00:20:57,520 --> 00:21:11,120
Wow, seems amazing.
- We're good, thanks.
Like, you know, everyone is happy around here with their laptops.
225
00:21:11,120 --> 00:21:14,440
- Understood. Some dumb text
...
I need a set of groups:
startTime, endTime, text
For now my achievements are not very good. I can get startTime, endTime and some text, but not all the text, only the last sentence. I've attached a screenshot.
As you can see, group 3 is capturing text, but only last sentence.
Please, explain me what I'm doing wrong.
Thank you.

Accounting for the possibility there is no new-line character after the final text of your string; Would the following work for you:
(\d\d:\d\d:\d\d,\d\d\d)[ >-]*?((?1))\n(.*?(?=\n\n|\Z))
See the online demo
(\d\d:\d\d:\d\d,\d\d\d) - The same pattern as you used to capture starting time in 1st capture group.
[ >-]*? - 0+ (but lazy) character from the character class up to:
((?1)) - A 2nd capture group which matches the same pattern as 1st group.
\n - A newline-character.
(.*?(?=\n\n|\Z)) - A 3rd capture group that captures anything (including newline with the s-flag) up to a positive lookahead for either two newline characters or the end of the whole string.
Note, some (not all) engines allow for backreferencing a previous subpattern. I guess the app you are using does not. Therefor you can swap the (?1) with your own pattern to capture the 2nd group.

Another option is to use a pattern that would capture all lines in group 3 that do not start with 3 digits.
(\d\d:\d\d:\d\d,\d\d\d) --> (\d\d:\d\d:\d\d,\d\d\d)((?:\r?\n(?!\d\d\d\b).*)*)
Explanation
(\d\d:\d\d:\d\d,\d\d\d) Capture group 1 Match a time like pattern
--> Match literally
(\d\d:\d\d:\d\d,\d\d\d) Capture group 2 Same pattern as group 1
( Capture group 3
(?: Non capture group
\r?\n(?!\d\d\d\b).* Match a newline and assert using a negative lookahead that the line does not start with 3 digits followed by word boundary. If that is the case, match the whole line
)* Optionally repeat all lines
) Close group 3
Regex demo
A bitmore specific pattern could be matching all lines that do not start with 3 digits or a start/end time like pattern.
^(\d\d:\d\d:\d\d,\d\d\d)[^\S\r\n]+-->[^\S\r\n]+(\d\d:\d\d:\d\d,\d\d\d)((?:\r?\n(?!\d+$|\d\d:\d\d:\d\d,\d\d\d\b).*)*)
Regex demo

RegEx - if then else

I am trying to work out a regex expression but struggle with conditionals. I have a list of 100s of URLs that look like this:
/name/something/details/55334
/name/page/1/2
/name/somethingdifferent/34523
/name/page/1
/name/something/553/1
Bottom line is that I want to remove everything when a number appears apart from a scenario where the last thing before the number is a word 'page'.
1. /name/something/details/
2. /name/page/1/2
3. /name/somethingdifferent/
4. /name/page/1
5. /name/something
I will be removing it with Google Analytics Content Grouping or potentially with DataStudio. I already removed /name/ so I have:
1. /something/details/55334
2. /page/1/2
3. /somethingdifferent/34523
4. /page/1
5. /something/553/1
but want to add another rule and remove the numbers so I get:
1. /something/details/
2. /page/1/2
3. /somethingdifferent/
4. /page/1
5. /something
have already tried:
\(?(?=(page\/[0-9]+))(\2)|(\/\d+)
following the syntax of:
(?(?=condition))(IF)|(ELSE)
but it highlights all numbers after text.
Thanks for your help.
sampak

Try ^(\/page.*|[^0-9]*), works with your example.
A Version incl. name: ^(page[\/\d]*|[^\d\s])*

One option might be to match not a whitespace or digit while not matching /page.
Then match a forward slash and 1+ digits followed by any char 0+ times to omit that from the result.
^((?:(?!\/page)[^\d\s])*\/)\d.*
In parts
^ Start of string
( Capture group 1
(?: Non capturing group
(?!\/page) Negative lookahead, assert what is directly to the right is not
[^\d\s] Match any char except a digit or whitespace char
)* Close non capturing group and repeat 0+ times
\/ Match /
) Close group 1
\d.* Match a digit followed by any char except a newline 0+ times
In the replacement use the first capturing group
Regex demo
If you also want to remove /name you could use:
^\/name((?:(?!\/page)[^\d\s])*\/)\d.*
Regex demo

Regex - optional capture group after wildcard

Say I have the following list:
No 1 And Your Bird Can Sing (4)
No 2 Baby, You're a Rich Man (5)
No 3 Blue Jay Way S
No 4 Everybody's Got Something to Hide Except Me and My Monkey (1)
And I want to extract the number, the title and the number of weeks in the parenthesis if it exists.
Works, but the last group is not optional (regstorm):
No (?<no>\d{1,3}) (?<title>.*?) \((?<weeks>\d)\)
Last group optional, only matches number (regstorm):
No (?<no>\d{1,3}) (?<title>.*?)( \((?<weeks>\d)\))?
Combining one pattern with week capture with a pattern without week capture works, but there gotta be a better way:
(No (?<no>\d{1,3}) (?<title>.*) \((?<weeks>\d)\))|(No (?<no>\d{1,3}) (?<title>.*))
I use C# and javascript but I guess this is a general regex question.

Your regex is almost there!
First and most importantly, you should add a $ at the end. This makes (?<title>.*?) match all the way towards the end of the string. Currently, (?<title>.*?) matches an empty string and then stops, because it realises that it has reached a point where the rest of the regex matches. Why does the rest of the regex match? Because the optional group can match any empty string. By putting the $, you are making the rest of the regex "harder" to match.
Secondly, you forgot to match an open parenthesis \(.
This is how your regex should look like:
No (?<no>\d{1,3}) (?<title>.*?)( \((?<weeks>\d)\))?$
Demo

You may use this regex with an optional last part:
^No (?<no>\d{1,3}) (?<title>.*?\S)(?: \((?<weeks>\d)\))?$
RegEx Demo

Another option could be for the title to match either not ( or when it does encounter a ( it should not be followed by a digit and a closing parenthesis.
^No (?<no>\d{1,3}) (?<title>(?:[^(\r\n]+|\((?!\d\)))+)(?:\((?<weeks>\d)\))?
In parts
^No
(?\d{1,3}) Group no and space
(?<title>
(?: Non capturing group
[^(\r\n]+ Match any char except ( or newline
| Or
\((?!\d\)) Match ( if not directly followed by a digit and )
)+ Close group and repeat 1+ times
) Close group title
(?: Non capturing group
\((?<weeks>\d)\) Group weeks between parenthesis
)? Close group and make it optional
Regex demo
If you don't want to trim the last space of the title you could exclude it from matching before the weeks.
Regex demo

Regular Expression to Extract Text Bounded by '/'

I need to a regular expression to extract names from a GEDCOM file. The format is:
Fred Joseph /Smith/
Where the text bounded by the / is the surname and the Fred Joseph are the forenames. The complication is that the surname could be at any place in the text or may not be there at all. I need something that will extract the surname and capture everything else as the forenames.
This is as far as I have got and I have tried making groups optional with the ? qualifier but to no avail:
As you can see it has several problems: If the surname is missing nothing gets captured, the forename(s) sometimes have leading and trailing spaces, and I have 3 capture groups when I'd really like 2. Even better would be if the capture group for the surname didn't include the '/' characters.
Any help would be much appreciated.

For your last line, I'm not sure there is a way to join the group 1 with group 3 into a single group.
Here is my proposed solution. It doesn't capture spaces around forenames.
^(?:\h*([a-z\h]+\b)\h*)?(?:\/([a-z\h]+)\/)?(?:\h*([a-z\h]+\b)\h*)?$
To correctly match the names, care to use the insensitive flag, and if you test all lines at once, use multiline flag.
See the demo
Explanation
^ start of the line
(?:\h*([a-z\h]+\b)\h*)? first non-capturing group that matches 0 or 1 time:
\h* 0 or more horizontal spaces
([a-z\h]+\b) captures in a group letters and spaces, but stops at the end of the last word
\h* matches the possible remaining spaces without capturing
(?:\/([a-z\h]+)\/)? second non-capturing group that matches 0 or 1 time a name in a capturing group surrounded by slashes
(?:\h*([a-z\h]+\b)\h*)? third non-capturing group doing the same as first one, capturing the names in a third group.
$ end of the line

For your requirements
([A-z a-z /])+\w*
Sample

Hope this helps
(.\*?)\\/(.\*?)\\/(.\*)

Try this: ^([^/]*)(/[^/]+/)?([^/]*)$
This matches the following:
^ start of string (or with multiline modifier start of line)
([^/\n]*) anything other than / or new line zero or more times - this is captured as group 1
(/[^/\n]+/)? a single / followed by one or more non / or new line characters, then a single '/' character - this is captured as group 2, and is optional
([^/\n]*) anything other than / or new line zero or more times - this is captured as group 3
$ end of string (or with multiline modifier end of line)
You can see in action with your example text here: https://regex101.com/r/9kmKpy/1
To not capture the slashes you can add a non capturing group by adding ?: to the second set of brackets, and then adding another pair between the slashes:
^([^\/\n]*)(?:\/([^\/\n]+)\/)?([^\/\n]*)$
https://regex101.com/r/9kmKpy/2

I am not sure I follow what language is being used to extract the data, but based on what you have so far, you simply need to add '?':
(.*)(\/?.*\/?)(.*)
Not that this does not give you groupings for EACH name as some solutions will have multiple names in a single group
Edit:
Extending on Niitaku solution and looking at having each individual name in its own group, you could use:
^\s*(?:\/?([a-z]+)\/?)\s*(?:\/?([a-z]+)\/?)\s*(?:\/?([a-z]+)\/?)\s*$
As explained though, if using a language like ruby it would simply be:
ruby -pe '$_ = $_.scan(/\w+/)' file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using regex to parse/reduce text file - regex

Related

replaceAll regex to remove last - from the output

Regular expression with multiline matching (subtitles strings)

RegEx - if then else

Regex - optional capture group after wildcard

Regular Expression to Extract Text Bounded by '/'

Categories

Resources