I need to a regular expression to extract names from a GEDCOM file. The format is:
Fred Joseph /Smith/
Where the text bounded by the / is the surname and the Fred Joseph are the forenames. The complication is that the surname could be at any place in the text or may not be there at all. I need something that will extract the surname and capture everything else as the forenames.
This is as far as I have got and I have tried making groups optional with the ? qualifier but to no avail:
As you can see it has several problems: If the surname is missing nothing gets captured, the forename(s) sometimes have leading and trailing spaces, and I have 3 capture groups when I'd really like 2. Even better would be if the capture group for the surname didn't include the '/' characters.
Any help would be much appreciated.
For your last line, I'm not sure there is a way to join the group 1 with group 3 into a single group.
Here is my proposed solution. It doesn't capture spaces around forenames.
^(?:\h*([a-z\h]+\b)\h*)?(?:\/([a-z\h]+)\/)?(?:\h*([a-z\h]+\b)\h*)?$
To correctly match the names, care to use the insensitive flag, and if you test all lines at once, use multiline flag.
See the demo
Explanation
^ start of the line
(?:\h*([a-z\h]+\b)\h*)? first non-capturing group that matches 0 or 1 time:
\h* 0 or more horizontal spaces
([a-z\h]+\b) captures in a group letters and spaces, but stops at the end of the last word
\h* matches the possible remaining spaces without capturing
(?:\/([a-z\h]+)\/)? second non-capturing group that matches 0 or 1 time a name in a capturing group surrounded by slashes
(?:\h*([a-z\h]+\b)\h*)? third non-capturing group doing the same as first one, capturing the names in a third group.
$ end of the line
For your requirements
([A-z a-z /])+\w*
Sample
Hope this helps
(.\*?)\\/(.\*?)\\/(.\*)
Try this: ^([^/]*)(/[^/]+/)?([^/]*)$
This matches the following:
^ start of string (or with multiline modifier start of line)
([^/\n]*) anything other than / or new line zero or more times - this is captured as group 1
(/[^/\n]+/)? a single / followed by one or more non / or new line characters, then a single '/' character - this is captured as group 2, and is optional
([^/\n]*) anything other than / or new line zero or more times - this is captured as group 3
$ end of string (or with multiline modifier end of line)
You can see in action with your example text here: https://regex101.com/r/9kmKpy/1
To not capture the slashes you can add a non capturing group by adding ?: to the second set of brackets, and then adding another pair between the slashes:
^([^\/\n]*)(?:\/([^\/\n]+)\/)?([^\/\n]*)$
https://regex101.com/r/9kmKpy/2
I am not sure I follow what language is being used to extract the data, but based on what you have so far, you simply need to add '?':
(.*)(\/?.*\/?)(.*)
Not that this does not give you groupings for EACH name as some solutions will have multiple names in a single group
Edit:
Extending on Niitaku solution and looking at having each individual name in its own group, you could use:
^\s*(?:\/?([a-z]+)\/?)\s*(?:\/?([a-z]+)\/?)\s*(?:\/?([a-z]+)\/?)\s*$
As explained though, if using a language like ruby it would simply be:
ruby -pe '$_ = $_.scan(/\w+/)' file
Related
edit
I've realized I made a mistake when explaining myself. Apologies for that.
Most of the artifacts come from this path:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\
then breaks into Artifact folders and its sub-folders like this:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.2\data.xxx
I would appreciate help with following thing:
I have this list (around 5k rows) of paths to different artifacts and they have different versions, to give you an example:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.2\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.2\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.3\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.2\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.3\data.xxx
And my goal to achieve is this:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.2\data.xxx
Basically to scope it down to just 1 version.
I've tried using ^(.*)(\n\1)+$ and $1. but that obviously didn't work. So I was wondering if you have an idea how to approach this. Greatly appreciate help, thanks!
You can use
Find what: ^(.*\.)(\d+)\\[^\\\n]+(\n\1\d+\\[^\\\n]+)+$
Replace: $1$2\\
See the regex demo. Details:
^ - start of a line (it is the default ^ behavior in Visual Studio Code)
(.*\.) - Group 1: any one or more chars other than line break chars as many as possible and then a .
(\d+) - Group 2:
\\ - a \ char
[^\\\n]+ - one or more chars other than \ and a line break
(\n\1\d+\\[^\\\n]+)+ - Group 3 capturing one or more sequences of a line break and then the value captured into Group 1, one or more digits, a \ char and then one or more chars other than \ and a line break
$ - end of a line.
Here is another attempt, see regex101 demo.
The basic idea is to isolate someText-\d?. in capture group 2.
Then look for $2 in following lines. What precedes $2 or follows $2 in those following lines can vary.
Find: ^(.*\\(?=.*\\))(.*-\d+\.)(.*\\?.*)(\n.*\2.*)*
Replace: $1$2$3
So here is the most interesting part: ^(.*\\(?=.*\\))(.*-\d+\.)
This will get your Artifact-1. or Artifact-17. or someText-2. into capture group 2. Because using a positive lookahead (?=.*\\) the following group 2 (.*-\d+\.) will be in the last directory only. And then (.*\\?.*) gathers the rest of that line into group 3.
Finally (\n.*\2.*)* checks to see if there is a backreference to group 2, \2, in any following lines. [Technically, that backreference could be anywhere in a line, even the beginning, that can be fixed if necessary - let me know if you need that for your data. See safer regex101 demo if 'someText-/d.' could appear anywhere and should be ignored if not last directory and use that find.]
You can not use a single capture group for the whole line using ^(.*), as you want to repeat only the part before the last dot using a backreference and that will not work capturing the whole line.
Therefore you have to capture the digits in the first match in a separate capture group to keep it in the replacement.
If you want to match all following lines with the same text before the last dot, you can use a repeating group:
^\s*(.*\.)(\d+\\[^\\\r\n]*)(?:\r?\n\s*\1\d*\\[^\\\r\n]*)+
The pattern matches:
^ Start of string
\s* Match optional whitespace chars
(.*\.) Capture group 1, match till the last dot
(\d+\\[^\\\r\n]*) Capture group 2, match 1+ digits, \ and optional chars other than \ or a newline
(?: Non capture group
\r?\n\s*\1 Match a newline and a backreference to group 1
\d+\\[^\\\r\n]* Same pattern as in the first part
)+ Close the non capture group and repeat 1+ times
See a regex demo.
In the replacement use the 2 capture groups $1$2
The replacement will look like
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.2\data.xxx
Need some help in regexp matching pattern.
The text goes like here (it's subtitles for video)
...
223
00:20:47,920 --> 00:20:57,520
- Hello! This is good subtitle text.
- Yes! How are you, stackoverflow?
224
00:20:57,520 --> 00:21:11,120
Wow, seems amazing.
- We're good, thanks.
Like, you know, everyone is happy around here with their laptops.
225
00:21:11,120 --> 00:21:14,440
- Understood. Some dumb text
...
I need a set of groups:
startTime, endTime, text
For now my achievements are not very good. I can get startTime, endTime and some text, but not all the text, only the last sentence. I've attached a screenshot.
As you can see, group 3 is capturing text, but only last sentence.
Please, explain me what I'm doing wrong.
Thank you.
Accounting for the possibility there is no new-line character after the final text of your string; Would the following work for you:
(\d\d:\d\d:\d\d,\d\d\d)[ >-]*?((?1))\n(.*?(?=\n\n|\Z))
See the online demo
(\d\d:\d\d:\d\d,\d\d\d) - The same pattern as you used to capture starting time in 1st capture group.
[ >-]*? - 0+ (but lazy) character from the character class up to:
((?1)) - A 2nd capture group which matches the same pattern as 1st group.
\n - A newline-character.
(.*?(?=\n\n|\Z)) - A 3rd capture group that captures anything (including newline with the s-flag) up to a positive lookahead for either two newline characters or the end of the whole string.
Note, some (not all) engines allow for backreferencing a previous subpattern. I guess the app you are using does not. Therefor you can swap the (?1) with your own pattern to capture the 2nd group.
Another option is to use a pattern that would capture all lines in group 3 that do not start with 3 digits.
(\d\d:\d\d:\d\d,\d\d\d) --> (\d\d:\d\d:\d\d,\d\d\d)((?:\r?\n(?!\d\d\d\b).*)*)
Explanation
(\d\d:\d\d:\d\d,\d\d\d) Capture group 1 Match a time like pattern
--> Match literally
(\d\d:\d\d:\d\d,\d\d\d) Capture group 2 Same pattern as group 1
( Capture group 3
(?: Non capture group
\r?\n(?!\d\d\d\b).* Match a newline and assert using a negative lookahead that the line does not start with 3 digits followed by word boundary. If that is the case, match the whole line
)* Optionally repeat all lines
) Close group 3
Regex demo
A bitmore specific pattern could be matching all lines that do not start with 3 digits or a start/end time like pattern.
^(\d\d:\d\d:\d\d,\d\d\d)[^\S\r\n]+-->[^\S\r\n]+(\d\d:\d\d:\d\d,\d\d\d)((?:\r?\n(?!\d+$|\d\d:\d\d:\d\d,\d\d\d\b).*)*)
Regex demo
Here I have a text string.
Serial#......... 12345678910123456\nCust#........... 654321\nCustomer Name... Some Customer\nBILL TO NO NAME. Bill To: 123456 - Some Company Pty Ltd\nDATE...... 01/01/00
I want to capture 2 parts of this string.
Cust#........... 654321 BILL TO NO NAME. Bill To: 123456 - Some Company Pty Ltd
using regex.
So far I have Cust#.*?\d+ which captures
Cust#........... 654321
However I dont think this is the best approach.
Note.. This is 1 string from thousands, so data within strings is dynamic, can I capture what is within end of line \n character to achieve my result??
Try this regex: ^.*?\n(.*?)\n.*?\n(.*?)\n.*$ at least it should give you a different way of looking at the problem.
It describes the entire string, using carriage returns as element delimiters. The parenthesis defines groups which you want to save, which are the 2nd and 4th groups.
Of course this depends on the elements you want always being the 2nd and 4th and being delimited by the newlines.
https://regex101.com/r/harmzn/1
You might use 2 capturing groups. In the first group, use your pattern without the lazy quantifier, as the digits are at the end of the line.
Then match (not capture) all the lines that do not start with BILL
After that, capture in group 2 the whole line that starts with BILL
^(Cust#.*\d+)(?:\r?\n(?!BILL ).*)*\r?\n(BILL .*)
Explanation
^ Start of string
( Capture group 1
Cust#.*\d+ The pattern to match Cust# with the digits at the end
) Close group
(?:\r?\n(?!BILL ).*)*\r?\n Match all lines that do not start with BILL
( Capture group 2
BILL .* Match the line that starts with BILL
) Close group
Regex demo
When using character delimited text, what code allows me to pull out specific segments within a given row? Out of a given set of data (focusing on bold):
1194459945,11/07/2007 18:25:45,2,vnta,287.78,2,7.783,2,34.111,2,1.3,2,89.54,2,1485.31,26.612
Trying to get it like:
11/07/2007 7.783 89.54
Currently, the progress I've made has been: (\w+,)(.+) (
which has given me the first two columns, but I'm stuck as to how to reach 7.783 and segment that out. Without including the entire row. I cannot put \, because that doesn't help.
Something like this might work.. ^.*?,([^ ,]+)(?:.*?,){5}([^ ,]+)(?:.*?,){6}([^ ,]+).*$
Explanation:
^ - Start of the string / line
.*?, - matches anything up until the first comma
([^ ,]+) - matches anything not a space or comma and stores it in capture group 1 (your date)
(?:.*?,){5} - non capture group to match the fields and commas for the next 5 fields
([^ ,]+) - matches anything not a space or comma and stores it in capture group 2 (your 7.783)
(?:.*?,){6} - another non capture group to match the fields and commas for the next 6 fields
([^ ,]+) - matches anything not a space or comma and stores it in capture group 3 (your 89.54)
.*$ - matches anything trailing after this match to the end of string / line
Notepad++:
You can use the find and replace tool in Notepad++ to replace the strings with only the capture groups which can be accessed by using a dollar sign followed by the capture group number like so:
Find: ^.*?,([^ ,]+)(?:.*?,){5}([^ ,]+)(?:.*?,){6}([^ ,]+).*$
Replace: $1 $2 $3
Test:
Before:
1194459945,11/07/2007 18:25:45,2,vnta,287.78,2,7.783,2,34.111,2,1.3,2,89.54,2,1485.31,26.612
After:
11/07/2007 7.783 89.54
I have some lines which I need to alter. They are protein sequences. How would I copy the first 4 characters of the line to the end of the line, and also copy the last 4 characters to the beginning of the line?
The strings are variable which complicates it, for example:
>X
LTGLGIGTGMAATIINAISVGLSAATILSLISGVASGGAWVLAGAKQALKEGGKKAGIAF
>Y
LVATGMAAGVAKTIVNAVSAGMDIATALSLFSGAFTAAGGIMALIKKYAQKKLWKQLIAA
Moreover, how could I exclude lines with a '>' at the beginning (these are names of the corresponding sequence)?
Does anyone know a regex which will allow this to work?
I've already tried some regex solutions but I'm not very experienced with this sort of thing and I can find the end string but can't get it to replace:
Find:
(...)$
Replace:
^$2$1"
An example of what I want to achieve is:
>1
ABCDEFGHIJKLMNOPQRSTUVWXYZ
becomes:
>1
WXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCD
Thanks
Try doing a find, in regex mode, on the following pattern:
^([A-Z]{4}).*([A-Z]{4})$
Then replace with the first four and last four characters swapped:
$2$0$1
Demo
You can use the regex below.
^(([A-Z]{4})([A-Z]*)([A-Z]{4}))$
^ asserts the position at the start of the line, so nothing can come before it.
( is the start of a capture group, this is group 1.
( is the start of a capture group, this is group 2. This group is inside group 1.
[A-Z]{4} means exactly 4 capital characters from A to Z.
) is the end of capture group 2.
( is the start of a capture group, this is group 3.
[A-Z]* matches capital characters from A to Z between zero and infinite times.
) is the end of capture group 3.
( is the start of a capture group, this is group 4.
[A-Z]{4} means exactly 4 capital characters from A to Z.
) is the end of capture group 4.
$ asserts the position at the end of the line, so nothing can come after it.
See how it works with a replace here: https://regex101.com/r/W786uL/3.
$4$1$2
$4 means put capture group 4 here. Which is the last 4 characters.
$1 means put capture group 1 here. Which is everything in the entire string.
$2 means put capture group 2 here. Which is the first 4 characters.
You can use
^(.{4})(.*?)(.{4})$
^ - start of sting
(.{4}) - Match any for characters except new line
(.*?) - Match any character zero or more time (lazy mode)
$ - End of string
Demo