Regex to remove time zone stamp - regex

In Google Sheets, I have time stamps with formats like the following:
5/25/2022 14:13:05
5/25/2022 13:21:07 EDT
5/25/2022 17:07:39 GMT+01:00
I am looking for a regex that will remove everything after the time, so the desired output would be:
5/25/2022 14:13:05
5/25/2022 13:21:07
5/25/2022 17:07:39
I have come up with the following regex after some trial and error, although I am not sure if it is prone to errors: [^0-9:\/' '\n].*
And the function in Google Sheets that I plan to use is REGEXREPLACE().
My goal is to be able to do calculations regardless of one's time zone, however the result will be stamped with the user's local time zone.
Could someone confirm this is correct? Appreciate any feedback I can get!

You can use
=REGEXREPLACE(A1, "^(\S+\s\S+).*", "$1")
=REGEXREPLACE(A1, "^([\d/]+\s[\d:]+).*", "$1")
See the regex demo #1 / regex demo #2.
Details:
^ - start of string
(\S+\s\S+) - Group 1: one or more non-whitespaces, one or more whitespaces and one or more non-whitespaces
[\d/]+\s[\d:]+ - one or more digits or / chars, a whitespace, one or more digits or colons
.* - any zero or more chars other than line break chars as many as possible.
The $1 is a replacement backreference that refers to the Group 1 value.

With your shown samples, attempts please try following regex in REGEXREPLACE. This will help to match time stamp specifically. Here is the Online demo for following regex. This will create only 1 capturing group with which we are replacing the whole value(as per requirement).
=REGEXREPLACE(A1, "^((?:\d{1,2}\/){2}\d{4}\s+(?:\d{1,2}:){2}\d{1,2}).*", "$1")
Explanation: Adding detailed explanation for above used regex.
^( ##Matching from starting of the value and creating/opening one and only capturing group.
(?:\d{1,2}\/){2} ##Creating a non-capturing group and matching 1 to 2 digits followed by / with 2 times occurrence here.
\d{4}\s+ ##Matching 4 digits occurrence followed by 1 or more spaces here.
(?:\d{1,2}:){2} ##In a non-capturing group matching 1 to 2 occurrence of digits followed by colon and this combination should occur2 times.
\d{1,2} ##Matching 1 to 2 occurrences of digits.
) ##Closing capturing group here.
.* ##This will match everything till last but its not captured.

You can do this without REGEX by simply splitting and adding the first and second index.
=ARRAYFORMULA(
IF(ISBLANK(A2:A),,
INDEX(SPLIT(A2:A," "),0,1)+
INDEX(SPLIT(A2:A," "),0,2)))

Related

Using JOIN and REGEXEXTRACT with ARRAYFORMULA to Switch First and Last Names Not Working

I have a column of names in my spreadsheet that are structured like this...
Albarran Basten, Thalia Aylin
I'm using the formula below to extract every word BEFORE the comma (last name), and then only the first word AFTER the comma (first name), and then switch their places. It works great.
=join(" ",REGEXEXTRACT(D2,",\s(\S+)"),REGEXEXTRACT(D2,"^(.*?),"))
The formula above returns the name mentioned above like this, exactly as I need it to...
Thalia Albarran Basten
But, when I try to get it to automatically update the entire column of names using ARRAYFORMULA, it joins together all the names in the column all together into one cell, in each of the cells all the way down the column. Here's the formula I'm using that won't work...
={"Student Full Name";arrayformula(if(D2:D="",,join(" ",REGEXEXTRACT(D2:D,",\s(\S+)"),REGEXEXTRACT(D2:D,"^(.*?),"))))}
Any idea on what I could change in this arrayformula to make it work? Thanks for your help.
You can replace your REGEXEXTRACTs with a single REGEXREPLACE:
REGEXREPLACE(D2:D, "^(.*?),\s*(\S+).*", "$2 $1")
Or,
REGEXREPLACE(D2:D, "^([^,]*),\s*(\S+).*", "$2 $1")
See the regex demo.
Details:
^ - start of string
(.*?) - Group 1 ($1): zero or more chars other than line break chars as few as possible
, - a comma
\s* - zero or more whitespaces
(\S+) - Group 2 ($2): one or more non-whitespaces
.* - zero or more chars other than line break chars as many as possible.
With your shown samples please try following regex with REGEXREPLACE.
REGEXREPLACE(D2:D, "^([^,]*),\s*([^\s]+)\s\S*$", "$2 $1")
Here is the Online demo for used regex.
Explanation: Adding detailed explanation for used regex.
^ ##Matching from starting of the value.
([^,]*) ##Creating 1st capturing group which matches everything before comma comes.
,\s* ##Matching comma followed by 0 or more occurrences of spaces here.
([^\s]+) ##Creating 2nd capturing group where matching all non-spaces here.
\s\S*$ ##Matching space followed by 0 or more non-spaces till end of the value.

Using REGEXEXTRACT on an IMPORTRANGE in Google Docs

I am importing a range from another Google sheet and I need to pull a specific number from the data that is imported. The data looks something like:
R2.word.4.word
I want to extract the second number. It will always follow this format (a letter and a number then a period then a word then a period then a number (might be single or double digit) then a period and a word). The regex to extract the second number should be: (\d+)(?!.*\d) and I have tested it in multiple regex test sites. However, Google docs gives me an error stating it is not a regular expression. I tried something like this (edited out URL and the sheet name):
=REGEXEXTRACT(IMPORTRANGE(URL,Sheet!A2:A200), "(\d+)(?!.*\d"))
Can anyone help me understand how I can fix this?
And the other issue here is that it isn't actually importing the range. I only get it to import on the first cell and not down the column.
You could write a pattern like:
=REGEXEXTRACT(A2,"^[A-Z]\d+\.\w+\.(\d+)")
Explanation
^ Start of string
[A-Z] Match a single uppercase char
\d+ Match 1+ digits
\. Match a dot
\w+ Match 1+ word characters
\. Match a dot
(\d+) Capture group 1, match 1+ digits
Regex demo
With your shown samples please try following regex.
=REGEXEXTRACT(A2,"^[a-zA-Z]\d+\.[^.]*\.(\d+)\.\S+$")
Here is the Online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^[a-zA-Z] ##From starting of value matching a-zA-Z here.
\d+ ##Matching 1 or more occurrences of digits.
\.[^.]*\. ##Matching literal dot till next occurrence of dot here.
(\d+) ##Creating 1 capturing group and which has 1 or more digits matching in it.
\.\S+$ ##Matching literal dot followed by 1o or more non-spaces till end of value.
"It will always follow this format"
Based on the above; you can use REGEXEXTRACT() but it's slow compared to simple SPLIT() which in your standardized format is ideal:
Formula in B2:
=INDEX(SPLIT(A2:A3,"."),0,3)
This is an array-formula by default and will spill all values down. Just apply it to your entire range.

replaceAll regex to remove last - from the output

I was able to achieve some of the output but not the right one. I am using replace all regex and below is the sample code.
final String label = "abcs-xyzed-abc-nyd-request-xyxpt--1-cnaq9";
System.out.println(label.replaceAll(
"([^-]+)-([^-]+)-(.+)-([^-]+)-([^-]+)", "$3"));
i want this output:
abc-nyd-request-xyxpt
but getting:
abc-nyd-request-xyxpt-
here is the code https://ideone.com/UKnepg
You may use this .replaceFirst solution:
String label = "abcs-xyzed-abc-nyd-request-xyxpt--1-cnaq9";
label.replaceFirst("(?:[^-]*-){2}(.+?)(?:--1)?-[^-]+$", "$1");
//=> "abc-nyd-request-xyxpt"
RegEx Demo
RegEx Details:
(?:[^-]+-){2}: Match 2 repetitions of non-hyphenated string followed by a hyphen
(.+?): Match 1+ of any characters and capture in group #1
(?:--1)?: Match optional --1
-: Match a -
[^-]+: Match a non-hyphenated string
$: End
The following works for your example case
([^-]+)-([^-]+)-(.+[^-])-+([^-]+)-([^-]+)
https://regex101.com/r/VNtryN/1
We don't want to capture any trailing - while allowing the trailing dashes to have more than a single one which makes it match the double --.
With your shown samples and attempts, please try following regex. This is going to create 1 capturing group which can be used in replacement. Do replacement like: $1in your function.
^(?:.*?-){2}([^-]*(?:-[^-]*){3})--.*
Here is the Online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^(?:.*?-){2} ##Matching from starting of value in a non-capturing group where using lazy match to match very near occurrence of - and matching 2 occurrences of it.
([^-]*(?:-[^-]*){3}) ##Creating 1st and only capturing group and matching everything before - followed by - followed by everything just before - and this combination 3 times to get required output.
--.* ##Matching -- to all values till last.

Using regex to parse/reduce text file

edit
I've realized I made a mistake when explaining myself. Apologies for that.
Most of the artifacts come from this path:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\
then breaks into Artifact folders and its sub-folders like this:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.2\data.xxx
I would appreciate help with following thing:
I have this list (around 5k rows) of paths to different artifacts and they have different versions, to give you an example:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.2\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.2\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.3\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.2\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.3\data.xxx
And my goal to achieve is this:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.2\data.xxx
Basically to scope it down to just 1 version.
I've tried using ^(.*)(\n\1)+$ and $1. but that obviously didn't work. So I was wondering if you have an idea how to approach this. Greatly appreciate help, thanks!
You can use
Find what: ^(.*\.)(\d+)\\[^\\\n]+(\n\1\d+\\[^\\\n]+)+$
Replace: $1$2\\
See the regex demo. Details:
^ - start of a line (it is the default ^ behavior in Visual Studio Code)
(.*\.) - Group 1: any one or more chars other than line break chars as many as possible and then a .
(\d+) - Group 2:
\\ - a \ char
[^\\\n]+ - one or more chars other than \ and a line break
(\n\1\d+\\[^\\\n]+)+ - Group 3 capturing one or more sequences of a line break and then the value captured into Group 1, one or more digits, a \ char and then one or more chars other than \ and a line break
$ - end of a line.
Here is another attempt, see regex101 demo.
The basic idea is to isolate someText-\d?. in capture group 2.
Then look for $2 in following lines. What precedes $2 or follows $2 in those following lines can vary.
Find: ^(.*\\(?=.*\\))(.*-\d+\.)(.*\\?.*)(\n.*\2.*)*
Replace: $1$2$3
So here is the most interesting part: ^(.*\\(?=.*\\))(.*-\d+\.)
This will get your Artifact-1. or Artifact-17. or someText-2. into capture group 2. Because using a positive lookahead (?=.*\\) the following group 2 (.*-\d+\.) will be in the last directory only. And then (.*\\?.*) gathers the rest of that line into group 3.
Finally (\n.*\2.*)* checks to see if there is a backreference to group 2, \2, in any following lines. [Technically, that backreference could be anywhere in a line, even the beginning, that can be fixed if necessary - let me know if you need that for your data. See safer regex101 demo if 'someText-/d.' could appear anywhere and should be ignored if not last directory and use that find.]
You can not use a single capture group for the whole line using ^(.*), as you want to repeat only the part before the last dot using a backreference and that will not work capturing the whole line.
Therefore you have to capture the digits in the first match in a separate capture group to keep it in the replacement.
If you want to match all following lines with the same text before the last dot, you can use a repeating group:
^\s*(.*\.)(\d+\\[^\\\r\n]*)(?:\r?\n\s*\1\d*\\[^\\\r\n]*)+
The pattern matches:
^ Start of string
\s* Match optional whitespace chars
(.*\.) Capture group 1, match till the last dot
(\d+\\[^\\\r\n]*) Capture group 2, match 1+ digits, \ and optional chars other than \ or a newline
(?: Non capture group
\r?\n\s*\1 Match a newline and a backreference to group 1
\d+\\[^\\\r\n]* Same pattern as in the first part
)+ Close the non capture group and repeat 1+ times
See a regex demo.
In the replacement use the 2 capture groups $1$2
The replacement will look like
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.2\data.xxx

Regular expression with multiline matching (subtitles strings)

Need some help in regexp matching pattern.
The text goes like here (it's subtitles for video)
...
223
00:20:47,920 --> 00:20:57,520
- Hello! This is good subtitle text.
- Yes! How are you, stackoverflow?
224
00:20:57,520 --> 00:21:11,120
Wow, seems amazing.
- We're good, thanks.
Like, you know, everyone is happy around here with their laptops.
225
00:21:11,120 --> 00:21:14,440
- Understood. Some dumb text
...
I need a set of groups:
startTime, endTime, text
For now my achievements are not very good. I can get startTime, endTime and some text, but not all the text, only the last sentence. I've attached a screenshot.
As you can see, group 3 is capturing text, but only last sentence.
Please, explain me what I'm doing wrong.
Thank you.
Accounting for the possibility there is no new-line character after the final text of your string; Would the following work for you:
(\d\d:\d\d:\d\d,\d\d\d)[ >-]*?((?1))\n(.*?(?=\n\n|\Z))
See the online demo
(\d\d:\d\d:\d\d,\d\d\d) - The same pattern as you used to capture starting time in 1st capture group.
[ >-]*? - 0+ (but lazy) character from the character class up to:
((?1)) - A 2nd capture group which matches the same pattern as 1st group.
\n - A newline-character.
(.*?(?=\n\n|\Z)) - A 3rd capture group that captures anything (including newline with the s-flag) up to a positive lookahead for either two newline characters or the end of the whole string.
Note, some (not all) engines allow for backreferencing a previous subpattern. I guess the app you are using does not. Therefor you can swap the (?1) with your own pattern to capture the 2nd group.
Another option is to use a pattern that would capture all lines in group 3 that do not start with 3 digits.
(\d\d:\d\d:\d\d,\d\d\d) --> (\d\d:\d\d:\d\d,\d\d\d)((?:\r?\n(?!\d\d\d\b).*)*)
Explanation
(\d\d:\d\d:\d\d,\d\d\d) Capture group 1 Match a time like pattern
--> Match literally
(\d\d:\d\d:\d\d,\d\d\d) Capture group 2 Same pattern as group 1
( Capture group 3
(?: Non capture group
\r?\n(?!\d\d\d\b).* Match a newline and assert using a negative lookahead that the line does not start with 3 digits followed by word boundary. If that is the case, match the whole line
)* Optionally repeat all lines
) Close group 3
Regex demo
A bitmore specific pattern could be matching all lines that do not start with 3 digits or a start/end time like pattern.
^(\d\d:\d\d:\d\d,\d\d\d)[^\S\r\n]+-->[^\S\r\n]+(\d\d:\d\d:\d\d,\d\d\d)((?:\r?\n(?!\d+$|\d\d:\d\d:\d\d,\d\d\d\b).*)*)
Regex demo