I am trying to build a regular expression that would pull the first time out of a string.
The issue is the time format is not standardized.
Here are the possible variations.
':' with 1 hour digit before the ':' (ex. 9:00 pm)
':' with 2 hour digits before the ':' (ex. 10:00pm)
no minutes with with 1 hour digit (ex 9pm)
no minutes with with 1 hour digit (ex 10pm)
Additionally there may or may not be a space before "am" or "pm"
Here is an example string.
7:30 pm -9 pm Lorem Ipsum is simply dummy text. 9pm-10pm Lorem Ipsum is simply dummy text
I would like this string to return "7:30 pm"
You did not specify the tool you want to use, here a simple implementation using sed:
echo '7:30 pm -9 pm Lorem Ipsum is simply dummy text. 9pm-10pm Lorem Ipsum is simply dummy text' | sed 's/\([0-2]\?[0-9]\(:[0-5][0-9]\)\? *[ap]m\).*/\1/i'
Legenda:
'[0-2]\?[0-9]' match the hour (with 1 or 2 digits)
'\(:[0-5][0-9]\)\?' match the minutes (optional)
' *' optional spaces
'[ap]m' match am,pm,AM,PM (also Am,aM,pM,Pm)*
'.*' match all the rest of the string
In addiction: the external \(...\) create a group of all the above elements (a backreference) used later in the substitution part of the regex \1.
*: The last /i modifier make the regex case insensitive
You can rewrite all as a standard perl regex:
/(?i)[0-2]?\d(?::[0-5]\d)?\s*[ap]m/
Little ruby code:
#!/usr/bin/env ruby
input = "7:30 pm -9 pm Lorem Ipsum is simply dummy text. 9pm-10pm Lorem Ipsum is simply dummy text"
puts input[/(?i)[0-2]?\d(?::[0-5]\d)?\s*[ap]m/]
Try this regex:
(?i)\d{1,2}(?::\d{2})?\s*[ap]m
Explaining:
(?i) # insensitive case
\d{1,2} # one or two digits
(?: # optional group
:\d{2} # the minutes
)? # end optional group
\s* # any spaces
[ap]m # "am" or "pm"
Regex live here.
Hope it helps.
You can use the following regex:
\d{1,2}\:?(?:\d{1,2}|)\s*[ap]m
A almost generic solution may be achieved using following expression:
([012]?\d(:[0-5]\d)?\s*(pm|am|PM|AM))
It considers capturing groups, getting all present time strings on string.
In javascript, it might be tested like following:
var testTime = "7:30 pm -9 pm Lorem Ipsum is simply dummy text. 9pm-10pm Lorem Ipsum is simply dummy text";
var timeRex = /([012]?\d(:[0-5]\d)?\s*(pm|am|PM|AM))/g;
var firstTime = timeRex.exec(testTime)[0];
console.log(firstTime);
I really believe that there is a better general solution. I will try some more stable, then publish it here.
Related
I've been trying to create a regex with space & alpha numeric values.
Below Im sharing the sample String.
Manchester United 8547|12345678910
|12345678910
Manchester |12345678910
124587933 |12345678910
8457 Manchester United|12345678910
Manchester United|12345678910
I want to capture everything before pipe(|) separated. At times there is a possibility of complete space and no alpha numeric values before pipe(|) which I've shown in 2nd example. Regex should not capture pipe(|) and next numerical values(12345678910).
I've tried below regex but none are working for me.
^.*$
^[\s\w\d]+$
[a-zA-Z0-9\s]+
[a-zA-Z0-9\s\W]+
^[\sa-z|A-Z|0-9]+$
^[\sa-z|A-Z|0-9]+$
[^\s]*$
([^\"]*)
^[a-zA-Z0-9]$
^([^?]*)$
.+?(?=\w)
\s[a-zA-Z0-9]+
^[\sa-zA-Z0-9]+
I need a full match & not group match
for example if I try for
Manchester 8457 then regex would be Manchester \d+. This gives me full match & not group match.
You can try this.
input.substring(0,input.indexOf("|"))
If you want to match alphanumeric before the pipe and not get a group match, but a match only, you can use a character class with a positive lookahead (?=\|) (if that is supported) to assert the pipe at the right.
^[A-Za-z0-9 ]+(?=\|)
Regex demo
Assuming that every line would have a pipe, you could split the input string on CRLF, and then extract the portion to the left of the pipe:
String input = "Manchester United 8547|12345678910\n |12345678910\nManchester |12345678910\n124587933 |12345678910\n8457 Manchester United|12345678910\n Manchester United|12345678910\n";
String[] parts = input.split("\r?\n");
List<String> contents = Arrays.stream(parts)
.map(x -> x.split("\\|")[0].trim())
.collect(Collectors.toList());
System.out.println(contents);
This prints:
[Manchester United 8547, , Manchester, 124587933, 8457 Manchester United,
Manchester United]
for getting alphanumeric part use the following
^\s*\w(.+?)\|
This should answer your question i guess.
^(.+?)\|
Please use this and try it checks only for the beginning string.
its is for the pipe
Try it here
Any help will be appreciated. I have written a regex which fails in some edge cases. Not sure if there is a way to handle this.
I am trying to extract the values which having a 1.1 and 1.2 etc etc.
The regex I am using is
"[1-9]\.[1-9]([^\s]+)" If i use it it extracts the first three values but the 4.1 which has a space, only part is extracted. If i use "[1-9]\.1.*[(XDX)]$" It starts to capture the whole line.
Currently I have written a logic which check for MR and splits it and puts in array which is very inefficient way to do.
Let me know if you can think of a better solution than this one.
GIBBERISH
1.1CDDAX/SXEVEN MR*XDX 2.1CDDAX/JEROME MR*XDX
3.1CDDAX/SIXM MR*XDX 4.1CDDAX AMX/SIXM MR*XDX
1 OXP EY 31SED W PK3 MEL/REDOOK DEOPRE 31SED21 XO XRXVEL DEF
EXPRESSA VERO IN IIS AETATIBUS, QUAE IAM CONFIRMATAE SUNT. ATQUI
PERSPICUUM EST HOMINEM E CORPORE ANIMOQUE CONSTARE,
CUM PRIMAE SINT ANIMI PARTES, SECUNDAE CORPORIS. TUM QUINTUS:
EST PLANE, PISO, UT DICIS, INQUIT. BONA AUTEM CORPORIS HUIC SUNT,
QUOD POSTERIUS POSUI, SIMILIORA. ILLA TAMEN SIMPLICIA
You may use
(?<!\S)[1-9]\.[1-9](.*?)(?=\s+MR\*XDX|$)
Or,
(?<!\S)[1-9]\.[1-9]((?:(?!\s+MR\*XDX).)+)
See this regex #1 demo or regex #2 demo
Details
(?<!\S) - a whitespace should come right before the current location or start of string
[1-9]\.[1-9] - a digit from 1 to 9, then a ., and then again a digit from 1 to 9
(.*?) - Capturing group 1: any 0+ chars other than line break chars, as few as possible
(?=\s+MR\*XDX|$) - .*? will stop matching before the first occurrence of
\s+MR\*XDX - 1+ whitespace and then MR*XDX substring
| - or
$ - end of string.
I want to delete everything except IPs.
For example
1 138.68.161.60:1080 SOCKS5 HIA United States (New York NY) 138.68.161.60 (DigitalOcean, LLC) 0.143 75% (3) - 12-jan-2018 14:37 (10 minutes ago)
2 174.64.234.29:17501 SOCKS5 HIA United States wsip-174-64-234-29.sd.sd.cox.net (Cox Communications Inc.) 0.956
100% (5) - 12-jan-2018 14:36 (10 minutes ago)
3 45.79.219.154:63189 SOCKS5 HIA United States (Atlanta GA) li1318-154.members.linode.com (Linode, LLC) 6.973
90% (103) - 12-jan-2018 14:36 (11 minutes ago)
to
138.68.161.60:1080
174.64.234.29:17501
45.79.219.154:63189
I need a regex to this convert.
In Notepad++, it requires some finesse to delete text not containing matched strings, but you can choose Find, Mark, then check the Regular expression box and use the regex:
([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}+) and Mark Allto bookmark all rows containing IP adresses.
Then select Find, Replace, enter ^[0-9]\W in Find what:, and Replace All with nothing.
Then select Find, Replace, enter \w+S.+ in Find what:, and Replace All with nothing.
Then, go to Search, Bookmark, Remove Unmarked Lines.
Et VoilĂ !
You could use this regex in notepad++ and replace the captured values with group 1 \1
(?s)(\d \d+\.\d+\.\d+\.\d+:\d+).*?\(\d+ minutes ago\)
You select all the text for each of the 3 blocks from your example and use a capturing group for the text that you want to keep. Then in the replace you use only the captured group which holds your data.
Explanation
Inline modifier to make the dot match a line break (?s)
Group 1 with the pattern that you want to capture (\d \d+\.\d+\.\d+\.\d+:\d+)
Match any character zero or more times non greedy .*?
The pattern that is at the end of every part \(\d+ minutes ago\)
The general problem
I am trying to understand how to prevent the existence of some pattern before or after a sought-out pattern when writing regex's!
A more specific example
I'm looking for a regex that will match dates in the format YYMMDD ((([0-9]{2})(0[1-9]|1[0-2])(0[1-9]|[1-2][0-9]|3[0-1]))) inside a long string while ignoring longer numeric sequences
it should be able to match:
text151124moretext
123text151124moretext
text151124
text151124moretext1944
151124
but should ignore:
text15112412moretext
(reason: it has 8 numbers instead of 6)
151324
(reason: it is not a valid date YYMMDD - there is no 13th month)
how can I make sure that if a number has more than these 6 digits, it won't picked up as a date inside one single regex (meaning, that I would rather avoid preprocessing the string)
I've thought of \D((19|20)([0-9]{2})(0[1-9]|1[0-2])(0[1-9]|[1-2][0-9]|3[0-1]))\D but doesn't this mean that there has to be some character before and after?
I'm using bash 3.2 (ERE)
thanks!
Try:
#!/usr/bin/env bash
extract_date() {
local string="$1"
local _date=`echo "$string" | sed -E 's/.*[^0-9]([0-9]{6})[^0-9].*/\1/'`
#date -d $_date &> /dev/null # for Linux
date -jf '%y%m%d' $_date &> /dev/null # for MacOS
if [ $? -eq 0 ]; then
echo $_date
else
return 1
fi
}
extract_date text15111224moretext # ignore n_digits > 6
extract_date text151125moretext # take
extract_date text151132 # # ignore day 32
extract_date text151324moretext1944 # ignore month 13
extract_date text150931moretext1944 # ignore 31 Sept
extract_date 151126 # take
Output:
151125
151126
If your tokens are line-separated (i.e. there is only one token per line):
^[\D]*[\d]{6}([\D]*|[\D]+[\d]{1,6})$
Basically, this regex looks for:
Any number of non-digits at the beginning of the string;
Exactly 6 digits
Any number of non-digits until the end OR at least one non-digit and at least one digit (up to 6) to the end of the string
This regex passes all of your given sample inputs.
You could use non-capturing groups to define non-digits either side of your date Regex. I had success with this expression and your same test data.
(?:\D)([0-9]{2})(0[1-9]|1[0-2])(0[1-9]|[1-2][0-9]|3[0-1])(?:\D)
i have some files and must find identical lines starting with "abc" and exact one line between these two identical lines.
lorem
abcdefg
lorem
abcdefg
lorem
lorem
abcdefg
abcdefg
lorem
lorem
in this sample the lines 2 and 4 should match but not then lines 4 and 7 and not the lines 7 and 8. is it possible?
Since you don't say the language I would do something like:
abc([^\n]+)\n[^\n]*\nabc(\1)
which checks for:
Letters abc.
a captured group without new lines.
The new line character.
A complete new line.
The new line character.
The previously matched first group content.
Check if its available for your language:
http://www.regular-expressions.info/refext.html (for instance in .NET this is not valid).