How to regex something after first whitespace(s) - regex

I have a string link TEST123 DATA, so this are two words seperated by whitespace. How can I regex the right part after whitespace(s) to get DATA? I am new to this and I hoped someone could tell me how to do this? Any characters at the beginning should be skipped including the first whitespace. I need everything after the first whitespace(s). So This are string examples:
TEST_1 DATA
TEST DATA
123 DATA
and the result should be always "DATA".
Thanks

^\S*\s+(\S+)
matches the string from the beginning until the word after the first whitespace(s). Group 1 will then contain the string DATA (in your example).

If you only want to match DATA, and you have access to a Perl-compatible regex engine, you can use
^\S*\s+\K\S+
The \K token tells the regex engine to ignore all the text that has been matched so far.
See it live on regex101.com.
With a .NET regex engine, you can use a positive lookbehind assertion:
(?<=^\S*\s+)\S+
See it live on regexhero.net.

Starting from the end of the string, matching everything that isn't a whitespace character:
[^\s]*$
Gets all 3 DATA in the sample with global and multiline flags.

You can also try the following regex: (Python)
>>> import re
>>> s = "TEST_1 DATA"
>>> result = re.sub(r".*?(\w+)\s*$", r"\1", s)
>>> result
'DATA'

Related

Regex match last substring among same substrings in the string

For example we have a string:
asd/asd/asd/asd/1#s_
I need to match this part: /asd/1#s_ or asd/1#s_
How is it possible to do with plain regex?
I've tried negative lookahead like this
But it didn't work
\/(?:.(?!\/))?(asd)(\/(([\W\d\w]){1,})|)$
it matches this '/asd/asd/asd/asd/asd/asd/1#s_'
from this 'prefix/asd/asd/asd/asd/asd/asd/1#s_'
and I need to match '/asd/1#s_' without all preceding /asd/'s
Match should work with plain regex
Without any helper functions of any programming language
https://regexr.com/
I use this site to check if regex matches or not
here's the possible strings:
prefix/asd/asd/asd/1#s
prefix/asd/asd/asd/1s#
prefix/asd/asd/asd/s1#
prefix/asd/asd/asd/s#1
prefix/asd/asd/asd/#1s
prefix/asd/asd/asd/#s1
and asd part could be replaced with any word like
prefix/a1sd/a1sd/a1sd/1#s
prefix/a1sd/a1sd/a1sd/1s#
...
So I need to match last repeating part with everything to the right
And everything to the right could be character, not character, digit, in any order
A more complicated string example:
prefix/a1sd/a1sd/a1sd/1s#/ds/dsse/a1sd/22$$#!/123/321/asd
this should match that part:
/a1sd/22$$#!/123/321/asd
Try this one. This works in python.
import re
reg = re.compile(r"\/[a-z]{1,}\/\d+[#a-z_]{1,}")
s = "asd/asd/asd/asd/1#s_"
print(reg.findall(s))
# ['/asd/1#s_']
Update:
Since the question lacks clarity, this only works with the given order and hence, I suppose any other combination simply fails.
Edits:
New Regex
reg = r"\/\w+(\/\w*\d+\W*)*(\/\d+\w*\W*)*(\/\d+\W*\w*)*(\/\w*\W*\d+)*(\/\W*\d+\w*)*(\/\W*\w*\d+)*$"

How to fix the regex for this String "#*abc" I want to match this exact string where abc can be any words containing spaces too

I want to read multiple strings between specific characters from a file using regex. I have tried the following code but could not get expected results.
My input file contains data in this format:
#*OQL[C++]: Extending C++ with an Object Query Capability
##José A. Blakeley
#t1995
#cModern Database Systems
#index0
#*Transaction Management in Multidatabase Systems
##Yuri Breitbart,Hector Garcia-Molina,Abraham Silberschatz
#t1995
#cModern Database Systems
#index1
Expected output:
OQL[C++]: Extending C++ with an Object Query Capability
Transaction Management in Multidatabase Systems
What I tried
[^#*][a-z]\w+[\n$]
It is not reading the string spaces.
If you want to match a # and * at the start of the string and get what follows, you could use a capturing group. Note to get the characters at the start outside of the character class and escape the \*.
To match the space you could use a repeating pattern starting with a space. To match all the words in your example, you could use a character class to allow which characters to match.
^#\*([a-zA-Z][+:a-zA-Z\]\[]+(?: [+:a-zA-Z\]\[]+)*)
Regex demo
Or as an alternative use a positive lookbehind:
(?<=^#\*)[a-zA-Z][+:a-zA-Z\]\[]+(?: [+:a-zA-Z\]\[]+)*
Regex demo
To match either of the chars you could use a character class
^#[*#c]([a-zA-Z][+:a-zA-Z\]\[]+(?: [+:a-zA-Z\]\[]+)*)
Regex demo
Try this Regex. it will catch just after the #*, #c, ##:
#[\*c#]\K[\S].*$
Here Is Demo
Here's the regex you are looking for :
^#\*(.*)$
You can test it here
Explanation:
^ // start at the beginning of the line
#\* // match the literal '#*'
(.*) // match any character that follows
$ // until the end of the line

regex to select only the zipcode

,Ray Balwierczak,4/11/2017,,895 Forest Hill Rd,Apalachin,NY,13732,y,,
i want to select only 13732 from the line. I came up with this regex
(\d)(\s*\d+)*(\,y,,)
But its also selecting the ,y,, .if i remove it that part from regex, the regex also gets valid for the date. please help me on this.
Generally, if you want to match something without capturing it, use zero-length lookaround (lookahead or lookbehind). In your case, you can use lookahead:
(\d)(\s*\d+)*(?=\,y,,)
The syntax (?=<stuff>) means "followed by <stuff>, without matching it".
More information on lookarounds can be found in this tutorial.
Regex: \D*(\d{5})\D*
Explanation: match 5 digits surrounded by zero or more non-digits on both sides. Then you can extract group containing the match.
Here's code in python:
import re
string = ",Ray Balwierczak,4/11/2017,,895 Forest Hill Rd,Apalachin,NY,13732,y,,"
search = re.search("\D*(\d{5})\D*", string)
print search.group(1)
Output:
13732

How to extract big mgrs using regex

I have an input json:
{"id":12345,"mgrs":"04QFJ1234567890","code":"12345","user":"db3e1a-3c88-4141-bed3-206a"}
I would like to extract with regular expression MGRS of 1000 kilometer, in my example result should be: 04QFJ1267
First 2 symbols always digits, next 3 always chars and the rest always digits. MGRS have a fix length of 15 chars at all.
Is it possible?
Thanks.
All you really need to do is remove characters 8-10 and 13-15. If you want/need to do that using regex, then you could use the replace method with regex: (EDIT Edited to remove the rest of the string).
.*?(\w{7})\d{3}(\d{2})\d+.*
and replacement string:
$1$2
I see now you are using Java. So the relevant code line might look like:
resultString = subjectString.replaceAll(".*?(\\w{7})\\d{3}(\\d{2})\\d+.*", "$1$2");
The above assumes all your strings look like what you showed, and there is no need to test to be sure that "mgrs" is in the string.

regular expression matching issue

I've got a string which has the following format
some_string = ",,,xxx,,,xxx,,,xxx,,,xxx,,,xxx,,,xxx,,,"
and this is the content of a text file called f
I want to search for a specific term within the xxx (let's say that term is 'silicon')
note that the xxx can all be different and can contain any special characters (including meta characters) except for a new line
match = re.findall(r",{3}(.*?silicon.*?),{3}", f.read())
print match
But this doesn't seem to work because it returns results which are in the format:
["xxx,,,xxx,,,xxx,,,xxx,,,silicon", "xxx,,,xxx,,,xxx,,,xxsiliconxx"] but I only want it to return ["silicon", "xxsiliconxx"]
What am I doing wrong?
Try the following regex:
(?<=,{3})(?:(?!,{3}).)*?silicon.*?(?=,{3})
Example:
>>> s = ',,,xxx,,,silicon,,,xxx,,,xxsiliconxx,,,xxx'
>>> re.findall(r'(?<=,{3})(?:(?!,{3}).)*?silicon.*?(?=,{3})', s)
['silicon', 'xxsiliconxx']
I am assuming that the content in the xxx can contain commas, just not three consecutive commas or it would end the field. If the content in the xxx sections cannot contain any commas, you can use the following instead:
(?<=,{3})[^,\r\n]*?silicon.*?(?=,{3})
The reason your current approach doesn't work is that even though .*? will try to match as few characters as possible, the match will still start as early as possible. So for example the regex a*?b would match the entire string "aaaab". The only time the regex will advance the starting position is when the regex fails to match, and since ,,, can be matched by the .*?, your match will always start at the beginning of the string or just after the previous match.
The lookbehind and lookahead are used to address the issue raised by JaredC in comments, basically re.findall() won't return overlapping matches, so you need the leading and trailing ,,, to not be a part of the match.