I have a long list of strings formatted as follows:
"astringsmaller alongstringthatisgreaterthan32characterstomatch"
"alongstringthatisgreaterthan32characterstomatch astringsmaller"
"astring string alongstringthatisgreaterthan32 598931"
Using regex, how would I match a string where that string is greater than 31 characters using the space a delimiter?
For example, the string below would be "seen" as two parts and the counts would be 14 and 47 thus be a valid match.
"astringsmaller alongstringthatisgreaterthan32characterstomatch"
Unfortunately, the number of delimiters/spaces are not consistent in position or amount. I also have a bunch of other special characters that would be considered as "delimiters"
("!")
("#")
('"')
("#")
("$")
("&")
("'")
("(")
(")")
("*")
("+")
(",")
(".")
("/")
(":")
(";")
("<")
("=")
(">")
("?")
("^")
("`")
("{")
("|")
("}")
("~")
(" ")
(" ")
("“")
("”")
("’")
("%")
Thanks in advance!
In that case, what you want to do is match 31+ characters which are not delimiters:
[^!#"#$&'()*+,./:;<=>?^`{|}~ “”’%]{31,}
Demo
Also, instead of using a delimiter "blacklist", you maybe could also maybe only match valid words? (but that depends on your exact use case)
\w{31,}
(\w is the same as [a-zA-Z0-9_])
Demo
As far as I understand this you want to validate lines if they contain a string that is greater than 31 characters (not containing separator). I would suggest using a lookahead like this:
^(?=.*[^!#"#$&'))*+,.\/:;<=>?^`{|}~ “”’%\r\n]{31,}).+$
Demo
Then, you can further process those lines as needed.
Related
Trying to come up with a Regex, or combination of Regex, that returns False if a) they have only entered only BLANK(s), or they b) entered "non-legal" characters. Lastly, the number of characters has a set limit.
The closest I have thus far is below. Where it fails is that it does not count any leading spaces; only the non-BLANKs are counted, and so it fails. Using js.
const reg = /^(**[ ]***[!-~\u2018-\u201d\u2013\u2014]){1,10}$/;
EDIT: I think the above is incorrect, and I meant to post this:
const re4 = /^(?!\s*$)[!-~\u2018-\u201d\u2013\u2014]{1,10}$/;
EDIT 2: this has less clutter; allow space and all other 'standard' keyboard chars:
const re5 = /^(?!\s*$)[!-~]{1,10}$/;
So, this says you can enter a bunch of spaces, and must include at least 1 other character from the list following; but the {1,10} only counts the non-spaces and so I can end up with too many in total.
EDIT:
So, using re5 above --
s = ' '; // should fail
s = ' blah blah'; // should pass
s = ' blah blah'; // should fail, as there are 11 characters
Try ^(?:\s*\S){1,10}\s*$
Allow 1-10 non whiter, change \S to allow chars
Update 2: After learning that you cannot invert the match result in code, here's one last suggestion using negative lookahead (like you already tried yourself).
This regex matches only strings of 1-10 non-banned characters that are not all whitespace:
const re4 = /^(?!\s+$)[^\!-\~\u2018-\u201d\u2013\u2014]{1,10}$/
Update 1: Use this regex to match all-whitespace string OR strings longer than 10 chars OR strings containing bad characters:
const re4 = /(^\s+$|^.{11,}$|[\!-\~\u2018-\u201d\u2013\u2014])/
I understand that you want to impose a length restriction via regex. I would suggest against that and recommend using str.length instead.
This regex will match whitespace-only strings and strings containing one or more bad characters:
const re4 = /(^\s+$|[\!-\~\u2018-\u201d\u2013\u2014])/;
Regarding prohibition of all-whitespace strings: Instead of packing it into a regex, you might consider using something more explicit like if (s.trim().length == 0). IMO this makes your intention clearer and your code propably more readable, leaving you with this easy to read regex:
# matches any string containing a *bad* character
const re4 = /[\!-\~\u2018-\u201d\u2013\u2014]/;
If you use trim for the all-whitespace check, you might convert your regex into a positive assertion, even with length restriction:
# matches any string consisting of 1-10 characters not considered *bad*
const re4 = /^[^\!-\~\u2018-\u201d\u2013\u2014]{1,10}$/;
To match the input when it’s from 1 to 10 chars long and can't be all blanks, use a negative look ahead to assert not all blanks:
^(?! *$).{1,10}
If you want to restrict allowable chars, change the dot to a suitable character class of allowable chars.
I am trying to extract words after the first space using
species<-gsub(".* ([A-Za-z]+)", "\1", x=genus)
This works fine for the other rows that have two words, however row [9] "Eulamprus tympanum marnieae" has 3 words and my code is only returning the last word in the string "marnieae". How can I extract the words after the first space so I can retrieve "tympanum marnieae" instead of "marnieae" but have the answers stored in one variable called >species.
genus
[9] "Eulamprus tympanum marnieae"
Your original pattern didn't work because the subpattern [A-Za-z]+ doesn't match spaces, and therefore will only match a single word.
You can use the following pattern to match any number of words (other than 0) after the first, within double quotes:
"[A-Za-z]+ ([A-Za-z ]+)" https://regex101.com/r/p6ET3I/1
https://regex101.com/r/p6ET3I/2
This is a relatively simple, but imperfect, solution. It will also match trailing spaces, or just one or more spaces after the first word even if a second word doesn't exist. "Eulamprus " for example will successfully match the pattern, and return 5 spaces. You should only use this pattern if you trust your data to be properly formatted.
A more reliable approach would be the following:
"[A-Za-z]+ ([A-Za-z]+(?: [A-Za-z]+)*)"
https://regex101.com/r/p6ET3I/3
This pattern will capture one word (following the first), followed by any number of addition words (including 0), separated by spaces.
However, from what I remember from biology class, species are only ever comprised of one or two names, and never capitalized. The following pattern will reflect this format:
"[A-Za-z]+ ([a-z]+(?: [a-z]+)?)"
https://regex101.com/r/p6ET3I/4
I am searching strings matching the following one in my source code:
<CONSTANT_STRING_1> <CONSTANT_STRING_2> <VARIABLE_DIGITS> <CONSTANT_STRING_3>
where
<CONSTANT_STRING_1>, <CONSTANT_STRING_2> and <CONSTANT_STRING_3> are constant strings like "ABC", ""DEF" and "GHI".
<VARIABLE_DIGITS> is a random number of 14 digits like "12345678901234"
Note: there are white spaces between words.
What I am looking for is to search <CONSTANT_STRING_1> <CONSTANT_STRING_2> <WHATEVER> <CONSTANT_STRING_3>. How can I build the Regex?
I am reading that by "constant string" you mean character strings? If so the below should work to find that full string you are looking for. Btw the website linked below is really great for visualizing this type of problem... give it a try :)
(([a-zA-Z]+\s){2})[0-9]{14}\s([a-zA-Z]+)$
Debuggex Demo
To break it down...
(([a-zA-Z]+\s){2}) means a string of one or more characters comprised of either LC or UC letters followed by a space and that whole thing (chars + space) repeated twice
[0-9]{14}\s 14 digits followed by a space. As #Avinash said \d{14}\s is another way of writing this portion
([a-zA-Z]+)$ Another string of one or more characters. The $ indicates that this ends the string you are searching for
You could try the below regex.
<CONSTANT_STRING_1> <CONSTANT_STRING_2> \d{14} <CONSTANT_STRING_3>
Where, \d{14} matches exactly the 14 digit number.
I want to split a text into it's single words using regular expressions. The obvious solution would be to use the regex \\b unfortunately this one does split words also on the hyphen.
So I am searching an expression doing exactly the same as the \\b but does not split on hyphens.
Thanks for your help.
Example:
String s = "This is my text! It uses some odd words like user-generated and need therefore a special regex.";
String [] b = s.split("\\b+");
for (int i = 0; i < b.length; i++){
System.out.println(b[i]);
}
Output:
This
is
my
text
!
It
uses
some
odd
words
like
user
-
generated
and
need
therefore
a
special
regex
.
Expected output:
...
like
user-generated
and
....
#Matmarbon solution is already quite close, but not 100% fitting it gives me
...
like
user-
generated
and
....
This should do the trick, even if lookaheads are not available:
[^\w\-]+
Also not you but somebody who needs this for another purpose (i.e. inserting something) this is more of an equivalent to the \b-solutions:
([^\w\-]|$|^)+
because:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
--- http://www.regular-expressions.info/wordboundaries.html
You can use this:
(?<!-)\\b(?!-)
I apologize for the terrible title...it can be hard to try to summarize an entire situation into a single sentence.
Let me start by saying that I'm asking because I'm just not a Regex expert. I've used it a bit here and there, but I just come up short with the correct way to meet the following requirements.
The Regex that I'm attempting to write is for use in an XML schema for input validation, and used elsewhere in Javascript for the same purpose.
There are two different possible formats that are supported. There is a literal string, which must be surrounded by quotation marks, and a Hex-value string which must be surrounded by braces.
Some test cases:
"this is a literal string" <-- Valid string, enclosed properly in "s
"this should " still be correct" <-- Valid string, "s are allowed within (if possible, this requirement could be forgiven if necessary)
"{00 11 22}" <-- Valid string, {}'s allow in strings. Another one that can be forgiven if necessary
I am bad output <-- Invalid string, no "s
"Some more problemss"you know <-- Invalid string, must be fully contained in "s
{0A 68 4F 89 AC D2} <-- Valid string, hex characters enclosed in {}s
{DDFF1234} <-- Valid string, spaces are ignored for Hex strings
DEADBEEF <-- Invalid string, must be contained in either "s or {}s
{0A 12 ZZ} <-- Invalid string, 'Z' is not a valid Hex character
To satisfy these general requirements, I had come up with the following Regex that seems to work well enough. I'm still fairly new to Regex, so there could be a huge hole here that I'm missing.:
".+"|\{([0-9]|[a-f]|[A-F]| )+\}
If I recall correctly, the XML Schema regex automatically assumes beginning and end of line (^ and $ respectively). So, essentially, this regex accepts any string that starts and ends with a ", or starts and ends with {}s and contains only valid Hexidecimal characters. This has worked well for me so far except that I had forgotten about another (although less common, and thus forgotten) input option that completely breaks my regex.
Where I made my mistake:
Valid input should also allow a user to separate valid strings (of either type, literal/hex) by a comma. This means that a single string should be able to contain more than one of the above valid strings, separated by commas. Luckily, however, a comma is not a supported character within a literal string (although I see that my existing regex does not care about commas).
Example test cases:
"some string",{0A F1} <-- Valid
{1122},{face},"peanut butter" <-- Valid
{0D 0A FF FE},"string",{FF FFAC19 85} <-- Valid (Spaces don't matter in Hex values)
"Validation is allowed to break, if a comma is found not separating values",{0d 0a} <-- Invalid, comma is a delimiter, but "Validation is allowed to break" and "if a comma..." are not marked as separate strings with "s
hi mom,"hello" <-- Invalid, String1 was not enclosed properly in "s or {}s
My thoughts are that it is possible to use commas as a delimiter to check each "section" of the string to match a regex similar to the original, but I just am not that advanced in regex yet to come up with a solution on my own. Any help would be appreciated, but ultimately a final solution with an explanation would just stellar.
Thanks for reading this huge wall of text!
According to http://www.regular-expressions.info/xml.html the regex language to be used in XSD is less expressive than the one used in Java, but expressive enough for your task.
Now for a construction, take your own regex. Replace the dot with a negated character class [^,] to match everything except the comma, and (for increased clarity) merge the hexadecimal character classes into one. You get the following regex:
"[^,]+"|\{[0-9a-fA-F ]+\}
If we name this regex <S> (for "single string"), the additional feature is validated by the regex matching any number of <S>,, followed by a single <S>:
(<S>,)*<S>
Expanded, this yields the desired regex:
(("[^,]+"|\{[0-9a-fA-F ]+\}),)*("[^,]+"|\{[0-9a-fA-F ]+\})
Maybe something along the lines of
(?:(?:"[^,]+?"|\{(?:[0-9]|[a-f]|[A-F]| )+?\}),)*(?:(?:"[^,]+?"|\{(?:[0-9]|[a-f]|[A-F]| )+?\}))