I have a string such as "option1;option2;option3" where the ";" separator might be anything. Any string of at least 1 character that the user puts.
I am looking for a simple/clean way to determine the separator without any information other than the input string.
I can guarantee the separator exists only between 2 elements but consider the possibility of only one option in the input string. I can also guarantee that the separator will only be non alphanumerical and may contain space and $ or # or % etc.
Couldn't create a regular expression for this, but perhaps someone will be able to, though I am not particularly looking for a regex expression.
To find the seperator
in = "option1;option2;option3"
separator=re.search("[ ;'#/.,<>?~#;,:}{\]\[+=\-_]+", in).group()
Sorry it was easier to use regexp for this
Now it's back to you. You need to prove that this works as you intend against all possible inputs
Here's a perhaps easier to use version
possible=""" ;'#/.,<>?~#,:}{][+=-_"""
seperator=re.search("[%s]+" % re.escape(possible), input).group()
This means that characters with special meaning in regexp can be added or taken away easier
This would work only if you knew for certain that only characters [A-Za-z0-9_] would appear inf fields:
^(\w+)\W(\w+)\W(\w+)$
This is probably not the case, so my solution would be to:
Create a list of all possible separators.
For each of these separators run a regex (dynamically constructed in a loop): ^([^X]+)X([^X]+)X([^X]+)$ where X is a separator character.
Check if number of matches equals expected number of columns (or go to 4. if you don't know the number of columns).
Run it for every line to see if number of matches changes, because a match in first line could be a blind luck.
If it matches everywhere, then you have your separator and the number of columns. If it doesn't match then start checking next separator for every line.
The downside of this solution is that in worst case you'd run your regex for every line of text and for every separator.
Possible optimizations would be to:
Start checking with most common separators first
Instead of running regex for every line for every separator, just count the number of separator characters in entire text. If the number of lines divides the number of separator characters without a remainder, then there's high probability that the separator is valid.
Related
I have gene sequences that can have actual string text in them I want to remove with regex. I would like to try to remove the errant text in a generic way with regex. I'd like to remove all characters up to 10 chars between any invalid characters. I am assuming that anything between invalid chars up to 10 chars apart is part of the invalid text.
example :
BADTEXTATTHEBEGINNINGATCATCGGCCCATGCATMOREBADTEXTINTHEMIDDLEGCGGGGATCGCCCCTTTAAAATHISISSOMETEXTATTHEENDIWANTREMOVED
Valid sequence characters are ATCG. Can we create a regex to reduce the above string to
GATCATCGGCCCATGCATGCGGGGATCGCCCCTTTAAAAT?
I understand that the G at the beginning of this final sequence is the last character of the word BEGINNING, which is the "bad" text at the beginning of the string. I realize with regex, it is impossible to identify words, so I am willing to live this limitation. Same with the T at the end, which is the first letter of "THIS".
I've tried to do something with repeated capture groups that allow for a certain number of chars between bad characters, but I can't seem to make it work right. Maybe someone can help me...
This regex does not quite work to capture everything.
([^ACTG].{1,10}[^ACTG])+
Initial string:
BADTEXTATTHEBEGINNINGATCATCGGCCCATGCATMOREBADTEXTINTHEMIDDLEGCGGGGATCGCCCCTTTAAAATHISISSOMETEXTATTHEENDIWANTREMOVED
String after replacing non-ACGT:
-A-T--TATT----G-----GATCATCGGCCCATGCAT-----A-T--T--T--------GCGGGGATCGCCCCTTTAAAAT---------T--TATT-------A-T-------
For this sample, a run of up to four ACGT characters can appear in the unwanted text. Examining other samples may give a sensible upper bound.
Perhaps "starts and ends with invalid character and contains no long runs of valid characters" is a better measure to use than "1 to 10 characters, starting and ending with invalid character"?
A regex for this is:
[^ACGT]((?![ACGT]{5,}).)*[^ACGT]
and matches:
BADTEXTATTHEBEGINNIN
MOREBADTEXTINTHEMIDDLE
HISISSOMETEXTATTHEENDIWANTREMOVED
I want to use positive lookaheads so that RegEx will pick up two words from two different sets in any order, but with a string between them of length 1 to 20 that is always in the middle.
It also is already case insensitive, allow for any number of characters including 0 before the first word found and the same after the second word found - I am unsure if it is more correct to terminate in $.
Without the any order matching I am so far as:
(?i:.*(new|launch|releas)+.{1,20}(product1|product2)+.*)
I have attempted to add any order matching with the following but it only picks up the first word:
(?i:.*(?=new|launch|releas)+.{1,20}(?=product1|product2)+.*)
I thought perhaps this was because of the +.{1,20} in the middle but I am unsure how it could work if I add this to both sets instead, as for instance this could cause a problem if the first word is the very first part of the source text it is parsing, and so no character before it.
I have seen example where \b is used for lookaheads but that also seems like it may cause a problem as I want it to match when the first word is at the start of the source text but also when it is not.
How should I edit my RegEx here please?
I have a word list in alphabetical order.
It is ranked as a column.
I do not use any programming languages.
The list in notepad format.
I need to match every similar words and take them on same line.
I use regex but I can't achieve correct results.
First list is like:
accept
accepted
accepts
accepting
calculate
calculated
calculates
calculating
fix
fixed
A list I want:
accept accepted accepts accepting
calculate calculated calculates calculating
fix fixed
This seems to work, but you will have to do Replace All multiple times:
Find (^(.+?)\s*?.*?)\R\2 and replace with \1\t\2. . matches newline should be disabled.
How it works:
It finds some characters at the start of line ^(.+?), then any linebreak \R, and those same characters again \2.
\s*?.*? is used to skip unnecessary characters after multiple Replace All. \s*? skips the first whitespace, and .*? any remaining chars on the line.
Match is replaced with \1\t\2, where \1 is anything matched in (^(.+?)\s*?.*?), and \2 is anything matched with (.+?). \t is used to insert tab character to replace linebreak.
How it breaks:
Note that this will not work well with different words with similar prefix, like:
hand
hands
handle
handles
This will be hand hands handle handles after 2 replaces.
I can imagine doing this programatically with limited success (take first word which comes as a root and if derived word with this root follows, place it on the same line, else take the word as a new root and put it to new line). This will still fail at irregular words where root is not the same for all forms.
Without programming there is a way only with (manual) preprocessing – if there are less than 4 forms for given word in the list, you insert blank line for each missing verb form, so there are always 4 lines for each word. Then you can use regex to get each such a quadruple into one line.
I have a regex to replace a certain pattern with a certain string, where the string is built dynamically by repeating a certain character as many times as there are characters in the match.
For example, say I have the following substitution command:
%s/hello/-----/g
However, I would like to do something like this instead:
%s/hello/-{5}/g
where the non-existing notation -{5} would stand for the dash character repeated five times.
Is there a way to do this?
Ultimately, I'd like to achieve something like this:
%s/(hello)*/-{\=strlen(\0)}/g
which would replace any instance of a string of only hellos with the string consisting of the dash character repeated the number of times equal to the length of the matched string.
%s/\v(hello)*/\=repeat('-',strlen(submatch(0)))/g
As an alternative to using the :substitute command (the usage of
which is already covered in #Peter’s answer), I can suggest automating
the editing commands for performing the replacement by means of
a self-referring macro.
A straightforward way of overwriting occurrences of the search pattern
with a certain character by hand would the following sequence of
Normal-mode commands.
Search for the start of the next occurrence.
/\(hello\)\+
Select matching text till the end.
v//e
Replace selected text.
r-
Repeat from step 1.
Thus, to automate this routine, one can run the command
:let[#/,#s]=['\(hello\)\+',"//\rv//e\rr-#s"]
and execute the contents of that s register starting from the
beginning of the buffer (or anther appropriate location) by
gg#s
I'm looking for some regex code that I can use to check for a valid username.
I would like for the username to have letters (both upper case and lower case), numbers, spaces, underscores, dashes and dots, but the username must start and end with either a letter or number.
Ideally, it should also not allow for any of the special characters listed above to be repeated more than once in succession, i.e. they can have as many spaces/dots/dashes/underscores as they want, but there must be at least one number or letter between them.
I'm also interested to find out if you think this is a good system for a username? I've had a look for some regex that could do this, but none of them seem to allow spaces, and I would like for the usernames to have some spaces in them.
Thank you :)
So it looks like you want your username to have a "word" part (sequence of letters or numbers), interspersed with some "separator" part.
The regex will look something like this:
^[a-z0-9]+(?:[ _.-][a-z0-9]+)*$
Here's a schematic breakdown:
_____sep-word…____
/ \
^[a-z0-9]+(?:[ _.-][a-z0-9]+)*$ i.e. "word ( sep word )*"
|\_______/ \____/\_______/ |
| "word" "sep" "word" |
| |
from beginning of string... till the end of string
So essentially we want to match things like word, word-sep-word, word-sep-word-sep-word, etc.
There will be no consecutive sep without a word in between
The first and last char will always be part of a word (i.e. not a sep char)
Note that for [ _.-], - is last so that it's not a range definition metacharacter. The (?:…) is what is called a non-capturing group. We need the brackets for grouping for the repetition (i.e. (…)*), but since we don't need the capture, we can use (?:…)* instead.
To allow uppercase/various Unicode letters etc, just expand the character class/use more flags as necessary.
References
regular-expressions.info/Anchors, Character Class, Repetition, Grouping
Although I'm sure someone will shortly post a 1 million lines regex to do exactly what you want, I don't think in this case a regex is a good solution.
Why don't you write a good old fashioned parser? It will take about as long as writing the regex that does everything you mentioned, but it's going to be much easier to maintain and read.
In particular, this is the tricky part:
it should also not allow for any of
the special characters listed above to
be repeated more than once in
succession
Alternatively you can always do a hybrid of the two. A regex for the other checks ([a-zA-Z0-9][a-zA-Z0-9 _-\.]*[a-zA-Z0-9]) and a non-regex method for the no-repeat requirement.
You don't have to use a regex for everything. I find that requirements like the "no two consecutive characters" usually make the regexes so ugly that it's better to do that bit with a simple procedural loop.
I'd just use something like ^[A-Za-z0-9][A-Za-z0-9 \.\-_]*[A-Za-z0-9]$ (or the equivalents like ::alnum:: if your regex engine is more advanced) and then just check every character in a loop to make sure the next character isn't the same.
By doing it procedurally, you can check all the other rules you're likely to want at some point without resorting to what I call "regex gymnastics", things like:
not allowed to contain your first or last name.
no more than two consecutive digits.
and so forth.