Regex - How do you match everything except four digits in a row? - regex

Using Regex, how do you match everything except four digits in a row? Here is a sample text that I might be using:
foo1234bar
baz 1111bat
asdf 0000 fdsa
a123b
Matches might look something like the following:
"foo", "bar", "baz ", "bat", "asdf ", " fdsa", "a123b"
Here are some regular expressions I've come up with on my own that have failed to capture everything I need:
[^\d]+ (this one includes a123b)
^.*(?=[\d]{4}) (this one does not include the line after the 4 digits)
^.*(?=[\d]{4}).* (this one includes the numbers)
Any ideas on how to get matches before and after a four digit sequence?

You haven't specified your app language, but practically every app language has a split function, and you'll get what you want if you split on \d{4}.
eg in java:
String[] stuffToKeep = input.split("\\d{4}");

You can use a negative lookahead:
(?!\b\d{4}\b)(\b\w+\b)
Demo

In Python the following is very close to what you want:
In [1]: import re
In [2]: sample = '''foo1234bar
...: baz 1111bat
...: asdf 0000 fdsa
...: a123b'''
In [3]: re.findall(r"([^\d\n]+\d{0,3}[^\d\n]+)", sample)
Out[3]: ['foo', 'bar', 'baz ', 'bat', 'asdf ', ' fdsa', 'a123b']

Related

How to combine independent regular expressions and apply them on all rows of a dataset using Pandas?

Problem Statement:
I have two seperate regular expressions that I am trying to "combine" into one and apply to each row in a dataset. The matching part of each row should go to a new Pandas dataframe column called "Wanted". Please see example data below for how values that match should be formatted in the "Wanted" column.
Example Data (how I want it to look):
Column0
Wanted (Want "Column0" to look like this)
Alice\t12-345-623/ 10-1234
Alice, 12-345-623, 10-1234
Bob 201-888-697 / 12-0556a
Bob, 201-888-697, 12-0556a
Tim 073-110-101 / 13-1290
Tim, 073-110-101, 13-1290
Joe 74-111-333/ 33-1290 and Amy(12-345-623)/10-1234c
Joe, 74-111-333, 33-1290, Amy, 12-345-623, 10-1234c
In other words...:
2-3 digits ----- hyphen ---- 3 digits --- hyphen ---- 3 digits ---- any character ----
2 digits --- hyphen --- 4 digits ---- permit one single character
What I have tried #1:
After dinking around for a while I figured out two different regular expressions that on their own will solve part of the problem. Kinda.
This will match for the first group of numbers in each row (but doesn't get the second group--which I want) I'm interested in that I have tried. I'm not sure how robust this is though.
Example Problem Row (regex = r"(?:\d{1,3}-){0,3}\d{1,3}")
search_in = "Alice\t12-345-623/ 10-1234"
wanted_regex = r"(?:\d{1,3}\-){0,3}\d{1,3}"
match = re.search(wanted_regex, search_in)
match.group(0)
Wanted: Alice, 12-345-623, 10-1234
Got: 12-345-623 # matches the group of numbers but isn't formatted how I would like (see example data)
What I have tried #2:
This will match for the second part in each row--- but! --- only if its the only value in the column. The problem I have is that it matches on the first group of digits instead of the second.
Example Problem Row (regex = r"(?:\d{2,3}-){1}\d{3,4}") # different regex than above!
search_in = "Alice\t12-345-623/ 10-1234"
wanted_regex = r"(?:\d{2,3}\-){1}\d{3,4}"
match = re.search(wanted_regex, search_in)
match.group(0)
Wanted : Alice, 12-345-623, 10-1234
Got: 12-345 # matched on the first part
Known Problems:
When I try, "Alice\t12-345-623/ 10-1234", it will match "12-345" when I'm trying to match "10-1234"
Thank you!
Thanks in advance to all you wizards being willing to help me with this problem. I really appreciate it:)
Note: I have asked regarding regex that may make solving this problem easier. It might not, but here is the link anyways --> How to use regex to select a row and a fixed number of rows following a row containing a specific substring in a pandas dataframe
So this works for the four test examples you gave. How's this using the .split() method? Technically this returns a list of values and not a string.
import re
# text here
text = "Joe 74-111-333/ 33-1290 and Amy(12-345-623)/10-1234c"
# split this out to a list. remove the ending parenthesis since you are *splitting* on this
new_splits = re.split(r'\t|/|and|\(| ', text.replace(')',''))
# filter out the blank spaces
list(filter(None,new_splits))
['Joe', '74-111-333', '33-1290', 'Amy', '12-345-623', '10-1234c']
and if you are using pandas you can try the same steps above:
df['answer_Step1'] = df['Column0'].str.split(r'\\t+|/|and|\(| ')
df['answer_final'] = df['answer_Step1'].apply(lambda x: list(filter(None,x)))
You can use
re.sub(r'\s*\band\b\s*|[^\w-]+', ', ', text)
See the regex demo.
Pandas version:
df['Wanted'] = df['Column0'].str.replace(r'\s*\band\b\s*|[^\w-]+', ', ', regex=True)
Details:
\s*\band\b\s* - a whole word (\b are word boundaries) and enclosed with optional zero or more whitespace chars
| - or
[^\w-]+ - one or more chars other than letters, digits, _ and -
See a Python demo:
import re
texts = ['Alice 12-345-623/ 10-1234',
'Bob 201-888-697 / 12-0556a','Tim 073-110-101 / 13-1290',
'Joe 74-111-333/ 33-1290 and Amy(12-345-623)/10-1234c']
for text in texts:
print(re.sub(r'\s*\band\b\s*|[^\w-]+', ', ', text))
# => Alice, 12-345-623, 10-1234
# Bob, 201-888-697, 12-0556a
# Tim, 073-110-101, 13-1290
# Joe, 74-111-333, 33-1290, Amy, 12-345-623, 10-1234c

match only the digits from percentage using regular expression

Have a pandas dataframe, some of the cells contain percentage, taking a further look each cell is like '\u200b68%', '\u200b.75%','\u200b3.4%'. Only want to match the digit out.
Tried re.findall('(\d*(\.\d+)?)','\u200b.75%') but got too many stuffs.
What I expected 68, .75, 3.4.
Something like this should work..
\\u200b([0-9.]{1,5})%
Demo: https://paiza.io/projects/L9ZgArU-WZXxcRlGzVHH0Q
You could add the percentage sign after it, and use a single capturing group to be used with re.findall
(\d*(?:\.\d+)?)%
Regex demo
If the \u200b part should be present, you could also match that:
\\u200b(\d*(?:\.\d+)?)%
Regex demo
For example
import re
regex = r"\\u200b(\d*(?:\.\d+)?)%"
test_str = ("\\u200b68%\n"
"\\u200b.75%\n"
"\\u200b3.4%")
matches = re.findall(regex, test_str)
print(matches)
Output
['68', '.75', '3.4']

python3: regex need to character to match but dont want in output

I have a string named
Set-Cookie: BIGipServerApp_Pool_SSL=839518730.47873.0000; path=/
I am trying to extract 839518730.47873.0000 from it. For exact string I am fine with my regex but If I include any digit before 1st = then its all going wrong.
No Digit
>>> m=re.search('[0-9.]+','Set-Cookie: BIGipServerApp_Pool_SSL=839518730.47873.0000; path=/')
>>> m.group()
'839518730.47873.0000'
With Digit
>>> m=re.search('[0-9.]+','Set-Cookie: BIGipServerApp_Pool_SSL2=839518730.47873.0000; path=/')
>>> m.group()
'2'
Is there any way I can extract `839518730.47873.0000' only but doesnt matter what else lies in the string.
I tried
>>> m=re.search('=[0-9.]+','Set-Cookie: BIGipServerApp_Pool_SSL=839518730.47873.0000; path=/')
>>> m.group()
'=839518730.47873.0000'
As well but its starting with '=' in the output and I dont want it.
Any ideas.
Thank you.
If your substring always comes after the first =, you can just use capture group with =([\d.]+) pattern:
import re
result = ""
m = re.search(r'=([0-9.]+)','Set-Cookie: BIGipServerApp_Pool_SSL2=839518730.47873.0000; path=/')
if m:
result = m.group(1) # Get Group 1 value only
print(result)
See the IDEONE demo
The main point is that you match anything you do not need and match and capture (with the unescaped round brackets) the part of pattern you need. The value you need is in Group 1.
You can use word boundaries:
\b[\d.]+
RegEx Demo
Or to make match more targeted use lookahead for next semi-colon after your matched text:
\b[\d.]+(?=\s*;)
RegEx Demo2
Update :
>>> m.group(0)
'839518730.47873.0000'
>>> m=re.search(r'\b[\d.]+','Set-Cookie: BIGipServerApp_Pool_SSL2=839518730.47873.0000; path=/')
>>> m.group(0)
'839518730.47873.0000'
>>>

How to match regex with same format but different in terms of character set?

Suppose i have a string and i want to match only the part where value is empty and not the part where value is present?
for ex : &lang=&val=1233
I need only &lang and not &val as it has an actual value?
I have this
&(.+)=(?!\s\S)
regex which matches &lang=&val= in the string.
Can anyone help me out
Use following regular expression:
(?:(?<=\?)|&)[^=]+=(?=&|$)
could be explained as:
(?: ....): non-capturing (does not make a group), this may not needed according to your purpose.
\?: escaped ? to match ? literally.
(?<=\?): meaning "preceded by ?": ? is not included to the result.
(?=&|$): meaning "followed by &" or ~at end of the input".
Followings are sample test in Python interactive shell:
>>> pattern = r'(?:(?<=\?)|&)[^=]+=(?=&|$)'
>>> re.findall(pattern, '&lang=&val=')
['&lang=', '&val=']
>>> re.findall(pattern, '&lang=&val=1233')
['&lang=']
>>> re.findall(pattern, '&lang=&val=&val2=123&val3=')
['&lang=', '&val=', '&val3=']
>>> re.findall(pattern, '?lang=&val=&val2=123&val3=')
['lang=', '&val=', '&val3=']
>>> re.findall(pattern, '?lang=blah&val=&val2=123&val3=')
['&val=', '&val3=']
>>> re.findall(pattern, 'www.html.com?user=&lang=eng&code=.in')
do you mean
(&|?)([^&=]+)=(&|$)
(you can use non capturing groups if you need)
but I would just build a hash of all query string parameters and pick the keys without values. it is cheaper.
Try this:
[?&]([^&]+)=(&|$)
The first group will have the name of your parameter.
Note that this regex will also catch an empty first parameter (val1 in foo.php?val1=&val2=ok)
Try this one:
(&([^=]+))=(?=&)

Regex to determine if a string starts with more than one capital letter

I am trying to determine if a string has more than 1 capital letter in a row at the start of a string. I have the following regex but it doesn't work:
`^[A-Z]{2,1000}`
I want it to return true for:
ABC
ABc
ABC ABC
ABc Abc
But false for:
Abc
AbC
Abc Abc
Abc ABc
I have the 1000 just because I know the value won't be more than 1000 characters, but I don't care about restricting the length.
I am working with PHP, if it makes any difference.
Wouldn't leaving the second one do it?
^[A-Z]{2,}
Which basically says "string starts with 2 or more capital letters"
Here's some tests with the strings you provided that should match:
>>> 'ABC'.match(/^[A-Z]{2,}/);
["ABC"]
>>> 'ABc'.match(/^[A-Z]{2,}/);
["AB"]
>>> 'ABC ABC'.match(/^[A-Z]{2,}/);
["ABC"]
>>> 'ABc Abc'.match(/^[A-Z]{2,}/);
["AB"]
And then the ones it shouldn't match for:
>>> 'Abc'.match(/^[A-Z]{2,}/);
null
>>> 'AbC'.match(/^[A-Z]{2,}/);
null
>>> 'Abc Abc'.match(/^[A-Z]{2,}/);
null
>>> 'Abc ABc'.match(/^[A-Z]{2,}/);
null
If you only want to match the first two, you can just do {2}
I ran ^[A-Z]{2,} through the Regex Tester for egrep searches, and your test cases worked.
Try ^[A-Z][A-Z]
So do you want it to match the entire line if the first 2 letters are capital? If so, this should do it...
^[A-Z]{2,}.*$
php > echo preg_match("/^[A-Z]{2}/", "ABc");
1
php > echo preg_match("/^[A-Z]{2}/", "Abc");
0
/^[A-Z]{2}/ seems to work for me. Since you're doing a substring match anyways, there's no need to do {2,} or {2,1000}.
Since your regex works fine at finding which line begins with 2 caps, i assume you had another question.
Maybe you have case insensitve on
Try
(?-i)^[A-Z]{2,}
Or maybe you meant "match the whole line"
(?-i)^[A-Z].*$
non regex version
$str="Abccc";
$ch = substr($str,0,2);
if ( $ch == strtoupper($ch) ){
print "ok";
}else{
print "not ok";
}
In one word here is the answer in regex.
^[A-Z]{2,} or ^[A-Z][A-Z]+
whichever looks easier to you :)