I try to match the first character '.' in '-2.232.232'.
I am close to it by this regex
^[^\.]*(\.)(?=.*\.)
but it match '-2.' insted of '.'.
Thank you very much
You can use Series.str.replace with the n argument set to 1:
import pandas as pd
df = pd.DataFrame({'Data':['-2.232.232']})
df['Data'].str.replace(r"\.(?=[^.]*\.)", "", n=1)
# => 0 -2232.232
Here,
\.(?=[^.]*\.) - matches a dot that is followed with any zero or more chars other than a dot and then a dot char
n=1 - sets the number of replacements. n=1 means only one replacement.
Alternatively, you may use
>>> df['Data'].str.replace(r"^([^.]*)\.(?=[^.]*\.)", r"\1", n=1)
0 -2232.232
Here, ^([^.]*) matches and captures into Group 1 any zero or more chars other than . from the start of the string, and the \1 refers to that value from the replacement pattern.
Related
I have a data frame column in the below format:
header
THIS IS an example
ALSO this
ONE LAST
J. one more
I want to split it into two columns:
header1
header2
THIS IS
an example
ALSO
this
ONE LAST
null
null
J. one more
I have tried extracting the information like this:
df1['header'].str.split('[A-Z]', expand=True)
but my regular expressions are not up to par. Any help is much appreciated!
Greek Letter Notice
To only match Greek uppercase letters, replace [A-Z] in below patterns with [\u0391-\u03A1\u03A3-\u03A9]. To match both ASCII and Greek uppercase letters use [a-zA-Z\u0391-\u03A1\u03A3-\u03A9].
I.e.
rx = r'^\s*(?P<header1>(?:[\u0391-\u03A1\u03A3-\u03A9]+\b(?!\.)(?:\s+[\u0391-\u03A1\u03A3-\u03A9]+)*\b)?)(?:\s+(?P<header2>.*))?'
new_df = df['header'].str.extract(rx, expand=True)
See the regex demo.
You can use
df[['header1', 'header2']] = df['header'].str.extract(r'^\s*((?:[A-Z]+\b(?!\.)(?:\s+[A-Z]+)*)?)\s*(.*)', expand=True)
Output:
>>> df
header header1 header2
0 THIS IS an example THIS IS an example
1 ALSO this ALSO this
2 ONE LAST ONE LAST
3 J. one more J. one more
See the regex demo.
Details:
^ - start of string
\s* - zero or more whitespaces
((?:[A-Z]+\b(?!\.)(?:\s+[A-Z]+)*)?) - Group 1 (header1): an optional sequence of one or more uppercase ASCII letters (not followed with a . char) and then zero or more sequences of one or more whitespaces and one or more uppercase ASCII letters
\s* - zero or more whitespaces
(.*) - Group 2 (header2): any zero or more chars other than line break chars, as many as possible.
You may extract to a new dataframe using named capturing groups:
>>> new_df = df['header'].str.extract(r'^\s*(?P<header1>(?:[A-Z]+\b(?!\.)(?:\s+[A-Z]+)*)?)\s*(?P<header2>.*)', expand=True)
>>> new_df
header1 header2
0 THIS IS an example
1 ALSO this
2 ONE LAST
3 J. one more
You can also use 2 named capture groups, and join the columns.
^(?P<header1>[A-Z]+(?:[^\S\n]+[A-Z]+)*)?(?:(?:^|[^\S\n]+)(?P<header2>.+))?$
(or use [a-z].* instead of .+ if it must start with a lowercase char)
^ Start of string
(?P<header1>[A-Z]+ Capture group header1, match 1+ chars A-Z
(?:[^\S\n]+[A-Z]+)*)? Optionally match spaces and 1+ chars A-Z
(?: Non capture group
(?:^|[^\S\n]+) Either assert the start of the string or match 1+ spaces
(?P<header2>.+) Named group header2 match 1+ chars
)? Close group and make it optional
$ End of string
See a regex demo and a Python demo.
Example
import pandas as pd
strings = [
"THIS IS an example",
"ALSO this",
"ONE LAST",
"J. one more"
]
df1 = pd.DataFrame(strings, columns=["header"])
df1 = df1.join(
df1['header'].str.extract(
'^(?P<header1>[A-Z]+(?:[^\S\n]+[A-Z]+)*)?(?:(?:^|[^\S\n]+)(?P<header2>.+))?$',
expand=True
)
.fillna('')
)
print(df1)
Output
header header1 header2
0 THIS IS an example THIS IS an example
1 ALSO this ALSO this
2 ONE LAST ONE LAST
3 J. one more J. one more
Using str.extract we can try:
df["header1"] = df["header"].str.extract(r'^([A-Z]+(?: [A-Z]+)?)')
df["header2"] = df["header"].str.extract(r'\b([a-z]+(?: [a-z]+)?)')
I'm struggling with the following combination of characters that I'm trying to parse:
I have two types of text:
1. AF-B-W23F4-USLAMC-X99-JLK
2. LS-V-A23DF-SDLL--X22-LSM
I want to get the last two combination of characters devided by - within dash.
From the 1. X99-JLK and from the 2. X22-LSM
I accomplished the 2. with the following regex '--(.*-.*)'
How can I parse the 1. sample and is there any option to parse it at one time with something like OR operator?
Thanks for any help!
The pattern --(.*-.*) that you tried matches the second example because it contains -- and it matches the first occurrence.
Then it matches until the end of the string and backtracks to find another hyphen.
As .* can match any character (also -) and there are no anchors or boundaries set, this is a very broad match.
If there have to be 2 dashes, you can match the first one, and use a capture group for the part with the second one using a negated character class [^-]
The character class can also match a newline. If you don't want to match a newline you can use [^-\r\n] or also not matching spaces [^-\s] (as there are none in the example data)
-([^-]+-[^-]+)$
Explanation
- Match -
( Capture group 1
[^-]+-[^-]+ Match the second dash between chars other than -
) Close group 1
$ End of string
See a regex demo
For example using Javascript:
const regex = /-([^-]+-[^-]+)$/;
[
"AF-B-W23F4-USLAMC-X99-JLK",
"LS-V-A23DF-SDLL--X22-LSM"
].forEach(s => {
const m = s.match(regex);
if (m) {
console.log(m[1]);
}
})
You can try lookahead to match the last pair before the new line. JavaScript example:
const str = `
AF-B-W23F4-USLAMC-X99-JLK
LS-V-A23DF-SDLL--X22-LSM
`;
const re = /[^-]*-[^-]*(?=\n)/g;
console.log(str.match(re));
I have a string that looks like this:
'Home Cookie viewed item "yada_yada.mov" (22.4338.241384081)'
I need to parse the last set of numbers, the ones between the last period and the closing paren (in this case, 241384081) out of the string, keeping in mind that there may be one or more sets of parenthesis in the filename "yada_yada.mov."
So far I have this:
mo = re.match('.*([0-9])\)$', data1)
...where data1 is the string. But that is only returning the very last digit.
Any help, please?
Thanks!
You may use
(\d[\d.]*)\)$
See the regex demo.
Details
(\d[\d.]*) - Capturing group 1: a digit and then any amount of . and digits, 0 or more times
\) - a )
$ - end of string.
See the Python demo:
import re
s='Home Cookie viewed item "yada_yada.mov" (22.4338.241384081)'
m = re.search(r'(\d[\d.]*)\)$', s)
if m:
print(m.group(1)) # => 22.4338.241384081
# print(m.group(1).replace(".", "")) # => 224338241384081
Alternative patterns:
(\d+(?:\.\d+)*)\)$ # To match digits and then 0 or more repetitions of . + digits
(\d+(?:\.\d+)*)\)\s*$ # To allow any 0+ trailing whitespaces
In regex(python3) I have an input of 16 digit number.
I need to check the number so that no 4 consecutive digits are same
1234567890111234 ------> valid
1234555567891234 ------> invalid
You could search for the pattern (.)\1{3} in the string which matches 4 consecutive same letters, if re.search returns None, it's a valid string:
import re
lst = ['12345678901112', '12345555678912']
for x in lst:
print(x)
print('Valid: ', re.search(r'(.)\1{3}', x) is None)
#12345678901112
#Valid: True
#12345555678912
#Valid: False
Here (.) matches a general single character, and capture it as group 1 which we can refer later for the following characters match with back reference \1, and to further make sure there are three same characters, use quantifier {3} on \1. This ensures the matched 4 characters are the same.
i have the following string:
p=1A Testing$A123
I need to a regex to get only the "A123", consider the following:
The "A" can be any character between a-z (maybe more than single character)
there is only 1 $ sign after the "p="
There can be other "$" signs before the "p=".
any ideas?
Your regex expression should be something like: p=.*\$(\w+[0-9]*)
p= matches the p=
.* matches any character greedily
\$ matches the $
(\w+[0-9]*) matches a capture group: \w+ (a group of least one characters) followed by [0-9]* (a group of numbers (optional))
Change the [0-9]* to [0-9]+ if there should be at least one digit following the characters after the $.
Using capturing group:
Javascript:
'a=3$bcc,p=1A Testing$A123'.match(/p=.*\$(\w+)/)[1]
// => "A123"
Python:
>>> import re
>>> re.search(r'p=.*\$(\w+)', r'a=3$bcc,p=1A Testing$A123').group(1)
'A123'
Using Javasript:
var s = 'p=1A Testing$A123';
var m = s.match(/p=[^$]*$(.*)$/);
// ["p=1A Testing$A123", "A123"]
// use m[1]
This should work:
/^\p=[^$]*\$[A-Z]\d+$/
.*?p=[^$]*\$([a-z].*)$
Change the capturing group ([a-z].*) with a pattern you want to use to match the end of the input.
For instance,
([a-z]\d*) : beginning by [a-Z] followed by a sequence of decimals 0 or more times.
([a-z]\d{3}) : beginning by [a-Z] followed by a sequence of decimals exactly 3 times.
...