Remove given string from both start and end of a word - regex

Data :
col 1
AL GHAITHA
AL ASEEL
EMARAT AL
LOREAL
ISLAND CORAL
My code :
def remove_words(df, col, letters):
regular_expression = '^' + '|'.join(letters)
df[col] = df[col].apply(lambda x: re.sub(regular_expression, "", x))
Desired output :
col 1
GHAITHA
ASEEL
EMARAT
LOREAL
ISLAND CORAL
SUNRISE
Function call :
letters = ['AL','SUPERMARKET']
remove_words(df=df col='col 1',letters=remove_letters)
Basically, i wanted remove the letters provided either at the start or end. ( note : it should be seperate string)
Fog eg : "EMARAT AL" should become "EMARAT"
Note "LOREAL" should not become "LORE"
Code to build the df :
raw_data = {'col1': ['AL GHAITHA', 'AL ASEEL', 'EMARAT AL', 'LOREAL UAE',
'ISLAND CORAL','SUNRISE SUPERMARKET']
}
df = pd.DataFrame(raw_data)

You may use
pattern = r'^{0}\b|\b{0}$'.format("|".join(map(re.escape, letters)))
df['col 1'] = df['col 1'].str.replace(pattern, r'\1').str.strip()
The (?s)^{0}\b|(.*)\b{0}$'.format("|".join(map(re.escape, letters)) pattern will create a pattern like (?s)^word\b|(.*)\bword$ and it will match word as a whole word at the start and end of the string.
When checking the word at the end of the string, the whole text before it will be captured into Group 1, hence the replacement pattern contains the \1 placeholder to restore that text in the resulting string.
If your letters list contains items only composed with word chars you may omit map with re.escape, replace map(re.escape, letters) with letters.
The .str.strip() will remove any resulting leading/trailing whitespaces.
See the regex demo.

Related

match everything but a given string and do not match single characters from that string

Let's start with the following input.
Input = 'blue, blueblue, b l u e'
I want to match everything that is not the string 'blue'. Note that blueblue should not match, but single characters should (even if present in match string).
From this, If I replace the matches with an empty string, it should return:
Result = 'blueblueblue'
I have tried with [^\bblue\b]+
but this matches the last four single characters 'b', 'l','u','e'
Another solution:
(?<=blue)(?:(?!blue).)+(?=blue|$)|^(?:(?!blue).)+(?=blue|$)
Regex demo
If you regex engine support the \K flag, then we can try:
/blue\K|.*?(?=blue|$)/gm
Demo
This pattern says to match:
blue match "blue"
\K but then forget that match
| OR
.*? match anything else until reaching
(?=blue|$) the next "blue" or the end of the string
Edit:
On JavaScript, we can try the following replacement:
var input = "blue, blueblue, b l u e";
var output = input.replace(/blue|.*?(?=blue|$)/g, (x) => x != "blue" ? "" : "blue");
console.log(output);

How do I replace the nth occurrence of a special character, say, a pipe delimiter with another in Scala?

I'm new to Spark using Scala and I need to replace every nth occurrence of the delimiter with the newline character.
So far, I have been successful at entering a new line after the pipe delimiter.
I'm unable to replace the delimiter itself.
My input string is
val txt = "January|February|March|April|May|June|July|August|September|October|November|December"
println(txt.replaceAll(".\\|", "$0\n"))
The above statement generates the following output.
January|
February|
March|
April|
May|
June|
July|
August|
September|
October|
November|
December
I referred to the suggestion at https://salesforce.stackexchange.com/questions/189923/adding-comma-separator-for-every-nth-character but when I enter the number in the curly braces, I only end up adding the newline after 2 characters after the delimiter.
I'm expecting my output to be as given below.
January|February
March|April
May|June
July|August
September|October
November|December
How do I change my regular expression to get the desired output?
Update:
My friend suggested I try the following statement
println(txt.replaceAll("(.*?\\|){2}", "$0\n"))
and this produced the following output
January|February|
March|April|
May|June|
July|August|
September|October|
November|December
Now I just need to get rid of the pipe symbol at the end of each line.
You want to move the 2nd bar | outside of the capture group.
txt.replaceAll("([^|]+\\|[^|]+)\\|", "$1\n")
//val res0: String =
// January|February
// March|April
// May|June
// July|August
// September|October
// November|December
Regex Explained (regex is not Scala)
( - start a capture group
[^|] - any character as long as it's not the bar | character
[^|]+ - 1 or more of those (any) non-bar chars
\\| - followed by a single bar char |
[^|]+ - followed by 1 or more of any non-bar chars
) - close the capture group
\\| - followed by a single bar char (not in capture group)
"$1\n" - replace the entire matching string with just the first $1 capture group ($0 is the entire matching string) followed by the newline char
UPDATE
For the general case of N repetitions, regex becomes a bit more cumbersome, at least if you're trying to do it with a single regex formula.
The simplest thing to do (not the most efficient but simple to code) is to traverse the String twice.
val n = 5
txt.replaceAll(s"(\\w+\\|){$n}", "$0\n")
.replaceAll("\\|\n", "\n")
//val res0: String =
// January|February|March|April|May
// June|July|August|September|October
// November|December
You could first split the string using '|' to get the array of string and then loop through it to perform the logic you want and get the output as required.
val txt = "January|February|March|April|May|June|July|August|September|October|November|December"
val out = txt.split("\\|")
var output: String = ""
for(i<-0 until out.length -1 by 2){
val ref = out(i) + "|" + out(i+1) + "\n"
output = output + ref
}
val finalout = output.replaceAll("\"\"","") //just to remove the starting double quote
println(finalout)

RegEx. Check and pad string to ensure certain string format is used

Is it possible to take a string, and reformat it to ensure the output is always the same format.
I have an identification number that always follows the same format:
e.g.
166688205F02
16 66882 05 F 02
(15/16) (any 5 digit no) (05/06) (A-Z) (any 2 digit no)
Sometimes these are expressed as:
66882 5F 2
668825F2
66882 5 F 2
I want to take any of these lazy expressions, and pad them to form to proper format as above (defaulting as 16 for the first group).
Is this possible?
Your numbers can be matched by the following regex:
^ *(1[56])? *(\d{5}) *(0?[56]) *([A-Z]) *(\d{1,2}) *$
Here is a rough breakdown. I named the parts of the identification number. You may have more appropriate names for them.:
^ * #Start the match at the beginning of a string and consume all leading spaces if any.
(1[56])? #GROUP 1: The Id number prefix. (Optional)
* #Consume spaces if any.
(\d{5}) #GROUP 2: The five digit identifier code.
* #Consume spaces if any.
(0?[56]) #GROUP 3: The two digit indicator code.
* #Consume spaces if any.
([A-Z]) #GROUP 4: The letter code.
* #Consume spaces if any.
(\d{1,2}) #GROUP 5: The end code.
*$ #End the match with remaining spaces and the end of the string.
You didn't mention the language you are using. Here is a function I wrote in C# that uses this regex to reformat an input identification number.
private string FormatIdentificationNumber(string inputIdNumber) {
const string DEFAULT_PREFIX = "16";
const string REGEX_ID_NUMBER = #"^ *(1[56])? *(\d{5}) *(0?[56]) *([A-Z]) *(\d{1,2}) *$";
const int REGEX_GRP_PREFIX = 1;
const int REGEX_GRP_IDENTIFIER = 2;
const int REGEX_GRP_INDICATOR = 3;
const int REGEX_GRP_LETTER_CODE = 4;
const int REGEX_GRP_END_CODE = 5;
Match m = Regex.Match(inputIdNumber, REGEX_ID_NUMBER, RegexOptions.IgnoreCase);
if (!m.Success) return inputIdNumber;
string prefix = m.Groups[REGEX_GRP_PREFIX].Value.Length == 0 ? DEFAULT_PREFIX : m.Groups[REGEX_GRP_PREFIX].Value;
string identifier = m.Groups[REGEX_GRP_IDENTIFIER].Value;
string indicator = m.Groups[REGEX_GRP_INDICATOR].Value.PadLeft(2, '0');
string letterCode = m.Groups[REGEX_GRP_LETTER_CODE].Value.ToUpper();
string endCode = m.Groups[REGEX_GRP_END_CODE].Value.PadLeft(2, '0');
return String.Concat(prefix, identifier, indicator, letterCode, endCode);
}
You can replace space character with a blank one.
In JS for example :
"66882 5F 2".replace(' ','') // Will output "668825F2"
"66882 5 F 2".replace(' ','') // Will output "668825F2"
With regex, you can use "\s" delimiter for white spaces
First you eliminate spaces by replacing blank characters, then you use this regex
^1[5|6]([0-9]{5})0[5|6][A-Z]([0-9]{2})$

Regex to extract value at fixed position index

I have the following string of characters:
73746174652C313A312C310D
|
- extract the value at this position
I would like to extract the value 1 (the 1 at the end of the string) using regex.
So basically a regex that acts as a charAt(index).
I need this solution for a 3rd party application that only supports regular expressions. Note that the application cannot access capture groups and does not support negative lookbehinds.
In C#:
(?<=^.{21})(.)
in JS:
/.(?=.{2}$)/
You could try:
(?<=^.{21}).
It won't work in Javascript, but perhaps it will work in your app.
It means: a single character preceded (?<= ... ) by the beginning of the string ^ plus 21 characters .{21} . So, in the end, it returns the 22th character.
The 22nd character is in capture group 1.
/^.{21}(.)/
But what system are you in that requires this instead of normal string processing?
Depends how you want to match it ( x distance from the beginning or x distance from the end )
/(.).{2}$/ Third from the end (capturing group 1)
/^.{21}(.)/ 22nd character (capturing group 1)
//PHP
$str = '73746174652C313A312C310D';
$char = preg_replace('/(.).{2}$/','$1',$str); //3rd from last
preg_match('/(.).{2}$/',$str,$chars); //3rd from last
$char = $chars[1];
preg_match('/^.{21}(.)/',$str,$chars); //22nd character
$char = $chars[1];
//JS
var str = '73746174652C313A312C310D';
var ch = str.replace(/(.).{2}$/,'$1'); //3rd from last
var ch = str.match(/(.).{2}$/)[1]; //3rd from last
var ch = str.match(/^.{21}(.)/)[1]; //22nd character
If you're having to use the result of the First match: bit of your tool, run it twice:
73746174652C313A312C310D - ^.{21}. = 73746174652C313A312C31
73746174652C313A312C31 - .$ = 1

Regular expression that matches string equals to one in a group

E.g. I want to match string with the same word at the end as at the begin, so that following strings match:
aaa dsfj gjroo gnfsdj riier aaa
sdf foiqjf skdfjqei adf sdf sdjfei sdf
rew123 jefqeoi03945 jq984rjfa;p94 ajefoj384 rew123
This one could do te job:
/^(\w+\b).*\b\1$/
explanation:
/ : regex delimiter
^ : start of string
( : start capture group 1
\w+ : one or more word character
\b : word boundary
) : end of group 1
.* : any number of any char
\b : word boundary
\1 : group 1
$ : end of string
/ : regex delimiter
M42's answer is ok except degenerate cases -- it will not match string with only one word. In order to accept those within one regexp use:
/^(?:(\w+\b).*\b\1|\w+)$/
Also matching only necessary part may be significantly faster on very large strings. Here're my solutions on javascript:
RegExp:
function areEdgeWordsTheSame(str) {
var m = str.match(/^(\w+)\b/);
return (new RegExp(m[1]+'$')).test(str);
}
String:
function areEdgeWordsTheSame(str) {
var idx = str.indexOf(' ');
if (idx < 0) return true;
return str.substr(0, idx) == str.substr(-idx);
}
I don't think a regular expression is the right choice here. Why not split the the lines into an array and compare the first and the last item:
In c#:
string[] words = line.Split(' ');
return words.Length >= 2 && words[0] == words[words.Length - 1];