Regex help for Alphanumeric and International characters

Regex help for Alphanumeric and International characters - regex

I only want to allow
Numbers
Letters
Spaces
International Letters
Anything else I want to remove.
I am using Coldfusion. I really haven't tried much because I have never really used regex before. I am trying to remove the "bad" characters
Here is what I am doing so far:
<cfset theText = "Baum -$&*( 5 Steine hoch groß 3 Stück grün****">
<cfset test1 = rereplace(theText, '[\p{L}0-9 ]', ' ', 'all')>
<cfset test2 = rereplace(theText, '[^\p{L}0-9 ]', ' ', 'all')>
The results:
Original Text: Baum -$&*( 5 Steine hoch groß 3 Stück grün****
Test 1 Result: Baum -$&*( Steine hoch groß Stück grün****
Test 2 Result: 5 3
In the end, I wound up doing this and it seems to be giving me what I need..
<cfset finalFile = varData.replaceAll('[^\p{L}0-9-.: ]',' ') />

Your question is a bit vague, but this regex sounds like it might fit your description.
[^\p{L}0-9 ]
You don't specify a language or flavor, so assuming \p{L} is supported, simply replace anything that matches this pattern with an empty string "".
Small demo: http://rubular.com/r/W4q5PFSJRg

Related

How to combine independent regular expressions and apply them on all rows of a dataset using Pandas?

Problem Statement:
I have two seperate regular expressions that I am trying to "combine" into one and apply to each row in a dataset. The matching part of each row should go to a new Pandas dataframe column called "Wanted". Please see example data below for how values that match should be formatted in the "Wanted" column.
Example Data (how I want it to look):
Column0
Wanted (Want "Column0" to look like this)
Alice\t12-345-623/ 10-1234
Alice, 12-345-623, 10-1234
Bob 201-888-697 / 12-0556a
Bob, 201-888-697, 12-0556a
Tim 073-110-101 / 13-1290
Tim, 073-110-101, 13-1290
Joe 74-111-333/ 33-1290 and Amy(12-345-623)/10-1234c
Joe, 74-111-333, 33-1290, Amy, 12-345-623, 10-1234c
In other words...:
2-3 digits ----- hyphen ---- 3 digits --- hyphen ---- 3 digits ---- any character ----
2 digits --- hyphen --- 4 digits ---- permit one single character
What I have tried #1:
After dinking around for a while I figured out two different regular expressions that on their own will solve part of the problem. Kinda.
This will match for the first group of numbers in each row (but doesn't get the second group--which I want) I'm interested in that I have tried. I'm not sure how robust this is though.
Example Problem Row (regex = r"(?:\d{1,3}-){0,3}\d{1,3}")
search_in = "Alice\t12-345-623/ 10-1234"
wanted_regex = r"(?:\d{1,3}\-){0,3}\d{1,3}"
match = re.search(wanted_regex, search_in)
match.group(0)
Wanted: Alice, 12-345-623, 10-1234
Got: 12-345-623 # matches the group of numbers but isn't formatted how I would like (see example data)
What I have tried #2:
This will match for the second part in each row--- but! --- only if its the only value in the column. The problem I have is that it matches on the first group of digits instead of the second.
Example Problem Row (regex = r"(?:\d{2,3}-){1}\d{3,4}") # different regex than above!
search_in = "Alice\t12-345-623/ 10-1234"
wanted_regex = r"(?:\d{2,3}\-){1}\d{3,4}"
match = re.search(wanted_regex, search_in)
match.group(0)
Wanted : Alice, 12-345-623, 10-1234
Got: 12-345 # matched on the first part
Known Problems:
When I try, "Alice\t12-345-623/ 10-1234", it will match "12-345" when I'm trying to match "10-1234"
Thank you!
Thanks in advance to all you wizards being willing to help me with this problem. I really appreciate it:)
Note: I have asked regarding regex that may make solving this problem easier. It might not, but here is the link anyways --> How to use regex to select a row and a fixed number of rows following a row containing a specific substring in a pandas dataframe

So this works for the four test examples you gave. How's this using the .split() method? Technically this returns a list of values and not a string.
import re
# text here
text = "Joe 74-111-333/ 33-1290 and Amy(12-345-623)/10-1234c"
# split this out to a list. remove the ending parenthesis since you are *splitting* on this
new_splits = re.split(r'\t|/|and|\(| ', text.replace(')',''))
# filter out the blank spaces
list(filter(None,new_splits))
['Joe', '74-111-333', '33-1290', 'Amy', '12-345-623', '10-1234c']
and if you are using pandas you can try the same steps above:
df['answer_Step1'] = df['Column0'].str.split(r'\\t+|/|and|\(| ')
df['answer_final'] = df['answer_Step1'].apply(lambda x: list(filter(None,x)))

You can use
re.sub(r'\s*\band\b\s*|[^\w-]+', ', ', text)
See the regex demo.
Pandas version:
df['Wanted'] = df['Column0'].str.replace(r'\s*\band\b\s*|[^\w-]+', ', ', regex=True)
Details:
\s*\band\b\s* - a whole word (\b are word boundaries) and enclosed with optional zero or more whitespace chars
| - or
[^\w-]+ - one or more chars other than letters, digits, _ and -
See a Python demo:
import re
texts = ['Alice 12-345-623/ 10-1234',
'Bob 201-888-697 / 12-0556a','Tim 073-110-101 / 13-1290',
'Joe 74-111-333/ 33-1290 and Amy(12-345-623)/10-1234c']
for text in texts:
print(re.sub(r'\s*\band\b\s*|[^\w-]+', ', ', text))
# => Alice, 12-345-623, 10-1234
# Bob, 201-888-697, 12-0556a
# Tim, 073-110-101, 13-1290
# Joe, 74-111-333, 33-1290, Amy, 12-345-623, 10-1234c

Match a part of a sentence and replace white space in that sentence

I've been scratching my head over this one for a while now. Is it possible, with a single regex, to modify the following text:
123456 ABC - 14 days there are eels in my hovercraft [blablabla]
to look like this:
there+are+eels+in+my+hovercraft
The main points are match whatever is after days minus the white space and whatever is before the last [ minus the space before it. On top of that, the white spaces should be replaced by plus characters. I can do this with two regexes where one gets the desired text and the second one replaces the white spaces with plus characters. But I'm wondering if there is a clever trick (lookaround comes to mind), which could accomplish the same in one go.

The most straightforward and probably most efficient way to do this is to just use two regular expressions, however if the language you are using allows for using a function as the replacement then you can do this with one call. For example with Javascript:
var s = '123456 ABC - 14 days there are eels in my hovercraft [blablabla]'
var regex = /^.*days *| \[.*$|( )/g;
var result = s.replace(regex, function (match, p1) {
return p1 ? '+' : '';
});
Example: http://jsfiddle.net/5fsEA/
Same approach using Python:
import re
s = '123456 ABC - 14 days there are eels in my hovercraft [blablabla]'
result = re.sub(r'^.*days *| \[.*$|( )', lambda m: '+' if m.group(1) else '', s)

This could be done in a 2-step process: (1) isolate the text you want using a regex match; (2) use the output of #1 in a regex substitution operation.
Here's an example in python:
import re
line = "123456 ABC - 14 days there are eels in my hovercraft [blablabla]"
m = re.match("^.*days\s+(.+)\s+\[.*$", line) # this gives us "there are eels in my hovercraft"
print re.sub('\s+', '+', m.group(1)) # this substitutes white spaces with '+'

Regex to remove last letter of each word in list

I list that I have created in coldfusion. Lets use the following list as an example:
<cfset arguments.tags = "battlefieldx, testx, wonderful, ererex">
What I would like to do is remove the "x" from the words that have an x at the end and keep the words in the list. Order doesn't matter. A regex would be fine or looping with coldfusion would be okay too.

Removing x from end of each list element...
To remove all x characters that preceed a comma or the end of string, do:
rereplace( arguments.tags , "x(?=,|$)" , "" , "all" )
The (?= ) part here is a lookahead - it matches the position of its contents, but does not include them in what is replaced. The | is alternation - it'll try to match a literal , and if that fails it'll try to match the end of the string ($).
If you don't want to remove a lone x from, e.g. "x,marks,the,spot"...
If you want to make sure that x is at the end of a word (i.e. is not alone), you can use a non-word boundary check:
rereplace( arguments.tags , "\Bx(?=,|$)" , "" , "all" )
The \B will not match if there isn't a [a-zA-Z0-9_] before the x - for more complex/accurate rules on what constitutes "end of a word", you would need a lookbehind, which can't be done with rereplace, but is still easy enough by doing:
arguments.tags.replaceAll("(?<=\S)x(?=,|$)" , "" )
(That looks for a single non-whitespace character before the x to consider it part of a word, but you can put any limited-width expression within the lookbehind.)
Obviously, to do any letter, switch the x with [a-zA-Z] or whatever is appropriate.

The regex to grab the 'x' from the end of a word is pretty straightforward. Supposing you have a given element as a string, the regex you need is simply:
REReplace(myString, "x$", "")
This matches an x at the end of the given string and replaces it with an empty string.
To do this for each substring in a comma-delimited list, try:
REReplace(myString, "x,|x$", ",", "ALL")

REReplace(myString, "x$", "")
The $ symbol is going to be used to detect the end of the string. Thus detecting an 'x' at the end of your string. The empty quotes will replace it with nothing, thus removing the 'x'.

This has already been answered, but thought I'd post a ColdFusion only solution since you said you could use either. (The RegEx is obviously much easier, but this will work too)
<cfset arguments.tags = "battlefieldx, testx, wonderful, ererex">
<cfset temparray = []>
<cfloop list="#arguments.tags#" index="i">
<cfif right(i,1) EQ 'X'>
<cfset arrayappend(temparray,left(i,len(i) - 1))>
<cfelse>
<cfset arrayappend(temparray,i)>
</cfif>
</cfloop>
<cfset arguments.tags = arraytolist(temparray)>
If you have ColdFusion 9+ or Railo you can simplify the loop using a ternary operator
<cfloop list="#arguments.tags#" index="i">
<cfset cfif right(i,1) EQ 'X' ? arrayappend(temparray,left(i,len(i) - 1)) : arrayappend(temparray,i)>
</cfloop>
You could also convert arguments.tags to an array and loop that way
<cfloop array="#listtoarray(arguments.tags)#" index="i">
<cfset cfif right(i,1) EQ 'X' ? arrayappend(temparray,left(i,len(i) - 1)) : arrayappend(temparray,i)>
</cfloop>

How to remove text after certain word

I have a string
The best laid schemes of mice and men
How do I remove all text after the word "schemes" in ColdFusion? I suppose this can be done with regex.

Here ya go:
<cfset myString = "The best laid schemes of mice and men" />
<cfoutput>#REReplace(myString, "schemes(.*)", "schemes")#</cfoutput>

Your regex is:
schemes.*$
and replace with "schemes"
Explanation
.*$ means match any character (.) 0 or more times (*) till the end of the row ($)

Try this regex:
schemes(.*)
Replace $1 with empty string ""

how to trim a string without any spaces

How do I remove spaces and other whitespace characters from inside a string. I don't want to remove the space just from the ends of the string, but throughout the entire string.

You can use a regular expression
<cfset str = reReplace(str, "[[:space:]]", "", "ALL") />

You can also simply use Coldfusion's Replace() (if you don't want to use regular expressions for some reason - but don't forget the "ALL" optional parameter.
I've run into this in the past, trying to remove 5 spaces in the middle of a string - I would do something like:
<cfset str = Replace(str, " ", "")/>
Forgetting the "ALL" will only replace the first occurrence so I would end up with 4 spaces, if that makes sense.
Be sure to use:
<cfset str = Replace(str, " ", "", "ALL")/>
to replace multiple spaces. Hope this helps!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex help for Alphanumeric and International characters - regex

Your question is a bit vague, but this regex sounds like it might fit your description. [^\p{L}0-9 ] You don't specify a language or flavor, so assuming \p{L} is supported, simply replace anything that matches this pattern with an empty string "". Small demo: http://rubular.com/r/W4q5PFSJRg

Related

How to combine independent regular expressions and apply them on all rows of a dataset using Pandas?

Match a part of a sentence and replace white space in that sentence

Regex to remove last letter of each word in list

How to remove text after certain word

how to trim a string without any spaces

Categories

Resources