How to write expression in expression builder in data flow of ADF - regex

I am transforming data, While doing it I have to perform some transformation.
I need an expression in the expression builder to transform the customer Name as below
Take first character of word in the name followed by * . Customer name may contain 1 or more words
Name can be Tim or Tim John or Tim John Zac or Tim John Mike Zac

I have reproduced above and got below results using derived column.
I have used the same data that you have given in a single column and used the below dataflow expression in derived column.
dropLeft(toString(reduce(map(split(Name, ' '),regexReplace(#item, concat('[^',left(#item,1),']'), '*')), '', #acc + ' ' + #item, #result)), 2)
Here, some general regular expressions were given errors for me in dataflow, that's why used the above approach.
First, I have used split() by space to get an array of strings. Then used regular expression on every item of array like above.
As we do not have join in dataflow expression, I have used the code from this SO answer by #Jarred Jobe to convert array to a string seperated by spaces.
Result:
NOTE:
Make sure you give two spaces in toString() of above code to get the required result. If we give only one space it will give the results like below.
Update:
Thank you so much for sharing this. I have tried your solution but I
got few names wrong .Also I want to replace the rest of the characters
with just 5 '' irrespective of how many characters the name has. Also
name : Mia hellah came as M* h****h instead of M***** h*****. Another
one SAM & JOHN TIBEH should be S***** &***** J***** T*****. I tried to
update your expression but I couldn't get it right.
If you want to do like above, you can directly use concat function dataflow expression.
dropLeft(toString(reduce(map(split(Name, ' '),concat(left(#item,1), '*****')), '', #acc + ' ' + #item, #result)), 2)
Results:

Related

string replace method to be replaced by regular expression

I am using string replace method to clean-up column names.
df.columns=df.columns.str.replace("#$%./- ","").str.replace(' ', '_').str.replace('.', '_').str.replace('(','').str.replace(')','').str.replace('.','').str.lower()
Though it works, certainly does not look pythonic. Any suggestion?
I need only A-Za-z and underscore _ if required as column names.
Update:
I tried using Regular expression in the first replace method, but I still need to chain the string like this...
terms.columns=terms.columns.str.replace(r"^[^a-zA-Z1-9]*", '').str.replace(' ', '_').str.replace('(','').str.replace(')','').str.replace('.', '').str.replace(',', '')
Update showing test data:
Original string (Tab separated):
[Sr.No. Course Terms Besic of Education Degree Course Course Approving Authority (i.e Medical Council, etc.) Full form of Course 1 year Duration 2nd year 3rd year Duration 4 th year Duration]
Change column names:
terms.columns=terms.columns.str.replace(r"^[^a-zA-Z1-9]*", '').str.replace(' ', '_').str.replace('(','').str.replace(')','').str.replace('.', '').str.replace(',', '').str.lower()
Output:
['srno', 'course', 'terms', 'besic_of_education', 'degree_course',
'course_approving_authority_ie_medical_council_etc',
'full_form_of_course', '1_year_duration', '2nd_year_',
'3rd_year_duration', '4_th_year_duration']
Above output is correct. The question: Is there any way to achive the same other than the way I have used?
You can use a smaller number of .replace operations by replacing non-word strings with an empty string and subsequently removing the whitespace characters with an underscore.
df.columns.str.replace("[^\w\s]+","").str.replace("\s+","_")‌​.str.lower()
I hope this helps.

Is there an efficient way to scrape substrings from column values in Postgres?

I have a column called user_response, on which I want to do variety of operations like take out words contained in quotes, and take out the part of the string after colon (:)
One such operation is this:
Let's say for a record
user_response = "My company: 'XYZ Co.' has allowed to use:: the following \n \n kind of product: RealMadridTShirts"
Now, I want to scrape the part of the string after last colon(:). Hence, my output should be RealMadridTShirts
I could achieve this somehow with the following hack:
SELECT reverse(split_part(reverse(user_response), ' :', 1))
However, this is grossly inefficient, specially when I am having to do this over 500,000 rows. It's not an operation that I will doing throughout the day. This operation is for a once-a-day load but even then the load is becoming very expensive.
Coming from Oracle, I know I could have used INSTR and SUBSTR functions to achieve it in a more elegant fashion (without having to reverse the string and all.
Also, what if I had to scrape the text after the second last colon?
Find the string after the last colon, right?
My company: 'XYZ Co.' has allowed to use:: the following \n \n kind of product: RealMadridTShirts
It's trivial with a regular expression:
regress=> SELECT (regexp_matches(
'My company: ''XYZ Co.'' has allowed to use:: the following \n \n kind of product: RealMadridTShirts',
'.*:(.*?)$')
)[1];
regexp_matches
--------------------
RealMadridTShirts
(1 row)
The apparent lack of a function to request the position of a string counting from a particular starting point makes it harder to do without using a regexp, but as a regexp is sure to be the fastest way to solve this I doubt that's an issue.
Your bigger problem is likely to be that you're scanning so much data. That's never going to be fast.

variable number of capturing groups

I have a xpath expression which I want to use to extract City and date from a td which contains a string of this kind:
City(may contain spaces and may be missing, but the following space is always present) on 2013/07/20
So far, I got to the following solution for extracting the date, which works partially:
//path/to/my/td/text()/replace(.,'(.*) on (.*)','$3')
This works when City is present, but when City is missing I get "on 2013/07/20" as a result.
I think this is because the first capturing group fails and so the number of groups is different.
How can I get this expression to work?
I did not fully check your regex, but it looks fine at first sight. Anyway, you can also go an easier way if you only want to get the date by extracting the text after "on ":
//path/to/my/td/text()/substring-after(.,'on ')
edit: or you may go the substring-way and select the last 10 characters of the content:
//path/to/my/td/text()/substring(., string-length(.) - 9)

Regular expression for matching diffrent name format in python

I need a regular expression in python that will be able to match different name formats like
I have 4 different names format for same person.like
R. K. Goyal
Raj K. Goyal
Raj Kumar Goyal
R. Goyal
What will be the regular expression to get all these names from a single regular expression in a list of thousands.
PS: My list have thousands of such name so I need some generic solution for it so that I can combine these names together.In the above example R and Goyal can be used to write RE.
Thanks
"R(\.|aj)? (K(\.|umar)? )?Goyal" will only match those four cases. You can modify this for other names as well.
Fair warning: I haven't used Python in a while, so I won't be giving you specific function names.
If you're looking for a generic solution that will apply to any possible name, you're going to have to construct it dynamically.
ASSUMING that the first name is always the one that won't be dropped (I know people whose names follow the format "John David Smith" and go by David) you should be able to grab the first letter of the string and call that the first initial.
Next, you need to grab the last name- if you have no Jr's or Sr's or such, you can just take the last word (find the last occurrence of ' ', then take everything after that).
From there, "<firstInitial>* <lastName>" is a good start. If you bother to grab the whole first name as well, you can reduce your false positive matches further with "<firstInitial>(\.|<restOfFirstName>)* <lastName>" as in joon's answer.
If you want to get really fancy, detecting the presence of a middle name could reduce false positives even more.
I may be misunderstanding the problem, but I'm envisioning a solution where you iterate over the list of names and dynamically construct a new regexp for each name, and then store all of these regexps in a dictionary to use later:
import re
names = [ 'John Kelly Smith', 'Billy Bob Jones', 'Joe James', 'Kim Smith' ]
regexps={}
for name in names:
elements=name.split()
if len(elements) == 3:
pattern = '(%s(\.|%s)?)?(\ )?(%s(\.|%s)? )?%s$' % (elements[0][0], \
elements[0][1:], \
elements[1][0], \
elements[1][1:], \
elements[2])
elif len(elements) == 2:
pattern = '%s(\.|%s)? %s$' % (elements[0][0], \
elements[0][1:], \
elements[1])
else:
continue
regexps[name]=re.compile(pattern)
jksmith_regexp = regexps['John Kelly Smith']
print bool(jksmith_regexp.match('K. Smith'))
print bool(jksmith_regexp.match('John Smith'))
print bool(jksmith_regexp.match('John K. Smith'))
print bool(jksmith_regexp.match('J. Smith'))
This way you can easily keep track of which regexp will find which name in your text.
And you can also do handy things like this:
if( sum([bool(reg.match('K. Smith')) for reg in regexps.values()]) > 1 ):
print "This string matches multiple names!"
Where you check to see if some of the names in your text are ambiguous.

Postgres regular expressions and regexp_split_to_array

In postgresql, I need to extract the first two words in the value for a given column. So if the value is "hello world moon and stars" or "hello world moon" or even just "hello world", I need "hello world".
I was hoping to use regexp_split_to_array but it doesn't seem that I can use this and access the elements returned in the same query?
Do I need to create a function for what I'm trying to do?
I can't believe that 5 years ago and no one noticed that you can access elements from regexp_split_to_array function if you surround them with parenthesis.
I saw many people tried to access the elements of the table like this:
select regexp_split_to_array(my_field, E'my_pattern')[1] from my_table
The previous will return an error, but the following will not :
select (regexp_split_to_array(my_field, E'my_pattern'))[1] from my_table
You can use POSIX regular expressions with PostgreSQL's substring():
select substring('hello world moon' from E'^\\w+\\s+\\w+');
Or with a very liberal interpretation of what a word is:
select substring('it''s a nice day' from E'^\\S+\\s+\\S+');
Note the \S (non-whitespace) instead of \w ("word" character, essentially alphanumeric plus underscore).
Don't forget all the extra quoting nonsense though:
The E'' to tell PostgreSQL that you're using extending escaping.
And then double backslashes to get single backslashes past the string parser and in to the regular expression parser.
If you really want to use regexp_split_to_array, then you can but the above quoting issues apply and I think you'd want to slice off just the first two elements of the array:
select (regexp_split_to_array('hello world moon', E'\\s+'))[1:2];
I'd guess that the escaping was causing some confusion; I usually end up adding backslashes until it works and then I pick it apart until I understand why I needed the number of backslashes that I ended up using. Or maybe the extra parentheses and array slicing syntax was an issue (it was for me but a bit of experimentation sorted it out).
found one answer:
select split_part('hello world moon', ' ', 1) || ' ' || split_part('hello world moon', ' ', 2);
select substring(my_text from $$^\S+\s+\S+$$) from v;
substring
-------------
hello world
hello world
hello world
(3 rows)
where for the purpose of demonstration, v is:
create view v as select 'hello world moon and stars' as my_text union all
select 'hello world mood' union all
select 'hello world';
if you want to ignore whitespace at the beginning:
select substring(my_text from $$^\s*\S+\s+\S+$$) from v;