Group by similar strings in SAS within a column

Group by similar strings in SAS within a column - sas

I have the following table:
Name
----
John Smith
John Smth
Jane Lee
Jane Line
Timothy Brown
Timmothy Brown
Agnes James
Aaron James
Using SAS, how can I group these strings on a large scale to identify those that are similar, so that I can get this table:
Name
----
John Smith
John Smth
Timothy Brown
Timmothy Brown

There are many ways in SAS to perform comparisons of strings. A simple example is using SOUNDEX to find two strings that sound alike.
data have;
input Name $char20.;
datalines;
John Smith
John Smth
Jane Lee
Jane Line
Timothy Brown
Timmothy Brown
Agnes James
Aaron James
;
proc sql;
create table want as
select
A.name
, B.name as name2
, soundex(A.name) as sxname
, soundex(B.name) as sxname2
from have a
cross join
have b
where a.name lt b.name
having sxname = sxname2
;
Other techniques would use a matching criterion based on a metric such as Levenshtein edit distance, which can be computed with COMPLEV. You can also learn more about SPEDIS.
Searching up How to perform a fuzzy match using SAS functions and you will get plenty to chew on. Keep an eye out for papers by Charles Patridge

Related

DAX Distinct Strings looking in column of Lists

I'm trying to do a DISTINCT function with DAX, however looking within list values and not just the column value. Sample data (sorry for the formatting):
Name Word List
Bob {aye, bee, cee}
Bob {aye, bee, cee}
Jim {dee, eee, eff}
Jim {dee, eee, eff}
Ray {aye, bee, cee}
Ray {dee, eeee, eff}
Desired Measure Output
Distinct Words for Jim: 3
Distinct Words for Bob: 3
Distinct Words for Ray: 6
Is there a way for the measure to look through the list and count distinct values?

This doesn't really answer your question how to count distinct in columns of lists but is a workaround to get the desired results.
I would use the query editor and Split Column > by delimiter (under the Home tab) and Replace Values (under the transform tab) to get your table to look something like this:
Name 1 2 3
Bob aye bee cee
Bob aye bee cee
Jim dee eee eff
Jim dee eee eff
Ray aye bee cee
Ray dee eeee eff
After that I would select all columns except Name and use Unpivot Columns, which would make your table look like this (after removing the Attribute column).
Name Word
Bob aye
Bob bee
Bob cee
Bob aye
Bob bee
Bob cee
Jim dee
Jim eee
Jim eff
Jim dee
Jim eee
Jim eff
Ray aye
Ray bee
Ray cee
Ray dee
Ray eeee
Ray eff
By simply dragging the Name and Word columns into a Matrix visual now you would get distinct count.
If you have a lot of data you could instead Group By on name with Count Distinct Rows as Operation already in the Query Editor and you would get a table that looks like your desired result.

Sitecore query syntax: Select all female descendents whose parent isn't called Jack

Given a Sitecore content tree of Males and Females (each sex with own template) representing a family tree, how would I select all Female descendents of an item where the parent wasn't called Jack using Sitecore query?
Context: My context item is one of Bob's children. My query shouldn't return Bob himself. Bob also has hundreds of brothers with thousands of descendants that I really don't want appearing in my results.
Bob
Sarah
Jim
Julie
John
Sue
Jack
Anne
Jack
Claire
Mary
The query should return: Sarah, Julie, Sue and Mary but not Anne or Claire.
I can select all female descendents of Bob with:
..//*[##templateid='{insert female template id here}']
But how do I add the parent name != Jack clause?

If you had a "family root" node that did not represent a person by itself, you could do this:
/path/to/family root//*[##name != 'Jack']/*[##templateid = '{template id}']
In your case, you want only a certain person's descendants to be returned. The person themselves should not be included in the result set. In that case, the approach from your comment is the way to go:
..//*[../##name != 'Jack' AND ##templateid = '{template id}']
The results of both queries will include Mary since her direct parent is not called Jack.

PostgreSQL regex - split column to array

I have a table music:
author | music
----------------------+-------
Kevin Clein | a
Gucio G. Gustawo | b
R. R. Andrzej | c
John McKnight Burman | d
How can I split a column which contain two different symbols (space and dot) and how to split name and surmane correctly to have result like:
author | name | surname
----------------------+---------+----------------
Kevin Clein | Kevin | Clein
Gucio G. Gustawo | Gucio G.| Gustawo
R. R. Andrzej | R. R. | Andrzej
John McKnight Burman | John | McKnight Burman
I have tried something like that so far:
WITH ad AS(
SELECT author,
s[1] AS name,
s[2] AS surname
FROM (SELECT music.*,
regexp_split_to_array(music.author,E'\\s[.]') AS s
FROM music)t
)SELECT * FROM ad;

I've create a possible solution to you. Be aware that it may not solve all problems and you will need to create an extra table to solve rules problem. By rule I mean what I've said in the comments like:
When to decide which is name and surname.
So in order to solve your problem I had to create another table that will handle surnames that should be considered as so.
The test case scenario:
create table surname (
id SERIAL NOT NULL primary key,
sample varchar(100)
);
--Test case inserts
insert into surname (sample) values ('McKnight'), ('McGregory'), ('Willian'), ('Knight');
create table music (
id SERIAL NOT NULL primary key,
author varchar(100)
);
insert into music (author) values
('Kevin Clein'),
('Gucio G. Gustawo'),
('R. R. Andrzej'),
('John McKnight Burman'),
('John Willian Smith'),
('John Williame Smith');
And My proposed solution:
select author,
trim(replace(author, surname, '')) as name,
surname
from (
select author,
case when position(s.sample in m.author)>0
then (regexp_split_to_array( m.author, '\s(?='||s.sample||')' ))[2]::text
else trim(substring( author from '\s\w+$' ))
end as surname
from music m left join surname s
on m.author like '%'||s.sample||'%'
where case when position(s.sample in m.author)>0
then (regexp_split_to_array( m.author, '\s(?='||s.sample||')' ))[2]::text
else trim(substring( author from '\s\w+$' )) end is not null
) as x
The output will be:
AUTHOR NAME SURNAME
------------------------------------------------------------
Kevin Clein Kevin Clein
Gucio G. Gustawo Gucio G. Gustawo
R. R. Andrzej R. R. Andrzej
John McKnight Burman John McKnight Burman
John Willian Smith John Willian Smith
John Williame Smith John Williame Smith
See it working here: http://sqlfiddle.com/#!15/c583f/2
In the table surname you will insert all names that should be considered as surname.
You may want to sub-query the query that do the case expression so you would use just the field instead of the hole case statement again on the where clause.

Postgresql substring using regex

I need to run a postgresql query to get names from database but I need to sort these names alphabetically.
The names that I am gettign from database are as follows:
(123) Jone Lee
(22) Hans Hee
2 Dean Alloni
Alen Khan
I need to output to be
Alen Khan
2 Dean Alloni
(22) Hans Hee
(123) Jone Lee
I tried the following psql query:
select name from table order by substring(name, E'\\W+\ +(.*)');
select name from table order by substring(name, E'\\(?\\w+?\\)?\ +?(.*)');
My problem if the name is Alen Khan, it only return Khan, so I get:
Khan
Dean Alloni
Hans Hee
Jone Lee
Any Help would be appreciate,
kind regards

select name
from table
order by substring(name, E'[a-zA-Z]+')
Edit as per OP's comment
select name
from table order by regexp_replace(name, '[^a-zA-Z]', '', 'g')

this will sort by strings last word
select name from table
order by (string_to_array(trim(name),' '))[ array_upper(string_to_array(trim(name),' '),1) ]

Rearranging the order of the text in a character string in SAS?

I have a data set with a character variable called "name". It contains the full name of a person like this:
"firstname middlename lastname".
I want to have the data rearranged so that is becomes:
"lastname, firstname middlename".
I'm not that hardcore in SAS functions, but I have used some of the few I know.
(My code can be seen below).
In the first try (test2) I don't get the result I want - I get:
"lastName , firstName middleName" and not
"lastName, firstName middleName" - my problem is the comma.
So I thought that I would solve my problem by making af new last name variable containing the comma at the end (in test2_new). But I don't get what I want? SAS put three dots at the end, and not a comma?
I hope a person with more SAS skills than me, can answer my question??
Kind Regards
Maria

data have ;
input #1 text & $64. ;
datalines ;
Susan Smith
David A Jameson
Bruce Thomas Forsyth
;
run ;
data want ;
set have ;
lastname = scan(text,-1,' ') ;
firstnames = substr(text,1,length(text)-length(lastname)) ;
newname = catx(', ',lastname,firstnames) ;
run ;
Which gives
text lastname firstnames newname
Susan Smith Smith Susan Smith, Susan
David A Jameson Jameson David A Jameson, David A
Bruce Thomas Forsyth Forsyth Bruce Thomas Forsyth, Bruce Thomas

PERL expressions are a useful tool here, particularly PRXCHANGE. The SAS Support website provides a good example of how to reverse first and last name, here's a slight modification of that code. I've only catered for people with either 2 or 3 names, but it should be fairly simple to expand this if necessary. My code is based on the HAVE dataset created in the answer from #Chris J.
data want;
set have;
if countw(text)=2 then text = prxchange('s/(\w+) (\w+)/$2, $1/', -1, text);
else if countw(text)=3 then text = prxchange('s/(\w+) (\w+) (\w+)/$3, $1 $2/', -1, text);
run;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Group by similar strings in SAS within a column - sas

Related

DAX Distinct Strings looking in column of Lists

Sitecore query syntax: Select all female descendents whose parent isn't called Jack

PostgreSQL regex - split column to array

Postgresql substring using regex

Rearranging the order of the text in a character string in SAS?

Categories

Resources