Replacing multiple values per cell in Pandas - list

I have the following column in a dataframe:
Q2
1 4
1 3
3 4 11
1 4 6 15 16
I want to replace mutiple values in a cell, if present: 1 gets replaced by Facebook, 2 with Instagram, and so on.
I splitted the values as follows:
columns_to_split = 'Q2'
for c in columns_to_split:
df[c] = df[c].str.split(' ')
which outputs
code
DSOKF31 [1, 4]
DSOVH39 [1, 3]
DSOVH05 [3, 4, 16]
DSOVH23 [1, 4, 6, 15, 16]
Name: Q2, dtype: object
but when trying to replace the multiple values with a dictionary as follows:
social_media_2 = {'1':'Facebook', '2':'Instagram', '3':'Twitter', '4':'Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO)', '5':'SnapChat', '6':'Imo', '7':'Badoo', '8':'Viber', '9':'Twoo', '10':'Linkedin', '11':'Flickr', '12':'Meetup', '13':'Tumblr', '14':'Pinterest', '15':'Yahoo', '16':'Gmail', '17':'Hotmail', '18':'M-Pesa', '19':'M-Shwari', '20':'KCB-Mpesa', '21':'Equitel', '22':'MobiKash', '23':'Airtel money', '24':'Orange Money', '25':'Mobile Bankig Accounts', '26':'Other specify'}
df['Q2'] = df['Q2'].replace(social_media_2)
I get the same output:
code
DSOKF31 [1, 4]
DSOVH39 [1, 3]
DSOVH05 [3, 4, 16]
DSOVH23 [1, 4, 6, 15, 16]
Name: Q2, dtype: object
How do I replace multiple values in one cell in this case?

Since the number of items is varying, there isn't a lot of structure. Still, after you split the string, you can apply a function that maps a list into dictionary values:
In [36]: df = pd.DataFrame({'Q2': ['1 4', '1 3', '1 2 3']})
In [37]: df.Q2.str.split(' ').apply(lambda l: [social_media_2[e] for e in l])
Out[37]:
0 [Facebook, Messenger (Google hangout, Tagg, Wh...
1 [Facebook, Twitter]
2 [Facebook, Instagram, Twitter]
Name: Q2, dtype: object
Edit Following Jezrael's excellent comment, here's a version that accounts for missing values as well:
In [58]: df = pd.DataFrame({'Q2': ['1 4', '1 3', '1 2 3', None]})
In [59]: df.Q2.str.split(' ').apply(lambda l: [] if type(l) != list else [social_media_2[e] for e in l])
Out[59]:
0 [Facebook, Messenger (Google hangout, Tagg, Wh...
1 [Facebook, Twitter]
2 [Facebook, Instagram, Twitter]
3 []
Name: Q2, dtype: object

Here is an alternative solution:
In [45]: df
Out[45]:
Q2
0 1 4
1 1 3
2 3 4 16
3 1 4 6 15 16
In [47]: (df.Q2.str.split(expand=True)
....: .stack()
....: .map(social_media_2)
....: .unstack()
....: .apply(lambda x: x.dropna().values.tolist(), axis=1)
....: )
Out[47]:
0 [Facebook, Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO)]
1 [Facebook, Twitter]
2 [Twitter, Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO), Gmail]
3 [Facebook, Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO), Imo, Yahoo, Gmail]
dtype: object
Explanation:
In [50]: df.Q2.str.split(expand=True).stack().map(social_media_2)
Out[50]:
0 0 Facebook
1 Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO)
1 0 Facebook
1 Twitter
2 0 Twitter
1 Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO)
2 Gmail
3 0 Facebook
1 Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO)
2 Imo
3 Yahoo
4 Gmail
dtype: object
In [51]: df.Q2.str.split(expand=True).stack().map(social_media_2).unstack()
Out[51]:
0 1 2 3 4
0 Facebook Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO) None None None
1 Facebook Twitter None None None
2 Twitter Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO) Gmail None None
3 Facebook Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO) Imo Yahoo Gmail
Timing against 40K rows DF:
In [86]: big = pd.concat([df] * 10**4, ignore_index=True)
In [87]: big.shape
Out[87]: (40000, 1)
In [88]: %%timeit
....: (big.Q2.str.split(expand=True)
....: .stack()
....: .map(social_media_2)
....: .unstack()
....: .apply(lambda x: x.dropna().values.tolist(), axis=1)
....: )
....:
1 loop, best of 3: 19.6 s per loop
In [89]: %timeit big.Q2.str.split(' ').apply(lambda l: [social_media_2[e] for e in l])
10 loops, best of 3: 72.6 ms per loop
Conclusion: Ami's solution is approx. 270 times faster!

If dont need list as output add only regex=True to replace:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Q2': ['1 4', '1 3', '3 4 11']})
print (df)
Q2
0 1 4
1 1 3
2 3 4 11
social_media_2 = {'1':'Facebook', '2':'Instagram', '3':'Twitter', '4':'Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO)', '5':'SnapChat', '6':'Imo', '7':'Badoo', '8':'Viber', '9':'Twoo', '10':'Linkedin', '11':'Flickr', '12':'Meetup', '13':'Tumblr', '14':'Pinterest', '15':'Yahoo', '16':'Gmail', '17':'Hotmail', '18':'M-Pesa', '19':'M-Shwari', '20':'KCB-Mpesa', '21':'Equitel', '22':'MobiKash', '23':'Airtel money', '24':'Orange Money', '25':'Mobile Bankig Accounts', '26':'Other specify'}
df['Q2'] = df['Q2'].replace(social_media_2, regex=True)
print (df)
Q2
0 Facebook Messenger (Google hangout, Tagg, What...
1 Facebook Twitter
2 Twitter Messenger (Google hangout, Tagg, Whats...
If need lists, use another solutions.
EDIT by comment:
You can replace whitespace by ; and then it works nice:
df = pd.DataFrame({'Q2': ['1 4', '1 3', '3 4 11']})
print (df)
Q2
0 1 4
1 1 3
2 3 4 11
df['Q2'] = df['Q2'].str.replace(' ',';')
print (df)
Q2
0 1;4
1 1;3
2 3;4;11
social_media_2 = {'1':'Facebook', '2':'Instagram', '3':'Twitter', '4':'Messenger (Google hangout, Tagg, WhatsAPP, MSG, Facetime, IMO)', '5':'SnapChat', '6':'Imo', '7':'Badoo', '8':'Viber', '9':'Twoo', '10':'Linkedin', '11':'Flickr', '12':'Meetup', '13':'Tumblr', '14':'Pinterest', '15':'Yahoo', '16':'Gmail', '17':'Hotmail', '18':'M-Pesa', '19':'M-Shwari', '20':'KCB-Mpesa', '21':'Equitel', '22':'MobiKash', '23':'Airtel money', '24':'Orange Money', '25':'Mobile Bankig Accounts', '26':'Other specify'}
df['Q2'] = df['Q2'].replace(social_media_2, regex=True)
print (df)
Q2
0 Facebook;Messenger (Google hangout, Tagg, What...
1 Facebook;Twitter
2 Twitter;Messenger (Google hangout, Tagg, Whats...
EDIT1:
Tou can also a bit change dict by adding ; to keys and then replace by double ;:
df = pd.DataFrame({'Q2': ['1 2', '1 3', '3 2 11']})
print (df)
Q2
0 1 2
1 1 3
2 3 2 11
df['Q2'] = df['Q2'].str.replace(' ',';;') + ';'
print (df)
Q2
0 1;;2;
1 1;;3;
2 3;;2;;11;
social_media_2 = {'1':'Fa', '2':'I', '3':'T', '11':'KL'}
#add ; to keys in dict
social_media_2 = dict((key + ';', value) for (key, value) in social_media_2.items())
print (social_media_2)
{'1;': 'Fa', '2;': 'I', '3;': 'T', '11;': 'KL'}
df['Q2'] = df['Q2'].replace(social_media_2, regex=True)
print (df)
Q2
0 Fa;I
1 Fa;T
2 T;I;1Fa

Related

Create list from pandas dataframe

I have a function that takes all, non-distinct, MatchId and (xG_Team1 vs xG_Team2, paired) and gives an output of as an array. which then summed up to be sse constant.
The problem with the function is it iterates through each row, duplicating MatchId. I want to stop this.
For each distinct MatchId I need the corresponding home and away goals as a list. I.e. Home_Goal and Away_Goal to be used in each iteration. from Home_Goal_time and Away_Goal_time columns of the dataframe. The list below doesn't seem to work.
MatchId Event_Id EventCode Team1 Team2 Team1_Goals
0 842079 2053 Goal Away Huachipato Cobresal 0
1 842079 2053 Goal Away Huachipato Cobresal 0
2 842080 1029 Goal Home Slovan lava 3
3 842080 1029 Goal Home Slovan lava 3
4 842080 2053 Goal Away Slovan lava 3
5 842080 1029 Goal Home Slovan lava 3
6 842634 2053 Goal Away Rosario Boca Juniors 0
7 842634 2053 Goal Away Rosario Boca Juniors 0
8 842634 2053 Goal Away Rosario Boca Juniors 0
9 842634 2054 Cancel Goal Away Rosario Boca Juniors 0
Team2_Goals xG_Team1 xG_Team2 CurrentPlaytime Home_Goal_Time Away_Goal_Time
0 2 1.79907 1.19893 2616183 0 87
1 2 1.79907 1.19893 3436780 0 115
2 1 1.70662 1.1995 3630545 121 0
3 1 1.70662 1.1995 4769519 159 0
4 1 1.70662 1.1995 5057143 0 169
5 1 1.70662 1.1995 5236213 175 0
6 2 0.82058 1.3465 2102264 0 70
7 2 0.82058 1.3465 4255871 0 142
8 2 0.82058 1.3465 5266652 0 176
9 2 0.82058 1.3465 5273611 0 0
For example MatchId = 842079, Home_goal =[], Away_Goal = [87, 115]
x1 = [1,0,0]
x2 = [0,1,0]
x3 = [0,0,1]
m = 1 ,arbitrary constant used to optimise sse.
k = 196
total_timeslot = 196
Home_Goal = [] # No Goal
Away_Goal = [] # No Goal
def sum_squared_diff(x1, x2, x3, y):
ssd = []
for k in range(total_timeslot): # k will take multiple values
if k in Home_Goal:
ssd.append(sum((x2 - y) ** 2))
elif k in Away_Goal:
ssd.append(sum((x3 - y) ** 2))
else:
ssd.append(sum((x1 - y) ** 2))
return ssd
def my_function(row):
xG_Team1 = row.xG_Team1
xG_Team2 = row.xG_Team2
return np.array([1-(xG_Team1*m + xG_Team2*m)/k, xG_Team1*m/k, xG_Team2*m/k])
results = df.apply(lambda row: sum_squared_diff(x1, x2, x3, my_function(row)), axis=1)
results
sum(results.sum())
For the three matches above the desire outcome should look like the following.
If I need an individual sse, sum(sum_squared_diff(x1, x2, x3, y)) gives me the following.
MatchId = 842079 = 3.984053038520635
MatchId = 842080 = 7.882189570700502
MatchId = 842080 = 5.929085973050213
Given the size of the original data, realistically I am after the total sum of the sse. For the above sample data, simply adding up the values give total sse=17.79532858227135.` Once I achieve this, then I will try to optimise the sse based on this figure by updating the arbitrary value m.
Here are the lists i hoped the function will iterate over.
Home_scored = s.groupby('MatchId')['Home_Goal_time'].apply(list)
Away_scored = s.groupby('MatchId')['Away_Goal_Time'].apply(list)
type(HomeGoal)
pandas.core.series.Series
Then convert it to lists.
Home_Goal = Home_scored.tolist()
Away_Goal = Away_scored.tolist()
type(Home_Goal)
list
Home_Goal
Out[303]: [[0, 0], [121, 159, 0, 175], [0, 0, 0, 0]]
Away_Goal
Out[304]: [[87, 115], [0, 0, 169, 0], [70, 142, 176, 0]]
But the function still takes Home_Goal and Away_Goal as empty list.
If you only want to consider one MatchId at a time you should .groupby('MatchID') first
df.groupby('MatchID').apply(...)

for loop in pandas to search dataframe and update list stuck

I want to count areas of interest in my dataframe column 'which_AOI' (ranging from 0 -9). I would like to have a new column with the results added to a dataframe depending on a variable 'marker' (ranging from 0 - x) which tells me when one 'picture' is done and the next begins (one marker can go on for a variable length of rows). This is my code so far but it seems to be stuck and runs on without giving output. I tried reconstructing it from the beginning once but as soon as i get to 'if df.marker == num' it doesn't stop. What am I missing?
(example dataframe below)
## AOI count of spec. type function (in progress):
import numpy as np
import pandas as pd
path_i = "/Users/Desktop/Pilot/results/gazedata_filename.csv"
df = pd.read_csv(path_i, sep =",")
#create a new dataframe for AOIs:
d = {'marker': []}
df_aoi = pd.DataFrame(data=d)
### Creating an Aoi list
item = df.which_AOI
aoi = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] #list for search
aoi_array = [0, 0 , 0, 0, 0, 0, 0, 0, 0, 0] #list for filling
num = 0
for i in range (0, len (df.marker)): #loop through the dataframe
if df.marker == num: ## if marker = num its one picture
for index, item in enumerate(aoi): #look for item (being a number in which_AOI) in aoi list
if (item == aoi[index]):
aoi_array[index] += 1
print (aoi)
print (aoi_array)
se = pd.Series(aoi_array) # make list into a series to attach to dataframe
df_aoi['new_col'] = se.values #add list to dataframe
aoi_array.clear() #clears list before next picture
else:
num +=1
index pos_time pos_x pos_y pup_time pup_diameter marker which_AOI fixation Picname shock
1 16300 168.608779907227 -136.360855102539 16300 2.935715675354 0 7 18 5 save
2 16318 144.97673034668 -157.495513916016 16318 3.08838820457459 0 8 33 5 save
3 16351 152.92560577392598 -156.64172363281298 16351 3.0895299911499 0 7 17 5 save
4 16368 152.132453918457 -157.989685058594 16368 3.111008644104 0 7 18 5 save
5 16386 151.59835815429702 -157.55587768554702 16386 3.09514689445496 0 7 18 5 save
6 16404 150.88092803955098 -152.69479370117202 16404 3.10009074211121 1 7 37 5 save
7 16441 152.76554107666 -142.06188964843798 16441 3.0821495056152304 1 7 33 5 save
Not 100% clear based on your question but it sounds like you want to count the number of rows for each which_AOI value in each marker.
You can accomplish this using groupby
df_aoi = df.groupby(['marker','which_AOI']).size().unstack('which_AOI',fill_value=0)
In:
pos_time pos_x pos_y pup_time pup_diameter marker \
0 16300 168.608780 -136.360855 16300 2.935716 0
1 16318 144.976730 -157.495514 16318 3.088388 0
2 16351 152.925606 -156.641724 16351 3.089530 0
3 16368 152.132454 -157.989685 16368 3.111009 0
4 16386 151.598358 -157.555878 16386 3.095147 0
5 16404 150.880928 -152.694794 16404 3.100091 1
6 16441 152.765541 -142.061890 16441 3.082150 1
which_AOI fixation Picname shock
0 7 18 5 save
1 8 33 5 save
2 7 17 5 save
3 7 18 5 save
4 7 18 5 save
5 7 37 5 save
6 7 33 5 save
Out:
which_AOI 7 8
marker
0 4 1
1 2 0

Function I defined is not cleaning my list properly

Here is my minimal working example:
list1 = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] #len = 21
list2 = [1,1,1,0,1,0,0,1,0,1,1,0,1,0,1,0,0,0,1,1,0] #len = 21
list3 = [0,0,1,0,1,1,0,1,0,1,0,1,1,1,0,1,0,1,1,1,1] #len = 21
list4 = [1,0,0,1,1,0,0,0,0,1,0,1,1,1,1,0,1,0,1,0,1] #len = 21
I have four lists and I want to "clean" my list 1 using the following rule: "if any of list2[i] or list3[i] or list4[i] are equal to zero, then I want to eliminate the item I from list1. SO basically I only keep those elements of list1 such that the other lists all have ones there.
here is the function I wrote to solve this
def clean(list1, list2,list3,list4):
for i in range(len(list2)):
if (list2[i]==0 or list3[i]==0 or list4[i]==0):
list1.pop(i)
return list1
however it doesn't work. If you apply it, it give the error
Traceback (most recent call last):line 68, in clean list1.pop(I)
IndexError: pop index out of range
What am I doing wrong? Also, I was told Pandas is really good in dealing with data. Is there a way I can do it with Pandas? Each of these lists are actually columns (after removing the heading) of a csv file.
EDIT
For example at the end I would like to get: list1 = [4,9,11,15]
I think the main problem is that at each iteration, when I pop out the elements, the index of all the successor of that element change! And also, the overall length of the list changes, and so the index in pop() is too large. So hopefully there is another strategy or function that I can use
This is definitely a job for pandas:
import pandas as pd
df = pd.DataFrame({
'l1':list1,
'l2':list2,
'l3':list3,
'l4':list4
})
no_zeroes = df.loc[(df['l2'] != 0) & (df['l3'] != 0) & (df['l4'] != 0)]
Where df.loc[...] takes the full dataframe, then filters it by the criteria provided. In this example, your criteria are that you only keep the items where l2, l3, and l3 are not zero (!= 0).
Gives you a pandas dataframe:
l1 l2 l3 l4
4 4 1 1 1
9 9 1 1 1
12 12 1 1 1
18 18 1 1 1
or if you need just list1:
list1 = df['l1'].tolist()
if you want the criteria to be where all other columns are 1, then use:
all_ones = df.loc[(df['l2'] == 1) & (df['l3'] == 1) & (df['l4'] == 1)]
Note that I'm creating new dataframes for no_zeroes and all_ones and that the original dataframe stays intact if you want to further manipulate the data.
Update:
Per Divakar's answer (far more elegant than my original answer), much the same can be done in pandas:
df = pd.DataFrame([list1, list2, list3, list4])
list1 = df.loc[0, (df[1:] != 0).all()].astype(int).tolist()
Here's one approach with NumPy -
import numpy as np
mask = (np.asarray(list2)==1) & (np.asarray(list3)==1) & (np.asarray(list4)==1)
out = np.asarray(list1)[mask].tolist()
Here's another way with NumPy that stacks those lists into rows to form a 2D array and thus simplifies things quite a bit -
arr = np.vstack((list1, list2, list3, list4))
out = arr[0,(arr[1:] == 1).all(0)].tolist()
Sample run -
In [165]: arr = np.vstack((list1, list2, list3, list4))
In [166]: print arr
[[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]
[ 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0 1 1 0]
[ 0 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 1]
[ 1 0 0 1 1 0 0 0 0 1 0 1 1 1 1 0 1 0 1 0 1]]
In [167]: arr[0,(arr[1:] == 1).all(0)].tolist()
Out[167]: [4, 9, 12, 18]

scikit-learn: One hot encoding of column with list values [duplicate]

This question already has answers here:
How to one hot encode variant length features?
(2 answers)
Closed 5 years ago.
I am trying to encode a dataframe like below:
A B C
2 'Hello' ['we', are', 'good']
1 'All' ['hello', 'world']
Now as you can see I can labelencod string values of second column, but I am not able to figure out how to go about encode the third column which is having list of string values and length of the lists are different. Even if i onehotencode this i will get an array which i dont know how to merge with array elements of other columns after encoding. Please suggest some good technique
Assuming we have the following DF:
In [31]: df
Out[31]:
A B C
0 2 Hello [we, are, good]
1 1 All [hello, world]
Let's use sklearn.feature_extraction.text.CountVectorizer
In [32]: from sklearn.feature_extraction.text import CountVectorizer
In [33]: vect = CountVectorizer()
In [34]: X = vect.fit_transform(df.C.str.join(' '))
In [35]: df = df.join(pd.DataFrame(X.toarray(), columns=vect.get_feature_names()))
In [36]: df
Out[36]:
A B C are good hello we world
0 2 Hello [we, are, good] 1 1 0 1 0
1 1 All [hello, world] 0 0 1 0 1
alternatively you can use sklearn.preprocessing.MultiLabelBinarizer as #VivekKumar suggested in this comment
In [56]: from sklearn.preprocessing import MultiLabelBinarizer
In [57]: mlb = MultiLabelBinarizer()
In [58]: X = mlb.fit_transform(df.C)
In [59]: df = df.join(pd.DataFrame(X, columns=mlb.classes_))
In [60]: df
Out[60]:
A B C are good hello we world
0 2 Hello [we, are, good] 1 1 0 1 0
1 1 All [hello, world] 0 0 1 0 1

R regex to fetch strings between characters at specific positions

I have some string data as follows in R.
DT <- structure(list(ID = c(1, 2, 3, 4, 5, 6), GKT = c("G1:GRST, G45:KRPT",
"G48932:KD56", "G7764:MGI45, K7786:IRE4R, K45:TG45", "K4512:3345, G51:56:34, K22:45I67",
"K678:RT,IG, G123:TGIF, G33:IG56", "T4534:K456")), .Names = c("ID",
"GKT"), class = "data.frame", row.names = c(NA, 6L))
DT
ID GKT
1 1 G1:GRST, G45:KRPT
2 2 G48932:KD56
3 3 G7764:MGI45, K7786:IRE4R, K45:TG45
4 4 K4512:3345, G51:56:34, K22:45I67
5 5 K678:RT,IG, G123:TGIF, G33:IG56
6 6 T4534:K456
I want to get the output out from DT$GKT using gsub and regex in R.
out <- c("G1, G45", "G48932", "G7764, K7786, K45", "K4512, G51, K22",
"K678, G123, G33", "T4534")
DT$out <- out
DT
ID GKT out
1 1 G1:GRST, G45:KRPT G1, G45
2 2 G48932:KD56 G48932
3 3 G7764:MGI45, K7786:IRE4R, K45:TG45 G7764, K7786, K45
4 4 K4512:3345, G51:56:34, K22:45I67 K4512, G51, K22
5 5 K678:RT,IG, G123:TGIF, G33:IG56 K678, G123, G33
6 6 T4534:K456 T4534
I have tried gsub(x=DT$GKT, pattern = "(:)(.*)(, |\\b)", replacement=""), but it fetches only first instances.
gsub(x=DT$GKT, pattern = "(:)(.*)(, |\\b)", replacement="")
[1] "G1" "G48932" "G7764" "K4512" "K678" "T4534"
Another option using gsub is to use a look behind
DT$out <- gsub("(?=:)(.[A-Z0-9,]+)(?=\\b)", "", DT$GKT, perl = TRUE)
DT
# ID GKT out
# 1 1 G1:GRST, G45:KRPT G1, G45
# 2 2 G48932:KD56 G48932
# 3 3 G7764:MGI45, K7786:IRE4R, K45:TG45 G7764, K7786, K45
# 4 4 K4512:3345, G51:56:34, K22:45I67 K4512, G51, K22
# 5 5 K678:RT,IG, G123:TGIF, G33:IG56 K678, G123, G33
# 6 6 T4534:K456 T4534
EDIT
You can use the following regular expression for replacing ...
DT$out <- gsub(':\\S+\\b', '', DT$GKT)
DT
# ID GKT out
# 1 1 G1:GRST, G45:KRPT G1, G45
# 2 2 G48932:KD56 G48932
# 3 3 G7764:MGI45, K7786:IRE4R, K45:TG45 G7764, K7786, K45
# 4 4 K4512:3345, G51:56:34, K22:45I67 K4512, G51, K22
# 5 5 K678:RT,IG, G123:TGIF, G33:IG56 K678, G123, G33
# 6 6 T4534:K456 T4534
You could use a lookahead (?=) to check for : and capture just the first group
unlist(regmatches(DT$GKT, gregexpr("([A-Z0-9]+)(?=:)", DT$GKT, perl=T)))
# [1] "G1" "G45" "G48932" "G7764" "K7786" "K45" "K4512" "G51"
# [9] "56" "K22" "K678" "G123" "G33" "T4534"