Regular Expression Split on nᵗʰ occurrence

Regular Expression Split on nᵗʰ occurrence - regex

I have a string stored in Hive, and I want to split the text on the 4th occurrence of , (or any other character).
I would really appreciate if someone can give me hint about the regular expression to do this.
The text is below:
The Band,The Band,,Up On Cripple Creek (2000 Digital Remaster),2000,Greatest Hits,The Band,,The Weight (2000 Digital Remaster),2003,Rhythm Of The Rain,The Cascades,,Rhythm Of The Rain (LP Version),2005,Chronicle Volume One,Creedence Clearwater Revival,,Who'll Stop the Rain,1976,The Complete Sun Singles, vol. 1,Johnny Cash,,I Walk the Line,2001,Greatest Hits,Bob Seger,,Against The Wind,1980,Their Greatest Hits,The Eagles,,Lyin' Eyes,1975,Johnny Horton's Greatest Hits,Johnny Horton,,North To Alaska,1987,Super Hits,Marty Robbins,,You Gave Me A Mountain,1969,Greatest Hits,Bob Seger,,Night Moves,1976,Hello Darlin' 15 #1 Hits,Conway Twitty,,It's Only Make Believe,2003,Anthology,Kenny Rogers & The First Edition,,Ruby, Don't Take Your Love To Town,1996,Greatest Hits,Neil Young,,Old Man,2004,Harvest,Neil Young,,Heart Of Gold,2009,The Very Best Of,The Springfields,,Silver Threads And Golden Needles,2011,The Best Of The Statler Brothers,The Statler Brothers,,Susan When She Tried,1987,The Definitive Collection,The Statler Brothers,,The Class Of '57,2005,The Definitive Collection,The Statler Brothers,,I'll Go To My Grave Loving You,2005,Greatest Hits: 1974-1978,Steve Miller Band,,The Joker,2006,Greatest Hits: 1974-1978,Steve Miller Band,,Rock'n Me,2006,Early Girl 7" Hits,Gale Garnett,,We'll Sing In The Sunshine,2010,King of the Road,Various Artists,,I Can't Stop Loving You - Don Gibson,2004,America's Troubador,Willie Nelson,,Angel Flying To Close To The Ground,2005,Their Greatest Hits,The Eagles,,Take It To The Limit,1975,Their Greatest Hits,The Eagles,,Desperado,1973,Highwayman,The Highwaymen,,Desperados Waiting For A Train,1985,Super Hits,Marty Robbins,,My Woman, My Woman, My Wife,1970,Super Hits,Marty Robbins,,Some Memories Just Won't Die,1982,Highwayman,The Highwaymen,,Committed To Parkview,1985,Greatest Hits - Roy Clark,Roy Clark,,Yesterday When I Was Young,1995,Greatest Hits - Roy Clark,Roy Clark,,I Never Picked Cotton,1995,Simon & Garfunkel's Greatest Hits,Simon & Garfunkel,,Bridge Over Troubled Water [Live],1970,Collection,The Oak Ridge Boys,,Y'all Come Back Saloon,1977,Super Hits,Vern Gosdin,,Chiseled In Stone,1987,Super Hits,Vern Gosdin,,Who You Gonna Blame It On This Time,1987,The Very Best Of John Denver [Disc 2],John Denver,,Rocky Mountain High,1972,The Very Best Of John Denver [Disc 2],John Denver,,Take Me Home, Country Roads,1971,Souvenirs,Vince Gill,,Never Knew Lonely,1995,Souvenirs,Vince Gill,,When I Call Your Name,1995,Souvenirs,Vince Gill,,Pocket Full Of Gold,1995,Greatest Hits - Waylon Jennings,Waylon Jennings,,Bob Wills Is Still King,2000,Greatest Hits - Waylon Jennings,Waylon Jennings,,Just To Satisfy You,2000

The Regex...
/ # Start regex
( # Start group
[^,]*, # not `,` - zero or more times (*) followed by `,`
){4} # repeat group four times
/g # match globally (using String.match)
All together...
console.log( str.match(/([^,]*,){4}/g) );

^([^,]*,[^,]*,[^,]*,[^,]*,)(.*)$
should give you two strings, 1) the part up to and including the fourth comma, and 2) the rest of the string.

Split by comma, and join again in a simple loop...
var arr, i, out, str;
str = "The Band,The Band,,Up On Cripple Creek (2000 Digital Remaster),2000,Greatest Hits,The Band,,The Weight (2000 Digital Remaster),2003,Rhythm Of The Rain,The Cascades,,Rhythm Of The Rain (LP Version),2005,Chronicle Volume One,Creedence Clearwater Revival,,Who'll Stop the Rain,1976,The Complete Sun Singles, vol. 1,Johnny Cash,,I Walk the Line,2001,Greatest Hits,Bob Seger,,Against The Wind,1980,Their Greatest Hits,The Eagles,,Lyin' Eyes,1975,Johnny Horton's Greatest Hits,Johnny Horton,,North To Alaska,1987,Super Hits,Marty Robbins,,You Gave Me A Mountain,1969,Greatest Hits,Bob Seger,,Night Moves,1976,Hello Darlin' 15 #1 Hits,Conway Twitty,,It's Only Make Believe,2003,Anthology,Kenny Rogers & The First Edition,,Ruby, Don't Take Your Love To Town,1996,Greatest Hits,Neil Young,,Old Man,2004,Harvest,Neil Young,,Heart Of Gold,2009,The Very Best Of,The Springfields,,Silver Threads And Golden Needles,2011,The Best Of The Statler Brothers,The Statler Brothers,,Susan When She Tried,1987,The Definitive Collection,The Statler Brothers,,The Class Of '57,2005,The Definitive Collection,The Statler Brothers,,I'll Go To My Grave Loving You,2005,Greatest Hits: 1974-1978,Steve Miller Band,,The Joker,2006,Greatest Hits: 1974-1978,Steve Miller Band,,Rock'n Me,2006,Early Girl 7\" Hits,Gale Garnett,,We'll Sing In The Sunshine,2010,King of the Road,Various Artists,,I Can't Stop Loving You - Don Gibson,2004,America's Troubador,Willie Nelson,,Angel Flying To Close To The Ground,2005,Their Greatest Hits,The Eagles,,Take It To The Limit,1975,Their Greatest Hits,The Eagles,,Desperado,1973,Highwayman,The Highwaymen,,Desperados Waiting For A Train,1985,Super Hits,Marty Robbins,,My Woman, My Woman, My Wife,1970,Super Hits,Marty Robbins,,Some Memories Just Won't Die,1982,Highwayman,The Highwaymen,,Committed To Parkview,1985,Greatest Hits - Roy Clark,Roy Clark,,Yesterday When I Was Young,1995,Greatest Hits - Roy Clark,Roy Clark,,I Never Picked Cotton,1995,Simon & Garfunkel's Greatest Hits,Simon & Garfunkel,,Bridge Over Troubled Water [Live],1970,Collection,The Oak Ridge Boys,,Y'all Come Back Saloon,1977,Super Hits,Vern Gosdin,,Chiseled In Stone,1987,Super Hits,Vern Gosdin,,Who You Gonna Blame It On This Time,1987,The Very Best Of John Denver [Disc 2],John Denver,,Rocky Mountain High,1972,The Very Best Of John Denver [Disc 2],John Denver,,Take Me Home, Country Roads,1971,Souvenirs,Vince Gill,,Never Knew Lonely,1995,Souvenirs,Vince Gill,,When I Call Your Name,1995,Souvenirs,Vince Gill,,Pocket Full Of Gold,1995,Greatest Hits - Waylon Jennings,Waylon Jennings,,Bob Wills Is Still King,2000,Greatest Hits - Waylon Jennings,Waylon Jennings,,Just To Satisfy You,2000";
arr = str.split(/,/);
out = [];
i = 0;
while (i < arr.length - 4) {
i += 4;
out.push("" + arr[i] + ", " + arr[i + 1] + ", " + arr[i + 2] + ", " + arr[i + 3]);
}
console.log(out);

Related

manipulating the format of date on X-axis

I have a weekly dataset. I use this code to plot the causality between variables. Stata shows the number of weeks of each year on the X-axis. Is it possible to show only year or year-month instead of year-week on the X-axis?
generate Date =wofd(D)
format Date %tw
tsset Date
tvgc Momentum supply, p(3) d(3) trend window(25) prefix(_) graph

The fact that you have weekly data is only a distraction here.
You should only use Stata's weekly date functions if your weeks satisfy Stata's rules:
Week 1 starts on 1 January, always.
Later weeks start 7 days later in turn, except that week 52 is always 8 or 9 days long.
Hence there is no week 53.
These are documented rules, and they do not match your data. You are lucky that you have no 53 week years in your data; otherwise you would get some bizarre results.
See much detailed discussion at references turned up by search week, sj.
The good news is that you need just to build on what you have and put labels and ticks on your x axis. It's a little bit of work but no more than use of standard and documented label and tick options. The main ideas are blindingly obvious once spelled out:
Labels Put informative labels in the middle of time intervals. Suppress the associated ticks. You can suppress a tick by setting its length to zero or its colour to invisible.
Ticks Put ticks as the ends (equivalently beginnings) of time intervals. Lengthen ticks as needed.
Grid lines Lines demarcating years could be worth adding. None are shown here, but the syntax is just an extension of that given.
Axis titles If the time (usually x) axis is adequately explained, that axis title is redundant and even dopey if it is some arbitrary variable name.
See especially https://www.stata-journal.com/article.html?article=gr0030 and https://www.stata-journal.com/article.html?article=gr0079
With your data, showing years is sensible but showing months too is likely to produce crowded detail that is hard to read and not much use. I compromised on quarters.
* Example generated by -dataex-. For more info, type help dataex
clear
input str10 D float(Momentum Supply)
"12/2/2010" -1.235124 4.760894
"12/9/2010" -1.537671 3.002344
"12/16/2010" -.679893 1.5665628
"12/23/2010" 1.964229 .5875537
"12/30/2010" -1.1872853 -1.1315695
"1/6/2011" .028031677 .065580264
"1/13/2011" .4438451 1.2316793
"1/20/2011" -.3865465 1.7899017
"1/27/2011" -.4547117 1.539866
"2/3/2011" 1.6675532 1.352376
"2/10/2011" -.016190516 3.72986
"2/17/2011" .5471755 2.0804555
"2/24/2011" .2695233 2.1094923
"3/3/2011" .5136591 -1.0686383
"3/10/2011" .606721 3.786967
"3/17/2011" .004175631 .4544936
"3/24/2011" 1.198901 -.3316304
"3/31/2011" .1973385 .5846249
"4/7/2011" 2.2470737 1.0026894
"4/14/2011" .3980386 -2.6676855
"4/21/2011" -1.530687 -7.214682
"4/28/2011" -.9735931 3.246654
"5/5/2011" .13312873 .9581707
"5/12/2011" -.8017629 -.468076
"5/19/2011" -.11491735 -4.354526
"5/26/2011" .3627179 -2.233418
"6/2/2011" .13805833 2.2697728
"6/9/2011" .27832976 .58203816
"6/16/2011" -1.9467738 -.2834298
"6/23/2011" -.9579238 -1.0356172
"6/30/2011" 1.1799787 1.1011268
"7/7/2011" -2.0982232 .5292908
"7/14/2011" -.2992591 -.4004747
"7/21/2011" .5904395 -2.5159726
"7/28/2011" -.21626104 1.936029
"8/4/2011" -.02421602 -.8160484
"8/11/2011" 1.5797064 -.6868965
"8/18/2011" 1.495294 -1.8621664
"8/25/2011" -1.2188485 -.8388996
"9/1/2011" .4991612 -1.6689343
"9/8/2011" 2.1691883 1.3244398
"9/15/2011" -1.2074957 .9707839
"9/22/2011" -.3399567 .6742781
"9/29/2011" 1.9860272 -3.331345
"10/6/2011" 1.935733 -.3882593
"10/13/2011" -1.278119 .6796986
"10/20/2011" -1.3209987 .2258049
"10/27/2011" 4.315368 .7879103
"11/3/2011" .58669937 -.5040554
"11/10/2011" 1.460597 -2.0426705
"11/17/2011" -1.338189 -.24199644
"11/24/2011" -1.6870773 -1.1143018
"12/1/2011" -.19232976 -1.2156726
"12/8/2011" -2.655519 -2.054406
"12/15/2011" 1.7161795 -.15301673
"12/22/2011" -1.43026 -3.138013
"12/29/2011" .03427247 -.28446484
"1/5/2012" -.15930523 -3.362428
"1/12/2012" .4222094 4.0962815
"1/19/2012" -.2413332 3.8277814
"1/26/2012" -2.850591 .067359865
"2/2/2012" -1.1785052 -.3558361
"2/9/2012" -1.0380571 .05134211
"2/16/2012" .8539951 -4.421839
"2/23/2012" .2636529 1.3424703
"3/1/2012" .022639304 2.734022
"3/8/2012" .1370547 .8043283
"3/15/2012" .1787796 -.56465846
"3/22/2012" -2.0645525 -2.9066684
"3/29/2012" 1.562931 -.4505192
"4/5/2012" 1.2587242 -.6908772
"4/12/2012" -1.5202224 .7883849
"4/19/2012" 1.0128288 -1.6764873
"4/26/2012" -.29182148 1.920932
"5/3/2012" -1.228097 -3.7068026
"5/10/2012" -.3124508 -3.034149
"5/17/2012" .7570716 -2.3398724
"5/24/2012" -1.0697783 -2.438565
"5/31/2012" 1.2796624 1.299344
"6/7/2012" -1.5482885 -1.228557
"6/14/2012" 1.396692 3.2158935
"6/21/2012" .3116726 8.035475
"6/28/2012" -.22332123 .7450229
"7/5/2012" .4655248 .04986914
"7/12/2012" .4769497 4.045938
"7/19/2012" .08743203 .25987592
"7/26/2012" -.402533 .3213503
"8/2/2012" -.1564897 1.5290447
"8/9/2012" -.0919008 .13955575
"8/16/2012" -1.3851573 1.0860283
"8/23/2012" .020250637 -.8858514
"8/30/2012" -.29458764 -1.6602173
"9/6/2012" -.39921495 -.8043483
"9/13/2012" 1.76396 4.2867813
"9/20/2012" -1.2335806 2.476225
"9/27/2012" .176066 -.5992883
"10/4/2012" .1075483 1.7167135
"10/11/2012" .06365488 1.1636261
"10/18/2012" -.2305842 -1.506699
"10/25/2012" -.1526354 -2.669866
"11/1/2012" -.06311637 -2.0813057
"11/8/2012" .55959195 .8805096
"11/15/2012" 1.5306772 -2.708766
"11/22/2012" -.5585792 .26319882
"11/29/2012" -.035690214 -1.6176193
"12/6/2012" -.7885767 1.1719254
"12/13/2012" .9131169 -1.1135346
"12/20/2012" -.6910864 -.4893669
"12/27/2012" .9836168 .4052487
"1/3/2013" -.8828759 .7161615
"1/10/2013" 1.505474 -.1768004
"1/17/2013" -1.3013282 -1.333739
"1/24/2013" -1.3670077 1.0568022
"1/31/2013" .05846912 -.7845241
"2/7/2013" .4923012 -1.202816
"2/14/2013" -.06551787 -.9198701
"2/21/2013" -1.8149366 -.1746187
"2/28/2013" .3370621 1.0104061
"3/7/2013" 1.2698976 1.273357
"3/14/2013" -.3884514 .7927139
"3/21/2013" -.1437847 1.7798674
"3/28/2013" -.2325031 .9336611
"4/4/2013" .03971701 .6680117
"4/11/2013" -.25990707 -3.0261614
"4/18/2013" .7046488 -.458615
"4/25/2013" -2.1198323 -.14664523
"5/2/2013" 1.591287 -.3687443
"5/9/2013" -1.1266721 -2.0973356
"5/16/2013" -.7595757 -1.1238302
"5/23/2013" 2.2590933 2.124479
"5/30/2013" -.7447268 .7387985
"6/6/2013" 1.3409324 -1.3744274
"6/13/2013" -.3844476 -.8341842
"6/20/2013" -.8135379 -1.7971268
"6/27/2013" -2.506065 -.4194731
"7/4/2013" -.4755843 -5.216218
"7/11/2013" -1.256806 1.8539237
"7/18/2013" -.13328764 -1.0578626
"7/25/2013" 1.2412375 1.7703875
"8/1/2013" 1.5033063 -2.2505422
"8/8/2013" -1.291876 -1.5896243
"8/15/2013" 1.0093634 -2.8861396
"8/22/2013" -.6952878 -.23103845
"8/29/2013" -.05459245 1.53916
"9/5/2013" 1.2413216 .749662
"9/12/2013" .19232245 2.81967
"9/19/2013" -2.6861706 -4.520664
"9/26/2013" .3105677 -5.274343
"10/3/2013" -.2184027 -3.251637
"10/10/2013" -1.233326 -5.031735
"10/17/2013" 1.9415965 -1.250861
"10/24/2013" -1.2008202 -1.5703772
"10/31/2013" -.6394427 -1.1347327
"11/7/2013" 2.715824 2.0324607
"11/14/2013" -1.5833142 2.5080755
"11/21/2013" .9940037 4.117931
"11/28/2013" -.8226601 3.752914
"12/5/2013" .09966203 1.865995
"12/12/2013" -.18744355 2.5426314
end
gen ddate = daily(D, "MDY")
gen year = year(ddate)
gen dow = dow(ddate)
tab year
tab dow
forval y = 2010/2013 {
local Y = `y' + 1
local yend `yend' `=mdy(1,1,`Y')'
if `y' > 2010 local ymid `ymid' `=mdy(7,1, `y')' "`y'"
forval q = 1/4 {
if `q' > 4 | `y' > 2010 {
local qmid : word `q' of 2 5 8 11
local qmids `qmids' `=mdy(`qmid', 15, `y')' "Q`q'"
local qend : word `q' of 4 7 10 4
local qends `qends' `=mdy(`qend', 1, `y')'
}
}
}
line M S ddate, xla(`ymid', tlength(*3) tlc(none)) xtic(`yend', tlength(*5)) xmla(`qmids', tlc(none) labsize(small) tlength(*.5)) xmti(`qends', tlength(*5)) xtitle("") scheme(s1color)

How to delete words from a dataframe column that are present in dictionary in Pandas

An extension to :
Removing list of words from a string
I have following dataframe and I want to delete frequently occuring words from df.name column:
df :
name
Bill Hayden
Rock Clinton
Bill Gates
Vishal James
James Cameroon
Micky James
Michael Clark
Tony Waugh
Tom Clark
Tom Bill
Avinash Clinton
Shreyas Clinton
Ramesh Clinton
Adam Clark
I'm creating a new dataframe with words and their frequency with following code :
df = pd.DataFrame(data.name.str.split(expand=True).stack().value_counts())
df.reset_index(level=0, inplace=True)
df.columns = ['word', 'freq']
df = df[df['freq'] >= 3]
which will result in
df2 :
word freq
Clinton 4
Bill 3
James 3
Clark 3
Then I'm converting it into a dictionary with following code snippet :
d = dict(zip(df['word'], df['freq']))
Now if I've to remove words from df.name that are in d(which is dictionary, with word : freq), I'm using following code snippet :
def check_thresh_word(merc,d):
m = merc.split(' ')
for i in range(len(m)):
if m[i] in d.keys():
return False
else:
return True
def rm_freq_occurences(merc,d):
if check_thresh_word(merc,d) == False:
nwords = merc.split(' ')
rwords = [word for word in nwords if word not in d.keys()]
m = ' '.join(rwords)
else:
m=merc
return m
df['new_name'] = df['name'].apply(lambda x: rm_freq_occurences(x,d))
But in actual my dataframe(df) contains nearly 240k rows and i've to use threshold(thresh=3 in above sample) greater than 100.
So above code takes lots of time to run because of complex search.
Is there any effiecient way to make it faster??
Following is a desired output :
name
Hayden
Rock
Gates
Vishal
Cameroon
Micky
Michael
Tony Waugh
Tom
Tommy
Avinash
Shreyas
Ramesh
Adam
Thanks in advance!!!!!!!

Use replace by regex created by joined all values of column word, last strip traling whitespaces:
data.name = data.name.replace('|'.join(df['word']), '', regex=True).str.strip()
Another solution is add \s* for select zero or more whitespaces:
pat = '|'.join(['\s*{}\s*'.format(x) for x in df['word']])
print (pat)
\s*Clinton\s*|\s*James\s*|\s*Bill\s*|\s*Clark\s*
data.name = data.name.replace(pat, '', regex=True)
print (data)
name
0 Hayden
1 Rock
2 Gates
3 Vishal
4 Cameroon
5 Micky
6 Michael
7 Tony Waugh
8 Tom
9 Tom
10 Avinash
11 Shreyas
12 Ramesh
13 Adam

Python: How to calculate tf-idf for a large data set

I have a following data frame df, which I converted from sframe
URI name text
0 <http://dbpedia.org/resource/Digby_M... Digby Morrell digby morrell born 10 october 1979 i...
1 <http://dbpedia.org/resource/Alfred_... Alfred J. Lewy alfred j lewy aka sandy lewy graduat...
2 <http://dbpedia.org/resource/Harpdog... Harpdog Brown harpdog brown is a singer and harmon...
3 <http://dbpedia.org/resource/Franz_R... Franz Rottensteiner franz rottensteiner born in waidmann...
4 <http://dbpedia.org/resource/G-Enka> G-Enka henry krvits born 30 december 1974 i...
I have done the following:
from textblob import TextBlob as tb
import math
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob.words)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
bloblist = []
for i in range(0, df.shape[0]):
bloblist.append(tb(df.iloc[i,2]))
for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:3]:
print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
But this is taking a lot of time as there are 59000 documents.
Is there a better way to do it?

I am confused about this subject. But I found a few solution on the internet with use Spark. Here you can look at:
https://www.linkedin.com/pulse/understanding-tf-idf-first-principle-computation-apache-asimadi
On the other hand i tried theese method and i didn't get bad results. Maybe you want to try :
I hava a word list. This list contains word and it's counts.
I found the average of this words counts.
I selected the lower limit and the upper limit with the average value.
(e.g. lower bound = average / 2 and upper bound = average * 5)
Then i created a new word list with upper and lower bound.
With theese i got theese result :
Before normalization word vector length : 11880
Mean : 19 lower bound : 9 upper bound : 95
After normalization word vector length : 1595
And also cosine similarity results were better.

How can I sort and include rules sorting in django?

I'm not to confident that I am asking this question correctly but this is what I'd like to do.
In django admin, I would like to write an action that sorts the list of my contestants randomly and doesn't allow two people with the same first name to be within 4 records of eachother. So basically,
if you have John L. John C. Carey J, Tracy M. Mary T., the records would be listed like this:
John L.
Mary T.
Carey J.
Tracy T.
John C.
OR
How can I write an action that would create random groups where two people with the same name wouldn't be within the same group like so:
John L. John C. Carey J, Tracy M. Mary T. =
Group 1
John L.
Mary T.
Carey J.
Tracy T.
Group 2
John C.
Forgive me if it isn't very clear, let me know and I'll try to specify further but any help would be appreciated
EDIT:
Is this what you are referring to? I can't quite figure out how to compare the fields to see if they are the same
Model:
class people(models.Model)
fname = model.CharField()
lname = model.CharField()
group = model.IntegerField()
View:
N = 4
Num = randint(0, N-1)
for x in queryset:
x.group = Num
if group == group| fname == fname | lname == lname:
x.group = (Num + 1) % N

Your first question cannot be solved always. Just think of all contestants have the same name, then you actually cannot find a solution to it.
For the second question, I can suggest an algorithm to do that, though.
Since I do not see any restriction on the number of groups, I will suggest a method to create the least number of groups here.
EDIT: I assumed you don't want 2 people with same "First name" in a group.
The steps are
Count the appearance of each name
count = {}
for x in queryset:
if x.fname not in count:
count[x.fname] = 0
count[f.name] += 1
Find the name with the most appearance
N = 0
for x in queryset:
if count[x.fname] > N:
N = count[x.fname]
Create N groups, where N equals to the number of appearance of the name in step 2
For each name, generate a random number X, where X < N.
Try to put the name into group X. If group X has that name already, set X = (X + 1) % N and retry, repeat until success. You will always find a group to put the contestant.
from random import randint
groups = [[]] * N
for item in queryset:
X = randint(0, N-1)
while item.fname in groups[X]:
X = (X + 1) % N
groups[X].append(item.fname)
item.group = X
EDIT:
Added details in steps 1, 2, 4.
From the code segment in your edited, I think you do not actually need a definition of "group" in model, as seems you only need a group number for it.

Deleting tables with regular expressions

Not really a specific question, since I don't know enough - more of a question on how to approach this.
Example file can be seen below:
LOADING CONDITION : 4-Homogenous cargo 98% 1.018t/m3, draught 3.35m
- outgoing
ITEMS OF LOADING
-------------------------------------------------------------------------------
CAPA ITEM REFERENCE X1 X2 WEIGHT KG LCG YG FSM
No (m) (m) (t) (m) (m) (m) (t.m)
-------------------------------------------------------------------------------
13 No2 CARGO TK P 1.650 29.400 609.04 2.745 15.525 -3.384 483.49
14 No2 CARGO TK S 1.650 29.400 603.61 2.745 15.525 3.384 483.49
15 No1 CARGO TK P 29.400 56.400 587.23 2.745 42.900 -3.384 470.42
16 No1 CARGO TK S 29.400 56.400 592.45 2.745 42.900 3.384 470.42
17 MGO tank aft 21.150 23.400 23.42 6.531 22.275 -0.500 15.70
18 TO storage tank 21.150 23.400 2.68 7.225 22.275 2.300 0.00
19 MGO fore tank 33.150 35.400 25.90 6.643 34.275 -0.212 0.00
-------------------------------------------------------------------------------
DEADWEIGHT 2444.34 2.828 29.007 -0.005 1923.52
SUMMARY OF LOADING
WEIGHT KG LCG YG FSM
(t) (m) (m) (m) (t.m)
-------------------------------------------------------------------------------
DEADWEIGHT 2444.34 2.828 29.007 -0.005 1923.52
LIGHT SHIP 634.00 3.030 28.654 0.000 0.00
-------------------------------------------------------------------------------
TOTAL WEIGHT 3078.34 2.869 28.935 -0.004 1923.52
LOADING CONDITION : 4-Homogenous cargo 98% 1.018t/m3, draught 3.35m
- outgoing
Damage Case : 1bott: all cargo & void3
Flooding Percentage : 100 %
Flooded Volumes : No.3 Void space P No.3 Void space S No2 CARGO TK P
No2 CARGO TK S No1 CARGO TK P No1 CARGO TK S
-------------------------------------------------------------------------------
WEIGHT KG LCG YG FSM CORR.KG
(t) (m) (m) (m) (t.m) (m)
-------------------------------------------------------------------------------
TOTAL WEIGHT 3078.34 2.869 28.935 -0.004 1923.52 3.494
RUN-OFF WEIGHTS 0.00 0.000 0.000 0.000 0.00 0.000
-------------------------------------------------------------------------------
DAMAGE CONDITION 3078.34 2.869 28.935 -0.004 1923.52 3.494
EQUILIBRIUM NOT FOUND ON STARBOARD
LOADING CASE :
4-Homogenous cargo 98% 1.018t/m3, draught 3.35m - outgoing
-------------------------------------------------------------------------------
WEIGHT KG LCG YG FSM CORR.KG
(t) (m) (m) (m) (t.m) (m)
-------------------------------------------------------------------------------
TOTAL WEIGHT 3078.34 2.869 28.935 -0.004 1923.52 3.494
SUMMARY OF RESULTS OF DAMAGE STABILITY
-------------------------------------------------------------------------------
DAMAGE CASE % R HEEL GM FBmin GZ>0 GZmax Area
(deg) (m) (m) (deg) (m) (m.rad)
-------------------------------------------------------------------------------
1bott: all cargo & void3 100 0 EQUILIBRIUM NOT FOUND
% : Flooding percentage.
R : R=1 if run-off weights considered, R=0 if no run-off.
HEEL : Heel at equilibrium (negative if equilibrium is on port).
GM : GM at equilibrium.
FBmin : Minimum distance of margin line, weathertight or non-weathertight
points from waterline.
GZ>0 : Range of positive GZ limited to immersion of non-weathertight openings.
GZmax : Maximum GZ value.
It is one of many, they can differ a bit, but they all come down to tables in textual form. I need to clean up some items from them, before pasting them in a report.
So I was wondering - what would be the text way to delete a certain table. For example, SUMMARY OF LOADING (it starts with the line containing "SUMMARY OF LOADING" and end in the line containing "TOTAL WEIGHT").
How to match that table and delete it?

Try the following from within vim
:g/SUMMARY OF LOADING/, /TOTAL WEIGHT/d
sed works in the same way:
sed '/SUMMARY OF LOADING/, /TOTAL WEIGHT/d' input_with_tables.txt

Fredrik Pihl's solution with :g works well if you need to delete all such tables. For more specific edits, you could use my CountJump plugin to create custom motions and text objects by defining start and end patterns (like SUMMARY OF LOADING and TOTAL WEIGHT in your case), and then quickly jump to the next table and delete a table with a quick mapping.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular Expression Split on nᵗʰ occurrence - regex

The Regex... / # Start regex ( # Start group [^,], # not `,` - zero or more times () followed by `,` ){4} # repeat group four times /g # match globally (using String.match) All together... console.log( str.match(/([^,]*,){4}/g) );

^([^,],[^,],[^,],[^,],)(.*)$ should give you two strings, 1) the part up to and including the fourth comma, and 2) the rest of the string.

Related

manipulating the format of date on X-axis

How to delete words from a dataframe column that are present in dictionary in Pandas

Python: How to calculate tf-idf for a large data set

How can I sort and include rules sorting in django?

Deleting tables with regular expressions

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular Expression Split on nᵗʰ occurrence - regex

The Regex... / # Start regex ( # Start group [^,]*, # not `,` - zero or more times (*) followed by `,` ){4} # repeat group four times /g # match globally (using String.match) All together... console.log( str.match(/([^,]*,){4}/g) );

^([^,]*,[^,]*,[^,]*,[^,]*,)(.*)$ should give you two strings, 1) the part up to and including the fourth comma, and 2) the rest of the string.

Related

manipulating the format of date on X-axis

How to delete words from a dataframe column that are present in dictionary in Pandas

Python: How to calculate tf-idf for a large data set

How can I sort and include rules sorting in django?

Deleting tables with regular expressions

Categories

Resources

The Regex... / # Start regex ( # Start group [^,], # not `,` - zero or more times () followed by `,` ){4} # repeat group four times /g # match globally (using String.match) All together... console.log( str.match(/([^,]*,){4}/g) );

^([^,],[^,],[^,],[^,],)(.*)$ should give you two strings, 1) the part up to and including the fourth comma, and 2) the rest of the string.