How to extract labels from a Binary Image in SimpleITK in python - python-2.7

I would like to extract the labels from the 2D Binary image I get using the following code:
image2DThresh = sitk.Threshold(image2D, lower=stats.GetMinimum(), upper=127.500)
cca = sitk.ConnectedComponentImageFilter()
cca_image = cca.Execute(2D_Slice)
# Get the shape statistics of the labels using
labelStats = sitk.LabelShapeStatisticsImageFilter()
The basic idea is to find the mean intensity, area of ROI and min/max indexes of the label in the main image. What I am trying to do is binarizing the image with Threshold Filter, then running CCA on this to get all the labels. Then I use the LabelShapeStatisticsImageFilter() to get the physical attributes of every label (except label 0 of course) and check if the label meets the conditions. The problem is i am not able to get the average intensity in the main image where the label is. That is why I suggest using LabelIntensityStatisticsFilter, which however for python 2.7, SimpleITK 0.10 isn't available.

The two filters which you may be interested in are the "LabelStatisticsImageFilter" and the "LabelIntensityStatisticsImageFilter". These are both available in SimpleITK 0.10, if not you have a distribution problem. Both filters compute the mean, but the later computes a bounding box and many more advanced statistics.
Usage would go something like this:
In [1]: import SimpleITK as sitk
In [2]: print sitk.Version()
SimpleITK Version: 0.10.0 (ITK 4.10)
Compiled: Aug 16 2016 17:21:32
In [3]: img = sitk.ReadImage("cthead1.png")
In [4]: cc = sitk.ConnectedComponent(img>100)
In [5]: stats = sitk.LabelIntensityStatisticsImageFilter()
In [6]: stats.Execute(cc,img)
Out[6]: <SimpleITK.SimpleITK.Image; proxy of <Swig Object of type 'std::vector< itk::simple::Image >::value_type *' at 0x2a6b540> >
In [7]: for l in stats.GetLabels():
...: print("Label: {0} -> Mean: {1} Size: {2}".format(l, stats.GetMean(l), stats.GetPhysicalSize(l)))
...:
Label: 1 -> Mean: 157.494210868 Size: 3643.8348071
Label: 2 -> Mean: 151.347826087 Size: 2.86239969136
Label: 3 -> Mean: 123.75 Size: 0.497808641975
Label: 4 -> Mean: 106.0 Size: 0.248904320988
Label: 5 -> Mean: 104.0 Size: 0.124452160494
Label: 6 -> Mean: 106.0 Size: 0.124452160494
Label: 7 -> Mean: 103.0 Size: 0.124452160494
Label: 8 -> Mean: 121.5 Size: 1.49342592593
Label: 9 -> Mean: 106.0 Size: 0.124452160494
In stead of printing you could create lists of labels to preserve or to relabel to 0 (erase). The ChangeLabelImageFilter can then be used to apply this change to the label image.
The combination of thresholding, statistics, and label section is a power segmentation approach which can be used and customized for many tasks. It also serves as a starting point for more complication methods.

So I solved the problem using numpy. I'm posting the code, may be it helps someone else in the future!
def get_label(ccaimage, label, image2D):
# labelImage is the mask for a particular label
labelImage = sitk.Threshold(ccaimage, lower=label, upper=label)
#sitk_show(labelImage)
# get image as array
labelImageArray = sitk.GetArrayFromImage(labelImage)
image2Darray = sitk.GetArrayFromImage(image2D)
# ROI_1 is the minimum in the original image where the mask is equal to label
ROI_1 = image2Darray == np.min(image2Darray[labelImageArray == label])
plt.imshow(ROI_1)
plt.show()
# ROI_2 is the mask image
ROI_2 = labelImageArray == label
plt.imshow(ROI_2)
plt.show()
# AND gives me only those pixels which satisfy both conditions.
ROI = np.logical_and(image2Darray == np.min(image2Darray[labelImageArray == label]), labelImageArray == label )
avg = np.mean(image2Darray[labelImageArray == label])
print np.min(image2Darray[labelImageArray == label])
print np.where(ROI)
plt.imshow(ROI)
plt.show()

Related

Pine Script for Trading View: How to display on certain time frames

Could you kindly teach me how to not show the vertical line for timeframe of more than or equal to 1 Hour?
I desire to show the vertical lines (when the market opens) .
There is a difficulty using the iff statement alongside with the use of the likes of isIntraday and cannot compile.
Thank you in advance.
//#version=2
study("Line", overlay=true)
t2 = time(period, "1800-1801")
t1 = time(period, "0300-0301")
t3 = time(period, "0930-0931")
Open2 = na(t2) ? na : blue
Open1 = na(t1) ? na : green
Open3 = na(t3) ? na : green
This will show when the chart's TF is >= 60min. You can use this condition to build your display conditions:
//#version=2
study("")
tf60AndMore = not (isseconds or (isintraday and interval < 60))
plotchar(tf60AndMore, "tf60AndMore", "•", location.top)

Extract strings based on pattern in python and writing them into pandas dataframe columns

I have text data inside a caloumn of dataset as shown below
Record Note 1
1 Amount: $43,385.23
Mode: Air
LSP: Panalpina
2 Amount: $1,149.32
Mode: Ocean
LSP: BDP
3 Amount: $1,149.32
LSP: BDP
Mode: Road
4 Amount: U$ 3,234.01
Mode: Air
5 No details
I need to extract each of the details inside the text data and write them into new column as shown below how to do it in python
Expected Output
Record Amount Mode LSP
1 $43,385.23 Air Panalpina
2 $1,149.32 Ocean BDP
3 $1,149.32 Road BDP
4 $3,234.01 Air
5
Is this possible. how can this be do
Write a custom function and then use pd.apply() -
def parse_rec(x):
note = x['Note']
details = note.split('\n')
x['Amount'] = None
x['Mode'] = None
x['LSP'] = None
if len(details) > 1:
for detail in details:
if 'Amount' in detail:
x['Amount'] = detail.split(':')[1].strip()
if 'Mode' in detail:
x['Mode'] = detail.split(':')[1].strip()
if 'LSP' in detail:
x['LSP'] = detail.split(':')[1].strip()
return x
df = df.apply(parse_rec, axis=1)
import re
Amount = []
Mode = []
LSP = []
def extract_info(txt):
Amount_lst = re.findall(r"amounts?\s*:\s*(.*)", txt, re.I)
Mode_lst = re.findall(r"Modes??\s*:\s*(.*)", txt, re.I)
LSP_lst = re.findall(r"LSP\s*:\s*(.*)", txt, re.I)
Amount.append(Amount_lst[0].strip() if Amount_lst else "No details")
Mode.append(Mode_lst[0].strip() if Mode_lst else "No details")
LSP.append(LSP_lst[0].strip() if LSP_lst else "No details")
df["Note"].apply(lambda x : extract_info(x))
df["Amount"] = Amount_lst
df["Mode"]= Mode_lst
df["LSP"]= LSP_lst
df = df[["Record","Amount","Mode","LSP"]]
By using regex we can extract information such as the above code and write down to separate columns.

TypeError when attempting stacking

would be grateful for some help here... I am trying to implement stacking but based on this code, I keep getting TypeError: object of type 'generator' has no len(). Would anyone know how to correct this? Many thanks.
y_lr = clf_lr.predict(X_test) # Linear
y_rf = forest.predict(X_test) # Random Forest
y_gb = clf_gb.predict(X_test) # Gradient Boosting
dtest = xgb.DMatrix(X_test)
y_xgb = clf_xgb.predict(dtest) # XGBoost
y_nn = model_nn.predict(X_test) # DNN
stack = pd.DataFrame(data={'lr':y_lr, 'gb':y_gb, 'xgb':y_xgb, 'nn':y_nn, 'true':y_test})
This is what I get when I do just data = {'rf':y_rf, 'gb':y_gb, 'xgb':y_xgb, 'nn':y_nn, 'true':y_test} and print data:
{'gb': array([ 5176163.73806255, 6717797.72382604, 7079943.66873864, ...,
12224999.12632363, 6632903.39968627, 7314008.41080324]),
'nn': <generator object _as_iterable at 0x7f535ca0a780>,
'rf': array([ 3525000. , 6713017.2, 5577500. , ..., 11708300. ,
6255000. , 6290000. ]),
'true': 1715 5200000.0
17126 6796548.0
28143 7300000.0
10037 12581315.0
16133 7500000.0
17356 6450000.0
28348 7300000.0
24818 6100000.0
2240 5000000.0
25878 3300000.0
8533 4058255.0
4374 5063160.0
29140 6200000.0
16599 3606128.0
5647 6500000.0
4878 4347200.0
30267 7500000.0
18793 4762800.0
22865 5850000.0
20585 6600000.0
1166 3000000.0
21417 6100000.0
13557 4200000.0
8716 10000000.0
1486 8000000.0
7916 4650776.0
28010 8600000.0
21926 5972181.0
3567 4498491.0
6729 6850000.0
...
1224 5550000.0
19234 5100000.0
1201 9500000.0
11412 5000000.0
27141 6623516.0
28107 6800000.0
19347 3834328.0
17965 8300000.0
18584 6440000.0
11473 6518400.0
16907 11200000.0
28412 5950000.0
18744 5700000.0
15247 7000000.0
19411 6907232.0
3185 8000000.0
6348 3413300.0
19544 4800000.0
21309 12800000.0
10733 5107200.0
17367 6900000.0
11761 14500000.0
27435 7251680.0
13039 6300000.0
1966 12800000.0
3664 4506800.0
3626 4467294.0
25682 15250000.0
8988 1000000.0
24637 7700000.0
Name: price_doc, dtype: float64,
'xgb': array([ 4634399. , 5984703. , 6499839.5, ..., 12502588. ,
6457020.5, 7572096. ], dtype=float32)}
After trying different iterations, this is the code that worked!
y_lr = list(clf_lr.predict(X_test)) # Linear
y_rf = list(forest.predict(X_test)) # Random Forest
y_gb = list(clf_gb.predict(X_test)) # Gradient Boosting
dtest = xgb.DMatrix(X_test)
y_xgb = list(clf_xgb.predict(dtest)) # XGBoost
y_nn = list(model_nn.predict(X_test)) # DNN
stack = pd.DataFrame({'lr':y_lr, 'gb':y_gb, 'xgb':y_xgb,'nn':y_nn, 'true':y_test})
For sklearn-compatible estimators you can use the StackingClassifier to make this job easier. It will merge the outputs of each model into a single dataset and then pass these on to a final meta-estimator (which is "stacked" on top of the other models).
As an alternative, you could try using a library called skdag (disclaimer: I am the author) which lets you compose your classifiers in any kind of workflow, including the stacking architecture you described in your example:
from skdag import DAGBuilder
stack = (
DAGBuilder()
.add_step("pass", "passthrough")
.add_step("lr", clf_lr, deps=["pass"])
.add_step("rf", forest, deps=["pass"])
.add_step("gb", clf_gb, deps=["pass"])
.add_step("xgb", clf_xgb, deps=["pass"])
.add_step("meta", LinearRegression(), deps={
"lr": [1], "rf": [1], "gb": [1], "xgb": [1]
}])
.make_dag()
)
stack.fit(X_train, y_train)
stack.predict(X_test, y_test)
You can read more about this in the docs for skdag.
Both the StackingClassifier and the DAG are slightly different from your example in that they use predict_proba rather than predict as the inputs for your final meta-estimator, but maybe this is what you want anyway? A call to predict will apply thresholds and drop lots of valuable information.

Python: How to calculate tf-idf for a large data set

I have a following data frame df, which I converted from sframe
URI name text
0 <http://dbpedia.org/resource/Digby_M... Digby Morrell digby morrell born 10 october 1979 i...
1 <http://dbpedia.org/resource/Alfred_... Alfred J. Lewy alfred j lewy aka sandy lewy graduat...
2 <http://dbpedia.org/resource/Harpdog... Harpdog Brown harpdog brown is a singer and harmon...
3 <http://dbpedia.org/resource/Franz_R... Franz Rottensteiner franz rottensteiner born in waidmann...
4 <http://dbpedia.org/resource/G-Enka> G-Enka henry krvits born 30 december 1974 i...
I have done the following:
from textblob import TextBlob as tb
import math
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob.words)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
bloblist = []
for i in range(0, df.shape[0]):
bloblist.append(tb(df.iloc[i,2]))
for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:3]:
print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
But this is taking a lot of time as there are 59000 documents.
Is there a better way to do it?
I am confused about this subject. But I found a few solution on the internet with use Spark. Here you can look at:
https://www.linkedin.com/pulse/understanding-tf-idf-first-principle-computation-apache-asimadi
On the other hand i tried theese method and i didn't get bad results. Maybe you want to try :
I hava a word list. This list contains word and it's counts.
I found the average of this words counts.
I selected the lower limit and the upper limit with the average value.
(e.g. lower bound = average / 2 and upper bound = average * 5)
Then i created a new word list with upper and lower bound.
With theese i got theese result :
Before normalization word vector length : 11880
Mean : 19 lower bound : 9 upper bound : 95
After normalization word vector length : 1595
And also cosine similarity results were better.

Fetching top n records in pandas pivot , based on multiple criteria and plotting them with matplotlib

Usecase : Extending the pivot functionality of Pandas. Fetch top n records & plot them against its own "Click %"(s) vs. no of records of that name
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'name':['A', 'A', 'B', 'B','C','A'], 'click':[1,1,0,1,1,0]})
click name
0 1 A
1 1 A
2 0 B
3 1 B
4 1 C
5 0 A
[6 rows x 2 columns]
#fraction of records present & clicks as a fraction of it's OWN records present
f=df1.pivot_table(rows='name', aggfunc=[len, np.sum])
f['len']['click']/sum(f['len']['click']) , f['sum']['click']/sum(f['sum']['click'])
(name
A 0.500000
B 0.333333
C 0.166667
Name: click, dtype: float64, name
A 0.50
B 0.25
C 0.25
Name: click, dtype: float64)
But to be able to plot them need to store the top n records in an object that is supported by matplotlib.
I tried storing the
"top names" A,B, C ..etc by creating dict (output of
f['len']['click']/sum(f['len']['click']
) )- and sorted by values - after which I stored the "click %" [A -> 0.50, B -> 0.25 , C-> 0.25] also in the same dictionary.
**Since this is clearly an overkill - wondering if there's a more pythonic way to do this ? **
I also tried head with groupby clause, but it doesn't give me what I am looking for. I am looking for a dataframe as above
A 0.500000
B 0.333333
C 0.166667
Name: click, dtype: float64, name
A 0.50
B 0.25
C 0.25
except that the top n logic should be embedded (head(n) does not work with n depends on my data-set - I guess I need to use "apply" ? - and post this the Object , which is a "" object needs to be identified by matplotlib with its own labels (top n "name" here)
Here's my dict function implementation :- # This is an OVERKILL just to fetch top n by a custom criteria as above
def freq_counts(df_var,n): # df_var is like df1.name , just to make the top n logic generic for each column name
perct_freq=dict((df_var.value_counts()*100)/len(df_var))
vec=[]
for key,value in perct_freq.items():
if value>=n :
vec.append([key,value])
return vec
freq_counts(df1.name,3) # eg. top 3 freq counts - to get the names, see vec[i][0] which has the corresponding keys
#In this example when I calculate the "perct_freq", which is a Series object, I would ideally want to avoid converting this to a dict - What an overkill !
Store the actual occurances (len of names) , and find the fraction of a "name" in population
Against this, also fins the "sucess outcome" and find it as a fraction of its OWN population
Finally plot top n name(s), output of (1) & (2) in same plot - criteria for top n should be based on (1) as a percentage
Ie. for (1) & (2) use dataframes that support plot with
name as labels in x axis
(1) as y axis (primary)
(2) as y axis (secondary)
PPS: In the code above -
(1) is > f['len']['click']/sum(f['len']['click']) and
(2) is > f['sum']['click']/sum(f['sum']['click'])