What are hp.Discrete and hp.Realinterval? Can I include more values in hp.realinterval instead of just 2? - tensorboard

I am using Hyperparameter using HParams Dashboard in Tensorflow 2.0-beta0 as suggested here https://www.tensorflow.org/tensorboard/r2/hyperparameter_tuning_with_hparams
I am confused in step 1, I could not find any better explanation. My questions are related to following lines:
HP_NUM_UNITS = hp.HParam('num_units', hp.Discrete([16, 32]))
HP_DROPOUT = hp.HParam('dropout', hp.RealInterval(0.1, 0.2))
HP_OPTIMIZER = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd']))
My question:
I want to try more dropout values instead of just two (0.1 and 0.2). If I write more values in it then it throws an error- 'maximum 2 arguments can be given'. I tried to look for documentation but could not find anything like from where these hp.Discrete and hp.RealInterval functions came.
Any help would be appreciated. Thank you!

Good question. They notebook tutorial lacks in many aspects. At any rate, here is how you do it at a certain resolution res
for dropout_rate in tf.linspace(
HP_DROPOUT.domain.min_value,
HP_DROPOUT.domain.max_value,
res,):

By looking at the implementation to me it really doesn't seem to be GridSearch but MonteCarlo/Random search (note: this is not 100% correct, please see my edit below)
So on every iteration a random float of that real interval is chosen
If you want GridSearch behavior just use "Discrete". That way you can even mix and match GridSearch with Random search, pretty cool!
Edit: 27th of July '22: (based on the comment of #dpoiesz)
Just to make it a little more clear, as it is sampled from the intervals, concrete values are returned. Therefore, those are added to the grid dimension and grid search is performed using those

RealInterval is a min, max tuple in which the hparam will pick a number up.
Here a link to the implementation for better understanding.
The thing is that as it is currently implemented it does not seems to have any difference in between the two except if you call the sample_uniform method.

Note that tf.linspace breaks the mentioned sample code when saving current value.
See https://github.com/tensorflow/tensorboard/issues/2348
In particular OscarVanL's comment about his quick&dirty workaround.

Related

Extracting number of bits in a macroblock from VVC VTM reference software

Final:Result after calculating and displaying the differenceI am new to VVC and I am going through the reference software's code trying to understand it. I have encoded and decoded videos using the reference software. I want to extract the bitstream from it, I want to know the number of bits there are in each macroblock. I am not sure which class I should be working with, for now I am looking at, mv.cpp, QuantRDOQ.cpp, and TrQuant.cpp.
I am afraid to mess the code up completely, I don't know where to add what lines of code. Start: Result after calculating and displaying the difference
P.S. The linked pictures are after my problem has been solved, I attached these pictures because of my query in the comments.
As the error says, getNumBins() is not supported by the CABAC estimator. So you should make sure you call it "only" during the encoding, and not during the RDO.
This should do the job:
if (isEncoding())
before = m_BinEncoder.getNumBins()
coding_unit( cu, partitioner, cuCtx );
if (isEncoding())
{
after = m_BinEncoder.getNumBins();
diff = after - before;
}
The simpleset solution that I'm aware of is at the encoder side.
The trick is to compute the difference in the number of written bits "before" and "after" encoding a Coding Unit (CU) (aka macroblock). This stuff happens in the CABACWriter.cpp file.
You should go to to coding_tree() function, where coding_unit() function is called, which is responsible for context-coding all syntax elementes in the current CU.
There, you may call the function getNumBins() twice: once before and once after coding_unit(). The difference of the two value should do the job for you.

userWarning pymc3 : What does reparameterize mean?

I built a pymc3 model using the DensityDist distribution. I have four parameters out of which 3 use Metropolis and one uses NUTS (this is automatically chosen by the pymc3). However, I get two different UserWarnings
1.Chain 0 contains number of diverging samples after tuning. If increasing target_accept does not help try to reparameterize.
MAy I know what does reparameterize here mean?
2. The acceptance probability in chain 0 does not match the target. It is , but should be close to 0.8. Try to increase the number of tuning steps.
Digging through a few examples I used 'random_seed', 'discard_tuned_samples', 'step = pm.NUTS(target_accept=0.95)' and so on and got rid of these user warnings. But I couldn't find details of how these parameter values are being decided. I am sure this might have been discussed in various context but I am unable to find solid documentation for this. I was doing a trial and error method as below.
with patten_study:
#SEED = 61290425 #51290425
step = pm.NUTS(target_accept=0.95)
trace = sample(step = step)#4000,tune = 10000,step =step,discard_tuned_samples=False)#,random_seed=SEED)
I need to run these on different datasets. Hence I am struggling to fix these parameter values for each dataset I am using. Is there any way where I give these values or find the outcome (if there are any user warnings and then try other values) and run it in a loop?
Pardon me if I am asking something stupid!
In this context, re-parametrization basically is finding a different but equivalent model that it is easier to compute. There are many things you can do depending on the details of your model:
Instead of using a Uniform distribution you can use a Normal distribution with a large variance.
Changing from a centered-hierarchical model to a
non-centered
one.
Replacing a Gaussian with a Student-T
Model a discrete variable as a continuous
Marginalize variables like in this example
whether these changes make sense or not is something that you should decide, based on your knowledge of the model and problem.

Distinguishing between terms of different domains

What I am trying to do:
I am trying to take a list of terms and distinguish which domain they are coming from. For example "intestine" would be from the anatomical domain while the term "cancer" would be from the disease domain. I am getting these terms from different ontologies such as DOID and FMA (they can be found at bioportal.bioontology.org)
The problem:
I am having a hard time realizing the best way to implement this. Currently I am naively taking the terms from the ontologies DOID and FMA and taking difference of any term that is in the FMA list which we know is anatomical from the DOID list (which contains terms that may be anatomical such as colon carcinoma, colon being anatomical and carcinoma being disease).
Thoughts:
I was thinking that I can get root words, prefixes, and postfixes, for the different term domains and try and match it to the terms in the list. Another idea is to take more information from their ontology such as meta data or something and use this to distinguish between the terms.
Any ideas are welcome.
As a first run, you'll probably have the best luck with bigrams. As an initial hypothesis, diseases are usually noun phrases, and usually have a very English-specific structure where NP -> N N, like "liver cancer", which means roughly the same thing as "cancer of the liver." Doctors tend not to use the latter, while the former should be caught with bigrams quite well.
Use the two ontologies you have there as starting points to train some kind of bigram model. Like Rcynic suggested, you can count them up and derive probabilities. A Naive Bayes classifier would work nicely here. The features are the bigrams; classes are anatomy or disease. sklearn has Naive Bayes built in. The "naive" part means, in this case, that all your bigrams are independent of each other. This assumption is fundamentally false, but it works well in a lot of circumstances, so we pretend it's true.
This won't work perfectly. As it's your first pass, you should be prepared to probe the output to understand how it derived the answer it came upon and find cases that failed on. When you find trends of errors, tweak your model, and try again.
I wouldn't recommend WordNet here. It wasn't written by doctors, and since what you're doing relies on precise medical terminology, it's probably going to add bizarre meanings. Consider, from nltk.corpus.wordnet:
>>> livers = reader.synsets("liver")
>>> pprint([l.definition() for l in livers])
[u'large and complicated reddish-brown glandular organ located in the upper right portion of the abdominal cavity; secretes bile and functions in metabolism of protein and carbohydrate and fat; synthesizes substances involved in the clotting of the blood; synthesizes vitamin A; detoxifies poisonous substances and breaks down worn-out erythrocytes',
u'liver of an animal used as meat',
u'a person who has a special life style',
u'someone who lives in a place',
u'having a reddish-brown color']
Only one of these is really of interest to you. As a null hypothesis, there's an 80% chance WordNet will add noise, not knowledge.
The naive approach - what precision and recall is it getting you? If you setup a test case now, then you can track your progress as you apply more sophisticated methods.
I don't know what initial set you are dealing with - but one thing to try is to get your hands on annotated documents(maybe use mechanical turk). The documents need to be tagged as the domains you're looking for - anatomical or disease.
then count and divide will tell you how likely a word you encounter is to belong to a domain. With that the next step and be to tweak some weights.
Another approach (going in a whole other direction) is using WordNet. I don't know if it will be useful for exactly your purposes, but its a massive ontology - so it might help.
Python has bindings to use Wordnet via nltk.
from nltk.corpus import wordnet as wn
wn.synsets('cancer')
gives output = [Synset('cancer.n.01'), Synset('cancer.n.02'), Synset('cancer.n.03'), Synset('cancer.n.04'), Synset('cancer.n.05')]
http://wordnetweb.princeton.edu/perl/webwn
Let us know how it works out.

How can I select Yes/No qestionID dynamically in weka j48 App

I'm developing a Weka app like Akinator by using the j48 method.
Sample:
http://jbossews-vdoctor.rhcloud.com/doctor
The following is the app's table definition and sample data
qa means question id(Please refer the master which can be set by user) + answer(1:Yes, 2: I don't know, 3: No).
1 line per 1 question & answer.
id,qa,class
A,13,1
A,23,1
B,13,2
B,21,2
The point is to find a way to select the question which can maximize the entropy.
Currently this app is regarding first node id of decision tree as the best question.
And then it narrows down the options by this elimination way.
But the accuracy was too bad to run correctly so I'd like to improve it.
I noticed that the qa column was identified as numeric so it could not build the correct decision tree.
I am confused what I should do for improvement. Dataset? Table definition? Logic?
This is quite a broad question that you are asking, and without code or a clear understanding of the problem it is quite difficult to answer, but I'll give some tips for improvement:
Table Definition
What may have made more sense here is to have an attribute for each question, instead of using a single instance per question. For Example, instead of id, qa and class, you could have A, B, C, D, E, F and Disease. (I believe there were six questions, and naming each attribute would be recommended instead of A-F)
Dataset
You will need at least as many cases as there are diseases, if not more for defining multiple subsets of the problem space for the same disease. There are likely cases where some questions are irrelevant or missing, and the model may need to handle such situations.
Logic
In such a case, you might be able to do the questionnaire by starting with the root node and asking questions until you reach the estimated class. This way, you can ask from node to node until a class is reached.
I hope this helps in improving your existing model.
NOTE: I tried your questionnaire and answered No to all of your questions, and I strangely ended up with Trichomoniasis. Perhaps there could be a 'No Disease' category for your training data also.
My nominal qa data is building such a decision tree by binary split.
actually this structure won't make sense because there is tree at only one side. When qa equal 23 it would be always '3' answer. It's irrational.
http://www.fastpic.jp/viewer.php?file=2693704973.jpg
You should first reformat your features to get all possible questions A,B,C,D... as binary features and your final answer (ie. what to guess) as target class if you want your tree to get a sequence of questions reaching to your answer. Your data will certainly be sparse (many questions without data/answer).
By the way, a binary tree is not the right ML structure and algorithm to build an Akinator like or 20Q/Guess-who. Please look some suggestions here: https://stats.stackexchange.com/questions/6074/akinator-com-and-naive-bayes-classifier

Some confusion over Numpy + Scipy + matplotlib Spectrum Analyzer code

I've been attempting to understand the code at the bottom of http://www.frank-zalkow.de/en/code-snippets/create-audio-spectrograms-with-python.html, though sadly I haven't been getting anywhere with it. I don't think I'm expected to understand most of the code, as I have limited experience with FFTs, but unfortunately I'm also having trouble understanding how the graph is generated. I'm also getting very limited progress from a trial-and-error approach, due to the fact that my computer lags heavily and because of the relatively long time it takes for a graph to be generated.
With that being said, I need a way to scale the graph so that it only displays values up to 5000 Hz, though still on a logarithmic scale. I'd also like to understand how the wav file is sampled, and what values I can edit in order to take more samples per second. Can somebody explain how both of these points work, and how I can edit the code in order to fulfill these requirements?
Hm, this code is by me so gladly help you understanding it. It's maybe not best practice and there may be several ways to improve it – suggestions are welcome. But at least it worked for me.
The function stft does a standard short-time-fourier-transform of an audio signal by the help of the numpy strides. The function logscale_spec takes an stft and scales it logarithmically. This is maybe a bit dirty and there must be a better way to do it. But it worked for me. plotstft is the function that finally reads a wave file via scipy.io.wavfile, combines the prior two functions and makes a plot with matplotlibs imshow. If you have a mono wavefile you should be able to just call plotstft("/path/to/mono.wav").
That was an overview – if I should explain some things in more detail, just say so.
To your questions. To leave out some frequencie values: You can get the frequencies values of the fft wih np.fft.fftfreq(binsize, 1./sr). You just have to find the index of of your cutoff value and leaving this values of the stft.
I don't understand your second question... You can have a look of all samples of your wavefile by:
>>> import scipy.io.wavfile as wav
>>> x = wav.read("/path/to/file.wav")
>>> x
(44100, array([4554752, 4848551, 3981874, ..., 2384923, 2040309, 294912], dtype=int32))
>>> x[1]
array([4554752, 4848551, 3981874, ..., 2384923, 2040309, 294912], dtype=int32)