I've looked at the other answers to this issue and none of them are helping me. I am trying to run a simple random cut forest algorithm. I have a small data set of IPs which have been stripped down to only have numbers. I still get this error. It only has one column of these numbers. The CSV looks like this:
176162144
176862141
176762141
176761141
176562141
Have you looked at this sample notebook, and tried using it with your own data?
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.ipynb
In a nutshell, it reads the CSV file with Pandas and trains the model like this:
rcf = RandomCutForest(role=execution_role,
train_instance_count=1,
train_instance_type='ml.m4.xlarge',
data_location='s3://{}/{}/'.format(bucket, prefix),
output_path='s3://{}/{}/output'.format(bucket, prefix),
num_samples_per_tree=512,
num_trees=50)
# automatically upload the training data to S3 and run the training job
rcf.fit(rcf.record_set(taxi_data.value.as_matrix().reshape(-1,1)))
You didn't say what your use case was, but as you're working with IP addresses, you may find the IP Insights built-in algorithm useful too: https://docs.aws.amazon.com/sagemaker/latest/dg/ip-insights.html
I was utilizing the sample notebook Julien Simon mentioned earlier, but at some point the data was ending up as strings! The funny thing about RCF algorithms is they have to run on integer data.
What I did is I made sure to cast the array as an int array as a double check and vallah! It worked. I am at loss over how the data ended up in a string format but alas, that was the issue. Simple solution.
Related
I'm currently trying to implement YOLOv3 object detection model in C(only detection, not training).
I have tested my convolution method with arbitrary values and it seems to be working as I expected.
Before stacking up multiple method calls to do forward propagation, I thought it would be safe to test with the actual pretrained weight file data.
When I look up Darknet's pre-trained weight file, it was a huge chunk of binary files. I tried to convert it to hex and decimals, but it still doesn't look simple to pinpoint what part of values to use.
So, my question is, what should I do to extract the decimal numbers of the weights or the filter values so that I can use them in the same order of the forward propagation happening in YOLOv3?
*I'm currently trying to build my c version of YOLOv3 using the structure image shown in https://www.itread01.com/content/1541167345.html
*My c code will be run on an FPGA board called MicroZed, along with other HDL code.
*I tried to plug some printf functions into some places of Darknet code to see what kinds of data are moving around when YOLOv3 runs, however, when I ran it on in Linux terminal, it didn't show anything new and kept outputting the same results.
Any help or advice will be really appreciated. Thank you!
I am not too sure if there is a direct way to read darknet weights, but you can convert it into .h5 format and obtain the weight values from it
You can convert the darknet yolov3 weights into .h5 format (used by keras) by using the appropriate command from this repository.
You can choose the command based on your Yolo version from the list shown in the ReadMe of the linked repo. For the standard yolov3, the command for converting is
python tools/model_converter/convert.py cfg/yolov3.cfg weights/yolov3.weights weights/yolov3.h5
Once you have the .h5weights, you can use the below code snippet for obtaining the
values from the weights. credit/source
import h5py
path = "<path to weights>.h5"
weights = {}
keys = []
with h5py.File(path, 'r') as f: # open file
f.visit(keys.append) # append all keys to list
for key in keys:
if ':' in key: # contains data if ':' in key
param_name = f[key].name
weights[f[key].name] = f[key].value
print(param_name,weights[f[key].name])
I'm new to Google Data Studio, and currently stuck on the following problem.
I have some log data in BigQuery and I am trying to visualize some info out of my logs using Google Data Studio.
The problem is when I use REGEXP_MATCH on one specific dimension of my data, it cannot match the RegEx. When I use REGEXP_MATCH against any other dimension of my log data, it works without a problem.
I am wondering, whether that could be due to the long strings that I have in that specific dimension, or are there any other thoughts?
By the way, I am able to make changes to that dimension using REGEXP_REPLACE, but cannot even do REGEXP_MATCH against the text that REGEXP_REPLACE replaced in there.
I have been working around this for days now and any advise will be really appreciated.
I finally found out that REGEXP_EXTRACT works against that dimension. Then it's also possible to do REGEXP_MATCH on the output of REGEXP_EXTRACT with no problem.
I still don't know why REGEXP_MATCH does not work directly against that specific dimension, maybe due to long strings or some specific letters that I have in my data in that dimension or some other reason.
I have created a rather large CSV file (63000 rows and around 40 columns) and I want to join it with an ESRI Shapefile.
I have used ArcPy but the whole process takes 30! minutes. If I make the join with the original (small) CSV file, join it with the Shapefile and then make my calculations with ArcPy and continously add new fields and calculate the stuff it takes 20 minutes. I am looking for a faster solution and found there are other Python modules such as PySHP or DBFPy but I have not found any way for joining tables, hoping that could go faster.
My goal is already to get away from ArcPy as much as I can and preferable only use Python, so preferably no PostgreSQL and alikes either.
Does anybody have a solution for that? Thanks a lot!
Not exactly a programmatical solution for my problem but a practical one:
My shapefile is always static, only the attributes of the features will change. So I copy my original shapefile (only the basic files with endings .shp, .shx, .prj) to my output folder and rename it to the name I want.
Then I create my CSV-File with all calculations and convert it to DBF and save it with the name of my new shapefile to the output folder too. ArcGIS will now load the shapefile along with my own DBF file and I don't even need to do any tablejoin at all!
Now my program runs through in only 50 seconds!
I am still interested in more solutions for the table join problem, maybe I will encounter that problem again in the future where the shapefile is NOT always static. I did not really understand Nan's solution, I am still at "advanced beginner" level in Python :)
Cheers
I've been following Dave Miller's ANN C++ Tutorial, and I've been having some problems getting it to function as expected.
You can view the code I'm working with here. It's an XCode project, but includes the main.cpp and data set file.
Previously, this program would only gives outputs between -1 and 1, I'm presuming due to the use of the tanh function. I've manipulated the data inputs so I can input my data that is much larger and have valid outputs. I've simply done this by multiplying the input values by 0.0001, and multiplying the output values by 10000.
The training data I'm using is the included CSV file. The last column is the expected output, the rest are inputs. Am I using the wrong mathematical function for these data?
Would you say that this is actually learning? This whole thing has stressed me out so much, I understand the theory behind ANN's but just can't implement from scratch for myself.
The net recent average error definitely gets smaller and smaller, which to me would say it is learning.
I'm sorry if I haven't explained myself very well, I'm very new to ANN's and this whole thing is very confusing to me. My university lecturers are useless when it comes to the practical side, they only teach us the theory of it.
I've been playing around with the eta and alpha values, along with the number of hidden layers.
You explained yourself quite well, if the net recent average is getting lower and lower it probably means that the network is actually learning, but here is my suggestion about how to be completely sure.
Take you CSV file and split it into 2 files one should be about 10% of the all data and the other all the remaining.
You start with an untrained network and you run your 10% file trough the net and for each line you save the difference between actual output and expected result.
Then you train the network only with the 90% of the CSV file you have and finally you re run trough the NET the first 10% file again and you compare the differences you had on the first run with the the latest ones.
You should find out that the new results are much closer to the expected values than the first time, and this would be the final proof that your network is learning.
Does this make any sense ? if not please send share some code or send me a link to the exercise you are running and I will try to explain it in code.
Is there any way to pinpoint the badrecord when we are loading the data using hive or while processing the data.
The scenario Goes like this.
Suppose I have file that need to be loaded as table using hive which got 1 Million records in it. Delimited by some '|' symbol.
So suppose after Half a million record processing I encounter a problem. IS there anyway to debug it or precisely pinpoint the record/records having the issues.
If you are not clear about my question please let me know.
I know there is a skipping of bad record in mapreduce (Kind of percentage). I would like to get this in the perspective of hive.
Thanks In Advance.