Searching for means to get smaller rdf (n3) dataset

Searching for means to get smaller rdf (n3) dataset - c++

I have downloaded yago.n3 dataset
However for testing I wish to work on a smaller version of the dataset (as the dataset is 2 GB) and even though i make a small change it takes me a lot of time to debug.
Therefore, I tried to copy a small portion of the data and create a separate file, however this did not work and threw lexical errors.
I saw the earlier posts, however the earlier post is about big datasets, whereas I am searching for smaller ones.
Is there any means by which I may obtain a smaller amount of the same dataset?

If you have an RDF parser at hand to read your yago.n3 file, you can parse it and write on a separate file as many RDF triples as you want/need for your smaller dataset to run your experiments with.
If you find some data in N-Triples format (i.e. one RDF triple per line) you can just take as many line as you want and make your dataset as small as you want: head -n 10 filename.nt would give you a tiny dataset of 10 triples.

Related

Concatenate Monthy modis data

I downloaded daily MODIS DATA LEVEL 3 data for a few months from https://disc.gsfc.nasa.gov/datasets. The filenames are of the form MCD06COSP_M3_MODIS.A2006001.061.2020181145945 but the files do not contain any time dimension. Hence when I use ncecat to concatenate various files, the date information is missing in the resulting file. I want to know how to add the time information in the combined dataset.

Your commands look correct. Good job crafting them. Not sure why it's not working. Possibly the input files are HDF4 format (do they have a .hdf suffix?) and your NCO is not HDF4-enabled. Try to download the files in netCDF3 or netCDF4 format and your commands above should work. If that's not what's wrong, then examine the output files in each step of your procedure and identify which step produces the unintended results and then narrow your question. Good luck.

I need help in designing my C++ Console application

I have a task to complete.
There are two types of csv files 4000+ both related to each other.
2 types are:
1. Country2.csv
2. Security_Name.csv
Contents of Country2.csv:
Company Name;Security Name;;;;Final NOS;Final FFR
Contents of Security_Name.csv:
Date;Close Price;Volume
There are multiple countries and for each country multiple security files
Now I need to READ them do some CALCULATION and then WRITE the output in another files
READ
Read both the file Country 2.csv and Security.csv and extract all the data from them.
For example :
Read France 2.csv, extract Security_Name, Final NOS, Final FFR
Then Read Security.csv(which matches the Security_Name) and extract Date, Close Price, Volume
Calculation
Calculations are basically finding Median of the values extracted which is quite simple.
For Example:
Monthly Median Traded Values
Daily Traded Value of a Security ... and so on
Write
Based on the month I need to sort the output in two different file with following formats:
If Month % 3 = 0
Save It as MONTH_NAME.csv in following format:
Security name; 12-month indicator; 3-month indicator; FOT
Else
Save It as MONTH_NAME.csv in following format:
Security Name; Monthly Median Traded Value Ratio; Number of days Volume > 0
My question is how do I design my application in such a way that it is maintainable and the flow of data throughout the execution is seamless?

So first thing. Based on the kind of data you are looking to generate, I would probably be looking at moving this data to a SQL db if at all possible. This is "one SQL query" kind of stuff. And far more maintainable than C++ that generates CSV files from CSV files.
Barring that, I would probably look at using datamash and/or perl. On a Windows platform, you could do this through Cygwin or WSL. Probably less maintainable, but so much easier it's not too much of an issue.
That said, if you're looking for something moderately maintainable, C++ could work. The first thing I would do is design my input classes. Data-centric, but it can work. It sounds like you could have a Country class, a Security class, and a SecurityClose class...or something along those lines. You can think about whether a Security class should contain a collection of SecurityClosees (data), or whether the data should just be "loose" and reference the Security it belongs to. Same with the Country->Security relationship.
Once you've decided how all that's going to look, you want something (likely a function) that can tokenize a CSV line. So "1,2,3" gets turned into a vector<string> with the contents "1" "2" "3". Then, each of your input classes should have a constructor or initializer that takes a vector<string> and populates itself. You might need to pass higher level data along too. Like the filename if you want the security data to know which security it belongs to..
That's basically most of the battle there. Once you've pulled your data into sensibly organized classes, the rest should come more easily. And if you run into bumps, hopefully you can ask specific design or implementation questions from there.

How to report a list in Behaviorspace NetLogo?

I am running a NetLogo model in BehaviorSpace each time varying number of runs. I have turtle-breed pigs, and they accumulate a table with patch-types as keys and number of visits to each patch-type as values.
In the end I calculate a list of mean number of visits from all pigs. The list has the same length as long as the original table has the same number of keys (number of patch-types). I would like to export this mean number of visits to each patch-type with BehaviorSpace.
Perhaps I could write a separate csv file (tried - creates many files, so lots of work later on putting them together). But I would rather have everything in the same file output after a run.
I could make a global variable for each patch-type but this seems crude and wrong. Especially if I upload a different patch configuration.
I tried just exporting the list, but then in Excel I see it with brackets e.g. [49 0 31.5 76 7 0].
So my question Q1: is there a proper way to export a list of values so that in BehaviorSpace table output csv there is a column for each value?
Q2: Or perhaps there is an example of how to output a single csv that looks exactly as I want it from BehaviorSpace?
PS: In my case the patch types are costs. And I might change those in the future and rerun everything. Ideally I would like to have as output: a graph of costs vs frequency of visits.
Thanks

If the lists are a fixed length that doesn't vary from run to run, you can get the items into separate columns by using one metric for each item. So in your BehaviorSpace experiment definition, instead of putting mylist, put item 0 mylist and item 1 mylist and so on.
If the lists aren't always the same length, you're out of luck. BehaviorSpace isn't flexible that way. You would have to write a separate program (in the programming language of your choice, perhaps NetLogo itself, perhaps an Excel macro, perhaps something else) to postprocess the BehaviorSpace output and make it look how you want.

Missing the obvious with inconsistently delimited data?

I have built something in SAS to pull down Yahoo! finance .csv data. The code I have built now works fine and I have built some robust error handling into the code. The problem I have had with the data though is that the .csv feed is unsupported and not clean.
The data is comma delimited, but some of the data also has commas in it. Some of the fields are in quotes and some are not. Also the length of the fields varies wildly as as well. A field like Market Capitlisation for example could run form a few million to hundreds of billions.
As a result, if you pass multiple stock metrics for multiple stocks through to the Yahoo! API at the same time, you will get rows of .csv data where each field is in a different place, is a different length and is inconsistently delimited.
I have tried multiple infile options that could handle some of these errors in isolation, but not all of them together. My only solution that works is to download single stock metrics by multiple stocks at the same time.
This gives me what I want, but it takes over an hour to run the data for the NASDAQ and the NYSE. Have I overlooked another method for handling this type of problem?
Thanks

This is the outline of a way to do what you are looking for. The whole of the code to do this would be too long to post here and out of scope of what this site looks to do.
Create a SAS program that takes a stock ticker from the SYSPARM automatic macro, and downloads the data to a data set named the same as the ticker into a permanent library.
The SYSPARM macro is set by the value you set on the commandline to call SAS
sas.exe myprog.sas -sysparm XYZ
This would set &SYSPARM to resolve XYZ
Write a SAS program that merges all the ticker data sets together for further processing.
Create a program in a language like Perl or Python, (or shell script, etc.) that loops over a range of tickers and calls your SAS program, passing the ticker through SYSPARM.
Use a threading, forking, etc. package from that language to have multiple of these running at the same time. You can probably go to some multiple of the CPU cores on your machine as this processing will not be CPU intensive. Test values to you find one that works.
From that same language call your SAS program to merge the datasets.

Fast CSV parser in C++

I am trying to read a .csv file with 20k+ lines, and each line has ~300 fields.
I am using my own code to read it line by line, then I separate the lines to fields, and convert the fields to corresponding data type (such as integer, double, etc). Then these data are transfered to class objects via their constructor.
However, I found it is not very efficient. It took about 1 min to read these 20k+ lines and create 20k+ objects.
I've googled about fast csv parser, and found there are many options. I've tried some of them, but not very satisfied with the time performance.
Does anyone have a better method to read large .csv files? Many thanks in advance.

An efficient method for parsing or for that matter processing of files is to read as much of the file into memory before you start parsing.
File I/O has been, since the dawn of computers, one of the slower parts of a computer system. For example, parsing your data may take 1 microsecond. Reading the data from a hard drive may take 1 millisecond == 1000 microseconds.
I've made programs faster by allocating a large array for the data then reading the data into the array. Next I process the data in the array and repeat until the entire file is processed.
Another technique is called memory mapping, where the OS handles reading the file into memory as needed.
Please edit your post to show the code where the bottleneck is.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js