Warning (cdfInqContents): Coordinates variable XTIME can't be assigned - cdo-climate

I have a high number of daily WRF outputs, each one consisting of 24 time steps for every single hour of the day. Now I would like to combine these single output files to one resulting file that comprises the entire time period by using cdo mergetime. I have done this before with some other output files in another context and it worked well.
When I apply this command for example:
cdo mergetime wrf_file1.nc wrf_file2.nc output_file.nc
I get the following message many times: Warning (cdfInqContents): Coordinates variable XTIME can't be assigned!
Since it is only a warning and not an error, the process continues. But it takes way too much time and the resulting output file is way too big. For example, when the two input files are about 6 GB, the resulting output file is above 40 GB, which does not make sense at all.
Anybody with an idea how to solve this?

The merged files are probably large because CDO does not, by default, compress the output file. And the WRF files are probably compressed.
You can modify your call to compress the output as follows:
cdo -z zip -mergetime wrf_file1.nc wrf_file2.nc output_file.nc

Related

Concatenate Monthy modis data

I downloaded daily MODIS DATA LEVEL 3 data for a few months from https://disc.gsfc.nasa.gov/datasets. The filenames are of the form MCD06COSP_M3_MODIS.A2006001.061.2020181145945 but the files do not contain any time dimension. Hence when I use ncecat to concatenate various files, the date information is missing in the resulting file. I want to know how to add the time information in the combined dataset.
Your commands look correct. Good job crafting them. Not sure why it's not working. Possibly the input files are HDF4 format (do they have a .hdf suffix?) and your NCO is not HDF4-enabled. Try to download the files in netCDF3 or netCDF4 format and your commands above should work. If that's not what's wrong, then examine the output files in each step of your procedure and identify which step produces the unintended results and then narrow your question. Good luck.

How to concatenate using ncrcat or cdo merge with packed netcdf files

I have ERA5 files that I am trying to concatenate into monthly files. It appears the files have been packed to reduce size making the data type within the file a short. When I try ncrcat, it will warn about encountering a packing attribute "add_offset", then concatenate all the files together. However the values of the data become messed up. I tried using ncpdq -U to unpack the files, then ncrcat to concatenate which works. But the resulting files are too large to be useful and when I try ncpdq to repack the resulting file I receive a malloc() failure which seems related to a memory/RAM issue.
I've also tried cdo merge which strangely works perfectly for most of the concatenations, but a few of the files fail and output this error "Error (cdf_put_vara_double): NetCDF: Numeric conversion not representable"
So is there anyway to concatenate these files while they are still packed, or at least a way to repack the large files once they are concatenated
Instead of repacking the large files once they are concatenated you could try netCDF4 compression, e.g.,
ncpdq -U -7 -L 1 inN.nc in_upk_cmpN.nc # Loop over N
ncrcat in_upk_cmp*.nc out.nc
Good luck!
When data is packed, CDO will often throw an error due to too much loss of precision,
cdo -b32 mergetime in*.nc out.nc
should do the trick and avoid the error. If you want to then compress the files you can try this:
cdo -z zip_9 copy out.nc out_compressed.nc

Best way to read this file to manipulate later?

I am given a config file that looks like this for example:
Start Simulator Configuration File
Version/Phase: 2.0
File Path: Test_2e.mdf
CPU Scheduling Code: SJF
Processor cycle time (msec): 10
Monitor display time (msec): 20
Hard drive cycle time (msec): 15
Printer cycle time (msec): 25
Keyboard cycle time (msec): 50
Mouse cycle time (msec): 10
Speaker cycle time (msec): 15
Log: Log to Both
Log File Path: logfile_1.lgf
End Simulator Configuration File
I am supposed to be able to take this file, and output the cycle and cycle times to a log and/or monitor. I am then supposed to pull data from a meta-data file that will tell me how many cycles each of these run (among other things) and then im supposed to calculate and log the total time. for example 5 Hard drive cycles would be 75msec. The config and meta data files can come in any order.
I am thinking I will put each item in an array and then cycle through waiting for true when the strings match(This will also help detect file errors). The config file should always be the same size despite a different order. The metadata file can be any size so I figured i would do a similar thing but in a vector.
Then I will multiply the cycle times from the config file by the number of cycles in the matching metadata file string. I think the best way to read the data from the vector is in a queue.
Does this sound like a good idea?
I understand most of the concepts. But my data structures is shaky in terms of actually coding it. For example when reading from the files, should I read it line by line, or would it be best to separate the int's from the strings to calculate them later? I've never had to do this that from a file that can change before.
If i separate them, would I have to use separate arrays/vectors?
Im using C++ btw
Your logic should be:
Create two std::map variables, one that maps a string to a string, and another that maps a string to a float.
Read each line of the file
If the line contains :, then, split the string into two parts:
3a. Part A is the line starting from zero, and 1-minus the index of the :
3b. Part B is the part of the line starting from 1+ the index of the :
Use these two parts to store in your custom std::map types, based on the value type.
Now you have read the file properly. When you read the meta file, you will simply look up the key in the meta data file, use it to lookup the corresponding key in your configuration file data (to get the value), then do whatever mathematical operation is required.

Number of mappers for a Mapreduce program

How Many Mappers are going to executed if my mapreduce job read 60 files each 1 mb of size availabe in a directory. Lets assume under this /user/cloudera/inputs/ directory there are 60 files and size of each file is 1 mb
In my configuration class of mapreduce i specified the directory /user/cloudera/inputs/.
Can somebody tell me how many blocks are used for storing that 60 files of each 1 mb size and How many mappers are executed
Is it 60 blocks and 60 mappers?If it is so Somebody explain me how
Map tasks usually process a block of input at a time (using the default FileInputFormat). If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.
In your case 60 map are used in 60 files and used 60 blocks.
If you're using something like TextInputFormat, the problem is that each file has at least 1 split, so the upper bound of the number of maps is the number of files, which in your case where you have many very small files you will end up with many mappers processing each very little data.
To remedy to that, you should use CombineFileInputFormat which will pack multiple files into the same split (I think up to the block size limit), so with that format the number of mappers will be independent of the number of files, it will simply depend on the amount of data.
You will have to create your own input format by extending from CombineFileInputFormt, you can find an implementation here. Once you have your InputFormat defined, let's called it like in the link CombinedInputFormat, you can tell your job to use it by doing:
job.setInputFormatClass(CombinedInputFormat.class);

Searching for means to get smaller rdf (n3) dataset

I have downloaded yago.n3 dataset
However for testing I wish to work on a smaller version of the dataset (as the dataset is 2 GB) and even though i make a small change it takes me a lot of time to debug.
Therefore, I tried to copy a small portion of the data and create a separate file, however this did not work and threw lexical errors.
I saw the earlier posts, however the earlier post is about big datasets, whereas I am searching for smaller ones.
Is there any means by which I may obtain a smaller amount of the same dataset?
If you have an RDF parser at hand to read your yago.n3 file, you can parse it and write on a separate file as many RDF triples as you want/need for your smaller dataset to run your experiments with.
If you find some data in N-Triples format (i.e. one RDF triple per line) you can just take as many line as you want and make your dataset as small as you want: head -n 10 filename.nt would give you a tiny dataset of 10 triples.