I downloaded daily MODIS DATA LEVEL 3 data for a few months from https://disc.gsfc.nasa.gov/datasets. The filenames are of the form MCD06COSP_M3_MODIS.A2006001.061.2020181145945 but the files do not contain any time dimension. Hence when I use ncecat to concatenate various files, the date information is missing in the resulting file. I want to know how to add the time information in the combined dataset.
Your commands look correct. Good job crafting them. Not sure why it's not working. Possibly the input files are HDF4 format (do they have a .hdf suffix?) and your NCO is not HDF4-enabled. Try to download the files in netCDF3 or netCDF4 format and your commands above should work. If that's not what's wrong, then examine the output files in each step of your procedure and identify which step produces the unintended results and then narrow your question. Good luck.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 months ago.
Improve this question
Scenario: We have a list of over 200 SAS files. We need to identify all SAS libraries and data sets used as inputs to these programs, and write out a table linking the SAS input data sets to the associated program files. We are not SAS programmers and are just now becoming familiar with the language. The intent is to design a rearchitecture of the logic of the SAS files to be more modular.
We are conducting this analysis statically - i.e., we are not running SAS, we are attempting to extract this data purely from interrogating the code in program files themselves and we do not have access to the data files.
Solution attempted: we have parsed the SAS programs to identify inputs to SAS Procs and SAS Data steps, however there are several challenges. The approach we are using is as follows:
We have obtained a python-based parser (https://github.com/benjamincorcoran/sasdocs) that extracts key information from SAS files. We have applied it to all 200+ files and extracted parsed content into a text file. However, not all SAS syntax is supported; in particular, DataSet blocks are left as unparsed raw text, Procs with a variable number and names of arguments may be missed, and some commands, like various constructs of “set” and “merge” are missed completely by the grammar that has been implemented in the parser so far.
The parser correctly locates about 60% of the files, especially the libraries and files preceded by a "Set" statement. For reasons we do not understand, not all libraries/files preceded by a "SET" command are captured by this parser.
In addition to the "Set" command, we have observed that SAS can also reference a library/file within a Merge or Sort procedure, without a specific Set command.
We are ignoring SAS files from within the 'work' library that are created during processing; we are only concerned with external input files.
Note that we are not running these programs, we only have access to the SAS Program file sources - hence we do not have access to a SAS log.
Questions:
Is there a more direct way do accomplish this goal? Does SAS understand what files it reads and writes, and is there a method of extracting a list of all libraries and files read by SAS associated with a SAS program?
If there is no method of accomplishing this information programmatically, what are all the ways that SAS can access or reference an external library/file, other than within a SET, MERGE or SORT procedure?
SAS has a procedure that does this, PROC SCAPROC. If you do have access to SAS, this is by far the best solution. You would technically need to run SAS, but even if there are errors, in theory it might work okay - the fact that the dataset doesn't exist should be okay, unless your code is data driven.
If you're unable to run the code or run anything in SAS, you'd need to do something with text analysis.
The key things to look for which would catch most of the possibilities would be (in sort of pseudo regex code):
data [lib.]dataset(could have parens but ignore them);
set( [lib.]dataset(ignore parens))* (could have multiple)
merge( [lib.]dataset(ignore parens))* (could have multiple)
update( [lib.]dataset(ignore parens))* (could have multiple)
modify( [lib.]dataset(ignore parens))* (could have multiple)
data=[lib.]dataset(ignore parens) - this is for most PROCs input, could have spaces around the equals sign
out=[lib.]dataset(ignore parens) - this is for most PROCs output, could have spaces around the equals sign
To get more than the "most" above, you'd want to analyze which PROCs were used. Each PROC can have its own output/input options, for example proc surveyselect could use various different datasets for different things, proc format uses CNTLIN and CNTLOUT, etc. You'd also have to see if there are hash tables or other objects used in the code as that has its own elements.
The other thing you could do, only caring about external files, is identify the libname statements. Once you find them, it's possible you could just look for libname.data in the program - that's how all of the datasets in the external folders (libraries) will be referred to. This won't work, though, if you are using metadata-assigned libraries, unless there are a small enough number of them that you could possibly list them all out (and you have access to SAS to find out the list).
Ultimately, your 100% solution is to hire a SAS consultant to look at the code; without being able to run the code (and thus use SCAPROC), there's not really a perfect solution.
Requirement is, The source files structure will be changed on daily basis / dynamically. how we can achieve in Informatica could:
For example,
Let's consider the source is a flat file with different formats like with header, without header, different metadata(today file with 4 columns and tomorrow its 7 different columns and day after tomorrow without header , another day file with count of records in file)
I need to consume all dynamically changed files in one informatica cloud mapping. could you please help me on this.
This is a tricky situation. I know its not a perfect solution but here is my idea-
create a source file structure having maximum number of columns of type text, say 50. Read file, apply filter to cleanup header data etc. Then use router to treat files as per their structure - may be filename can give you a hint what it contains. Once you identify the type of file, treat,convert columns according to their data type and load into correct target.
Mapping would look like Source -> SQ -> EXP -> FIL -> RTR -> TGT1, TGT2
There has to be a pattern to identify the dynamic file structure.
HTH...
To summarise my understanding of the problem:
You have a random number of file formats
You don't know the file formats in advance
The files don't contain the necessary information to determine their format.
If this is correct then I don't believe this is a solvable problem in Informatica or in any other tool, coding language, etc. You don't have enough information available to enable you to define the solution.
The only solution is to change your source files. Possibilities include:
a standard format (or one of a small number of standard formats with information in the file that allows you to programatically determine the format being used)
a self-documenting file type such as JSON
i have many .csv files which are stored into gcs and i want to load data from.csv to BigQuery using below commands:
bq load 'datasate.table' gs://path.csv json_schema
i have tried but giving errors, same error is giving for many file.
error screenshot
how can i remove unwanted values from .csv files before importing into table.
Suggest me to load file in easiest way
The answer depends on what do you want to do with this junk rows. If you look at the documentation, you have several options
Number of errors allowed. By default, it's set to 0 and that why the load job fails at the first line. If you know the total number of rom, set this value to the Number of errors allowed and all the errors will be ignored in the Load Job
Ignore unknown values. If your errors are made because some line contains more column as defined in the schema, this option keep the line in error and only the known column, the others are ignore
Allow jagged rows. If your errors are made by too short line (and it is in your message) and you still want to keep the first columns (because the last ones are optional and/or not relevant), you can check this option
For more advanced and specific filters, you have to perform pre or post processing. If it's the case, let me know to add this part to my answer.
I have built something in SAS to pull down Yahoo! finance .csv data. The code I have built now works fine and I have built some robust error handling into the code. The problem I have had with the data though is that the .csv feed is unsupported and not clean.
The data is comma delimited, but some of the data also has commas in it. Some of the fields are in quotes and some are not. Also the length of the fields varies wildly as as well. A field like Market Capitlisation for example could run form a few million to hundreds of billions.
As a result, if you pass multiple stock metrics for multiple stocks through to the Yahoo! API at the same time, you will get rows of .csv data where each field is in a different place, is a different length and is inconsistently delimited.
I have tried multiple infile options that could handle some of these errors in isolation, but not all of them together. My only solution that works is to download single stock metrics by multiple stocks at the same time.
This gives me what I want, but it takes over an hour to run the data for the NASDAQ and the NYSE. Have I overlooked another method for handling this type of problem?
Thanks
This is the outline of a way to do what you are looking for. The whole of the code to do this would be too long to post here and out of scope of what this site looks to do.
Create a SAS program that takes a stock ticker from the SYSPARM automatic macro, and downloads the data to a data set named the same as the ticker into a permanent library.
The SYSPARM macro is set by the value you set on the commandline to call SAS
sas.exe myprog.sas -sysparm XYZ
This would set &SYSPARM to resolve XYZ
Write a SAS program that merges all the ticker data sets together for further processing.
Create a program in a language like Perl or Python, (or shell script, etc.) that loops over a range of tickers and calls your SAS program, passing the ticker through SYSPARM.
Use a threading, forking, etc. package from that language to have multiple of these running at the same time. You can probably go to some multiple of the CPU cores on your machine as this processing will not be CPU intensive. Test values to you find one that works.
From that same language call your SAS program to merge the datasets.
I have downloaded yago.n3 dataset
However for testing I wish to work on a smaller version of the dataset (as the dataset is 2 GB) and even though i make a small change it takes me a lot of time to debug.
Therefore, I tried to copy a small portion of the data and create a separate file, however this did not work and threw lexical errors.
I saw the earlier posts, however the earlier post is about big datasets, whereas I am searching for smaller ones.
Is there any means by which I may obtain a smaller amount of the same dataset?
If you have an RDF parser at hand to read your yago.n3 file, you can parse it and write on a separate file as many RDF triples as you want/need for your smaller dataset to run your experiments with.
If you find some data in N-Triples format (i.e. one RDF triple per line) you can just take as many line as you want and make your dataset as small as you want: head -n 10 filename.nt would give you a tiny dataset of 10 triples.