Automate data loading from folder to SAS lib - sas

I want to automate the process of loading data from folder to SAS LASR Server (but I expect it must be similar to just loading data in normal SAS Lib). I have a folder where user can put their data (lets say *.csv files with the same structure). I want to create some sort of process which will automatically scan this folder, check if there are any new files and if any - append it to existing data and upload it to make available to all users for further analysis.
I know how to read separate CSV to SAS dataset and I'm looking for the easiest way to do solve two problems - comparing current CSVs with already uploaded and making this process scheduled.
Many thanks in advance for any help!

Related

AWS Glue: Add An Attribute to CSV Distinguish Between Data Sets

I need to pull two companies' data from their respective AWS S3 buckets, map their columns in Glue, and export them to a specific schema in a Microsoft SQL database. The schema is to have one table, with the companies' data being distinguished with attributes for each of their sites (each company has multiple sites).
I am completely new to AWS and SQL, would someone mind explaining to me how to add an attribute to the data, or point me to some good literature on this? I feel like manipulating the .csv in the Python script I'm already running to automatically download the data from another site then upload it to S3 could be an option (deleting NaN columns and adding a column for site name), but I'm not entirely sure.
I apologize if this has already been answered elsewhere. Thanks!
I find this website to generally be pretty helpful with figuring out SQL stuff. I've linked to the ALTER TABLE commands that would allow you to do this through SQL.
If you are running a python script to edit the .csv to start, then I would edit the data there, personally. Depending on the size of the data sets, you can run your script as a Lambda or Batch job to grab, edit, and then upload to s3. Then you can run your Glue crawler or whatever process you're using to map the columns.

Pulling CSV files into Oracle and MySQL database

I am creating several Informatica jobs to pull CSV data files from different third party servers and would wish to ingest and load this data into different databases, both oracle and SQL. I have sample data files and I am looking for anyone to assist me create the jobs or provide me with a template to assist me. The biggest problem is that some CSV files holds data for more than two or three tables and it is becoming very hard for me to know how to transform the files. Again, I am looking at a way on how to automate the loading of the daily CSV files

Sorting blobs in google cloud storage with respect to the last modified date field using python api

I have a scenario where i want to list blobs and then sort it using the last modified time.
I am trying to do it in python api.
I want to execute this script n number of times, and in each execution i want to list 10 files and perform some operation (e.g copy). I want to save the date of the last file in a config file and want to list the files in another iteration after the last saved date.
Need some suggestion as google api doesn't let us sort the files after listing.
blobs = storage_client.list_blobs(bucket_name,prefix=prefix,max_results=10)
Several solutions I can think of.
Get pubsub notification every time a file created.Read 10 messages each time or save the topic data to bigquery.
After using a file move it to another folder with a metadata file, or update the processed files metadata.
Use storage to trigger a function and save the event data to database.
If you control the files names and path save them in a easy to query path by using the prefix parameter.
I think the database solution in the must flexible one which give you the best control over the data and the ability to create a dashboard for your data.
Knowing more about your flow will help in order to give you a more fine grained solution.

Append CSV Data to Apache Superset Dataset

Using CSV upload in Apache Superset works as expected. I can use it to add data from CSV to a databse, e.g. Postgres. Now I want to apped data from a different CSV to this table/dataset. But how?
The CSVs all have the same format. But there is a new one for every day. In the end I want to have a dashboard which updates every day, taking the new data into account.
Generally, I agree with Ana that if you want to repeatedly upload new CSV data then you're better off operationalizing this into some type of process, pipeline, etc that runs on a schedule.
But if you need to stick with the uploading CSV route through the Superset UI, then you can set the Table Exists field to Append instead of Replace.
You can find a helpful GIF in the Preset docs: https://docs.preset.io/docs/tips-tricks#append-csv-to-a-database
Probably you'll be better served by creating a simple process to load the CSV to a table in the database and then querying that table in Superset.
Superset is a tool to visualize data, it allows uploading CSV for quick and dirty "only once" kind of charts, but if this is going to be a recurrent and structured periodical load of data, it's better to use whatever integrating tool you want to load the data, there are zillions of ETL (Extract-Transform-Load) tools out there (or scripting programs to do it), ask if your company is already using one, or choose the one that is simpler for you.

Keeping existing data in data model and just extend it with new data

What I do:
I built ETL processes with power query to load data (production machine stop history) from multiple Excel files directly into PowerBI.
On each new shift (every 8 hrs.) there is a new excel file generated by the production machine that need to be loaded to the data model too.
How I did it:
To do so, power query is processing all files found in a specific folder.
The problem:
During query refresh it need to process all the data files again and again (old files + new files).
If I remove the old files from the folder, power query removes the data also from the data model during the next refresh cycle.
What I need / My question:
A batch process copies new files into the folder while removing all the old files.
Is there a possibility to configure powery query in a way that it keeps the existing data inside the data model and just extend it with the data from the new files?
What I would like to avoid:
I know building a database would be one solution but this requires a second system with new ETL process. But power query does already a very good job for preprocessing the data! Therefore and if possible, it would be highly appreciated if this problem could be solved directly inside power query / power bi.
If you want to shoot sparrows with a cannon gun, you could try incremental refresh, but it's Premium feature.
In Power BI refreshing a dataset reloads it, so first it is cleared, and second - you will need all the files to re-load them and recalculate everything. If you don't want this, you have to either change your ETL to store the data outside of the report's dataset (e.g. a database would be a very good choice), or to push the data from the new files only to a dataset (which I wouldn't recommend in your case).
To summarize - the best solution is to build ETL process and put the data in a datawarehouse, and then to use it as a datasource for your reports.