I have some data which are grouped by organizational group, and I'd like to compute a moving average, within each group. I've played around, and don't know a quick and easy way to make the command operate within groups, as opposed to over the entire data set.
See the SPLIT FILE command. You can simply specify your grouping variable and run the appropriate CREATE command to estimate your moving average. Brief example below:
sort cases by GROUPING_VAR.
split file by GROUPING_VAR.
CREATE
/group_MA=PMA(X 5).
split file off.
Related
Looking for a way to process ~ 4Gb file which is a result of Athena query and I am trying to know:
Is there some way to split Athena's query result file into small pieces? As I understand - it is not possible from Athena side. Also, looks like it is not possible to split it with Lambda - this file too large and looks like s3.open(input_file, 'r') does not work in Lambda :(
Is there some other AWS services that can solve this issue? I want to split this CSV file to small (about 3 - 4 Mb) to send them to external source (POST requests)
You can use the option to CTAS with Athena and use the built-in partition capabilities.
A common way to use Athena is to ETL raw data into a more optimized and enriched format. You can turn every SELECT query that you run into a CREATE TABLE ... AS SELECT (CTAS) statement that will transform the original data into a new set of files in S3 based on your desired transformation logic and output format.
It is usually advised to have the newly created table in a compressed format such as Parquet, however, you can also define it to be CSV ('TEXTFILE').
Lastly, it is advised to partition a large table into meaningful partitions to reduce the cost to query the data, especially in Athena that is charged by data scanned. The meaningful partitioning is based on your use case and the way that you want to split your data. The most common way is using time partitions, such as yearly, monthly, weekly, or daily. Use the logic that you would like to split your files as the partition key of the newly created table.
CREATE TABLE random_table_name
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
partitioned_by = ARRAY['year','month'])
AS SELECT ...
When you go to s3://bucket/folder/ you will have a long list of folders and files based on the selected partition.
Note that you might have different sizes of files based on the amount of data in each partition. If this is a problem or you don't have any meaningful partition logic, you can add a random column to the data and partition with it:
substr(to_base64(sha256(some_column_in_your_data)), 1, 1) as partition_char
Or you can use bucketing and provide how many buckets you want:
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
bucketed_by = ARRAY['column_with_high_cardinality'],
bucket_count = 100
)
You won't be able to do this with Lambda as your memory is maxed out around 3GB and your file system storage is maxed out at 512 MB.
Have you tried just running the split command on the filesystem (if you are using a Unix based OS)?
If this job is reoccurring and needs to be automated and you wanted to still be "serverless", you could create a Docker image that contains a script to perform this task and then run it via a Fargate task.
As for the specific of how to use split, this other stack overflow question may help:
How to split CSV files as per number of rows specified?
You can ask S3 for a range of the file with the Range option. This is a byte range (inclusive), for example bytes=0-1000 to get the first 1000 bytes.
If you want to process the whole file in the same Lambda invocation you can request a range that is about what you think you can fit in memory, process it, and then request the next. Request the next chunk when you see the last line break, and prepend the partial line to the next chunk. As long as you make sure that the previous chunk gets garbage collected and you don't aggregate a huge data structure you should be fine.
You can also run multiple invocations in parallel, each processing its own chunk. You could have one invocation check the file size and then invoke the processing function as many times as necessary to ensure each gets a chunk it can handle.
Just splitting the file into equal parts won't work, though, you have no way of knowing where lines end, so a chunk may split a line in half. If you know the maximum byte size of a line you can pad each chunk with that amount (both at the beginning and end). When you read a chunk you skip ahead until you see the last line break in the start padding, and you skip everything after the first line break inside the end padding – with special handling of the first and last chunk, obviously.
I have a task to complete.
There are two types of csv files 4000+ both related to each other.
2 types are:
1. Country2.csv
2. Security_Name.csv
Contents of Country2.csv:
Company Name;Security Name;;;;Final NOS;Final FFR
Contents of Security_Name.csv:
Date;Close Price;Volume
There are multiple countries and for each country multiple security files
Now I need to READ them do some CALCULATION and then WRITE the output in another files
READ
Read both the file Country 2.csv and Security.csv and extract all the data from them.
For example :
Read France 2.csv, extract Security_Name, Final NOS, Final FFR
Then Read Security.csv(which matches the Security_Name) and extract Date, Close Price, Volume
Calculation
Calculations are basically finding Median of the values extracted which is quite simple.
For Example:
Monthly Median Traded Values
Daily Traded Value of a Security ... and so on
Write
Based on the month I need to sort the output in two different file with following formats:
If Month % 3 = 0
Save It as MONTH_NAME.csv in following format:
Security name; 12-month indicator; 3-month indicator; FOT
Else
Save It as MONTH_NAME.csv in following format:
Security Name; Monthly Median Traded Value Ratio; Number of days Volume > 0
My question is how do I design my application in such a way that it is maintainable and the flow of data throughout the execution is seamless?
So first thing. Based on the kind of data you are looking to generate, I would probably be looking at moving this data to a SQL db if at all possible. This is "one SQL query" kind of stuff. And far more maintainable than C++ that generates CSV files from CSV files.
Barring that, I would probably look at using datamash and/or perl. On a Windows platform, you could do this through Cygwin or WSL. Probably less maintainable, but so much easier it's not too much of an issue.
That said, if you're looking for something moderately maintainable, C++ could work. The first thing I would do is design my input classes. Data-centric, but it can work. It sounds like you could have a Country class, a Security class, and a SecurityClose class...or something along those lines. You can think about whether a Security class should contain a collection of SecurityClosees (data), or whether the data should just be "loose" and reference the Security it belongs to. Same with the Country->Security relationship.
Once you've decided how all that's going to look, you want something (likely a function) that can tokenize a CSV line. So "1,2,3" gets turned into a vector<string> with the contents "1" "2" "3". Then, each of your input classes should have a constructor or initializer that takes a vector<string> and populates itself. You might need to pass higher level data along too. Like the filename if you want the security data to know which security it belongs to..
That's basically most of the battle there. Once you've pulled your data into sensibly organized classes, the rest should come more easily. And if you run into bumps, hopefully you can ask specific design or implementation questions from there.
Now I have a file recording the entries of a lookup table. If the number of entries is small, I can simply load this file into an STL map and perform search in my code. But what if there are many many entries? If I do it in the way above, it may cause error such as out of memory. I'm here to listen to your advice...
P.S. I just want to perform search without loading all entries into memory.
Can Key-value database solve this problem?
You'll have to load the data from hard drive eventually but sure if a table is huge it won't fit into memory to do a linear search through it, so:
think if you can split the data into a set of files
make an index table of what file contains what entries (say the first 100 entries are in "file1_100", second hundred is in "file101_201" an so on)
using index table from step 2 locate the file to load
load the file and do a linear search
That is a really simplified scheme for a typical database management system so you may want to use one like MySQL, PostgreSQL, MsSQL, Oracle or any one of them.
If that's a study project then after you're done with the search problem, consider optimizing linear operations (by switching to something like binary search) and tables (real databases use balanced tree structures, hash tables and like).
One method would be to reorganize the data in the file into groups.
For example, let's consider a full language dictionary. Usually, dictionaries are too huge to read completely into memory. So one idea is to group the words by first letter.
In this example, you would first read in the appropriate group based on the letter. So if the word you are searching for begins with "m", you would load the "m" group into memory.
There are other methods of grouping such as word (key) length. There can also be subgroups too. In this example, you could divide the "m" group by word lengths or by second letter.
After grouping, you may want to write the data back to another file so you don't have to modify the data anymore.
There are many ways to store groups on the file, such as using a "section" marker. These would be for another question though.
The ideas here, including from #047, are to structure the data for the most efficient search, giving your memory constraints.
I am running a NetLogo model in BehaviorSpace each time varying number of runs. I have turtle-breed pigs, and they accumulate a table with patch-types as keys and number of visits to each patch-type as values.
In the end I calculate a list of mean number of visits from all pigs. The list has the same length as long as the original table has the same number of keys (number of patch-types). I would like to export this mean number of visits to each patch-type with BehaviorSpace.
Perhaps I could write a separate csv file (tried - creates many files, so lots of work later on putting them together). But I would rather have everything in the same file output after a run.
I could make a global variable for each patch-type but this seems crude and wrong. Especially if I upload a different patch configuration.
I tried just exporting the list, but then in Excel I see it with brackets e.g. [49 0 31.5 76 7 0].
So my question Q1: is there a proper way to export a list of values so that in BehaviorSpace table output csv there is a column for each value?
Q2: Or perhaps there is an example of how to output a single csv that looks exactly as I want it from BehaviorSpace?
PS: In my case the patch types are costs. And I might change those in the future and rerun everything. Ideally I would like to have as output: a graph of costs vs frequency of visits.
Thanks
If the lists are a fixed length that doesn't vary from run to run, you can get the items into separate columns by using one metric for each item. So in your BehaviorSpace experiment definition, instead of putting mylist, put item 0 mylist and item 1 mylist and so on.
If the lists aren't always the same length, you're out of luck. BehaviorSpace isn't flexible that way. You would have to write a separate program (in the programming language of your choice, perhaps NetLogo itself, perhaps an Excel macro, perhaps something else) to postprocess the BehaviorSpace output and make it look how you want.
I need to export data from 3 maps to preferably a single CSV and would like to be able to do so without simply making a column for every possible key (there may be up to 65024 of them).
The output would be a CSV containing the value at each of the keys at each timestep (may be several hundred thousand).
Anyone got any ideas?
Reduce the granularity by categorizing your keys into groups and store them with one timestep per row. Then you can plot one datapoint per line.
Let me know if you need clarification, i'd need some more info.