I'm using reportstats edge to download some reports in CSV format. (It probably applies to XLS as well)
What I've noticed:
headers have different descriptions than the data columns parameters - is there a resource describing the mapping? (eg. adgroup_id -> 'Ad ID', adgroup_name -> 'Ad Name', unique_impressions -> 'Reach'...
will the order of csv columns be as defined in data_columns param?
! some columns are not returned in csv format - two I've identified so far are inline_actions and unique_social_clicks - the column is skipped in csv format but available in json - is it a bug or there is a reason for that?
general question - does csv format require pagination or I will always get all of the data?
value mapping - the constant values in csv/xls format have different labels, eg. placement(desktop_feed -> 'News Feed on Desktop Computers'), Is there a resource describing all the possible values?
asynchronous report requests - it happens quite often that although I'm checking the report_run_id for async_percent_completion the data is still not available when it should . I'm getting a text response No data available.. I need to retry and then it's usually available. Is this expected?
Thanks!
different names in API and XLS are intentional; API developers prefer naming consistent with the rest of Ads API, but people using XLS exports are often not developers and prefer human-friendly naming
you can use export_columns to define the order
inline_actions/unique_social_clicks - not sure, maybe these might be deprecated
it will give you all of the data
I don't think there's public resource for mapping between placement values :-(
you need to check report_run_id for the job status (field "async_status"), that should work reliable; once it's "Job Completed" you should be able to get the data
Related
Looking at documentation of awswrangler.s3.to_csv or awswrangler.s3.to_parquet, there is a dataset parameter.
From testing, it looks like setting dataset=True allows, among other things, to append new data to an already existing set. It also looks like when dataset=True, I can't specify the file name and AWS autogenerates the names for the files which are added to the specified path.
Apart from that, I can't find more information on what dataset means. Is it just referring to the general concept or is there a specific meaning within the context of AWS? What exactly is dataset and when should it be set to True?
The dataset=True option allows you to store the entire dataset, including all metadata, indexes, etc.
The dataset parameter documentation:
dataset (bool) – If True store as a dataset instead of ordinary file(s) If True, enable all follow arguments: partition_cols, mode, database, table, description, parameters, columns_comments, concurrent_partitioning, catalog_versioning, projection_enabled, projection_types, projection_ranges, projection_values, projection_intervals, projection_digits, catalog_id, schema_evolution.
Note all those extra things that get saved when you save a dataset. All that information, like columns_comments, concurrent_partitioning, projection_values, will be lost when you save to CSV or Parquet. But on the other hand, those values are probably only useful if you plan to do further manipulation of the data via awswrangler/pandas at some later date.
Also note that if you set dataset=True you have to give it a file name prefix instead of a single file name, because the output generated will be spread across multiple files.
If you want to use the data in any other tool besides Pandas, such as loading the CSV into Excel, then you most likely want to set dataset=False and output to a single file.
I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.
I have a pbix file that takes an Azure Storage account as a parameter and reads data from there accordingly. The next step is to be able to embed this powerbi dashboard on a webpage and let the end user specify the storage account. I see a lot of questions and answers surrounding passing in filter query parameters--this is different, we're trying to read from a completely different data source and not filtering on a static data source.
Another way to ask this question is: is there a way to embed powerbi template files, if not, is there a feature request somewhere we can upvote?
The short answer is no.
There is a reason to use filters in this case instead of parameters. Parameters are something that is part of the report itself. Each users that looks at your reports will get the same parameter values as the others. If one of them changes some parameter, this will affect all other users. Filters on the other hand, is something local for your session. You can filter the report the way you like, and this will not affect other users experience in any way.
You can't embed templates, because template is simply a state of the report on the disk. When you open it, it's not a template anymore, but becomes a report.
You can either combine the data from all of your data sources in a single report, adding one more column to indicate from where this data comes from, and then filter on this new column. Or create/modify ETL process (for example dataflows can be used for this) to combine these data sources into a single one.
I have a SearchDomain on AWS CloudSearch. I know all the defined facets names.
I would like to build a web query form to use it, but I want to add my categories values (facets) on the side, like it is done on Amazon webstore
The only way I have to get facets values is to make a query (params query) and in the answer will contain facets linked to my query results.
Is there a way to fetch all the facet.FIELD possible values to build the query form ?
If not (as read here), how to design a form using facets ?
You could also use the matchall keyword in a structured query. Also, since you don't need the results you can pass size=0 so you only get the facets which will reduce latency.
/2013-01-01/search?q=matchall&q.parser=structured&size=0&facet.field={sort:'count',size:100}
I am wondering how I can efficiently manage formats in SAS for a reporting office that takes in data from various sources, some with proper lookup tables / metadata, and some without.
For data sources that have proper metadata, joining tables for value descriptions works fine, but when metadata doesn't exist and needs to be maintained separately, how should that be done? Some straightforward examples/ideas:
Plain .sas files with a native PROC FORMAT step that is maintained separately.
External files (e.g., Excel, CSV) that are maintained separately and imported into SAS to create a format library.
Database tables maintained separately that can be read from to create a format library.
In addition to just the formatted values, managing value changes (i.e., effective dates for certian values) is also a concern.
Any help in conventions or standards that work well for this type of task is greatly appreciated.
I'm not sure there's a single best solution here - it depends largely on your environment, your users, etc.
If you have fairly naive users, then I'd definitely recommend a single complete repository if possible; whether that is a .sas7bcat file if you are using a single SAS version/OS/bitness, or a ready-made table/dataset to input into PROC FORMAT (and a .sas file included in their autoexec to do the importing). The biggest drawbacks to this are that you have to manage it actively (you cannot allow users to write their own formats to the master format dataset, for example, as they may overwrite other ones), and that there will be additional work to ensure format names do not conflict - YNF. might be 1=YES 2=NO or 1=YES 0=NO or something else. This also doesn't allow you to very easily handle effective dates; but it's possible this is better for your users (and then just handle the documentation separately).
If you have more advanced users, then you might consider a table/dataset that is more relational in nature. A hybrid approach might include a dataset with columns:
Dataset Name (qualified as needed to ensure uniqueness)
Format Name
Start
Label
Other elements (Type, HLO, etc.)
Effective date
That would allow users to make their own modifications (assuming you trust them enough to add dataset name properly, anyway - or set up a stored proc to do the adding from a temp table that checked for conflicts) and allow you to handle format names that conflicted. You'd still have to have a way for the user to handle using multiple datasets, if that's necessary (such as by adding some unique element to the format name by default, like 'dataset ID').
In my mind, however, the best option is using a data dictionary to handle the metadata, which combines self-documentation with metadata management. Similar to above, you have a table with dataset and format elements, but add columns for descriptive text (question description, for example) and other useful information, depending on your use cases. This can be maintained in a database table or dataset, or perhaps more usefully in an excel or similar document that can be shared with non-programmers and easily edited. I use this method for several projects, and it has paid off by allowing my users to help write the documentation for my code, keeping my programs accurate and up to date, while minimizing back-and-forth discussions of updates. I just import the spreadsheet and run a proc format each time I run my data.
You can then have one spreadsheet per dataset, one tab, or one full spreadsheet with all datasets in them - whichever is easiest to use. This easily handles 'effective date' type issues as well - or even versioning, as that can be handled in the spreadsheet.