Google Cloud Platform url resolving - google-cloud-platform

In the billing usage report, links are given in the form of:
com.google.cloud/services/big-query/ActiveStorage
Does this correspond to an actual url? If so, what would that be?

At close look it looks like a namespace, as it's starts with com.
The link you posted actually describes as MeasurementId:
The ID of the type of resource that is being measured. For example,
VmimageN1Standard_1 to represent an n1-standard-1 machine type.
Example: com.google.cloud/services/compute‑engine/VmimageN1Standard_1
also gives an example that contains Resource URI
+-------------+--------------------------------------------------------------+----------+---------+----------------------------------------------------------------------------------------------------+-------------+---------------+
| Report Date | MeasurementId | Quantity | Unit | Resource URI | Resource ID | Location |
+-------------+--------------------------------------------------------------+----------+---------+----------------------------------------------------------------------------------------------------+-------------+---------------+
| 02/13/2014 | com.google.cloud/services/compute-engine/VmimageN1Standard_1 | 86400 | seconds | https://www.googleapis.com/compute/v1/projects/myproject/zones/us-central1-a/instances/my-instance | 16557630484 | us-central1-a |
+-------------+--------------------------------------------------------------+----------+---------+----------------------------------------------------------------------------------------------------+-------------+---------------+
I think you could actually use Resource URI instead of MeasurementId to match billing export items.

Related

Filtering by tags not working when using aws ec2 describe-instances from command line (cli)

I'm currently attempting to write an aws ec2 query from the command line (in AWS Linux, not that it should matter). I am trying to set a filter that matches both of the following:
Shows those instances that are in the off state (code 80), and;
Shows those instances who have a tag "ShortPurpose" whose value is "Fleet".
What is actually happening is that all instances in the off state are being returned, regardless of if they have the tag "ShortPurpose":"Fleet" set.
My instances are set up as so:
+-------------+--------------+------------------------+--+
| Instance ID | Tag | Tag Value | |
+-------------+--------------+------------------------+--+
| i-09876 | ShortPurpose | Fleet | |
| | Organisation | UmbrellaCorp | |
| | Name | cloud-01 | |
| | Owner | ORG-UMBR-ELLA | |
| | Purpose | Cloud processing fleet | |
+-------------+--------------+------------------------+--+
| | | | |
| i-12345 | (no tags) | | |
| | | | |
+-------------+--------------+------------------------+--+
The command I am using is:
aws ec2 describe-instances --query "Reservations[*].Instances[*].InstanceId" --filters "Name=tag:ShortPurpose,Values=Fleet,Name=instance-state-code,Values=80"
The results are the standard array style response. The instance state is successfully filtered upon, but not the tags.
I tried to verify your command and it produces errors as you wrote it:
Error parsing parameter '--filters': Second instance of key "Name" encountered for input:
Name=tag:ShortPurpose,Values=Fleet,Name=instance-state-code,Values=80
^
This is often because there is a preceeding "," instead of a space.
However, I was able to successful use it on my sandbox instances as follows:
aws ec2 describe-instances \
--query "Reservations[*].Instances[*].InstanceId" \
--filters Name=tag:ShortPurpose,Values=Fleet Name=instance-state-code,Values=80
I have discovered in one example in the AWS doumentation that I had the wrong query format. The correct query is:
aws ec2 describe-instances --query "Reservations[*].Instances[*].InstanceId" --filters "Name=tag-value,Values=Fleet" "Name=instance-state-code,Values=80"
Note that I'm ignoring the ShortPurpose tag and instead hunting directly for the value, which may exist in any tag.

How do I find change point in a timeseries in PoweBi

I have a group of people who started receiving a specific type of social benefit called benefitA, I am interested in knowing what(if any) social benefits the people in the group might have received immediately before they started receiving BenefitA.
My optimal result would be a table with the number people who was receiving respectively BenefitB, BenefitC and not receiving any benefit “BenefitNon” immediately before they started receiving BenefitA.
My data is organized as a relation database with a Facttabel containing an ID for each person in my data and several dimension tables connected to the facttabel. The important ones here at DimDreamYdelse(showing type of benefit received), DimDreamTid(showing week and year). Here is an example of the raw data.
Data Example
I'm not sure how to approach this in PowerBi as I am fairly new to this program. Any advice is most welcome.
I have tried to solve the problem in SQL but as I need this as part of a running report i need to do it in PowerBi. This bit of code might however give some context to what I want to do.
USE FLISDATA_Beskaeftigelse;
SELECT dbo.FactDream.DimDreamTid , dbo.FactDream.DimDreamBenefit , dbo.DimDreamTid.Aar, dbo.DimDreamTid.UgeIAar, dbo.DimDreamBenefit.Benefit,
FROM dbo.FactDream INNER JOIN
dbo.DimDreamTid ON dbo.FactDream.DimDreamTid = dbo.DimDreamTid.DimDreamTidID INNER JOIN
dbo.DimDreamYdelse ON dbo.FactDream.DimDreamBenefit = dbo.DimDreamYdelse.DimDreamBenefitID
WHERE (dbo.DimDreamYdelse.Ydelse LIKE 'Benefit%') AND (dbo.DimDreamTid.Aar = '2019')
ORDER BY dbo.DimDreamTid.Aar, dbo.DimDreamTid.UgeIAar
I suggest to use PowerQuery to transform your table into more suitable form for your analysis. Things would be much easier if each row of the table represents the "change" of benefit plan like this.
| Person ID | Benefit From | Benefit To | Date |
|-----------|--------------|------------|------------|
| 15 | BenefitNon | BenefitA | 2019-07-01 |
| 15 | BenefitA | BenefitNon | 2019-12-01 |
| 17 | BenefitC | BenefitA | 2019-06-01 |
| 17 | BenefitA | BenefitB | 2019-08-01 |
| 17 | BenefitB | BenefitA | 2019-09-01 |
| ...
Then you can simply count the numbers by COUNTROWS(BenefitChanges) filtering/slicing with both Benefit From and Benefit To.

AWS Glue Crawlers - How to handle large directory structure of CSVs that may only contain strings

Been at this for a few days and any help is greatly appreciated.
Background:
I am attempting to create 1+ glue crawlers to crawl the following S3 "directory" structure:
.
+-- _source1
| +-- _item1
| | +-- _2019 #year
| | | +-- _08 #month
| | | | +-- _30 #day
| | | | | +-- FILE1.csv #files
| | | | | +-- FILE2.csv
| | | | +-- _31
| | | | | +-- FILE1.csv
| | | | | +-- FILE2.csv
| | | +-- _09
| | | | +-- _01
| | | | +-- _02
| +-- _item2
| | +-- _2019
| | | +-- _08
| | | | +-- _30
| | | | +-- _31
| | | +-- _09
| | | | +-- _01
| | | | +-- _02
+-- _source2
| +-- ....
........ # and so on...
This goes on for several sources, each with potentially 30+ items, each of which has the year/month/day directory structure within.
All files are CSVs, and files should not change once they're in S3. However, the schemas for the files within each item folder may have columns added in the future.
2019/12/01/FILE.csv may have additional columns compared to 2019/09/01/FILE.csv.
What I've Done:
In my testing so far, crawlers created at source level directories (see above) have worked perfectly as long as no CSV only contains string-type columns.
This is due to the following restriction, as stated in the AWS docs:
The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of data is not sufficiently different from subsequent rows to be used as the header.
Normally, I'd imagine you could get around this by creating a custom classifier that expects a certain CSV schema, but seeing as I may have 200+ items (different schemas) to crawl, I'd like to avoid this.
Proposed Solutions:
Ideally, I'd like to force my crawlers to interpret the first row of
every CSV as a header, but this doesn't seem possible...
Add a dummy INT column to every CSV to force my crawlers to read the CSV headers, and delete/ignore the column down the pipeline. (Seems very hackish)
Find another file format that works (will require changes throughout my ETL pipeline)
DON'T USE GLUE
Thanks again for any help!
Found the issue: Turns out in order for an updated glue crawler classifier to take effect, a new crawler must be created and have the updated classifier applied. As far as I can tell this is not explicitly mentioned in the AWS docs, and I've only seen mention of it over on github
Early on in my testing I modified an existing csv classifier that specifies "Has Columns", but never created a new crawler to apply my modified classifier to. Once I created a new crawler and applied the classifier, all data catalog tables were created as expected regardless of column types.
TL;DR: Modified classifiers will not take effect unless they are applied to a new crawler. Source

Reading S3 files in nested directory through Spark EMR

I figured out how to read files into my pyspark shell (and script) from an S3 directory, e.g. by using:
rdd = sc.wholeTextFiles('s3n://bucketname/dir/*')
But, while that's great in letting me read all the files in ONE directory, I want to read every single file from all of the directories.
I don't want to flatten them or load everything at once, because I will have memory issues.
Instead, I need it to automatically go load all the files from each sub-directory in a batched manner. Is that possible?
Here's my directory structure:
S3_bucket_name -> year (2016 or 2017) -> month (max 12 folders) -> day (max 31 folders) -> sub-day folders (max 30; basically just partitioned the collecting each day).
Something like this, except it'll go for all 12 months and up to 31 days...
BucketName
|
|
|---Year(2016)
| |
| |---Month(11)
| | |
| | |---Day(01)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| | |---Day(02)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| |---Month(12)
|
|---Year(2017)
| |
| |---Month(1)
| | |
| | |---Day(01)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| | |---Day(02)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| |---Month(2)
Each arrow above represents a fork. e.g. I've been collecting data for 2 years, so there are 2 years in the "year" fork. Then for each year, up to 12 months max, and then for each month, up to 31 possible day folders. And in each day, there will be up to 30 folders just because I split it up that way...
I hope that makes sense...
I was looking at another post (read files recursively from sub directories with spark from s3 or local filesystem) where I believe they suggested using wildcards, so something like:
rdd = sc.wholeTextFiles('s3n://bucketname/*/data/*/*')
But the problem with that is it tries to find a common folder among the various subdirectories - in this case there are no guarantees and I would just need everything.
However, on that line of reasoning, I thought what if I did..:
rdd = sc.wholeTextFiles("s3n://bucketname/*/*/*/*/*')
But the issue is that now I get OutOfMemory errors, probably because it's loading everything at once and freaking out.
Ideally, what I would be able to do is this:
Go to the sub-directory level of the day and read those in, so e.g.
First read in 2016/12/01, then 2016/12/02, up until 2012/12/31, and then 2017/01/01, then 2017/01/02, ... 2017/01/31 and so on.
That way, instead of using five wildcards (*) as I did above, I would somehow have it know to look trough each sub-directory at the level of "day".
I thought of using a python dictionary to specify the file path to each of the days, but that seems like a rather cumbersome approach. What I mean by that is as follows:
file_dict = {
0:'2016/12/01/*/*',
1:'2016/12/02/*/*',
...
30:'2016/12/31/*/*',
}
basically for all the folders, and then iterating through them and loading them in using something like this:
sc.wholeTextFiles('s3n://bucketname/' + file_dict[i])
But I don't want to manually type out all those paths. I hope this made sense...
EDIT:
Another way of asking the question is, how do I read the files from a nested sub-directory structure in a batched way? How can I enumerate all the possible folder names in my s3 bucket in python? Maybe that would help...
EDIT2:
The structure of the data in each of my files is as follows:
{json object 1},
{json object 2},
{json object 3},
...
{json object n},
For it to be "true json", it either just needed to be like the above without a trailing comma at the end, or something like this (note square brackets, and lack of the final trailing comma:
[
{json object 1},
{json object 2},
{json object 3},
...
{json object n}
]
The reason I did it entirely in PySpark as a script I submit is because I forced myself to handle this formatting quirk manually. If I use Hive/Athena, I am not sure how to deal with it.
Why dont you use Hive, or even better, Athena? These will both deploy tables ontop of file systems, to give you access to all the data. Then you can capture this in to Spark
Alternatively, I believe you can also use HiveQL in Spark to set up a tempTable ontop of your file system location, and it'll register it all as a Hive table which you can execute SQL against. It's been a while since I've done that, but it is definitely do-able

cucumber Repeat steps

I am learing cucumber and trying to write a feature file.
Following is my feature file.
Feature: Doctors handover Notes Module
Scenario: Search for patients on the bases of filter criteria
Given I am on website login page
When I put username, password and select database:
| Field | Value |
| username | test |
| password | pass |
| database | test|
Then I login to eoasis
Then I click on doctors hand over notes link
And I am on doctors handover notes page
Then I select sites, wards, onCallTeam, grades,potential Discharge, outstanding task,High priority:
| siteList | wardsList | onCallTeamList | gradesList | potentialDischargeCB | outstandingTasksCB | highPriorityCB |
| THE INFIRMARY | INFIRMARY WARD 9 - ASSESSMENT | null | null | null | null | null |
| THE INFIRMARY | INFIRMARY WARD 9 - ASSESSMENT | GENERAL MEDICINE | null | null | null | null |
| THE INFIRMARY | INFIRMARY WARD 9 - ASSESSMENT | GENERAL MEDICINE | CONSULTANT | null | null | null |
| THE INFIRMARY | INFIRMARY WARD 9 - ASSESSMENT | GENERAL MEDICINE | CONSULTANT | true | null | null |
| THE INFIRMARY | INFIRMARY WARD 9 - ASSESSMENT | GENERAL MEDICINE | CONSULTANT | true | true | null |
| THE INFIRMARY | INFIRMARY WARD 9 - ASSESSMENT | GENERAL MEDICINE | CONSULTANT | true | true | true |
Then I click on search button
Then I should see search results
I want to repeat last three steps like I select the search criteria then click on search button and then check search result. So how should I break this feature file. if I use scenario outline then there would be two different scenarios One for login and one for search criteria. Is that fine? Will the session will maintain in that case? Whats the best way to write such feature file.
Or is this a right way to write?
I don't think we can have multiple example sets in a Scenario Outline.
Most of the scenario steps in the example is too procedural to have its own step.
The first three steps could be reduced to something like.
Given I am logged into eoasis as a <user>
Code in the step definition, which could make calls to a separate login method that could take care of updating entering the username, password and selecting database.
Another rule is to avoid statements like "When I click the doctor's handover link". The keyword to avoid here being click. Today its a click, tomorrow it could be drop down or a button. So the focus should be on the functional expectation of the user, which is viewing the handover notes. So we modify this to
When I view the doctor's handover notes link
To summarize, this is how I would write this test.
Scenario Outline: Search for patients on the basis of filter criteria
Given I am logged into eoasis as a <user>
When I view the doctor's handover notes link
And I select sites, wards, onCallTeam, grades, potential Discharge, outstanding task, High priority
And perform a search
Then I should see the search results
Examples:
|sites |wards |onCallTeam |grades |potential Discharge |outstanding task |High priority|
| THE INFIRMARY | INFIRMARY WARD 9 - ASSESSMENT | null | null | null | null | null |
This really is the wrong way to write features. This feature is very declarative, its all about HOW you do something. What a feature should do is explain WHY you are doing something.
Another bad thing this feature does is mix up the details of two different operations, signing in, and searching for patients. Write a feature for each one e.g.
Feature: Signing in
As a doctor
I want my patients data to only be available if I sign in
So I ensure their confidentiality
Scenario: Sign in
Given I am a doctor
When I sign in
Then I should be signed in
Feature: Search for patients
Explain why searching for patients gives value to the doctor
...
You should focus on the name of the feature and the bit at the top that explains why this has value first. If you do that well then the scenarios are much easier to write (look how simple my sign in scenario is).
The art of writing features is doing this bit well, so that you end up with simple scenarios.