AWS CloudSearch Expression to limit certain values per page - amazon-web-services

I am trying to write an AWS CloudSearch Expression that could dynamically adjust the number of documents containing the same value (for a specific field) returned per page.
Current public test domain contaning the default AWS dataset "IMDB Movies" (5 000 documents) can be hit here : https://search-test-dzexg56e6mtgl6njjnjshe6ekq.eu-west-1.cloudsearch.amazonaws.com/2013-01-01/search?q=year:[2000,2020]&q.parser=structured&size=20
This query returns all movies with a "year" value between 2000 and 2020 included.
I am trying to write an expression to limit the field "genres" per value per page.
Exemple:
There are 3376 results in the query above. I want to limit to 2 the number of movies having the same genre per page (page size = 20)

Related

Using SELECTEDVALUE in PowerBI

I have a table that contains the sample data from the attached image.
The sample table can be interpreted as follows: I have a list of customers with Customer_id, Customer_name and Email that have an account on 1 or more e-commerce sites. Every e-commerce site can be identified by the EcommerceSite_Id column.
If a customer has more than one account (eg: on EcommerceSite_Id = 111 and also on EcommerceSite_Id = 112) the GlobalClient_Id will have the same value (e.g. John has an account on the following EcommerceSite_Id: 111, 113 and 114. Therefore, he has the same GlobalClient_Id – “11” which is attributed based on some automatic criteria – in this example, email address).
What I want to achieve:
By using a slicer with the EcommerceSite_Id column, when selecting the EcommerceSite_Id 114, it should display all customers with a unique email address, that do not have an account in 114, by taking into account the GlobalClient_Id.
Therefore, the output should be
:
Therefore, as you can see I excluded the following customer_ids: 5 and 9. They do not have an account on 114, but I excluded them because they have the same ClientGlobal_Id with customers_ids 10 and 8, customers that have an account on 114.
I cannot find a solution. I tried to use selected values, but I don’t know if I am using it correctly.
Can you please give me an idea on how to solve it?
Welcome to SO. Here's one solution that might work for you:
Step 1. EcommerceSite_Id needs to be added to a separate table - it will contain distinct IDs:
Here's how the data model should look like:
Step 2. Create a measure
For each client, the measure will count the number of rows where a chosen EcommerceSite_Id appeared:
mExclude =
var EcommerceSite = SELECTEDVALUE(ES[EcommerceSite_Id])
var ThisCustomer = SELECTEDVALUE(CustomerData[Customer_name])
var Check = COUNTROWS(FILTER(ALL(CustomerData), CustomerData[Customer_name] = ThisCustomer && CustomerData[EcommerceSite_Id] = EcommerceSite))
return
Check + 0
When added to a table, and having a single EcommerceSite_Id selected (use your new table as a slicer), you get the following results:
Step 3. Finishing up
Finally, remove mExclude form the table and add it to the table filters. Set it up to filter values that are equal to zero.
The final result:

Getting Info from GCP Data Catalog

I notice when you query the data catalog in the Google Cloud Platform it retrieves stats for the amount of times a table has been queried:
Queried (Past 30 days): 5332
This is extremely useful information and I was wondering where this is actually stored and if it can be retrieved for all the tables in a project or a dataset.
I have trawled the data catalog tutorials and written some python scripts but these just retrieve entry names for tables and in an iterator which is not what I am looking for.
Likewise I also cannot see this data in the information schema metadata.
You can retrieve the number of completed/performed queries of any table/dataset exporting log entries to BiqQuery. Every query generates some logging on Stackdriver so you can use advanced filters to select the logs you are interested it and store them as a new table in Bigquery.
However, the retention period for the data access logs in GCP is 30 days, so you can only export the logs in the past 30 days.
For instance, use the following advance filter for getting the logs corresponding to all the jobs completed of an specific table:
resource.type="bigquery_resource" AND
log_name="projects/<project_name>/logs/cloudaudit.googleapis.com%2Fdata_access" AND
proto_payload.method_name="jobservice.jobcompleted"
"<table_name>"
Then select Bigquery as Sink Service and state a name for your sink table and the dataset where it will be stored.
All the completed jobs on this table performed after the sink is established will appear as a new table in BigQuery. You can query this table to get information about the logs (you can use a COUNT statement on any column to get the total number of successful jobs for instance).
This information is available in the projects.locations.entryGroups.entries/get API. It is availble as UsageSignal, and contains usage information of 24 hours, 7days, 30days.
Sample output:
"usageSignal": {
"updateTime": "2021-05-23T06:59:59.971Z",
"usageWithinTimeRange": {
"30D": {
"totalCompletions": 156890,
"totalFailures": 3,
"totalCancellations": 1,
"totalExecutionTimeForCompletionsMillis": 6.973312e+08
},
"7D": {
"totalCompletions": 44318,
"totalFailures": 1,
"totalExecutionTimeForCompletionsMillis": 2.0592365e+08
},
"24H": {
"totalCompletions": 6302,
"totalExecutionTimeForCompletionsMillis": 25763162
}
}
}
Reference:
https://cloud.google.com/data-catalog/docs/reference/rest/v1/projects.locations.entryGroups.entries/get
https://cloud.google.com/data-catalog/docs/reference/rest/v1/projects.locations.entryGroups.entries#UsageSignal
With Python Datacatalog - You first need to search the Data catalog and you will receive linked_resource in response.
Pass this linked_resource as a request to lookup_entry and you will fetch the last queried (30 days)
results = dc_client.search_catalog(request=request, timeout=120.0)
for result in results:
linked_resource = result.linked_resource
# Get the Location and number of times the table is queried in last 30 days
table_entry = dc_client.lookup_entry(request={"linked_resource": linked_resource})
queried_past_30_days = table_entry.usage_signal.usage_within_time_range.get("30D")
if queried_past_30_days is not None:
dc_num_queried_past_30_days = int(queried_past_30_days.total_completions)
else:
dc_num_queried_past_30_days = 0

DynamoDB QuerySpec {MaxResultSize + filter expression}

From the DynamoDB documentation
The Query operation allows you to limit the number of items that it
returns in the result. To do this, set the Limit parameter to the
maximum number of items that you want.
For example, suppose you Query a table, with a Limit value of 6, and
without a filter expression. The Query result will contain the first
six items from the table that match the key condition expression from
the request.
Now suppose you add a filter expression to the Query. In this case,
DynamoDB will apply the filter expression to the six items that were
returned, discarding those that do not match. The final Query result
will contain 6 items or fewer, depending on the number of items that
were filtered.
Looks like the following query should return (at least sometimes) 0 records.
In summary, I have a UserLogins table. A simplified version is:
1. UserId - HashKey
2. DeviceId - RangeKey
3. ActiveLogin - Boolean
4. TimeToLive - ...
Now, let's say UserId = X has 10,000 inactive logins in different DeviceIds and 1 active login.
However, when I run this query against my DynamoDB table:
QuerySpec{
hashKey: null,
rangeKeyCondition: null,
queryFilters: null,
nameMap: {"#0" -> "UserId"}, {"#1" -> "ActiveLogin"}
valueMap: {":0" -> "X"}, {":1" -> "true"}
exclusiveStartKey: null,
maxPageSize: null,
maxResultSize: 10,
req: {TableName: UserLogins,ConsistentRead: true,ReturnConsumedCapacity: TOTAL,FilterExpression: #1 = :1,KeyConditionExpression: #0 = :0,ExpressionAttributeNames: {#0=UserId, #1=ActiveLogin},ExpressionAttributeValues: {:0={S: X,}, :1={BOOL: true}}}
I always get 1 row. The 1 active login for UserId=X. And it's not happening just for 1 user, it's happening for multiple users in a similar situation.
Are my results contradicting the DynamoDB documentation?
It looks like a contradiction because if maxResultSize=10, means that DynamoDB will only read the first 10 items (out of 10,001) and then it will apply the filter active=true only (which might return 0 results). It seems very unlikely that the record with active=true happened to be in the first 10 records that DynamoDB read.
This is happening to hundreds of customers that are running similar queries. It works great, when according to the documentation it shouldn't be working.
I can't see any obvious problem with the Query. Are you sure about your premise that users have 10,000 items each?
Your keys are UserId and DeviceId. That seems to mean that if your user logs in with the same device it would overwrite the existing item. Or put another way, I think you are saying your users having 10,000 different devices each (unless the DeviceId rotates in some way).
In your shoes I would just remove the filterexpression and print the results to the log to see what you're getting in your 10 results. Then remove the limit too and see what results you get with that.

Netsuite Saved Search a list of ID's by invoice No

I have a request to extract specific Invoices, over the course of an year. For target invoices i have the following information to use: Invoice Number, Invoice date, Order Ref, Value and currency. I tried extracting few months at a time, but it's too much data.
Is there a way to filter on about 200 unique Invoice No in Netsuite?
Thank you,
Daniel
You can do this with a Formula(Numeric) filter in your criteria like this:
Filter: Formula (Numeric)
Description: is greater than 0
Formula: INSTR(',SLS00000101,SLS00000102,SLS00000103,SLS00000104,SLS00000105,', ',' || {tranid} || ',')
Note that the initial string is a comma separated list of document numbers that begins and ends with a comma.
I'm not sure if there is an upper limit on formula size, but I've used this pattern to find a large number of transactions when I know the document numbers or internal IDs.

How to parse through a column in Pig to create additional columns

New Apache Pig user here. I basically have data in a format and need to split this into 6 columns to create my desired schema and then load into Pig for my existing script to run.
Sorry if the format below is untidy, i cant upload a picture due to reputation score.
Existing format has 3 columns
User-Equipment values::key:bytearray values:value:bytearray
user1-mobile 20130306-AC 9
user1-mobile 20130306-AT 21
user2-laptop 20130306-BC 0
Required format:
User Equipment Date Type "Count or Time" Value
user1 mobile 20130306 A C 9
user1 mobile 20130306 A T 21
Any suggestions on how to ge this done? IS there a regex I need to write?
The tricky thing here is all the columns have a delimiter (-) between them except "Type" and column "C or T"
If you don't have a common delimiter I can think of two possibilities:
You could implement your own LoadFunc as explained here: http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html
You could use REGEX_EXTRACT_ALL as explained here: Apache Pig: Extra query parameters from web log
Here you go for 2.:
A = LOAD 'abc.txt' AS (line:CHARARRAY);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, '^(.+?)\\-(.+?)\\s(.+?)\\-(.)(.)\\s(.+)$')) AS (User:CHARARRAY,Equipment:CHARARRAY,Date:CHARARRAY,Type:CHARARRAY,CountorTime:CHARARRAY,Value:CHARARRAY);