Creating Index based on another field in logstash - regex

this question was asked 3 months ago. One of the answers helped me but doesn't solve evey issues.
I am new to ELK and I have an issue to build the index based on another field.
Alain Collins solution (see link) is pretty good: I could format the index as I wanted but the send_to field appears in the output and the field cannot be removed. send_to acts as a temporary variable used in the index. Is there any way to not output the send_to field ?

Sure - use a relatively new feature called metadata.
Put the value in a field like [#metadata][send_to], which you can then refer to in the output stanza. metadata fields aren't sent to elasticsearch, so they won't "pollute" your documents.

Related

Model Post and Topic through DynamoDB

Heres the relation I'm trying to model in DynamoDB:
My service contains posts and topics. A post may belong to multiple topics. A topic may have multiple posts. All posts have an interest value which would be adjusted based on a combination of likes and time since posted, interest measures the popularity of a post at the current moment. If a post gets too old, its interest value will be 0 and stay that way forever (archival).
The REST api end points work like this:
GET /posts/{id} returns a post-object containing title, text, author name and a link to the authors rest endpoint (doesn't matter for this example) and the number of likes (the interest value is not included)
GET /topics/{name} should return an object with both a list with the N newest posts of the topics as well as one for the N currently most interesting posts
POST /posts/ creates a new post where multiple topics can be specified
POST /topics/
creates a new topic
POST /likes/ creates a like for a specified post (does not actually create an object, just adds the user to the given post-object's list of likers, which is invisible to the users)
The problem now becomes, how do I create a relationship between topics and and posts in DynamoDB NoSql?
I thought about adding a list of copies of posts to tag entries in DynamboDB, where every tag has a list of both the newest and the most interesting Posts.
One way I could do this is by creating a cloudwatch job that would run every 10 minutes and loop through every topic object, finding both the most interesting and newest entries and then replacing the old lists of the topic.
Another job would also have to regularly update the "interest" value of every non archived post (keep in mind both likes and time have an effect on the interest value).
One problem with this is that a lot of posts in the Tag list would be out of date for 10 minutes in case the User makes a change or deletes the post. Likes will also not be properly tracked on the Tags post list. This could perhaps be solved with transactions, although dynamoDB is limited to 10 objects per transaction.
Another problem is that it would require the add-posts-to-tags job to load all the non archived posts into memory in order to manually sort them by both time and interest, split them up by tag and then adding the first N of both sets to the tag lists every 10 minutes.
I also had a another idea, by limiting the tags of a post that are allowed to 1, I could add the tag as a partition key, with the post-time as the sort key, and use a GSI to add Interest as a second sort key.
This does have several downsides though:
very popular tags may be limited to a single parition since all the posts share a single partition key
Tag limit is 1
A cloudwatch job to adjust the Interest value of posts may still be required
It would require use of a GSI which may lead to dangerous race conditions
But it would have the advantage that there are no replications of the post objects aside from the GSI. It would also allow basically infinite paging of all posts by date instead of being limited to just the N newest posts.
So what is a good approach here? It seams both of my solutions have horrible dealbreakers. Is this just one of those problems that NoSQL simply can't solve?
You are trying to model relational data using a non relational DB ,
to do this I would use 2 types of DB ,
I would store in dynamo the post information
in your example it would be :
GET /posts/{id}
POST /posts/
POST /likes/creates
For the topic related information I would use Elastic search (Amazon Elasticsearch Service)
GET /topics/{name} : the search index would stored the full topic info as well post id's that , and the relevant fields you want to search for (in your case update date to get the most recent posts)
what this will entail is background process (in dynamoDB this can be done via streams) that takes changes to the dynamoDB for new post's , update to like count etc.. and populates the search index.
Note: this can also be solved using graphDB but for scaling purposes better separate the source of the data (post's ) and the data relations (topic).

Maintain a audit table through re usable frame work

I was asked to create control table with Informatica. I am a newbie and do not have much knowledge about it. I saw the same kind of stuff in my previous project but don't know the way to create a mapplet for that. So the requirement is that I have to create a mapplet which has the following columns:
-mapping_name
-session_name
-last_run_date
--source count
--target count
--status
So what happens is
Example: We executed a workflow with a particular mapping last week.
Now after 1 week we are executing the same mapping.
The requirement is that we should be fetching only those records which fall in this particular time frame(i.e from previous run to the current run). This is something I do not know.
Can you please help me out? I can provide furthur details if required.
There is a solution provided in below link but it doesnt use mapplet.
See, if you want to use mapplet, you wont get 'status' attribute and mapplet approach can be difficult to implement for all mappings.
You can use this link to gather statistics as well.
http://powercenternotes.blogspot.com/2014/01/an-etl-framework-for-operational.html
Now, regarding your other requirement, it seems to me to be an issue with incremental extract. So, you need to store the date parameter when you ran your flow last - into a DB table or flat file.
Use that as reference and pull anything greater than that date.
Mapplet - We used this approach earlier to gather statistics. But this is difficult because you need to add this mapplet + a reusable generic target to capture stats.
Input -
Type_of_data- (this can be source, target)
unique_key - (unique key of the mapping)
MappingName - $PMMappingName
SessionName - $PMSessionName
Aggregator -
i/p-
Type_of_data
unique_key
MappingName group by
SessionName group by
o/p-
count_row = COUNT(*)
Output -
Type_of_data
MappingName
SessionName
count_row
Use a reusable generic target to capture all the rows. You need to add one set after each source, one set before each target. The approach in the link is better i think.

Django - How to add an entry in Ascending order in the database efficiently?

I am working on making an application in Django that can manage my GRE words and other details of each word. So whenever I add a new word to it that I have learnt, it should insert the word and its details in the database alphabetically. Also while retrieving, I want the details of the particular word I want to be extracted from the database.
Efficiency is the main issue.
Should I use SQLite? Should I use a file? Should I use a JSON object to store the data?
If using a file is the most efficient, what data structure should I implement?
Are there any functions in Django to efficiently do this?
Each word will have - meaning, sentence, picture, roots. How should I store all this information?
It's fine if the answer is not Django specific and talks about the algorithm or the type of database.
I'm going to answer from the data perspective since this is not totally related to django.
From your question it appears you have a fixed identifier for each "row": the word, which is a string, and a fixed set of attributes.
I would recommend using any enterprise level RDBMS. In the case of django, the most popular for the python ecosystem is PostgreSQL.
As for the ordering, just create the table with an index on the word name (this will be automatically done for you if you use the word as primary key), and retrieve your records using order_by in django.
Here's some info on django field options (check primary_key=True)
And here's the info for the order_by order_by method
Keep in mind you can also set the ordering in the Meta class of the model.
For your search case, you'll have to implement an endpoint that is capable of querying your database with startswith. You can check an example here
Example model:
class Word(models.Model):
word = models.CharField(max_length=255, primary_key=True)
roots = ...
picture = ...
On your second question: "Is this costly?"
It really depends. With 4000 words I'll say: NO
You probably want to add a delay in the client to do the query anyways (for example "if the user has typed in and 500ms have passed w/o further input")
If I'm to give 1 good advice to any starting developer, it's don't optimize prematurely

Explicit correspondence between Google spreadsheet cells and Google form input fields

I am making a web form for people to sign up for an event. I found that I can use Google form and Google spreadsheet for this task, and I started to learn how to use these web applications. I would also like to send the automatic confirmation e-mail to those who have signed up. For this task, I am also looking into the Google Apps Script. As far as I understand, I should define a function to send e-mail in a script in the spreadsheet and trigger this function at 'Form Submission Event'. I would like to identify the e-mail address of a person who signed up from the data he/she submitted, and I would like to include all the submitted data as well as the time stamp in the confirmation e-mail.
My questions are the following.
How can I identify the cell in the spreadsheet into which the value of an input field in the Form is stored?
Or, is there any way that I can read the values of the respective input fields from a Google Apps Script?
I would be glad if you could kindly refer me to an unambiguous API reference related to these questions.
So far I learned about the applications from the help pages provided in Google Drive,
e.g.
https://developers.google.com/apps-script/overview
However, I feel documents there are too concise.
I am learning how to send confirmation e-mail from this Google Apps Script:
FormSubmissionResponseEmail
I could not find a help document that explicitly relates an input field in a Google form and a cell in the Google spreadsheet. From my limited number of experiment cases, it seems that the time stamps are always stored in the first column of the spreadsheet. Is this guaranteed? It seems that 'namedValues' member of the 'Spreadsheet Form Submit Events' class is said to contain "the question names and values from the form submission." (https://developers.google.com/apps-script/understanding_events)
However, when I modified the Google form, the 'namedValues' member still held the elements corresponding to deleted input fields. Is there any way to loop over only those elements in 'namedValues' that corresponds to the fields actually input by a user?
I would also be glad to hear about alternative tools to replace Google form and Google spreadsheet.
This answer applies to the "new Forms", the 2013 version, not "Legacy Forms" which have been discontinued.
How can I identify the cell in the spreadsheet into which the value of
an input field in the Form is stored?
You can identify the column that will collect answers to a form question by the label in row 1. Armed with that knowledge, you can reference the answers by column number in functions such as getRange().
...is there any way that I can read the values of the respective input
fields from a Google Apps Script?
There are multiple ways to reference input values:
As you found in Understanding Events, using a function triggered by Form Submission you can retrieve input values from the event itself. Two options here; you get a set of values in an array, and namedValues that you can reference using the question text as a name.
You can read the data from the spreadsheet; within that same trigger function mentioned earlier, you could use e.range.getValues() to get an array with all the submitted values, which you could then reference by index. Remember that this would be a 0-based array, while the column numbering starts at 1.
From my limited number of experiment cases, it seems that the
timestamps are always stored in the first column of the spreadsheet.
Is this guaranteed?
Unless you modify the spreadsheet, the timestamp will be in the first column. It is possible to insert columns to the left of the data table created by Forms, which will affect where you would find your form results in the sheet (although e.range will adjust). The order of all following values will be the order that the questions were created in the form. Note that if you delete a column (because you removed the question from the form, say), the column for the "deleted" question will be recreated at the end of the row.
...when I modified the Google form, the 'namedValues' member still
held the elements corresponding to deleted input fields. Is there any
way to loop over only those elements in namedValues that corresponds
to the fields actually input by a user?
There are reasons for remembering past questions, although they are just a bother when they weren't used to collect any real data. Better planning can be used to avoid the problem!
Unanswered questions will be empty strings in the event (e.g. e.namedValues["Dead Question"] == ''). So you could skip them like this:
for (var ans in e.namedValues) {
if (e.namedValues[ans] != '') {
doSomethingWith(e.namedValues[ans]
}
}
Note, too, that you can get array of "headers", or the form questions, like this:
var sheet = SpreadsheetApp.getActiveSpreadsheet();
var headers = sheet.getDataRange().getValues()[0];
...and then use a search of headers to find the column containing the answer you're looking for.

Updating two tables, retrieving Foreign Key

I am inserting data into two tables, however I can not figure out (after hours of Googling) how to insert data into the second table after retrieving the new ID created after the first update?
I'm using <CFINSERT>.
use <CFQUERY result="result_name"> and the new ID will be available at result_name.generatedkey .. <cfinsert> and <cfupdate>, while easy and fast for simple jobs, they are pretty limited.
I have never used cfinsert myself, but this blog post from Ben Forta says you may not be able to use cfinsert if you need a generated key http://www.forta.com/blog/index.cfm/2006/10/3/Use-CFINSERT-And-CFUPDATE
Yes, I realize that blog post is old, but it doesn't appear much has changed.
Why not use a traditional INSERT statement wrapped in a <cfquery> tag?