Parse lines from messages in a Splunk query to be displayed as a chart on a dashboard - regex

I generate events on multiple computers that list service names that aren't running. I want to make a chart that displays the top offending service names.
I can use the following to get a table for the dashboard:
ComputerName="*.ourDomain.com" sourcetype="WinEventLog:Application" EventCode=7223 SourceName="internalSystem"
| eval Date_Time=strftime(_time, "%Y-%m-%d %H:%M")
| table host, Date_Time, Message, EventCode
Typical Message(s) will contain:
The following services were not running after 5603 seconds and a start command has been sent:
Service1
Service2
The following services were not running after 985 seconds and a start command has been sent:
Service2
Service3
Using regex I can make a named group of everything but the first line with (?<Services>((?<=\n)).*)
However, I don't think this is the right approach as I don't know how to do a valuation for the chart with this information.
So in essence, how do I grab and tally service names from messages in Splunk?
Edit 1:
Coming back to this after a few days.
I created a field extraction called "Services" with regex that grabs the contents of each message after the first line.
If I use | stats count BY Services it counts each message as a whole instead of the lines inside. The results look like this:
Service1 Service2 | Count: 1
Service2 Service3 | Count: 1
My intention is to have it treat each line as its own value so the results would look like:
Service1 | Count: 1
Service2 | Count: 2
Service3 | Count: 1
I tried | mvexpand Services but it didn't change the output so I assume I'm either using it improperly or it's not applicable here.

I think you can do it with the stats command.
| stats count by service
will give a number of appearances for each service. You then can choose the bar chart visualization to create a graph.

I ended up using split() and mvexpand to solve this problem.
This is what worked in the end:
My search
| eval events=split(Service, "
")
| mvexpand events
| eval events=replace(events, "[\n\r]", "")
| stats count BY events
I had to add the replace() method because any event with just one service listed was being treated differently from an event with multiple, after the split on an event with multiple services each service had a carriage return, hence the replace.
My end result dashboard chart:

For Chart dropping down that is clean:
index="yourIndex" "<searchCriteria>" | stats count(eval(searchmatch("
<searchCriteria>"))) as TotalCount
count(eval(searchmatch("search1"))) as Name1
count(eval(searchmatch("search2" ))) as Name2
count(eval(searchmatch("search3"))) as Name3
| transpose 5
| rename column as "Name", "row 1" as "Count"
Horizontal table example with percentages:
index=something "Barcode_Fail" OR "Barcode_Success" | stats
count(eval(searchmatch("Barcode_Success"))) as SuccessCount
count(eval(searchmatch("Barcode_Fail"))) as FailureCount
count(eval(searchmatch("Barcode_*"))) as Totals | eval
Failure_Rate=FailureCount/Totals |eval Success_Rate=SuccessCount/Totals

Related

Apache Beam - Bigquery Upsert

I have a dataflow job which splits up a single file into x number of records (tables). These flow in to bigquery no problem.
What I found though was there was no way to then execute another stage in the pipeline following the results.
For example
# Collection1- filtered on first two characters = 95
collection1 = (
rows | 'Build pCollection1' >> beam.Filter(lambda s: data_ingestion.filterRowCollection(s, '95'))
| 'p1 Entities to JSON' >> beam.Map(lambda s: data_ingestion.SplitRowDict(s, '95'))
| 'Load p1 to BIGQUERY' >> beam.io.WriteToBigQuery(
data_ingestion.spec1,
schema=parse_table_schema_from_json(data_ingestion.getBqSchema('95')),
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED) # Write to Bigquery
)
# Collection2 - filtered on first two characters = 99
collection2 = (
rows | 'Build pCollection2' >> beam.Filter(lambda s: data_ingestion.filterRowCollection(s, '99'))
| 'p2 Split Entities to JSON' >> beam.Map(lambda s: data_ingestion.SplitRowDict(s, '99'))
| 'Load p2 to BIGQUERY' >> beam.io.WriteToBigQuery(
data_ingestion.spec2,
schema=parse_table_schema_from_json(data_ingestion.getBqSchema('99')),
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED) # Write to Bigquery)
Following the above I'd like to run something like the following:
final_output = (
collection1, collection2
| 'Log Completion' >> beam.io.WriteToPubSub('<topic>'))
Is there anyway to run another part of the pipeline following the upsert to bigquery or is this impossible? Thanks in advance.
Technically, there's no way to do exactly what you asked. beam.io.WriteToBigquery consumes the pCollection leaving nothing.
However, it's simple to duplicate the input to beam.io.WriteToBigquery in a parDo just before you call beam.io.WriteToBigquery, and to send copies of your pCollection down each path. See this answer which references this sample doFn from the docs

Using dictionary in regexp_replace function in pyspark

I want to perform an regexp_replace operation on a pyspark dataframe column using dictionary.
Dictionary : {'RD':'ROAD','DR':'DRIVE','AVE':'AVENUE',....}
The dictionary will have around 270 key value pair.
Input Dataframe:
ID | Address
1 | 22, COLLINS RD
2 | 11, HEMINGWAY DR
3 | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR
Desired Output Dataframe:
ID | Address | Address_Clean
1 | 22, COLLINS RD | 22, COLLINS ROAD
2 | 11, HEMINGWAY DR | 11, HEMINGWAY DRIVE
3 | AVIATOR BUILDING | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR | 33, PARK AVENUE MULLOHAND DRIVE
I cannot find any documentation on internet. And if trying to pass dictionary as below codes-
data=data.withColumn('Address_Clean',regexp_replace('Address',dict))
Throws an error "regexp_replace takes 3 arguments, 2 given".
Dataset will be around 20 million in size. Hence, UDF solution will be slow (due to row wise operation) and we don't have access to spark 2.3.0 which supports pandas_udf.
Is there any efficient method of doing it other than may be using a loop?
It is trowing you this error because regexp_replace() needs three arguments:
regexp_replace('column_to_change','pattern_to_be_changed','new_pattern')
But you are right, you don't need a UDF or a loop here. You just need some more regexp and a directory table that looks exactly like your original directory :)
Here is my solution for this:
# You need to get rid of all the things you want to replace.
# You can use the OR (|) operator for that.
# You could probably automate that and pass it a string that looks like that instead but I will leave that for you to decide.
input_df = input_df.withColumn('start_address', sf.regexp_replace("original_address","RD|DR|etc...",""))
# You will still need the old ends in a separate column
# This way you have something to join on your directory table.
input_df = input_df.withColumn('end_of_address',sf.regexp_extract('original_address',"(.*) (.*)", 2))
# Now we join the directory table that has two columns - ends you want to replace and ends you want to have instead.
input_df = directory_df.join(input_df,'end_of_address')
# And now you just need to concatenate the address with the correct ending.
input_df = input_df.withColumn('address_clean',sf.concat('start_address','correct_end'))

Regex to extract two values from single string in Splunk

I've log statements appearing in Splunk as below.
info Request method=POST, time=100, id=12345
info Response statuscode=200, time=300, id=12345
I'm trying to write a Splunk query that would extract the time parameter from the lines starting with info Request and info Response and basically find the time difference. Is there a way I can do this in a query? I'm able to extract values separately from each statement but not the two values together.
I'm hoping for something like below, but I guess the piping won't work:
... | search log="info Request*" | rex field=log "time=(?<time1>[^\,]+)" | search log="info Response*" | rex field=log "time=(?<time2>[^\,]+)" | table time1, time2
Any help is highly appreciated.
General process:
Extract type into a field
Calculate response and request times
Group by id
Calculate the diff
You may want to use something other than stats(latest) but won't matter if there's only one request/response per id.
| rex field=_raw "info (?<type>\w+).*"
| eval requestTime = if(type="Request",time,NULL)
| eval responseTime = if(type="Response",time,NULL)
| stats latest(requestTime) as requestTime latest(responseTime) as responseTime by id
| eval diff = responseTime - requestTime

Regex to match last sentence of a line

Got some text:
[23/07 | DEV | FARO | QC Billable | #2032] Unable to Load label
[30/07 | QC | ROLAWN ] Selling products as a bundle
[11/08 | EST | QC BILLABLE | #2015 ISUOG ] On Demand website looping
[05/08 | EST | ROLAWN | Problems with 'find a stockist'
[29/07 | DEV | QUBA] Blog comments loading to error
[24/07 | FROG | EST| QC BILLABLE #2033] Carousel banner not working correctly
I'm trying to match the last sentence at the end of each line so the matches are as follows:
Unable to Load label
Selling products as a bundle
On Demand website looping
Problems with 'find a stockist'
Blog comments loading to error
Carousel banner not working correctly
Unfortunately, I can't depend on the structure of the line to conform, but the information I'm trying to extract should always be the last sentence. I've tried quite a few different things, but I'm struggling here.
If there is also some kind on no-word character before last sentence, try with:
[\w\s']+$
DEMO
Edit: The answer above by m.cekiera [\w\s']+$ is better.
](.+)$
Here's a pretty naive solution: https://regex101.com/r/yT8jJ7/1.
If you give more details about the actual structure it could be refined.

search for specific characters within column and then create different columns from it

I have param_Value column that have different values. I need to extract these values and create columns for all of them.
|PARAM_NAME |param_Value |
__________|____________
|Step 4 | SP:0.09 |
|Procedure | MAX:125 |
|Step 4 | SP:Ambient|
|(null) | +/-:N/A |
|Steam | SP:2 |
|Step 3 | MIN:0 |
|Step 4 | RDPHN427B |
|Testing De | N/A |
I only want columns with: And give them names:
SP: SET_POINT_VALUE,
MAX: MAX_LIMIT,
MIN: MIN_LIMIT,
+/-: UPPER_LOWER_LIMIT
So what I have so far is:
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME,
REGEXP_LIKE("param_Value", 'SP:') SET_POINT_VALUE,
REGEXP_LIKE("param_Value", '+/-:') UPPER_LOWER_LIMIT,
REGEXP_LIKE("param_Value", 'MAX:') MAX_VALUE,
REGEXP_LIKE("param_Value", 'MIN:') MIN_VALUE
FROM PROCESS_STEPS
;
I'm more familiar with TSQL and MySQL, but this ought to do what I think you're looking for. If it doesn't exactly, it should at least point you in the right direction.
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME
, CASE WHEN "param_Value" LIKE 'SP:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END SET_POINT_VALUE
, CASE WHEN "param_Value" LIKE '+/-:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END UPPER_LOWER_LIMIT
, CASE WHEN "param_Value" LIKE 'MAX:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MAX_VALUE
, CASE WHEN "param_Value" LIKE 'MIN:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MIN_VALUE
FROM PROCESS_STEPS
;
The basic concept here is identifying the information you want via LIKE, then using SUBSTR and INSTR to extract it. While LIKE is normally something to stay away from, since there's no leading % in your case, it's Sargable, and thus probably not a total efficiency sink.
Really, though, I have to ask you to question why you're laying out your data like this - substring operations are slow in any language, and a DB is no exception. Why not use another column for your limit type? Why not lay it out in the view you're currently looking at?