Telegraf: How to extract from field using regex processor? - regex

I would like to extract the values for connections, upstream and downstream using telegraf regex processor plugin from this input:
2022/11/16 22:38:48 In the last 1h0m0s, there were 10 connections. Traffic Relayed ↑ 60 MB, ↓ 4 MB.
Using this configuration the result key "upstream" is a copy of the initial message but without a part of the 'regexed' stuff.
[[processors.regex]]
tagpass = ["snowflake-proxy"]
[[processors.regex.fields]]
## Field to change
key = "message"
## All the power of the Go regular expressions available here
## For example, named subgroups
pattern = 'Relayed.{3}(?P<UPSTREAM>\d{1,4}\W.B),'
replacement = "${UPSTREAM}"
## If result_key is present, a new field will be created
## instead of changing existing field
result_key = "upstream"
Current output:
2022/11/17 10:38:48 In the last 1h0m0s, there were 1 connections. Traffic 3 MB ↓ 5 MB.
How do I get the decimals?
I'm quite a bit confused how to use the regex here, because on several examples in the web it should work like this. See for example: http://wiki.webperfect.ch/index.php?title=Telegraf:_Processor_Plugins

The replacement config option specifies what you want to replace in for any matches.
I think you want something closer to this:
[[processors.regex.fields]]
key = "message"
pattern = '.*Relayed.{3}(?P<UPSTREAM>\d{1,4}\W.B),.*$'
replacement = "${1}"
result_key = "upstream"
to get:
upstream="60 MB"

Related

Finding string around pattern of Strings

Regular expression(May be) to find the word/string surrounded by other words.
===========================================================================
For example I have below sentences
1.I’m setting up a new server, The key is ABC and want to support UTF-8 fully in my web application. Where do I need to set the encoding/charsets?”
2.XYZ is the key for the new server I am setting and it is located at address 111 abc
3.key as of the date is WWW for the new server I am setting at 111, ABC London
4.The key for server is LMN and it is being setup at location 111, abc London.
key will be finite and will only have around 10 values. The value for key itself can be any form though. I have used ACB, XYZ, WWW, LMN as example above
I should be able to identify that Key exists in the sentence and extract value(ACB, XYZ, WWW, LMN) from all the above examples.
I have basically tried finding using if then else which is very cumbersome and dont have very good code to show yet. But will update when I can
I have basically tried finding using if then else which is very cumbersome and dont have very good code to show yet. But will update when I can
I should be able to identify that Key exists in the sentence and extract value(ACB, XYZ, WWW, LMN) from all the above examples.
Another option could be to use Spacy with dependency parsing
Any help will be greatly appreciated
This expression is likely to return the desired output, not sure though:
^(?=.*\b(ABC|XYZ|WWW|LMN)\b).*$
DEMO
Test
import re
regex = r"^(?=.*\b(ABC|XYZ|WWW|LMN)\b).*$"
test_str = """
1.I’m setting up a new server, The key is ABC and want to support UTF-8 fully in my web application. Where do I need to set the encoding/charsets?”
2.XYZ is the key for the new server I am setting and it is located at address 111 abc
3.Key as of the date is WWW for the new server I am setting at 111, ABC London
4.The key for server is LMN and it is being setup at location 111, abc London.
"""
print(re.findall(regex, test_str,re.M))
Output
['ABC', 'XYZ', 'ABC', 'LMN']
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

fetching name and age from a text file

I have a .txt file from which I have to fetch name and age.
The .txt file has data in the format like:
Age: 71 . John is 47 years old. Sam; Born: 05/04/1989(29).
Kenner is a patient Age: 36 yrs Height: 5 feet 1 inch; weight is 56 kgs.
This medical record is 10 years old.
Output 1: John, Sam, Kenner
Output_2: 47, 29, 36
I am using the regular expression to extract data. For example, for age, I am using the below regular expressions:
re.compile(r'age:\s*\d{1,3}',re.I)
re.compile(r'(age:|is|age|a|) \s*\d{1,3}(\s|y)',re.I)
re.compile(r'.* Age\s*:*\s*[0-9]+.*',re.I)
re.compile(r'.* [0-9]+ (?:year|years|yrs|yr) \s*',re.I)
I will apply another regular expression to the output of these regular expressions to extract the numbers. The problem is with these regular expressions, I am also getting the data which I do not want. For example
This medical record is 10 years old.
I am getting '10' from the above sentence which I do not want.
I only want to extract the names of people and their age. I want to know what should be the approach? I would appreciate any kind of help.
Please take a look at the Cloud Data Loss Prevention API. Here is a GitHub repo with examples. This is what you'll likely want.
def inspect_string(project, content_string, info_types,
min_likelihood=None, max_findings=None, include_quote=True):
"""Uses the Data Loss Prevention API to analyze strings for protected data.
Args:
project: The Google Cloud project id to use as a parent resource.
content_string: The string to inspect.
info_types: A list of strings representing info types to look for.
A full list of info type categories can be fetched from the API.
min_likelihood: A string representing the minimum likelihood threshold
that constitutes a match. One of: 'LIKELIHOOD_UNSPECIFIED',
'VERY_UNLIKELY', 'UNLIKELY', 'POSSIBLE', 'LIKELY', 'VERY_LIKELY'.
max_findings: The maximum number of findings to report; 0 = no maximum.
include_quote: Boolean for whether to display a quote of the detected
information in the results.
Returns:
None; the response from the API is printed to the terminal.
"""
# Import the client library.
import google.cloud.dlp
# Instantiate a client.
dlp = google.cloud.dlp.DlpServiceClient()
# Prepare info_types by converting the list of strings into a list of
# dictionaries (protos are also accepted).
info_types = [{'name': info_type} for info_type in info_types]
# Construct the configuration dictionary. Keys which are None may
# optionally be omitted entirely.
inspect_config = {
'info_types': info_types,
'min_likelihood': min_likelihood,
'include_quote': include_quote,
'limits': {'max_findings_per_request': max_findings},
}
# Construct the `item`.
item = {'value': content_string}
# Convert the project id into a full resource id.
parent = dlp.project_path(project)
# Call the API.
response = dlp.inspect_content(parent, inspect_config, item)
# Print out the results.
if response.result.findings:
for finding in response.result.findings:
try:
if finding.quote:
print('Quote: {}'.format(finding.quote))
except AttributeError:
pass
print('Info type: {}'.format(finding.info_type.name))
print('Likelihood: {}'.format(finding.likelihood))
else:
print('No findings.')

Spacy to Conll format without using Spacy's sentence splitter

This post shows how to get dependencies of a block of text in Conll format with Spacy's taggers. This is the solution posted:
import spacy
nlp_en = spacy.load('en')
doc = nlp_en(u'Bob bought the pizza to Alice')
for sent in doc.sents:
for i, word in enumerate(sent):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
It outputs this block:
1 Bob bob NNP PERSON 2 nsubj
2 bought buy VBD 0 ROOT
3 the the DT 4 det
4 pizza pizza NN 2 dobj
5 to to IN 2 dative
6 Alice alice NNP PERSON 5 pobj
I would like to get the same output WITHOUT using doc.sents.
Indeed, I have my own sentence-splitter. I would like to use it, and then give Spacy one sentence at a time to get POS, NER, and dependencies.
How can I get POS, NER, and dependencies of one sentence in Conll format with Spacy without having to use Spacy's sentence splitter ?
A Document in sPacy is iterable, and in the documentation is states that it iterates over Tokens
| __iter__(...)
| Iterate over `Token` objects, from which the annotations can be
| easily accessed. This is the main way of accessing `Token` objects,
| which are the main way annotations are accessed from Python. If faster-
| than-Python speeds are required, you can instead access the annotations
| as a numpy array, or access the underlying C data directly from Cython.
|
| EXAMPLE:
| >>> for token in doc
Therefore I believe you would just have to make a Document for each of your sentences that are split, then do something like the following:
def printConll(split_sentence_text):
doc = nlp(split_sentence_text)
for i, word in enumerate(doc):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
Of course, following the CoNLL format you would have to print a newline after each sentence.
This post is about a user facing unexpected sentence breaks from using the spacy sentence boundary detection. One of the solutions proposed by the developers at Spacy (as on the post) is to add flexibility to add ones own sentence boundary detection rules. This problem is solved in conjunction with dependency parsing by Spacy, not before it. Therefore, I don't think what you're looking for is supported at all by Spacy at the moment, though it might be in the near future.
#ashu 's answer is partly right: dependency parsing and sentence boundary detection are tightly coupled by design in spaCy. Though there is a simple sentencizer.
https://spacy.io/api/sentencizer
It seems the sentecizer just uses punctuation (not the perfect way). But if such sentencizer exists then you can create a custom one using your rules and it will affect sentence boundaries for sure.

R- Subset a corpus by meta data (id) matching partial strings

I'm using the R (3.2.3) tm-package (0.6-2) and would like to subset my corpus according to partial string matches contained with the metadatum "id".
For example, I would like to filter all documents that contain the string "US" within the "id" column. The string "US" would be preceded and followed by various characters and numbers.
I have found a similar example here. It is recommended to download the quanteda package but I think this should also be possible with the tm package.
Another more relevant answer to a similar problem is found here. I have tried to adapt that sample code to my context. However, I don't manage to incorporate the partial string matching.
I imagine there might be multiple things wrong with my code so far.
What I have so far looks like this:
US <- tm_filter(corpus, FUN = function(corpus, filter) any(meta(corpus)["id"] == filter), grep(".*US.*", corpus))
And I receive the following error message:
Error in structure(as.character(x), names = names(x)) :
'names' attribute [3811] must be the same length as the vector [3]
I'm also not sure how to come up with a reproducible example simulating my problem for this post.
It could work like this:
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
(corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain)))
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 20
(idx <- grep("0", sapply(meta(corp, "id"), paste0), value=TRUE))
# 502 704 708
# "502" "704" "708"
(corpsubset <- corp[idx] )
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 3
You are looking for "US" instead of "0". Have a look at ?grep for details (e.g. fixed=TRUE).

Perl RegEx for Matching 11 column File

I'm trying to write a perl regex to match the 5th column of files that contain 11 columns. There's also a preamble and footer which are not data. Any good thoughts on how to do this? Here's what I have so far:
if($line =~ m/\A.*\s(\b\w{9}\b)\s+(\b[\d,.]+\b)\s+(\b[\d,.sh]+\b)\s+.*/i) {
And this is what the forms look like:
No. Form 13F File Number Name
____ 28-________________ None
[Repeat as necessary.]
FORM 13F INFORMATION TABLE
TITLE OF VALUE SHRS OR SH /PUT/ INVESTMENT OTHER VOTING AUTHORITY
NAME OF INSURER CLASS CUSSIP (X$1000) PRN AMT PRNCALL DISCRETION MANAGERS SOLE SHARED NONE
Abbott Laboratories com 2824100 4,570 97,705 SH sole 97,705 0 0
Allstate Corp com 20002101 12,882 448,398 SH sole 448,398 0 0
American Express Co com 25816109 11,669 293,909 SH sole 293,909 0 0
Apollo Group Inc com 37604105 8,286 195,106 SH sole 195,106 0 0
Bank of America com 60505104 174 12,100 SH sole 12,100 0 0
Baxter Internat'l Inc com 71813109 2,122 52,210 SH sole 52,210 0 0
Becton Dickinson & Co com 75887109 8,216 121,506 SH sole 121,506 0 0
Citigroup Inc com 172967101 13,514 3,594,141 SH sole 3,594,141 0 0
Coca-Cola Co. com 191216100 318 6,345 SH sole 6,345 0 0
Colgate Palmolive Co com 194162103 523 6,644 SH sole 6,644 0 0
If you ever do write a regex this long, you should at least use the x flag to ignore whitespace, and importantly allow whitespace and comments:
m/
whatever
something else # actually trying to do this
blah # for fringe case X
/xi
If you find it hard to read your own regex, others will find it Impossible.
I think a regular expression is overkill for this.
What I'd do is clean up the input and use Text::CSV_XS on the file, specifying the record separator (sep_char).
Like Ether said, another tool would be appropriate for this job.
#fields = split /\t/, $line;
if (#fields == 11) { # less than 11 fields is probably header/footer
$the_5th_column = $fields[4];
...
}
My first thought is that the sample data is horribly mangled in your example. It'd be great to see it embedded inside some <pre>...</pre> tags so columns will be preserved.
If you are dealing with columnar data, you can go after it using substr() or unpack() easier than you can using regex. You can use regex to parse out the data, but most of us who've been programming Perl a while also learned that regex is not the first tool to grab a lot of times. That's why you got the other comments. Regex is a powerful weapon, but it's also easy to shoot yourself in the foot.
http://perldoc.perl.org/functions/substr.html
http://perldoc.perl.org/functions/unpack.html
Update:
After a bit of nosing around on the SEC edgar site, I've found that the 13F files are nicely formatted. And, you should have no problem figuring out how to process them using substr and/or unpack.
FORM 13F INFORMATION TABLE
VALUE SHARES/ SH/ PUT/ INVSTMT OTHER VOTING AUTHORITY
NAME OF ISSUER TITLE OF CLASS CUSIP (x$1000) PRN AMT PRN CALL DSCRETN MANAGERS SOLE SHARED NONE
- ------------------------------ ---------------- --------- -------- -------- --- ---- ------- ------------ -------- -------- --------
3M CO COM 88579Y101 478 6051 SH SOLE 6051 0 0
ABBOTT LABS COM 002824100 402 8596 SH SOLE 8596 0 0
AFLAC INC COM 001055102 291 6815 SH SOLE 6815 0 0
ALCATEL-LUCENT SPONSORED ADR 013904305 172 67524 SH SOLE 67524 0 0
If you are seeing the 13F files unformatted, as in your example, then you are not viewing correctly because there are tabs between columns in some of the files.
I looked through 68 files to get an idea of what's out there, then wrote a quick unpack-based routine and got this:
3M CO, COM, 88579Y101, 478, 6051, SH, , SOLE, , 6051, 0, 0
ABBOTT LABS, COM, 002824100, 402, 8596, SH, , SOLE, , 8596, 0, 0
AFLAC INC, COM, 001055102, 291, 6815, SH, , SOLE, , 6815, 0, 0
ALCATEL-LUCENT, SPONSORED ADR, 013904305, 172, 67524, SH, , SOLE, , 67524, 0, 0
Based on some of the other files here's some thoughts on how to process them:
Some of the files use tabs to separate the columns. Those are trivial to parse and you do not need regex to split the columns. 0001031972-10-000004.txt appears to be that way and looks very similar to your example.
Some of the files use tabs to align the columns, not separate them. You'll need to figure out how to compress multiple tab runs into a single tab, then probably split on tabs to get your columns.
Others use a blank line to separate the rows vertically so you'll need to skip blank lines.
Others allow wrap columns to the next line (like a spreadsheet would in a column that is not wide enough. It's not too hard to figure out how to deal with that, but how to do it is being left as an exercise for you.
Some use centered column alignment, resulting in leading and trailing whitespace in your data. s/^\s+//; and s/\s+$//; will become your friends.
The most interesting one I saw appeared to have been created correctly, then word-wrapped at column 78, leading me to think some moron loaded their spreadsheet or report into their word processor then saved it. Reading that is a two step process of getting rid of the wrapping carriage-returns, then re-processing the data to parse out the columns. As an added task they also have column headings in the data for page breaks.
You should be able to get 100% of the files parsed, however you'll probably want to do it with a couple different parsing methods because of the use of tabs and blank lines and embedded column headers.
Ah, the fun of processing data from the wilderness.