I would like to extract the values for connections, upstream and downstream using telegraf regex processor plugin from this input:
2022/11/16 22:38:48 In the last 1h0m0s, there were 10 connections. Traffic Relayed ↑ 60 MB, ↓ 4 MB.
Using this configuration the result key "upstream" is a copy of the initial message but without a part of the 'regexed' stuff.
[[processors.regex]]
tagpass = ["snowflake-proxy"]
[[processors.regex.fields]]
## Field to change
key = "message"
## All the power of the Go regular expressions available here
## For example, named subgroups
pattern = 'Relayed.{3}(?P<UPSTREAM>\d{1,4}\W.B),'
replacement = "${UPSTREAM}"
## If result_key is present, a new field will be created
## instead of changing existing field
result_key = "upstream"
Current output:
2022/11/17 10:38:48 In the last 1h0m0s, there were 1 connections. Traffic 3 MB ↓ 5 MB.
How do I get the decimals?
I'm quite a bit confused how to use the regex here, because on several examples in the web it should work like this. See for example: http://wiki.webperfect.ch/index.php?title=Telegraf:_Processor_Plugins
The replacement config option specifies what you want to replace in for any matches.
I think you want something closer to this:
[[processors.regex.fields]]
key = "message"
pattern = '.*Relayed.{3}(?P<UPSTREAM>\d{1,4}\W.B),.*$'
replacement = "${1}"
result_key = "upstream"
to get:
upstream="60 MB"
Regular expression(May be) to find the word/string surrounded by other words.
===========================================================================
For example I have below sentences
1.I’m setting up a new server, The key is ABC and want to support UTF-8 fully in my web application. Where do I need to set the encoding/charsets?”
2.XYZ is the key for the new server I am setting and it is located at address 111 abc
3.key as of the date is WWW for the new server I am setting at 111, ABC London
4.The key for server is LMN and it is being setup at location 111, abc London.
key will be finite and will only have around 10 values. The value for key itself can be any form though. I have used ACB, XYZ, WWW, LMN as example above
I should be able to identify that Key exists in the sentence and extract value(ACB, XYZ, WWW, LMN) from all the above examples.
I have basically tried finding using if then else which is very cumbersome and dont have very good code to show yet. But will update when I can
I have basically tried finding using if then else which is very cumbersome and dont have very good code to show yet. But will update when I can
I should be able to identify that Key exists in the sentence and extract value(ACB, XYZ, WWW, LMN) from all the above examples.
Another option could be to use Spacy with dependency parsing
Any help will be greatly appreciated
This expression is likely to return the desired output, not sure though:
^(?=.*\b(ABC|XYZ|WWW|LMN)\b).*$
DEMO
Test
import re
regex = r"^(?=.*\b(ABC|XYZ|WWW|LMN)\b).*$"
test_str = """
1.I’m setting up a new server, The key is ABC and want to support UTF-8 fully in my web application. Where do I need to set the encoding/charsets?”
2.XYZ is the key for the new server I am setting and it is located at address 111 abc
3.Key as of the date is WWW for the new server I am setting at 111, ABC London
4.The key for server is LMN and it is being setup at location 111, abc London.
"""
print(re.findall(regex, test_str,re.M))
Output
['ABC', 'XYZ', 'ABC', 'LMN']
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
I am trying to filter out a group of lines that match a pattern using a regexp but am having trouble getting the correct regexp to use.
The text file contains lines like this:
transaction 390134; promote; 2016/12/20 01:17:07 ; user: build
to: DEVELOPMENT ; from: DEVELOPMENT_BUILD
# some commit comment
/./som/file/path 11745/409 (22269/257)
# merged
version 22269/257 (22269/257)
ancestor: (22133/182)
transaction 390136; promote; 2016/12/20 01:17:08 ; user: najmi
to: DEVELOPMENT ; from: DEVELOPMENT_BUILD
/./some/other/file/path 11745/1 (22269/1)
version 22269/1 (22269/1)
ancestor: (none - initial version)
type: dir
I would like to filter out the lines that start with "transaction", contain "User: build all the way until the next line that starts with "transaction".
The idea is to end up with transaction lines where user is not "build".
Thanks for any help.
If you want only the transaction lines for all users except build:
grep '^transaction ' test_data| grep -v 'user: build$'
If you want the whole transaction record for such users:
awk '/^transaction /{ p = !/user: build$/};p' test_data
OR
perl -lne 'if(/^transaction /){$p = !/user: build$/}; print if $p' test_data
The -A and -v options of grep command would have done the trick if all transaction records had same number of lines.
I have the following text sample:
Sample Supplier 123
AP Invoices 123456 -229.00
AP Invoices 235435 337.00
AP Invoices 444323 228.00
AP Invoices 576432 248.00
It's from a text file with 21,000 lines, which lists invoices against a supplier.
The pattern is always the same on each block of invoices against each supplier, where:
The supplier name starts at the beginning of a line
The invoices being to be listed 2 rows down from the supplier name, indented by one space.
I wondered if I can use a Regular Expression (I'm using TextPad as a Text Editor on a Windows PC) to:
Append each invoice line with a tab (\t)
Append the supplier name in front of the tab so each invoice line now starts with the supplier name, and a tab, where the supplier name is taken from 2 rows above the start of each block of invoices
Delete the supplier name line from above the invoice block.
Expected output:
Sample Supplier 123 AP Invoices 123456 -229.00
Sample Supplier 123 AP Invoices 235435 337.00
Sample Supplier 123 AP Invoices 444323 228.00
Sample Supplier 123 AP Invoices 576432 248.00
I realise I am probably asking for "the moon on a stick" here, but the alternative is to go through a 21,000 line text file and copy and paste the data into Excel, which might not be a very good use of my time.
Maybe I can't do it using a simple regular expression, or maybe it's simply not possible at all.
Any advice would be much appreciated.
Thanks
I would use a simple Python script to solve this issue:
currentheader = ""
with open("yourfile.txt") as f:
with open("newfile.txt","w") as fw:
for line in f:
if len(line.strip()) == 0:
continue
elif line[0] != " ": #new header
currentheader = line[:-1]
else:
fw.write(currentheader + "\t" + line[1:])
For this to work, on Windows you will have to install Python. Python 2 or 3 should both work with this script. After installing Python, you open a command line (Win+R, cmd, Enter), navigate to the folder your file is in using cd foldername, if necessary, and then type python dealWithTextIssue.py (after having saved the script as "dealWithTextIssue.py" there.
I think this isn't solvable just with regex, you'll have to do some programming. I made a little script in PHP:
$string = <<<EOL
Sample Supplier 123
AP Invoices 123456 -229.00
AP Invoices 235435 337.00
AP Invoices 444323 228.00
AP Invoices 576432 248.00
Second Supplier
A B C D
B F
EOL;
$array = preg_split("~[\n\r]+~", $string);
foreach ($array as $value) {
if (strpos($value, " ") == 0) {
if (strlen(trim($value)) > 0) {
echo "\t".$header.rtrim($value).PHP_EOL;
}
}
else {
$header = $value;
}
}
You can see it at work for example here after clicking on execute code.
I'm trying to write a perl regex to match the 5th column of files that contain 11 columns. There's also a preamble and footer which are not data. Any good thoughts on how to do this? Here's what I have so far:
if($line =~ m/\A.*\s(\b\w{9}\b)\s+(\b[\d,.]+\b)\s+(\b[\d,.sh]+\b)\s+.*/i) {
And this is what the forms look like:
No. Form 13F File Number Name
____ 28-________________ None
[Repeat as necessary.]
FORM 13F INFORMATION TABLE
TITLE OF VALUE SHRS OR SH /PUT/ INVESTMENT OTHER VOTING AUTHORITY
NAME OF INSURER CLASS CUSSIP (X$1000) PRN AMT PRNCALL DISCRETION MANAGERS SOLE SHARED NONE
Abbott Laboratories com 2824100 4,570 97,705 SH sole 97,705 0 0
Allstate Corp com 20002101 12,882 448,398 SH sole 448,398 0 0
American Express Co com 25816109 11,669 293,909 SH sole 293,909 0 0
Apollo Group Inc com 37604105 8,286 195,106 SH sole 195,106 0 0
Bank of America com 60505104 174 12,100 SH sole 12,100 0 0
Baxter Internat'l Inc com 71813109 2,122 52,210 SH sole 52,210 0 0
Becton Dickinson & Co com 75887109 8,216 121,506 SH sole 121,506 0 0
Citigroup Inc com 172967101 13,514 3,594,141 SH sole 3,594,141 0 0
Coca-Cola Co. com 191216100 318 6,345 SH sole 6,345 0 0
Colgate Palmolive Co com 194162103 523 6,644 SH sole 6,644 0 0
If you ever do write a regex this long, you should at least use the x flag to ignore whitespace, and importantly allow whitespace and comments:
m/
whatever
something else # actually trying to do this
blah # for fringe case X
/xi
If you find it hard to read your own regex, others will find it Impossible.
I think a regular expression is overkill for this.
What I'd do is clean up the input and use Text::CSV_XS on the file, specifying the record separator (sep_char).
Like Ether said, another tool would be appropriate for this job.
#fields = split /\t/, $line;
if (#fields == 11) { # less than 11 fields is probably header/footer
$the_5th_column = $fields[4];
...
}
My first thought is that the sample data is horribly mangled in your example. It'd be great to see it embedded inside some <pre>...</pre> tags so columns will be preserved.
If you are dealing with columnar data, you can go after it using substr() or unpack() easier than you can using regex. You can use regex to parse out the data, but most of us who've been programming Perl a while also learned that regex is not the first tool to grab a lot of times. That's why you got the other comments. Regex is a powerful weapon, but it's also easy to shoot yourself in the foot.
http://perldoc.perl.org/functions/substr.html
http://perldoc.perl.org/functions/unpack.html
Update:
After a bit of nosing around on the SEC edgar site, I've found that the 13F files are nicely formatted. And, you should have no problem figuring out how to process them using substr and/or unpack.
FORM 13F INFORMATION TABLE
VALUE SHARES/ SH/ PUT/ INVSTMT OTHER VOTING AUTHORITY
NAME OF ISSUER TITLE OF CLASS CUSIP (x$1000) PRN AMT PRN CALL DSCRETN MANAGERS SOLE SHARED NONE
- ------------------------------ ---------------- --------- -------- -------- --- ---- ------- ------------ -------- -------- --------
3M CO COM 88579Y101 478 6051 SH SOLE 6051 0 0
ABBOTT LABS COM 002824100 402 8596 SH SOLE 8596 0 0
AFLAC INC COM 001055102 291 6815 SH SOLE 6815 0 0
ALCATEL-LUCENT SPONSORED ADR 013904305 172 67524 SH SOLE 67524 0 0
If you are seeing the 13F files unformatted, as in your example, then you are not viewing correctly because there are tabs between columns in some of the files.
I looked through 68 files to get an idea of what's out there, then wrote a quick unpack-based routine and got this:
3M CO, COM, 88579Y101, 478, 6051, SH, , SOLE, , 6051, 0, 0
ABBOTT LABS, COM, 002824100, 402, 8596, SH, , SOLE, , 8596, 0, 0
AFLAC INC, COM, 001055102, 291, 6815, SH, , SOLE, , 6815, 0, 0
ALCATEL-LUCENT, SPONSORED ADR, 013904305, 172, 67524, SH, , SOLE, , 67524, 0, 0
Based on some of the other files here's some thoughts on how to process them:
Some of the files use tabs to separate the columns. Those are trivial to parse and you do not need regex to split the columns. 0001031972-10-000004.txt appears to be that way and looks very similar to your example.
Some of the files use tabs to align the columns, not separate them. You'll need to figure out how to compress multiple tab runs into a single tab, then probably split on tabs to get your columns.
Others use a blank line to separate the rows vertically so you'll need to skip blank lines.
Others allow wrap columns to the next line (like a spreadsheet would in a column that is not wide enough. It's not too hard to figure out how to deal with that, but how to do it is being left as an exercise for you.
Some use centered column alignment, resulting in leading and trailing whitespace in your data. s/^\s+//; and s/\s+$//; will become your friends.
The most interesting one I saw appeared to have been created correctly, then word-wrapped at column 78, leading me to think some moron loaded their spreadsheet or report into their word processor then saved it. Reading that is a two step process of getting rid of the wrapping carriage-returns, then re-processing the data to parse out the columns. As an added task they also have column headings in the data for page breaks.
You should be able to get 100% of the files parsed, however you'll probably want to do it with a couple different parsing methods because of the use of tabs and blank lines and embedded column headers.
Ah, the fun of processing data from the wilderness.