reading output of command or read from a file - regex

I'm new to bash and want to improve. I need to learn reading specific text from a file or from output of a command.For example I want to sum of the total ethernet interrupt numbers of each core of the computer from /proc/interrupts file.The content of the file is:
CPU0 CPU1 CPU2 CPU3
0: 142 0 0 0 IO-APIC-edge timer
1: 1 0 1 0 IO-APIC-edge i8042
4: 694 18 635 19 IO-APIC-edge serial
7: 0 0 0 0 IO-APIC-edge parport0
9: 0 0 0 0 IO-APIC-fasteoi acpi
12: 1 1 0 2 IO-APIC-edge i8042
14: 0 0 0 0 IO-APIC-edge ide0
19: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3
23: 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2
46: 347470 119806 340499 108227 PCI-MSI-edge ahci
47: 33568 45958 46028 49191 PCI-MSI-edge eth0-rx-0
48: 0 0 0 0 PCI-MSI-edge eth0-tx-0
49: 1 0 1 0 PCI-MSI-edge eth0
50: 28217 42237 65203 39086 PCI-MSI-edge eth1-rx-0
51: 0 0 0 0 PCI-MSI-edge eth1-tx-0
52: 0 1 0 1 PCI-MSI-edge eth1
59: 114991 338765 77952 134850 PCI-MSI-edge eth4-rx-0
60: 429029 315813 710091 26714 PCI-MSI-edge eth4-tx-0
61: 5 2 1 5 PCI-MSI-edge eth4
62: 1647083 208840 1164288 933967 PCI-MSI-edge eth5-rx-0
63: 673787 1542662 195326 1329903 PCI-MSI-edge eth5-tx-0
64: 5 6 7 4 PCI-MSI-edge eth5
I need to read all of the numbers of interrupts with "eth" keyword and then find the sum of them for each CPU core(whateve the CPU core name is). For example for CPU0:33568+0+1+28217...
What is suitable for this? Must I use awk or sed for regex and how?

You could use awk for this, there is no need for grep or any other tool since awk can do the search itself.
UPDATE:
Based on the possibility of varying number of CPU columns (see first comment below), this will work:
NR==1 {
core_count = NF
print "core count: ", core_count
next
}
/eth/ {
for (i = 2; i <= 2+core_count; i++)
totals[i-2] += $i
}
END {
print "Totals"
for (i = 0; i < core_count; i++)
printf("CPU%d: %d\n", i, totals[i])
}
gives output:
core count: 4
Totals
CPU0: 2926686
CPU1: 2494284
CPU2: 2258897
CPU3: 2513721
Notes:
If the first line only contains CPU headers, then using NF as shown at the start of the script will work. If other data might be present then core_count = gsub(/CPU/, "CPU") could be used. Also, this script depends on consecutive CPU columns.

You can filter out the eth lines using grep, and then sum using awk.
e.g. for CPUs 1 and 2:
grep eth report | awk '{ CPU0 += $2; CPU1 += $3} END { print CPU0; print CPU1} '
Note that you can filter within awk, rather than use grep for this.
I would perhaps be tempted to do this in Perl however, and create a hash of sum per CPU. It would depend on how extensible I want to make this.

Related

Quick one ,what is the appropriate regex to match any number greater than 750 ? in the below test content?

what is the appropriate regex to match any number greater than "750 active" in the below text at the last line ?
=============================================================================
c0000000a7470b30 Y--P--- 4207362 weblogic - c000000098078f98 0 1 1832876 0
c0000000a74853f0 Y--P--- 4376431 krizzsa LL2CE414 c0000000a19479b8 0 8 70173 170
c0000000b3a1f2c8 Y--P--- 3996541 weblogic - c0000000acd54f90 0 1 64112 0
c0000000b3a22418 Y--P--- 4371951 tinpatel tK c000000098385b70 0 1 62 0
c0000000b3a286b8 B--PR-- 4385816 ayaw SL5CG752 c00000001b0bdde0 0 5 14452 701
c0000000b3a2afd0 Y--P--- 4383560 sognenov t3 c000000099afe900 0 1 100 0
c0000000b3a2b808 Y--P--- 4368082 wenpli 66 c00000009e6f8260 0 1 461 0
c0000000b3a2c878 Y--P--- 4228342 sarbrar tc c0000000a62da668 0 1 0 0
c0000000b3a2f9c8 Y--P--- 4384060 weblogic - c0000000a2deb910 0 1 0 0
c0000000b3a35430 Y--P--- 4383243 nakahmed t1 c00000009e0b9ce8 0 1 17 0
c0000000b3a38580 Y--P--- 3937012 mvvinay 76 c0000000a162a888 0 1 0 0
c0000000b3a3d7b0 Y--P--- 616042 neminhas LD2UA442 c0000000aea25ca8 0 2 0 0
c0000000b3a43218 Y--P--- 4383236 nakahmed t1 c0000000981b4570 0 1 37 22
c0000000b3a473d8 Y--P--- 4382647 viwang 2UA40922 c0000000a268a2d0 0 7 75275 4856
412 active, 2176 total, 441 maximum concurrent
================================================================
in the last line if the active connections to a DB goes more than 750 we need to match it and do further processing.So can someone please help with regex ?
Please note that there is a space in the beginning of the last line.
If it's Perl, there's no need to use matching for everything.
if (/ ([0-9]+) active/ && $1 > 750) {
print "Matching!\n";
If you need a single regex, it's
/^\ (?: 7 (?: 5 [1-9]
| [6-9][0-9])
| [89] [0-9]{2}
| [1-9][0-9]{3,})\ active/x
or shortly
/^ (?:7(?:5[1-9]|[6-9][0-9])|[89][0-9]{2}|[1-9][0-9]{3,}) active/
It may not be the best idea, yet if we might have to, probably our expression would be:
^\h*(7[5-9]\d|[89]\d\d|[1-9]\d{3,})\h+active\b
((7[5-9]\d|[89]\d\d|[1-9]\d{3,})\sactive)
((7[5-9]\d|[89]\d\d|[1-9]\d{3,})\s+active)
((7[5-9]\d|[89]\d\d|[1-9]\d{3,}) active)
based on bobble-bubble's advice,
DEMO
or something similar to:
([7][5-9][0-9]|[8-9][0-9][0-9]|[1-9][0-9]{3,}) active
(([7][5-9][0-9]|[8-9][0-9][0-9]|[1-9][0-9]{3,}) active)
if we might want to capture things, separately.

Trying to save a PDF string results in UnicodeDecodeError with WeasyPrint

So far this is my code:
from django.template import (Context, Template) # v1.11
from weasyprint import HTML # v0.42
import codecs
template = Template(codecs.open("/path/to/my/template.html", mode="r", encoding="utf-8").read())
context = Context({})
html = HTML(string=template.render(context))
pdf_file = html.write_pdf()
#with open("/path/to/my/file.pdf", "wb") as f:
# f.write(self.pdf_file)
Errorstack:
[17/Jan/2019 08:14:13] INFO [handle_correspondence:54] 'utf8' codec can't
decode byte 0xe2 in position 10: invalid continuation byte. You passed in
'%PDF-1.3\n%\xe2\xe3\xcf\xd3\n1 0 obj\n<</Author <> /Creator (cairo 1.14.6
(http://cairographics.org))\n /Keywords <> /Producer (WeasyPrint 0.42.3
\\(http://weasyprint.org/\\))>>\nendobj\n2 0 obj\n<</Pages 3 0 R /Type
/Catalog>>\nendobj\n3 0 obj\n<</Count 1 /Kids [4 0 R] /Type
/Pages>>\nendobj\n4 0 obj\n<</BleedBox [0 0 595 841] /Contents 5 0 R
/Group\n <</CS /DeviceRGB /I true /S /Transparency /Type /Group>>
MediaBox\n [0 0 595 841] /Parent 3 0 R /Resources 6 0 R /TrimBox [0 0 595
841]\n /Type /Page>>\nendobj\n5 0 obj\n<</Filter /FlateDecode /Length 15
0 R>>\nstream\nx\x9c+\xe4*T\xd0\x0fH,)I-\xcaSH.V\xd0/0U(N\xceS\xd0O4PH/\xe62P0P0\xb54U\xb001T(JUH\xe3\n\x04B\x00\x8bi\r\x89\nendstream\nendobj\n6 0
obj\n<</ExtGState <</a0 <</CA 1 /ca 1>>>> /Pattern <</p5 7 0
R>>>>\nendobj\n7 0 obj\n<</BBox [0 1123 794 2246] /Length 8 0 R /Matrix
[0.75 0 0 0.75 0 -843.5]\n /PaintType 1 /PatternType 1 /Resources
<</XObject <</x7 9 0 R>>>>\n /TilingType 1 /XStep 1588 /YStep
2246>>\nstream\n /x7 Do\n \n\nendstream\nendobj\n8 0 obj\n10\nendobj\n9 0
obj\n<</BBox [0 1123 794 2246] /Filter /FlateDecode /Length 10 0 R
/Resources\n 11 0 R /Subtype /Form /Type /XObject>>\nstream\nx\x9c+\xe4\nT(\xe42P0221S0\xb74\xd63\xb3\xb4T\xd05442\xd235R(JU\x08W\xc8\xe3*\xe42T0\x00B\x10\t\x942VH\xce\xe5\xd2O4PH/V\xd0\xaf04Tp\xc9\xe7\n\x04B\x00`\xf0\x10\x11\nendstream\nendobj\n10 0 obj\n77\nendobj\n11 0 obj\n<</ExtGState
<</a0 <</CA 1 /ca 1>>>> /XObject <</x11 12 0 R>>>>\nendobj\n12 0
obj\n<</BBox [0 1123 0 1123] /Filter /FlateDecode /Length 13 0 R
/Resources\n 14 0 R /Subtype /Form /Type /XObject>>\nstream\nx\x9c+\xe4\n
xe4\x02\x00\x02\x92\x00\xd7\nendstream\nendobj\n13 0 obj\n12\nendobj\n14 0
obj\n<<>>\nendobj\n15 0 obj\n58\nendobj\nxref\n0 16\n0000000000 65535
f\r\n0000000015 00000 n\r\n0000000168 00000 n\r\n0000000215 00000
n\r\n0000000270 00000 n\r\n0000000489 00000 n\r\n0000000620 00000
n\r\n0000000697 00000 n\r\n0000000923 00000 n\r\n0000000941 00000
n\r\n0000001165 00000 n\r\n0000001184 00000 n\r\n0000001264 00000
n\r\n0000001422 00000 n\r\n0000001441 00000 n\r\n0000001462 00000
n\r\ntrailer\n\n<</Info 1 0 R /Root 2 0 R /Size 16>>\nstartxref\n1481
n%%EOF\n' (<type 'str'>)
Actually it works via web request (returning the PDF as response) and via shell (manually writting the code). The code is tested and never gaves me problems. The files are saved with correct encoding, and setting the encoding kwarg in HTML doesn't help; also, the mode value of the template is correct, because I've seen other questions whose problem could be that.
However, I was adding a management command to use it periodically (for bigger PDFs I cannot do it via web request because the server's timeout could activate before finishing), and when I try to call it, I only get a UnicodeDecodeError saying 'utf8' codec can't decode byte 0xe2 in position 10: invalid continuation byte.
The PDF (at least from what I see) renders initially with this characters:
%PDF-1.3\n%\xe2\xe3\xcf\xd3\n1 0
which translates into this:
%PDF-1.3
%âãÏÓ
1 0 obj
So the problem is all about the character â. But it's a trap!
Instead, the problem is this line of code:
pdf_file = html.write_pdf()
Changing it to:
html.write_pdf()
Just works as expected!
So my question is: what type of reason could exists for Python to throw an UnicodeDecodeError when trying to assign a variable to a string? I've digged into weasyprint's code in my virtualenv, but I didn't see conversions out there.
So I don't know why, but now suddenly it works. I literally didn't modify anything: I just run the command again and it works.
I'm not marking the question as answered, as maybe in the future someone could have the same problem as me can try to post a correct one.
So disturbing.
EDIT
So it looks like I'm a very intelligent person who tries to set up the value of self.pdf_file, which is a models.FileField, to the content of the created PDF instead of the file itself.

Sklearn CountVectorizer vocabulary is not complete

Consider the following example:
tf_vectorizer = CountVectorizer(max_df=1, min_df=0,
max_features=None,
stop_words=None)
all_docs = ['ETH:0x0000 00:17:A4:77:9C:04 09:00:2B:00:00:05 0 PortA Unknown 755 0 45300 ETH FirstHourDay_21 LastHourDay_23 duration_6911 ThreatScore_nan ThreatCategory_nan False Anomaly_False',
'ETH:0x0000 00:17:A4:77:9C:04 09:00:2B:00:00:05 2 PortC Unknown 774 0 46440 ETH FirstHourDay_21 LastHourDay_23 duration_6911 ThreatScore_nan ThreatCategory_nan False Anomaly_False',
'ETH:0x0000 00:17:A4:77:9C:0A 09:00:2B:00:00:05 0 PortA Unknown 752 0 45120 ETH FirstHourDay_21 LastHourDay_23 duration_6913 ThreatScore_nan ThreatCategory_nan False Anomaly_False',
'ICMP 10.6.224.1 71.6.165.200 0 PortA 192 IP-ICMP 1 1 70 ICMP FirstHourDay_22 LastHourDay_22 duration_0 ThreatScore_122,127 ThreatCategory_21,23 True Anomaly_True',
'ICMP 10.6.224.1 71.6.165.200 2 PortC 192 IP-ICMP 1 1 70 ICMP FirstHourDay_22 LastHourDay_22 duration_0 ThreatScore_122,127 ThreatCategory_21,23 True Anomaly_True',
'ICMP 10.6.224.1 185.93.185.239 0 PortA 192 IP-ICMP 1 1 70 ICMP FirstHourDay_22 LastHourDay_22 duration_0 ThreatScore_127 ThreatCategory_23 True Anomaly_True']
tf_v = tf_vectorizer.fit(all_docs)
The obtained vocabulary is:
{'0a': 0,
'185': 1,
'239': 2,
'45120': 3,
'45300': 4,
'46440': 5,
'752': 6,
'755': 7,
'774': 8,
'93': 9,
'duration_6913': 10,
'threatcategory_23': 11,
'threatscore_127': 12}
Some words are missing from the vocabulary such as ETH, FirstHourDay_22, Anomaly_True.
Why is this? how can I have a full vocabulary?
EDIT:
The error is probably due to the token_pattern value in CountVectorizer
EDIT:
I suggest to riconsider the problem with the following variable:
all_docs=['ETH0x0000 0017A4779C04 09002B000005 0 PortA Unknown 755 0 45300 FirstHourDay21 LastHourDay23 duration6911 ThreatScorenan ThreatCategorynan False AnomalyFalse',
'ETH0x0000 0017A4779C04 09002B000005 2 PortC Unknown 774 0 46440 FirstHourDay21 LastHourDay23 duration6911 ThreatScorenan ThreatCategorynan False AnomalyFalse',
'ETH0x0000 0017A4779C0A 09002B000005 0 PortA Unknown 752 0 45120 FirstHourDay21 LastHourDay23 duration6913 ThreatScorenan ThreatCategorynan False AnomalyFalse',
'ICMP 10.6.224.1 71.6.165.200 0 PortA 192 IP-ICMP 1 1 70 FirstHourDay22 LastHourDay22 duration0 ThreatScore122,127 ThreatCategory21,23 True AnomalyTrue',
'ICMP 10.6.224.1 71.6.165.200 2 PortC 192 IP-ICMP 1 1 70 FirstHourDay22 LastHourDay22 duration0 ThreatScore122,127 ThreatCategory21,23 True AnomalyTrue',
'ICMP 10.6.224.1 185.93.185.239 0 PortA 192 IP-ICMP 1 1 70 FirstHourDay22 LastHourDay22 duration0 ThreatScore127 ThreatCategory23 True AnomalyTrue']

Saving binary data in ColdFusion

I have a problem with saving a binary representation of a file to a file...
Let me show you my pain:
Everything starts with a file, file.pdf
Then the file is send via POST to a website with some additional data:
curl --data "sector=4&name=John&surname=Smith&email=john#smith.com&isocode=PL&theFile=$(cat file.pdf | base64)" http://localhost/awesomeUpload
then the data is received and decoded:
var decoded = BinaryDecode(data.theFile, "Base64");
then I attempt to save it by:
var theFilePath = ExpandPath("/localserver/temp/theFile.pdf");
fileWrite(theFilePath , data.theFile);
or:
var file_output_steam = CreateObject("java","java.io.FileOutputStream").init(theFilePath);
file_output_steam.write(data.theFile);
file_output_steam.close();
My files does not match ;(
the original one looks like
%PDF-1.5
%µµµµ
1 0 obj
<</Type/Catalog/Pages 2 0 R/Lang(pl-PL) /StructTreeRoot 13 0 R/MarkInfo<</Marked true>>>>
endobj
2 0 obj
<</Type/Pages/Count 1/Kids[ 3 0 R] >>
endobj
3 0 obj
<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 10 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 595.32 841.92] /Contents 4 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>>
where as the copy that went through ColdFusion looks like:
%PDF-1.5
%µµµµ
1 0 obj
<</Type/Catalog/Pages 2 0 R/Lang(pl-PL) /StructTreeRoot 13 0 R/MarkInfo<</Marked true>> B™[™ŘšBŚŘšBŹŐ\KÔYŮ\ËĐŰÝ[ťKŇÚYÖČČ—H€Đ¦VćFö& ĐŁ2ö& ĐŁĂÂőG—RővRő&VçB""ő&W6÷W&6W3ĂÂôföçCĂÂôcR"ôc"#ŕ˝AÉ˝ŤM•Ńl˝A˝Q•áĐ˝%µ…ť•˝%µ…ť•˝%µ…ť•%t€>/MediaBox[ 0 0 595.32 841.92] /Contents 4 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>B™[™ŘšBŤŘšBŹŃš[\‹Ń›]QXŰŮKÓ[™ÝMŚOŹBśÝ™X[CBž'cłB°Ś!8Ě1Ď]CsôŘQ&‰2  PäV˝ëËöĽ¨QŰge•ź
ďÂŃ,đť#"aKR•˘<1™[ä¸
(ÄňĄyoâ9S\Śĺ <ę8I±D¬‰#…Ć”ťLé‘ا÷ÍnU|WŸ‰t`ýuşąĽ\hlu&âĂ7ß
ů"Ĺ\Ŕ>pÇč÷÷.°ß’Ř——•‹ĚB™[™Ý™X[CB™[™ŘšBŤHŘšBŹŐ\Kћ۝ÔÝXť\KŐ\LĐ\ŮQ›ŰťĐPŃQJĐŘ[XśšKŃ[ŰŮ[™ËŇY[ť]KRŃ\ŘŮ[™[ť›ŰťČ
‹ŐŐ[šXŰŮHŚŹŹB™[™ŘšBŤŘšB–Č
Č—HB™[™ŘšBŤČŘšBŹĐ\ŮQ›ŰťĐPŃQJĐŘ[XśšKÔÝXť\KĐŇQ›Űť\L‹Ő\Kћ۝ĐŇQŃŇQX\ŇY[ť]KŃČLĐŇQŢ\Ý[R[™›Č‹Ń›Űť\ŘÜš\ÜH‹ŐČŚŕЦVćFö& ĐŁ‚ö& ĐŁĂÂô÷&FW&–ćr„–FVçF—G’’ő&Vv—7G'’„Fö&R’ő7WĆVÖVçBăŕЦVćFö& ĐŁ’ö& ĐŁĂÂőG—RôföçDFW67&—F÷"ôföçDć
please help

Printing element of List in a different way

I need to print a List of Lists using Scala and the function toString, where every occurrence of 0 needs to be replaced by an '_'. This is my attempt so far. The commented code represents my different attempts.
override def toString() = {
// grid.map(i => if(i == 0) '_' else i)
// grid map{case 0 => '_' case a => a}
// grid.updated(0, "_")
//grid.map{ case 0 => "_"; case x => x}
grid.map(_.mkString(" ")).mkString("\n")
}
My output should look something like this, but an underscore instead of the zeros
0 0 5 0 0 6 3 0 0
0 0 0 0 0 0 4 0 0
9 8 0 7 4 0 0 0 5
1 0 0 0 7 0 9 0 0
0 0 9 5 0 1 6 0 0
0 0 8 0 2 0 0 0 7
6 0 0 0 1 8 0 9 3
0 0 1 0 0 0 0 0 0
Thanks in advance.
Just put an extra map in there to change 0 to _
grid.map(_.map(_ match {case 0 => "_"; case x => x}).mkString(" ")).mkString("\n")
Nothing special:
def toString(xs: List[List[Int]]) = xs.map { ys =>
ys.map {
case 0 => "_"
case x => String.valueOf(x)
}.mkString(" ")
}.mkString("\n")
Although the other solutions are functionally correct, I believe this shows more explicitly what happens and as such is better suited for a beginner:
def gridToString(grid: List[List[Int]]): String = {
def replaceZero(i: Int): Char =
if (i == 0) '_'
else i.toString charAt 0
val lines = grid map { line =>
line map replaceZero mkString " "
}
lines mkString "\n"
}
First we define a method for converting the digit into a character, replacing zeroes with underscores. (It is assumed from your example that all the Int elements are < 10.)
The we take each line of the grid, run each of the digits in that line through our conversion method and assemble the resulting chars into a string.
Than we take we take the resulting line-strings and turn them into the final string.
The whole thing could be written shorter, but it wouldn't necessarily be more readable.
It is also good Scala style to use small inner methods like replaceZero in this example instead of writing all code inline, as the naming of a method helps indicating what it is does, and as such enhances readability.
There's always room for another solution. ;-)
A grid:
type Grid[T] = List[List[T]]
Print a grid:
def print[T](grid: Grid[T]) = grid map(_ mkString " ") mkString "\n"
Replace all zeroes:
for (row <- grid) yield row.collect {
case 0 => "_"
case anything => anything
}