bash/sed script to get output from a file using a regex - regex

I have a file which contains lines of the form object 0: data: 2, object 0: data: 232132 in between other lines in the file.I need to extract the data values from the file for all object i and store them space separate in a output file say output using bash or sed.It would great if someone can help me in achieving this.
Example input:
num objects: 3
object 0: name: 'x'
object 0: size: 4
object 0: data: 1
object 1: name: 'y'
object 1: size: 4
object 1: data: 3231
object 2: name: 'x'
object 2: size: 4
object 3: data: -32
Example output:
1 3231 -32

You could use something like this:
awk '$3=="data:"{print $4}' file
This outputs the 4th field when the 3rd field is equal to "data:".
Shorter still, you could just match the pattern /data:/:
awk '/data:/{print $4}' file
To output the numbers on the same line, use printf rather than print. To keep things cleaner, you can use an array and print the values in the END block:
awk '/data:/{a[++n]=$4}END{for(i=1;i<=n;++i)printf "%s%s",$4,(i<n?FS:RS)}' file
Using an array like this makes it easy to separate each value with a space FS and add a newline RS at the end.
Any of these commands can produce an output file using redirection > output.

Related

Read excel files(xlsx) file one by one in R

I have a bunch of excel file in my directory. Is there a way to read all of them separately (not appending one another) in single command. FOr example.
I have 3 files in my folder
File1.xlsx
File2.xlsx
File3.xlsx
Expected output in R (Instead of reading them separately)
File1
## should have File1 contents
File2
## should have File2 contents
File3
## should have File3 contents
The function assign should help you - it assigns a value to a name given as a string. So the following code should do what you want (hint: use gsub to clean non-word characters to make a valid variable name):
library(readxl)
for (file in list.files(".", pattern = "xls$", full.names = TRUE))
assign(gsub("\\W", "", file), read_excel(file))
We observe that we have two objects in our workspace:
> ls()
[1] "file_1xls" "file_2xls"
Now let's see what they are:
> for(x in ls()) str(get(x))
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 2 variables:
$ a: num 1
$ b: num 2
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 2 variables:
$ c: num 4
$ d: num 5
The below code
loads the readxl library
creates a variable of the files in the current directory with the .xlsx extension
then reads in the files with the lapply() function into a dataframe list
library(readxl)
file.list <- list.files(pattern='*.xlsx')
df.list <- lapply(file.list, read_excel)

In python insert one space after every 5th Character in each line of a text file

I am reading a text file in python(500 rows) and it seems like:
File Input:
0082335401
0094446049
01008544409
01037792084
01040763890
I wanted to ask that is it possible to insert one space after 5th Character in each line:
Desired Output:
00823 35401
00944 46049
01008 544409
01037 792084
01040 763890
I have tried below code
st = " ".join(st[i:i + 5] for i in range(0, len(st), 5))
but the below output was returned on executing it:
00823 35401
0094 44604 9
010 08544 409
0 10377 92084
0104 07638 90
I am a novice in Python. Any help would make a difference.
There seems to be two issues here - By running your provided code, you seem to be reading the file into one single string. It would be much preferable (in your case) to read the file in as a list of strings, like the following (assuming your input file is input_data.txt):
# Initialize a list for the data to be stored
data = []
# Iterate through your file to read the data
with open("input_data.txt") as f:
for line in f.readlines():
# Use .rstrip() to get rid of the newline character at the end
data.append(line.rstrip("\r\n"))
Then, to operate on the data you obtained in a list, you could use a list comprehension similar to the one you have tried to use.
# Assumes that data is the result from the above code
data = [i[:5] + " " + i[5:] if len(i) > 5 else i for i in data]
Hope this helped!
If your only requirement is to insert a space after the fifth character than you could use the following simple version:
#!/usr/bin/env python
with open("input_data") as data:
for line in data.readlines():
line = line.rstrip()
if len(line) > 5:
print(line[0:5]+" "+line[5:])
else:
print(line)
If you don't mind if lines with less than five characters get a space at the end, you could even omit the if-else-statement and go with the print-function from the if-clause:
#!/usr/bin/env python
with open("input_data") as data:
for line in data.readlines():
line = line.rstrip()
print(line[0:5]+" "+line[5:])

How can I fetch only files which can contains 'test' word in practice.txt file and merging data into

cat practice.txt
test_0909_3434 test_8838 test_case_5656_5433 case_4333_3211 note_4433_2212
practice.txt file contains some more files.
required output:
test_0909_3434 test_8838
These test files contain some data so that I need to merge these two files data into one final file.
test_0909_3434 file contains:
id name
1 hh
2 ii
test_8838 file contains:
id name
2 ii
3 gg
4 kk
Final Output of output file: mergedfile.txt will be like follows:
id name
1 hh
2 ii
3 gg
4 kk
we need to remove redundant data also like above mergedfile.txt
1) simplistic (data is "in-order" and "well-formatted" in both input files):
cat f1 f2 | sort -u > f3
2) more complex (not "in-order" and not "well-formatted"). use regex.
Read records from both input files. Assume input record is called 'x'.
if [[ "${x}" =~ ^[[:space:]]*([[:digit:]]+)[[:space:]]+(.*)$ ]]; then
d="${BASH_REMATCH[1]}"
s="${BASH_REMATCH[2]}"
echo "d == $d, s == $s"
fi
aa["${d}"]="${k}"
Where aa is a Bash associative array (available in Bash >= 4x).
declare -A aa=()
This assumes the the first field is an integer (and a key). You can process accordingly, as to whether or not the key is unique.
If it's any more complex than this, consider using Perl.

Python Read then Write project

I am trying to write a program that will read a text file and convert what it reads to another text file but using the given variables. Kinda like a homemade encryption. I want the program to read 2 bytes at a time and read the entire file. I am new to python but enjoy the application. any help would be greatly appreciated
a = 12
b = 34
c = 56
etc... up to 20 different types of variables
file2= open("textfile2.text","w")
file = open("testfile.txt","r")
file.read(2):
if file.read(2) = 12 then;
file2.write("a")
else if file.read(2) = 34
file2.write("b")
else if file.read(2) = 56
file2.write("c")
file.close()
file2.close()
Text file would look like:
1234567890182555
so the program would read 12 and write "a" in the other text file and then read 34 and put "b" in the other text file. Just having some logic issues.
I like your idea here is how I would do it. Note I convert everything to lowercase using lower() however if you understand what I am doing it would be quite simple to extend this to work on both lower and uppercase:
import string
d = dict.fromkeys(string.ascii_lowercase, 0) # Create a dictionary of all the letters in the alphabet
updates = 0
while updates < 20: # Can only encode 20 characters
letter = input("Enter a letter you want to encode or type encode to start encoding the file: ")
if letter.lower() == "encode": # Check if the user inputed encode
break
if len(letter) == 1 and letter.isalpha(): # Check the users input was only 1 character long and in the alphabet
encode = input("What do want to encode %s to: " % letter.lower()) # Ask the user what they want to encode that letter to
d[letter.lower()] = encode
updates += 1
else:
print("Please enter a letter...")
with open("data.txt") as f:
content = list(f.read().lower())
for idx, val in enumerate(content):
if val.isalpha():
content[idx] = d[val]
with open("data.txt", 'w') as f:
f.write(''.join(map(str, content)))
print("The file has been encoded!")
Example Usage:
Original data.txt:
The quick brown fox jumps over the lazy dog
Running the script:
Enter a letter you want to encode or type encode to start encoding the file: T
What do want to encode t to: 6
Enter a letter you want to encode or type encode to start encoding the file: H
What do want to encode h to: 8
Enter a letter you want to encode or type encode to start encoding the file: u
What do want to encode u to: 92
Enter a letter you want to encode or type encode to start encoding the file: 34
Please enter a letter...
Enter a letter you want to encode or type encode to start encoding the file: rt
Please enter a letter...
Enter a letter you want to encode or type encode to start encoding the file: q
What do want to encode q to: 9
Enter a letter you want to encode or type encode to start encoding the file: encode
The file has been encoded!
Encode data.txt:
680 992000 00000 000 092000 0000 680 0000 000
I would read the source file and convert the items as you go into a string. Then write the entire result string separately to the second file. This would also allow you to use the better with open construct for file reading. This allows python to handle file closing for you.
This code will not work because it only reads the first two characters. you need to create your own idea on how to iterate it, but here is an idea (without just making a solution for you)
with open("textfile.text","r") as f:
# you need to create a way to iterate over these two byte/char increments
code = f.read(2)
decoded = <figure out what code translates to>
results += decoded
# now you have a decoded string inside `results`
with open("testfile.txt","w") as f:
f.write(results)
the decoded = <figure out what code translates to> part can be done much better than using a bunch of serial if/elseifs....
perhaps define a dictionary of the encodings?
codings = {
"12": "a",
"45": "b",
# etc...
}
then you could just:
results += codings[code]
instead of the if statements (and it would be faster).

Format of google-perftools/pprof with heap profiling

There is a pprof utility in google-perftools package. It is utility to convert profile files from google-perftools cpuprofiler and heapprofiler into beautiful images: like https://github.com/gperftools/gperftools/tree/master/doc/pprof-test-big.gif and
https://github.com/gperftools/gperftools/tree/master/doc/heap-example1.png
The format of pprof's input file is described for cpu profiles here: https://github.com/gperftools/gperftools/tree/master/doc/cpuprofile-fileformat.html
but the format of heap profile input files is not described in the svn.
What is the "heapprofiling" format and how can I generate such file from my code? I already can generate cpuprofiler format, so I interested what are the difference between two formats.
Seems the format is not binary as for cpu profiler, but textual:
First line:
heap profile: 1: 2 [ 3: 4] # heapprofile
Regex (not full)
(\d+): (\d+) \[(\d+): (\d+)\] # ([^/]*)(/(\d+))?)?
where
1 & 2 is "in-use stats"; first number is count of allocations, second is byte count
3 & 4 is "total allocated stats"; first and second with same meaning
heapprofile is the type
Then a profile itself follows in lot of lines:
1: 2 [ 3: 4] # 0x00001 0x00002 0x00003
where "1: 2" and "3: 4" is of the same meaning as in first line; but only malloced from given callsite; 0x00001 0x00002 is callstack of the callsite.
Then empty line and "MAPPED_LIBRARIES:". From the next line something very like copy of the /proc/pid/maps follows.