Julia dataframe replace character in column - replace

I am working on a dataframe in julia and one column has zipcodes. In certain instances, a State code is appended in front of zip code.
For example: if zipcode is 123456, and state is ab, it looks like ab 123456.
There are multiple values like that in the column. How can I replace all "ab " with "" in that column.

You can use replace function with a regexp matching your pattern and broadcast it over all entries of the column like this:
julia> using DataFrames
julia> df = DataFrame(x = ["ab x", "y", "ab z"])
3×1 DataFrame
Row │ x
│ String
─────┼────────
1 │ ab x
2 │ y
3 │ ab z
julia> df.x .= replace.(df.x, r"^ab " => "")
3-element Vector{String}:
"x"
"y"
"z"
julia> df
3×1 DataFrame
Row │ x
│ String
─────┼────────
1 │ x
2 │ y
3 │ z

Related

Use regular expression to extract elements from a pandas data frame

From the following data frame:
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
My ultimate goal is to extract the letters a, b or c (as string) in a pandas series. For that I am using the .findall() method from the re module, as shown below:
# import the module
import re
# define the patterns
pat = 'a|b|c'
# extract the patterns from the elements in the specified column
df['col1'].str.findall(pat)
The problem is that the output i.e. the letters a, b or c, in each row, will be present in a list (of a single element), as shown below:
Out[301]:
0 [a]
1 [b]
2 [c]
3 [a]
While I would like to have the letters a, b or c as string, as shown below:
0 a
1 b
2 c
3 a
I know that if I combine re.search() with .group() I can get a string, but if I do:
df['col1'].str.search(pat).group()
I will get the following error message:
AttributeError: 'StringMethods' object has no attribute 'search'
Using .str.split() won't do the job because, in my original dataframe, I want to capture strings that might contain the delimiter (e.g. I might want to capture a-b)
Does anyone know a simple solution for that, perhaps avoiding iterative operations such as a for loop or list comprehension?
Use extract with capturing groups:
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
result = df['col1'].str.extract('(a|b|c)')
print(result)
Output
0
0 a
1 b
2 c
3 a
Fix your code
pat = 'a|b|c'
df['col1'].str.findall(pat).str[0]
Out[309]:
0 a
1 b
2 c
3 a
Name: col1, dtype: object
Simply try with str.split() like this- df["col1"].str.split("-", n = 1, expand = True)
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
df['col1'] = df["col1"].str.split("-", n = 1, expand = True)
print(df.head())
Output:
col1
0 a
1 b
2 c
3 a

Remove all words containing '#' from list in DataFrame

I have a DataFrame in which one column contains lists of words.
>>dataset.head(1)
>> contain
0 ["name", "Place", "ect#gtr", "nick"]
1 ["gf#e", "nobel", "play", "hi"]
I want to remove all the words which contain '#'. In the above example, I want to remove "ect#gtr" and "gf#e".
Try This one
ab= np.column_stack([~df[col].str.contains(r"#") for col in df])
new_df=df.loc[ab.any(axis=1)]
print(new_df)
Use list comprehension with filtering, regex here is not necessary:
df = pd.DataFrame({'contain':[['name', 'Place', 'ect#gtr', 'nick'],
['gf#e', 'nobel', 'play', 'hi']]})
print (df)
contain
0 [name, Place, ect#gtr, nick]
1 [gf#e, nobel, play, hi]
df.contain = df.contain.apply(lambda x: [y for y in x if '#' not in y])
Or:
df.contain = [[y for y in x if '#' not in y] for x in df.contain]
print (df)
contain
0 [name, Place, nick]
1 [nobel, play, hi]
EDIT: For remove values in strings add split with join:
df = pd.DataFrame({'contain':['name Place ect#gtr nick',"gf#e nobel play hi"]})
print (df)
contain
0 name Place ect#gtr nick
1 gf#e nobel play hi
df.contain = df.contain.apply(lambda x: ' '.join([y for y in x.split() if '#' not in y]))
print (df)
contain
0 name Place nick
1 nobel play hi

Implementing Iterative deepening depth-first search

I am using the following pseudocode from the wikipedia page to implement iterative deepening depth-first search for graphs
function IDDFS(root)
for depth from 0 to ∞
found ← DLS(root, depth)
if found ≠ null
return found
function DLS(node, depth)
if depth = 0 and node is a goal
return node
if depth > 0
foreach child of node
found ← DLS(child, depth−1)
if found ≠ null
return found
return null
Here is my code:
bool DLS(GrapheMat* graphe, Node* source, NomSom but, int limit) {
bool found = false;
printf("%s\n", (char*)source->etat);
if (strcmp((char*)source->etat, (char*)but) == 0) {
return true;
}
if (limit > 0) {
List* listSon = nodeSon(graphe, source);
while(!listEpmty(listSon)) {
Node* son = (Node*)popList(listSon);
if (DLS(graphe, son, but, limit-1)) {
return true;
}
}
}
return false;
}
bool IDDLS (GrapheMat* graphe, NomSom goal, int limit) {
bool found = false;
node* source = createNode(graphe, graphe->nomS[0]);
for (int i = 0; i <= limit; i++) {
printf("/nLimit : %d\n", i);
DLS(graphe, source, goal, i);
}
return false;
}
I am using the following graph to test :
It's built from the following file:
A B C D E F G H I J ;
A : B (140) C (118) D (75) ;
B : A (140) E (99) F (151) G (80);
C : A (118) ;
D : A (75) F (71) ;
E : B (99) H (211) ;
F : D (71) B (151) ;
G : B (80) I (146) J (97) ;
H : E (211) J (101) ;
I : G (146) J (138) ;
J : G (97) H (101) I (138) ;
Calling IDDLS(graphe, "J", 4) outputs the following:
/nLimit : 0
A
That's all.
Calling DLS(graphe, "A", "J", 4) outputs the following (newlines removed):
ABABAEFGCADAFEBAEFGHEJ
From what I understand, the DLS function should actually follow the following path:
ABEGHCDEFGHIJ
DLS(graphe, "A", "J", 4) is taking the right path. ABABAEFGCADAFEBAEFGHEJ is correct.
4 3 2 1 0
A A
├─ B B
│ ├─ A A
│ │ ├─ B B
│ │ │ ├─ A A
│ │ │ ├─ E E
│ │ │ ├─ F F
│ │ │ └─ G G
│ │ ├─ C C
│ │ │ └─ A A
│ │ └─ D D
│ │ ├─ A A
│ │ └─ F F
│ ├─ E E
│ │ ├─ B B
│ │ │ ├─ A A
│ │ │ ├─ E E
│ │ │ ├─ F F
│ │ │ └─ G G
│ │ └─ H H
│ │ ├─ E E
│ │ └─ J J
C F
D G
In IDDLS, replace
DLS(graphe, source, goal, i);
with
if (DLS(graphe, source, goal, i)) {
return true;
}
There's no need to keep looking deeper once you've found the node.
The only way IDDLS(graphe, "J", 4) could output what you say it does is if the program was killed by a signal (e.g. from SIGSEGV)[1]. Verify this (by checking the process's exit code). If that's the case, there's a problem with the functions DLS calls, or there's a problem with how it calls them.
You have a memory leak. The List created by nodeSon is never freed.
Optimized to remove needless string comparisons:
bool DLS(GrapheMat* graphe, Node* source, NomSom but, int limit) {
printf("%s\n", (char*)source->etat);
if (limit) {
List* listSon = nodeSon(graphe, source);
while (!listEpmty(listSon)) {
Node* son = (Node*)popList(listSon);
if (DLS(graphe, son, but, limit-1)) {
return true;
}
}
return false;
} else {
return strcmp((char*)source->etat, (char*)but) == 0;
}
}
bool IDDLS(GrapheMat* graphe, NomSom goal, int limit) {
node* source = createNode(graphe, graphe->nomS[0]);
for (int i = 0; i <= limit; ++i) {
printf("/nLimit : %d\n", i);
if (DLS(graphe, source, goal, i)) {
return true;
}
}
return false;
}
Well, it's also possible one of the functions you call calls exit, performs a long jump, or does something similarly weird.

Extracting Specific Columns from Multiple Files & Writing to File Python

I have seven tab delimited files, each file has the exact number and name of the columns but different data of each. Below is a sample of how either of the seven files looks like:
test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change)
000001 000001 ZZ 1:1 01 01 NOTEST 0 0 0 0 1 1 no
I am trying to basically read all of those seven files and extract the third, fourth and tenth column (gene, locus, log2(fold_change)) And write those columns in a new file. So the file look something like this:
gene name locus log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change)
ZZ 1:1 0 0 0 0
all the log2(fold_change) are obtain from the tenth column from each of the seven files
What I had so far is this and need help constructing a more efficient pythonic way to accomplish the task above, note that the code is still not accomplish the task explained above, need some work
dicti = defaultdict(list)
filetag = []
def read_data(file, base):
with open(file, 'r') as f:
reader = csv.reader((f), delimiter='\t')
for row in reader:
if 'test_id' not in row[0]:
dicti[row[2]].append((base, row))
name_of_fold = raw_input("Folder name to stored output files in: ")
for file in glob.glob("*.txt"):
base=file[0:3]+"-log2(fold_change)"
filetag.append(base)
read_data(file, base)
with open ("output.txt", "w") as out:
out.write("gene name" + "\t"+ "locus" + "\t" + "\t".join(sorted(filetag))+"\n")
for k,v in dicti:
out.write(k + "\t" + v[1][1][3] + "\t" + "".join([ int(z[0][0:3]) * "\t" + z[1][9] for z in v ])+"\n")
So, the code above is a working code but is not what I am looking for here is why. The output code is the issue, I am writing a tab delimited output file with the gene at the first column (k), v[1][1][3] is the locus of that particular gene, and finally which is what I am having tough time coding is this is part of the output file:
"".join([ int(z[0][0:3]) * "\t" + z[1][9] for z in v ])
I am trying to provide a list of fold change from each of the seven file at that particular gene and locus and then write it to the correct column number, so I am basically multiply the column number of which file number is by "\t" this will insure that the value will go to the right column, the problem is that when the next column of another file comes a long, the writing will be starting from where it left off from writing which I don't want, I want to start again from the beginning of the writing:
Here is what I mean for instance,
gene name locus log2(fold change) from file 1 .... log2(fold change) from file7
ZZ 1:3 0
0
because first log2 will be recorded based on the column number for instance 2 and that is to ensure recording, I am multiplying the number of column (2) by "\t" and fold_change value , it will record it no problem but then last column will be the seventh for instance and will not record to the seven because the last writing was done.
Here is my first approach:
import glob
import numpy as np
with open('output.txt', 'w') as out:
fns = glob.glob('*.txt') # Here you can change the pattern of the file (e.g. 'file_experiment_*.txt')
# Title row:
titles = ['gene_name', 'locus'] + [str(file + 1) + '_log2(fold_change)' for file in range(len(fns))]
out.write('\t'.join(titles) + '\n')
# Data row:
data = []
for idx, fn in enumerate(fns):
file = np.genfromtxt(fn, skip_header=1, usecols=(2, 3, 9), dtype=np.str, autostrip=True)
if idx == 0:
data.extend([file[0], file[1]])
data.append(file[2])
out.write('\t'.join(data))
Content of the created file output.txt (Note: I created just three files for testing):
gene_name locus 1_log2(fold_change) 2_log2(fold_change) 3_log2(fold_change)
ZZ 1:1 0 0 0
I am using re instead of csv. The main problem with you code is the for loop which writes the output in the file. I am writing the complete code. Hope this solves problem you have.
import collections
import glob
import re
dicti = collections.defaultdict(list)
filetag = []
def read_data(file, base):
with open(file, 'r') as f:
for row in f:
r = re.compile(r'([^\s]*)\s*')
row = r.findall(row.strip())[:-1]
print row
if 'test_id' not in row[0]:
dicti[row[2]].append((base, row))
def main():
name_of_fold = raw_input("Folder name to stored output files in: ")
for file in glob.glob("*.txt"):
base=file[0:3]+"-log2(fold_change)"
filetag.append(base)
read_data(file, base)
with open ("output", "w") as out:
data = ("genename" + "\t"+ "locus" + "\t" + "\t".join(sorted(filetag))+"\n")
r = re.compile(r'([^\s]*)\s*')
data = r.findall(data.strip())[:-1]
out.write('{0[1]:<30}{0[2]:<30}{0[3]:<30}{0[4]:<30}{0[5]:<30} {0[6]:<30}{0[7]:<30}{0[8]:<30}'.format(data))
out.write('\n')
for key in dicti:
print 'locus = ' + str(dicti[key][1])
data = (key + "\t" + dicti[key][1][1][3] + "\t" + "".join([ len(z[0][0:3]) * "\t" + z[1][9] for z in dicti[key] ])+"\n")
data = r.findall(data.strip())[:-1]
out.write('{0[0]:<30}{0[1]:<30}{0[2]:<30}{0[3]:<30}{0[4]:<30}{0[5]:<30}{0[6]:<30}{0[7]:<30}{0[8]:<30}'.format(data))
out.write('\n')
if __name__ == '__main__':
main()
and i change the name of the output file from output.txt to output as the former may interrupt the code as code considers all .txt files. And I am attaching the output i got which i assume the format that you wanted.
Thanks
gene name locus 1.t-log2(fold_change) 2.t-log2(fold_change) 3.t-log2(fold_change) 4.t-log2(fold_change) 5.t-log2(fold_change) 6.t-log2(fold_change) 7.t-log2(fold_change)
ZZ 1:1 0 0 0 0 0 0 0
Remember to append \n to the end of each line to create a line break. This method is very memory efficient, as it just processes one row at a time.
import csv
import os
import glob
# Your folder location where the input files are saved.
name_of_folder = '...'
output_filename = 'output.txt'
input_files = glob.glob(os.path.join(name_of_folder, '*.txt'))
with open(os.path.join(name_of_folder, output_filename), 'w') as file_out:
headers_read = False
for input_file in input_files:
if input_file == os.path.join(name_of_folder, output_filename):
# If the output file is in the list of input files, ignore it.
continue
with open(input_file, 'r') as fin:
reader = csv.reader(fin)
if not headers_read:
# Read column headers just once
headers = reader.next()[0].split()
headers = headers[2:4] + [headers[9]]
file_out.write("\t".join(headers + ['\n'])) # Zero based indexing.
headers_read = True
else:
_ = reader.next() # Ignore header row.
for line in reader:
if line: # Ignore blank lines.
line_out = line[0].split()
file_out.write("\t".join(line_out[2:4] + [line_out[9]] + ['\n']))
>>> !cat output.txt
gene locus log2(fold_change)
ZZ 1:1 0
ZZ 1:1 0

Using Pandas to subset data from a dataframe based on multiple columns?

I am new to python. I have to extract a subset from pandas dataframe based on 2 lists corresponding to 2 columns in that dataframe. Both the values in list should match with that of dataframe at index level. I have tried with "isin" function but obviously it doesn't work with combinations.
from pandas import *
d = {'A' : ['a', 'a', 'c', 'a','b'] ,'B' : [1, 2, 1, 4,1]}
df = DataFrame(d)
list1 = ['a','b']
list2 = [1,2]
print df
A B
0 a 1
1 a 2
2 c 1
3 a 4
4 b 1
### Using isin function
df[(df.A.isin(list1)) & (df.B.isin(list2)) ]
A B
0 a 1
1 a 2
4 b 1
###Desired outcome
d2 = {'A' : ['a'], 'B':[1]}
DataFrame(d2)
A B
0 a 1
Please let me know if this can be done without using loops and if there is a way to do it in a single step.
A quick and dirty way to do this is using zip:
df['C'] = zip(df['A'], df['B'])
list3 = zip(list1, list2)
d2 = df[df['C'].isin(list3)
print(df2)
A B C
0 a 1 (a, 1)
You can of course drop the newly created column after you're done filtering on it.