Marking strings in one list that exist in another list - compare

I have 2 lists of values in 2 variables, which contain ZIP-codes in string, as they have numbers and letters. My first list contains 33.000 ZIP-codes, the second list 1400. Now I want to check if my ZIP-codes from the second variable are also in the first variable, and if so, give a third variable the code 1. If it is not in both variable lists, give it the code 0. I've tried to compare datasets, but that only compares if the variable is on the same position. Writing a loop didn't work so far.
Hopefully anyone can help! Thanks in advance.

Assuming you have two datasets:
dataset activate list2.
compute InBothLists=1.
sort cases by zipcode.
dataset activate list1.
sort cases by zipcode.
match files /file=* /table=list2 /by zipcode.
execute.
In the code above use your own dataset names and variable names - make sure your have the same variable name for the zipcode in both lists.
Once you run this you will have a new variable in the dataset list1 which has the value 1 for zipcodes that also appear in list2.

Related

spliting based on a condition and Array arguments to IF are of different size

I'm working on a sheet that can be extracted from a system and all the data is in one cell so i need to split them and it is basic for all cells except for resutls. as you can see test results should follow the same pattren of result status. so i regularlly splited one column (test status) and i tried to split test results based on if condition
it worked perfectly, however, for some status test results were not spliting because (Array arguments to IF are of different size.)
how do i fix this, please help
Thank you
Because IF() function is checking only first cell of C2:G2 range. Concat them into a single string them use search function to detect word Final. Try-
=IF(ISNUMBER(SEARCH("Final",JOIN(",",C2:G2))),SPLIT(H2,","),"")
You may try this:
=BYCOL(C2:G2,LAMBDA(Σ,(IF(Σ<>"Final",,INDEX(SPLIT(H2,","),, COUNTIF(C2:Σ,Σ))))))

How to get all cells that appear more than 5 times?

enter image description here
I have a table in OpenOffice that contains a column with region's codes (column J). Using table functions, how to get all codes that appear more than 5 times and write them in one cell?
Normally I would recommend breaking this problem down into smaller parts using helper columns. Or better yet, move the data into LibreOffice Base which can easily work with distinct values.
However, I managed to come up with a rather large formula that seems to do what you asked. Enter it as an array formula.
=TEXTJOIN(",";1;IF(COUNTIF(исходник.J$2:J$552;исходник.J2:J552)>5;IF(ROW(исходник.J2:J552)=MATCH(исходник.J2:J552;исходник.J$2:J$552;0)+ROW(J$2)-1;исходник.J2:J552;"")))
I can't test this on your actual data since your example is only an image, but let's say that there are six of both 77 and 37. Then this would show 77,37 as the result.
Here is a breakdown. Look up the functions in LibreOffice Online Help for more information.
=TEXTJOIN(",";1; — Join all results into a single cell, separated by commas.
IF(COUNTIF(исходник.J$2:J$552;исходник.J2:J552)>5; — Find codes that occur more than 5 times. This is the same as what you wrote.
IF(ROW(исходник.J2:J552)= — Compare the next result to the row number that we are currently looking at.
MATCH(исходник.J2:J552;исходник.J$2:J$552;0)+ROW(J$2)-1; — Determine the first row that has this code. We do this to get unique results instead of 6 or more of each code in the result.
исходник.J2:J552;""))) — Return the code. (Your formula simply returns 1 here, which doesn't seem to be what you want.) If it doesn't match, return an empty string rather than 0, because TEXTJOIN ignores empty strings.

How do combine a random amount of lists in python

I'm working on a program that reads trough a FASTQ file and gives the amount of N's per sequence in this file. I managed to get the number of N's per line and I put these in a list.
The problem is that I need all the numbers in one list to sum op the total amount of N's in the file but they get printed in their own list.
C:\Users\Zokids\Desktop>N_counting.py test.fastq
[4]
4
[3]
3
[5]
5
This is my output, the List and total amount in the list. I've seen ways to manually combine lists but one can have hundreds of sequences so that's a no go.
def Count_N(line):
'''
This function takes a line and counts the anmount of N´s in the line
'''
List = []
Count = line.count("N") # Count the amount of N´s that are in the line returned by import_fastq_file
List.append(int(Count))
Total = sum(List)
print(List)
print(Total)
This is what I have as code, another function selects the lines.
I hope someone can help me with this.
Thank you in advance.
The List you're defining in your function never gets more than one item, so it's not very useful. Instead, you should probably return the count from the function, and let the calling code (which is presumably running in some kind of loop) append the value to its own list. Of course, since there's not much to the function, you might just move it's contents out to the loop too!
For example:
list_of_counts = []
for line in my_file:
count = line.count("N")
list_of_counts.append(count)
total = sum(list_of_counts)
Looks from your code you send one line each time you call count_N(). List you declared is a local list and gets reinitialized when you call the function each time. You can declare the list global using:
global List =[]
I think you will also need to declare the list outside function in order to access it globally.
Also it would be better if you Total the list outside the function. Right now you are summing up list inside function. For that you will need to match indentation with the function declaration.

Conditional Vlook up without using VBA

I want to convert an input to desired output. Kindly help.
In the output - the columns value should start from most recent (year)
Please click this to see data
Unfortunately VLOOKUP is not able to fulfill that ask. However the INDEX-function can.
Here is a good read on how to use it:
http://fiveminutelessons.com/learn-microsoft-excel/use-index-lookup-multiple-values-list
This will work for you spreedsheet, if your input table starts at A1 without a header and your output table starts at H3 with the first ID.
You get this by copy&pasting the first column of your input table to column H and then remove duplicates.
{=IF(ISERROR(INDEX($A$1:$C$7,SMALL(IF($A$1:$A$7=$H$3,ROW($A$1:$A$7)),ROW(1:1)),3)),"",
INDEX($A$1:$C$7;SMALL(IF($A$1:$A$7=$H$3,ROW($A$1:$A$7)),ROW(1:1)),3))}
Let's look at the formula step by step:
The curly brackets tell excel that this is an array formula, the interesting part for you is: when you've inserted the formula (without curly brackets) press shift+ctrl+enter, excel will then know that this is an array formula.
'error at formula?, then blank, else formula
=IF(ISERROR(....),"",...)
When you autofill this formula you probably dont know how many instances of your lookup variable are. So when you put this formula in 4 cells, but there are only 3 entries, this bit will keep the cell blank instead of giving an error.
INDEX($A$1:$C$7,SMALL(IF($A$1:$A$7=$H$3,ROW($A$1:$A$7)),ROW(1:1)),3))
$A$1:$C$7 is your data matrix. Your IDs (in your case 125 and 501) are to be found in $A$1:$A$7. ROW(1:1) is the absolute(!) rowID, 3 the absolute(!) column id. So when you move your input table those values have to be changed.
What exactly SMALL and INDEX do are well described in the link above. (Or at least better than I could.)
Hope that clarified some parts,
Tom

R: searching within split character strings with apply

Within a large data frame, I have a column containing character strings e.g. "1&27&32" representing a combination of codes. I'd like to split each element in the column, search for a particular code (e.g. "1"), and return the row number if that element does in fact contain the code of interest. I was thinking something along the lines of:
apply(df["MEDS"],2,function(x){x.split<-strsplit(x,"&")if(grep(1,x.split)){return(row(x))}})
But I can't figure out where to go from there since that gives me the error:
Error in apply(df["MEDS"], 2, function(x) { :
dim(X) must have a positive length
Any corrections or suggestions would be greatly appreciated, thanks!
I see a couple of problems here (in addition to the missing semicolon in the function).
df["MEDS"] is more correctly written df[,"MEDS"]. It is a single column. apply() is meant to operate on each column/row of a matrix as if they were vectors. If you want to operate on a single column, you don't need apply()
strsplit() returns a list of vectors. Since you are applying it to a row at a time, the list will have one element (which is a character vector). So you should extract that vector by indexing the list element strsplit(x,"&")[[1]].
You are returning row(x) is if the input to your function is a matrix or knows what row it came from. It does not. apply() will pull each row and pass it to your function as a vector, so row(x) will fail.
There might be other issues as well. I didn't get it fully running.
As I mentioned, you don't need apply() at all. You really only need to look at the 1 column. You don't even need to split it.
OneRows <- which(grepl('(^|&)1(&|$)', df$MEDS))
as Matthew suggested. Or if your intention is to subset the dataframe,
newdf <- df[grepl((^|&)1(&|$)', df$MEDS),]