Remove duplicates from one list only - compare

I have 2 lists as plain text. One list contains all the email adresses of customers I mailed in the past. The other list is a list of potential customers I want to mail.
I find this hard to explain in english, so hopefully you'll get it once you look at the example below.
LIST A: List of customers I already sent email to (addresses from this list should never be outputted):
michael#aaa.nl
michael#bbb.nl
michael#ccc.nl
michael#ddd.nl (shouldn't be outputted, even though it is not in LIST B)
LIST B: List of customers I want to mail:
michael#aaa.nl (duplicate, exists in LIST A)
michael#bbb.nl (duplicate, exists in LIST A)
martin#ccc.nl (duplicate, domain ccc.nl exists in LIST A)
michael#fff.nl (not duplicate, I want to output this)
Result I want:
michael#fff.nl
Is there a script / commando (OSX or Linux) to get this to work? I hope that you guys can help me out?
I tried things with uniq and diff, but I cannot get the output of one list (B) only.

I just googled compare two lists online tool and found this: http://www.listdiff.com/compare-2-lists-difference-tool
But it doesn't compare domains. That would require developing a custom script in Node.js or Python or almost anything. I don't think there is a ready script for this, but it would be possible to compare it in few lines of code...

Related

Google Sheets - List the Combinations of 2 Different Columns and Indicate How Many Times They Occur

Participants are paired together to accomplish a task.
I would like to have a list automatically made showing me how many times participants are paired with each other. This way, I get to pair the participants equally.
In the picture attached, you can see how I want the list generated at the far right. I was thinking "query" would work? But I'm not so familiar on how to do it.
Below example will show you the way:
1.Data are in A:B
2.Unique pair are in D:E - code: =UNIQUE(A3:B)
3.Number of times they where together are in F. Code for F3 and belowe:
=COUNTA(QUERY(A2:B6,"select B where A='"&D3&"' and B='"&E3&"'",0))
Pictures:
Is that what you where trying to get?

How to report a list in Behaviorspace NetLogo?

I am running a NetLogo model in BehaviorSpace each time varying number of runs. I have turtle-breed pigs, and they accumulate a table with patch-types as keys and number of visits to each patch-type as values.
In the end I calculate a list of mean number of visits from all pigs. The list has the same length as long as the original table has the same number of keys (number of patch-types). I would like to export this mean number of visits to each patch-type with BehaviorSpace.
Perhaps I could write a separate csv file (tried - creates many files, so lots of work later on putting them together). But I would rather have everything in the same file output after a run.
I could make a global variable for each patch-type but this seems crude and wrong. Especially if I upload a different patch configuration.
I tried just exporting the list, but then in Excel I see it with brackets e.g. [49 0 31.5 76 7 0].
So my question Q1: is there a proper way to export a list of values so that in BehaviorSpace table output csv there is a column for each value?
Q2: Or perhaps there is an example of how to output a single csv that looks exactly as I want it from BehaviorSpace?
PS: In my case the patch types are costs. And I might change those in the future and rerun everything. Ideally I would like to have as output: a graph of costs vs frequency of visits.
Thanks
If the lists are a fixed length that doesn't vary from run to run, you can get the items into separate columns by using one metric for each item. So in your BehaviorSpace experiment definition, instead of putting mylist, put item 0 mylist and item 1 mylist and so on.
If the lists aren't always the same length, you're out of luck. BehaviorSpace isn't flexible that way. You would have to write a separate program (in the programming language of your choice, perhaps NetLogo itself, perhaps an Excel macro, perhaps something else) to postprocess the BehaviorSpace output and make it look how you want.

How to delete a url in each string from a dataset

I have a dataset in which 1 column has the tweets and other column has labels for the tweets. My problem is I want the html links present in the tweets to be removed for example
RT #AmDiabetesAssn: Know what’s scary? These #diabetes statistics. Spread awareness this November for #DiabetesMonth! http://t.co/qIiiSc4ozZ
I have a tweet as given above i want to remove(http://t.co/qIiiSc4ozZ) and want the output in this way, for all the strings.
RT #AmDiabetesAssn: Know what’s scary? These #diabetes statistics. Spread awareness this November for #DiabetesMonth!
I have seen many examples and tried those but couldn't get the desired result. Please help. Thanks in advance.
I tried this, which should work for any links that don't have spaces in them:
for tweet in tweets:
print re.sub(r'http://\S+\s?','',tweet)
I assume here that you've got a bunch of strings in the tweets array that represent the first column that you described above (also that you want them printed). You should be able to modify to suit the iteration pattern you're using.

Comparing two documents

I have two very large lists. They both were originally in excel, but the larger one is a list of emails (about 160,000) of them with other information like their name and address etc. And the smaller one is a list of just 18,000 emails.
My question is what would be the easiest way to get rid of all 18,000 rows from the first document that contain the email addresses from the second?
I was thinking regex or maybe there is another application I can use? I have tried searching online but it seems like there isn't much specific to this. I also tried notepad++ but it freezes when I try to compare these large files.
-Thank You in Advance!!
Good question. One way I would tackle this is making a C++ program [you could extrapolate the idea to the language of your choice; You never mentioned which languages you were proficient in] that read each item of the smaller file into a vector of strings. First, of course, use Excel to save the files as CSV instead of XLS or XLSX, which will comma-separate the values so you can work with them easier. For the larger list, "Save As" a copy of just email addresses, deleting the other rows for now.
Then, you could open the larger list and use a nested loop to check if you should output to an output file. Something like:
bool foundMatch=false;
for(int y=0;y<LargeListVector.size();y++) {
for(int x=0;x<SmallListVector.size();x++) {
if(SmallListVector[x]==LargeListVector[y]) foundMatch=true;
}
if(!foundMatch) OutputVector.append(LargeListVector[y]);
foundMatch=false;
}
That might be partially pseudo-code, but do you get the idea?
So I read a forum post at : Here
=MATCH(B1,$A$1:$A$3,0)>0
Column B would be the large list, with the 160,000 inputs and column A was my list of things I needed to delete of 18,000.
I used this to match everything, and in a separate column pasted this formula. It would print out either an error or TRUE. If the data was in both columns it printed out true.
Then because I suck with excel, I threw this text into Notepad++ and searched for all lines that contained TRUE (match case, because in my case some of the data had the word true in it without caps.) I marked those lines, then under search, bookmarks, I removed all lines with bookmarks. Pasted that back into excel and voila.
I would like to thank you guys for helping and pointing me in the right direction :)

Search a list of terms from this website, and nostop even any one of the terms are missing

I am trying to use RCurl package to get data from the genecard databases
http://www-bimas.cit.nih.gov/cards//
I read a wonderful solution in a previous posted questions:
How can I use R (Rcurl/XML packages ?!) to scrape this webpage?
However, my problem is different in a form that I need further supports from experist. Instead of exctracting all the links from the webpages. I have a list of ~ 1000 genes in my mind. They are in the form of gene symbols (some of the gene symbols can be found in the webpage, some of them are new to the database). Here is part of my lists of genes.
TP53
SOD1
EGFR
C2d
AKT2
NFKB1
C2d is not in the database, so, when I do the search manually I will see.
"Sorry, there is no GeneCard for C2d".
When I use to the solution posted in the previous questions for my analysis.
How can I use R (Rcurl/XML packages ?!) to scrape this webpage?
(1) I firstly readin the list
(2) I then use the get_structs function in the previous solution to subsitute each gene sybmols in the list to the following website
http://www-bimas.cit.nih.gov/cgi-bin/cards/carddisp.pl?gene=genesybol.
(3) Scrap the information that I needed for each genes in the list, using the get_data_url function in the previous message.
It works for the TP53, SOD1, EGFR, but when the search comes to C2d. The process stopped.
As I got ~ 1000 genes, I am sure some of them are missing from the webpage.
How can I get a modified gene list to tell me out of ~1000 genes, which one of them are missing automatically? So, that I can use the same approach as listed in the previous question to get all the data that I needed based on the new gene lists that are EXISTING in webpage?
Or are there any methods to ask the R to skip those missing items and do the scrapping continuously till the end of the list but mark those missing items in the final results.
In order to faciliate the discussion process. I have make a sudo input files using the scripts using in the previous questions for the same webpage that they used.
u <- c ("Aero_pern", "Ppate", "didnotexist", "Sbico")
library(RCurl)
base_url<-"http://gtrnadb.ucsc.edu/" base_html<-getURLContent(base_url)[[1]]
links<-strsplit(base_html,"a href=")[[1]]
get_structs<-function(u) {
struct_url<-paste(base_url,u,"/",u,"-structs.html",sep="")
raw_data<-getURLContent(struct_url)
s_split1<-strsplit(raw_data,"<PRE>")[[1]]
all_data<-s_split1[seq(3,length(s_split1))]
data_list<-lapply(all_data,parse_genomes)
for (d in 1:length(data_list)) {data_list[[d]]<-append(data_list[[d]],u)}
return(data_list)
}
I guess the problem can be solved by modifing the get_structs scripps above or ifelse function may help, but I cannot figure out how to modify it further. Pls comments.
You can enclose your function call inside a try() so that the process won't break if you get errors. Usually this will let you loop over problematic cases and it will return an error message instead of breaking your process. e.g.
dat <- list()
for (i in 1:length(u)){
dat[[i]] <- try(get_structs(u[i]))
}