Comparing two documents - regex

I have two very large lists. They both were originally in excel, but the larger one is a list of emails (about 160,000) of them with other information like their name and address etc. And the smaller one is a list of just 18,000 emails.
My question is what would be the easiest way to get rid of all 18,000 rows from the first document that contain the email addresses from the second?
I was thinking regex or maybe there is another application I can use? I have tried searching online but it seems like there isn't much specific to this. I also tried notepad++ but it freezes when I try to compare these large files.
-Thank You in Advance!!

Good question. One way I would tackle this is making a C++ program [you could extrapolate the idea to the language of your choice; You never mentioned which languages you were proficient in] that read each item of the smaller file into a vector of strings. First, of course, use Excel to save the files as CSV instead of XLS or XLSX, which will comma-separate the values so you can work with them easier. For the larger list, "Save As" a copy of just email addresses, deleting the other rows for now.
Then, you could open the larger list and use a nested loop to check if you should output to an output file. Something like:
bool foundMatch=false;
for(int y=0;y<LargeListVector.size();y++) {
for(int x=0;x<SmallListVector.size();x++) {
if(SmallListVector[x]==LargeListVector[y]) foundMatch=true;
}
if(!foundMatch) OutputVector.append(LargeListVector[y]);
foundMatch=false;
}
That might be partially pseudo-code, but do you get the idea?

So I read a forum post at : Here
=MATCH(B1,$A$1:$A$3,0)>0
Column B would be the large list, with the 160,000 inputs and column A was my list of things I needed to delete of 18,000.
I used this to match everything, and in a separate column pasted this formula. It would print out either an error or TRUE. If the data was in both columns it printed out true.
Then because I suck with excel, I threw this text into Notepad++ and searched for all lines that contained TRUE (match case, because in my case some of the data had the word true in it without caps.) I marked those lines, then under search, bookmarks, I removed all lines with bookmarks. Pasted that back into excel and voila.
I would like to thank you guys for helping and pointing me in the right direction :)

Related

Excel Alternative to nested IF

I have a couple of rather large nested if functions in my spreadsheet. It sure would be nice to have an alternative method. Problem is I'm using a wildcard (*) in my lookup because the source text has slight variations (date for example).
For example, if my list of data contains:
VENMO PAYMENT 220828 1022093447487 BRENDA HOSPY
VENMO PAYMENT 220813 1031323447487 BRENDA HOSPY
I want these to show in an adjacent column of cells as just Venmo
Currently my if function in that second column of cells is:
=IF(COUNTIF($F10,"*APPLE.COM/BILL*"),"AP",
IF(COUNTIF($F10,"IIA VOYA*"),"VOYA",
IF(COUNTIF($F10,"VENMO PAYMENT*"),"Venmo",
IF(COUNTIF($F10,etc...
This works fine but quickly gets unruly as more things get added.
I've spent a great deal of time searching for functions and processes that would make this easier, or at least more compact, but I can't find a way with typical functions like vlookup or index/match.
If I've explained this in a comprehensible fashion perhaps you've seen or experienced a similar situation and could offer a suggestion. It would be appreciated!
I'm not opposed to using a programming function.
I've looked at, and for, various Excel functions or combinations with no luck on my own or online.
I have created a structure as below
Formula present in B2 is as below
=IFERROR(INDEX($F$2:$F$9,MIN(IF(COUNTIF(A2,"*"&$E$2:$E$9&"*")>0,ROW($E$2:$E$9),9999999)-1)),"---")
Enter it as an Array Formula using Ctrl+Shift+Enter
It will search all the strings present in column E in A2 when found will return all the row numbers of column E where there is a match, i have then used min to get the first one, and if not found it will return 9999999, and as the data is starting from row 2 i have added -1 to make it equal to the data index. after that i have called the index to search value present at that index in column F. and at the end used the if error function to show --- where no match was found and 999999 was returned.

How can I resolve INDEX MATCH errors caused by discrepancies in the spelling of names across multiple data sources?

I've set up a Google Sheets workbook that synthesizes data from a few different sources via manual input, IMPORTHTML and IMPORTRANGE. Once the data is populated, I'm using INDEX MATCH to filter and compare the information and to RANK each data set.
Since I have multiple data inputs, I'm running into a persistent issue of names not being written exactly the same between sources, even though they're the same person. First names are the primary culprit (i.e. Mary Lou vs Marylou vs Mary-Lou vs Mary Louise) but some last names with special symbols (umlauts, accents, tildes) are also causing errors. When Sheets can't recognize a match, the INDEX MATCH and RANK functions both break down.
I'm wondering how to better unify the data automatically so my Sheet understands that each occurrence is actually the same person (or "value").
Since you can't edit the results of an IMPORTHTML directly, I've set up "helper columns" and used functions like TRIM and SPLIT to try and fix instances as I go, but it seems like there must be a simpler path.
It feels like IFS could work but I can't figure how to integrate it. Also thinking this may require a script, which I'm just beginning to study.
Here's a simplified example of what I'm trying to achieve and the corresponding errors: Sample Spreadsheet
The first tab is attempting to pull and RANK data from tabs 2 and 3. Sample formulas from the Summary tab, row 3 (Amelia Rose):
Cell B3: =INDEX('Q1 Sales'!B:B, MATCH(A3,'Q1 Sales'!A:A,0))
Cell C3: =RANK(B3,$B$2:B,1)
Cell D3: =INDEX('Q2 Sales'!B:B, MATCH(A3,'Q2 Sales'!A:A,0))
Cell E3: =RANK(D3,$D$2:D,1)
I'd be grateful for any insight on how to best index 'Q2Sales'!B3 as the correct value for 'Summary'!D3. Thanks in advance - the thoughtful answers on Stack Overflow have gotten me this far!
to counter every possible scenario do it like this:
=ARRAYFORMULA(IFERROR(VLOOKUP(LOWER(REGEXREPLACE(A2:A, "-|\s", )),
{REGEXEXTRACT(LOWER(REGEXREPLACE('Q2 Sales'!A2:A, "-|\s", )),
TEXTJOIN("|", 1, LOWER(REGEXREPLACE(A2:A, "-|\s", )))), 'Q2 Sales'!B2:B}, 2, 0)))

Merging cells in openoffice deleting specific whitespace

Ok, so what I have is going to be 3 cells of data that I need to merge into a link to pictures in my store. What I am looking for is an easy way to do this without double clicking and cntl v pasting for 4x at 100+ lines per page...
Cell 1. Cell 2. Cell 3
Assets/ name. .jpg
Needs to be.... assets/name.jpg
This seems simple, but the problem is most of the names are 2 words and even the single word names when merged look like this...... assets/ name name .jpg
Giving me a space after/ and a space after the second name. If the "name" I am merging with has 2 or more parts I still need to keep those spaces intact or the link will not work the way it's set up currently. I may need to rename the pictures into 1 solid word just for linking purposes, but hoping to avoid an extra step.
Is there a way to merge and remove the spaces I need gone to create the link? I have done a couple pages the hard way, not fun when I have 200+ pages to do.
Any help is appreciated.
Thank you.
Jerry
It seems to me possible that an answer to a completely different Q may be of interest to you:
=TRIM(LOWER((A1))&TRIM(A2)&TRIM(A3))

what are sufficient approaches for manipulating a list of thousands of individual words

I'm trying to design such an application that manipulates a list of thousands of individual words that is stored in a txt file for the following tasks,
1- Randomly picking up some words.
2- Checking whether some entered words by the user are actually in the list.
3- Retrieve the entire list from a txt file and store it temporarily for subsequent manipulations.
I'm not asking for implementation neither for pseudo codes. I'm looking for sufficient approach to deal with a massive list of words. For the time being, I might go with a vector of strings, however, searching thousands of words will take some times. Of course there must be some strategies to cope with this kind of tasks however, since my background is not Computer Science, I don't know in which direction which I go. Any suggestions are welcomed.
A vector of strings is fine for this problem. Just sort them, and then you can use binary search to find a string in the list.
Radix trees are a good solution for searching through word lists for matches. Reduced space for storage, but you'll have to have some custom code for getting and putting words in the list. And the text file won't necessarily be easy to read unless you create the tree anew each time you load from a text file. Here's an implementation I committed to on GitHub (I can't even remember the source material at this point) that might be of assistance to you.

yahoo pipes trimming all item titles

After a lot of hard work, I have created two yahoo Pipes I will be using.
One of them has a minor problem however... I am trimming the title length down to leave enough room for a ... and a link to fit within a tweet.
It trims the first post correctly... however it trims all of the posts after that to 0 length (before adding a bit of extra text to the end).
The problem is I'm not using a loop for all items after a certain point, but the reason for that is the output is always items from a loop, and I need the output to be number at a certain point so that I can feed in that number asa variable to trim the length by. The pipe can be found here: http://pipes.yahoo.com/pipes/pipe.info?_id=3e6c3c6b2d23d8ce0cf66cb3efc5fb56
Typically, I am inserting any RSS feed in the top box, something like "new blog post:" in the middle and "#bussiness #hashtags" in the last box.
If you can see any way I can have this yahoo pipe work for all posts rather than just the top one, please let me know. its not a big deal as im only ever posting for the moment, the top post to twitter... however there may come a point where I need all of them looking the same.