Using RegEx in SSIS - regex

I currently have a package pulling data from an excel file, but when pulling the data out I get rows I do not want. So I need to extract everything from the 'ID' field that has any sort of letter in it.
I need to be able to run a RegEx command such as "%[a-zA-Z]%" to pull out that data. But with the current limitation of conditional split it's not letting me do that. Any ideas on how this can be done?

At the core of the logic, you would use a Script Transformation as that's the only place you can access the regex.
You could simply a second column to your data flow, IDCleaned and that column would only contain cleaned values or a NULL. You could then use the Conditional Split to filter good rows vs bad. System.Text.RegularExpressions.Regex.Replace error in C# for SSIS
If you don't want to add another column, you can set your current ID column to be ReadWrite for the Script and then update in place. Perhaps adding a boolean column might make the Conditional Split logic easier at this point.

Related

Scenarios possible in apache nifi

I am trying to understand apache nifi in and out keeping files in hdfs and have various scenarios to work on. Please let me know the feasibility of each with explanations. I am adding few understanding with each scenario.
Can we check null value present with in a single column? I have checked different processors, and found notNull property, but I think this works on file names, not on columns present within file.
Can we drop a column present in hdfs using nifi transformations?
Can we change column values as in replace one text with other? I have checked replaceText property for the same.
Can we delete a row from file system?
Please suggest the possibilities and how to achieve the goal.
Try with this:
1.Can we check null value present with in a single column? I have checked different :
Yes using replace text processor you can check and replace if you want to replace or use 'Route on Attribute' if want to route based on null value condition.
Can we drop a column present in hdfs using nifi transformations?
Yes using same 'ReplaceText' processor you can put desired fields with delimiter as I used to have current date field and some mandatory fields only in my data with comma separated so I provided replacement value as
"${'userID'}","${'appID'}","${sitename}","${now():format("yyyy-MM-dd")}"
To change column value use 'ReplaceText' processor.

Power Query : replacing a cell's content by the previous one under conditions

On a regular bases, I have to clean a CSV file.
I'd like to automate this task, what I'm trying to achieve with Power Query.
But I'm stuck on one process : in this file, some dates are wrongly filled (always with the same value - so that I can easily identify them), and I'd like to replace them with the previous row's one.
Is there a M language function allowing this kind of operations ?
Thanks !
Right now, Table.FillDown only fills on null values. Maybe first you can replace those wrong values with null, then you can call Table.FillDown on those columns?

SP 2013 - Quick edit with Managed Meta Data columns, copy and paste from excel

I'm trying to migrate a meta data from an excel spreadsheet to a SP 2013 document library. The columns are managed meta data columns with pre defined terms matching the data in the excel spreadsheet.
However I cannot copy and paste data from excel via Quick Edit in the doucment library without getting the following error "The data returned from the tagging UI was not formatted correctly"
This happens even when I remove all formatting or paste to notepad first.
Are there any simple solutions to this issue?
http://i.imgur.com/1bqpMPA.jpg
Thanks,
Any metadata fields are in fact foreign keys, as it were, to a dynamic, hidden table (or 'list', whatever you want to call it) within SharePoint. To paste a value into a metadata column, you need to know your element's guid (as in, within the term set) and then append that to each metadata element you're pasting in as a <name>|<guid> pair.
Getting the GUID for an element within your term set
Browse to [site-root]/TaxonomyHiddenList/AllItems.aspx and create a new view (or edit the default one) to display the field 'IdForTerm'.
Where you have a term 'apple', your IdForTerm may look like '1288beaf-82e0-4d81-b9de-ad5ad8382938'. Take a note of the guid for each term which appears within your input data.
Edit your input to correctly reference each term
Let's say you're importing your data from an Excel spreadsheet. Or from a CSV. It doesn't really matter. What you need to do is, basically, a find and replace down each managed metadata column, replacing 'term' with 'term|guid'. So our example from earlier, with the apple, would become 'apple|1288beaf-82e0-4d81-b9de-ad5ad8382938'.
Finally, assuming your view is set up in exactly the same order as your input data, you should be able to 'edit list' from within the browser, hit the leftmost side of your first input row (to select the entire row) and CTRL+V all of your data at the same time.
Note there appears to be a limit to the number of entries you can make at the same time. It appears to sit at around 5,000 elements.
Adding on to #rmacd's answer, you can also get the GUID for a given MMS term by first manually entering the value(s) you need in a Quick Edit cell, then copy and paste the same value(s) from SharePoint to Excel. The pasted value will appear with the full term|guid that you need to complete the bulk copy/paste.

Efficiently processing all data in a Cassandra Column Family with a MapReduce job

I want to process all of the data in a column family in a MapReduce job. Ordering is not important.
An approach is to iterate over all the row keys of the column family to use as the input. This could be potentially a bottleneck and could replaced with a parallel method.
I'm open to other suggestions, or for someone to tell me I'm wasting my time with this idea. I'm currently investigating the following:
A potentially more efficient way is to assign ranges to the input instead of iterating over all row keys (before the mapper starts). Since I am using RandomPartitioner, is there a way to specify a range to query based on the MD5?
For example, I want to split the task into 16 jobs. Since the RandomPartitioner is MD5 based (from what I have read), I'd like to query everything starting with a for the first range. In other words, how would I query do a get_range on the MD5 with the start of a and ends before b. e.g. a0000000000000000000000000000000 - afffffffffffffffffffffffffffffff?
I'm using the pycassa API (Python) but I'm happy to see Java examples.
I'd cheat a little:
Create new rows job_(n) with each column representing each row key in the range you want
Pull all columns from that specific row to indicate which rows you should pull from the CF
I do this with users. Users from a particular country get a column in the country specific row. Users with a particular age are also added to a specific row.
Allows me to quickly pull the rows i need based on the criteria i want and is a little more efficient compared to pulling everything.
This is how the Mahout CassandraDataModel example functions:
https://github.com/apache/mahout/blob/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/cassandra/CassandraDataModel.java
Once you have the data and can pull the rows you are interested in, you can hand it off to your MR job(s).
Alternately, if speed isn't an issue, look into using PIG: How to use Cassandra's Map Reduce with or w/o Pig?

read csv using jmeter(starting from x)

I'm writing a jmeter script and I have a huge csv file with a bunch of data which I use in my requests, is it possible to start not from first entry but from 5th or nth entry?
Looking at the CSVDataSet, it doesn't seem to directly support skipping to a given row. However, you can emulate the same effect by first executing N loops where you just read from the data set and do nothing with the data. This is then followed by a loop containing your actual tests. It's been a while since I've used JMeter - for this approach to work, you must share the same CVSDataSet between both loops.
If that's not possible, then there is an alternative. In your main test loop, use a Counter and a If Controller. The Counter counts up from 1. The If controller contains your tests, with the condition ${Counter}>N where N is the number to skip. ("Counter" In the expression is whatever you set the "reference name" property" to in the counter.)
mdma's 2nd idea is a clean way to do it, but here are two other options that are simple, but annoying to do:
Easiest:
Create separate CSV files for where you want to start the file, deleting the rows your don't need. I would create separate CSV data config elements for each CSV file, and then just disable the ones you don't want to run.
Less Easy:
Create a new column in your CSV file, called "ignore". In the rows you want to skip, enter the value "True". In your test plan, create an IF controller that is parent to your requests. Make the If condition: "${ignore}"!="True" (include the quotes and note that 'true' is case sensitive). This will skip the requests if the 'ignore' column has a value of 'true'.
Both methods require modifying the CSV file, but method two has other applications (like excluding a header row) and can be fast if you're using Open Office, Excel, etc.