I'm trying to turn XML files in to SQL statements within Eclipse to save on the manual work of converting large files.
A line such as:
<TABLE_NAME COL_1="value1" COL_2="value2"/>
should be converted to:
insert into TABLE_NAME (COL_1, COL_2) values ("value1", "value2");
So far, I have managed to match and capture the table name and the first column/value pair with:
<(\w+)( \w+)=(".+?").*/>
The .* near the end is just there to test the first part of the pattern and should be removed once it is complete.
The following replace pattern yields the following result:
insert into $1 ($2) values ($3);
insert into TABLE_NAME ( COL_1) values ("value1");
The problem I'm having is that the number of columns is different for different tables, so I would like a generic pattern that would match n column/value pairs and repeatedly use the captured groups in the replace pattern. I haven't managed to understand how to do this yet although \G seems to be a good candidate.
Ideally, this would be solved in a single regexp statement, though I also wouldn't be against multiple statements that have to be run in sequence (though I really don't want to have to force the developer to execute once for every column/value pair).
Anyone have any ideas?
I did not in the end solve this with regular expressions as originally envisaged. Taking alternative solutions in mind, it was solved in the end by re-factoring test code to make it re-usable and then co-opting dbunit to use tests to write data to the different stages of the development databases.
I know that this is a misuse of tests, but it does make the whole thing much simpler since the user gets a green/red feedback as to whether the data is inserted or not and the same test data can be inserted for manual acceptance tests as is used for the automated component/integration tests (which used dbunit with an in-memory database). As long as these fake test classes are not added to the test suite, then they are only executed by hand.
The resulting fake tests ended up being extremely simple and lightweight, for instance:
public class MyComponentTestDataInserter extends DataInserter {
#Override
protected String getDeleteSql() {
return "./input/component/delete.sql";
}
#Override
protected String getInsertXml() {
return "./input/component/TestData.xml";
}
#Test
public void insertIntoDevStage() {
insertData(Environment.DEV);
}
#Test
public void insertIntoTestStage() {
insertData(Environment.TEST);
}
}
This also had the additional advantage that a developer can insert the data in to a single environment by simply executing a single test method via the context menu in the IDE. At the same time, by running the whole test class, the exact same data can be deployed to all environments simultaneously.
Also, a failure at any point during data cleanup or insertion causes the whole transaction to be rolled back, preventing an inconsistent state.
Related
After going through the documentations of Utplsql 3.0.2 , I couldn't find any references the assertion api as available in the older versions. Please let me know whether is there a equivalent assertion like utassert.eqtable available in newer versions.
I have just recently gone through the same pain. Most utPLSQL examples out there are for utPLSQL v2. It transpires appears that the assertions have been deprecated, and have been replaced by "Expects". I found a great blog post by Jacek Gebal that describes this. I've tried to put this and other useful links a page about how unit testing fits into Redgate's Oracle DevOps pipeline (I work for Redgate and we often get asked how to best implement automated unit testing for Oracle).
I don't think you can compare tables straight away, but you can compare cursors, which is quite flexible, because you can, for instance, set-up a cursor with test data based on a dual query, and then check that against the actual data in the table, something like this:
procedure TestCursorExample is
v_Expected sys_refcursor;
v_Actual sys_refcursor;
begin
-- Arrange (Nothing really to arrange, except setting the expectation).
open v_Expected for
select 'me#example.com' as Email
from dual;
-- Act
SomeUpsertProc('me', 'me#example.com');
-- Assert
open v_Actual for
select Email
from Tbl_User
where UserName = 'me';
ut.expect(v_Actual).to_equal(v_Expected);
end;
Also, the example above works in Oracle 11, but if you're in 12c, apparently things got even easier, because you can use the table operator with locally defined types.
I've used a similar solution to verify that certain columns of a row were updated, while others were not. You can easily open a cursor for the original data, with some columns replaces by the new fixed values. Then do the update. Then open a cursor with the new actual data of all columns. You still have to write the queries, but it's way more compact than querying everything into variables and comparing those individually.
And, because you can open the 'expected' cursor before doing the actual 'act' step of the test, you can be sure that the query with 'expected' data is not affected by the test itself, and can even base that cursor on the data you are going to modify.
For comparing the data, the cursors are serialized to XML. This may have some side effects. In the test example above, my act step didn't actually do anything, so I got this difference, showing the count as well as showing the missing data.
If your cursors have more columns, and multiple difference, it can sometimes take a seconds to spot the differences between the XML tags. Also, there are currently some edge-case issues with this, I think because of how trimming works in XML.
1) testcursorexample
Actual: refcursor [ count = 0 ] was expected to equal: refcursor [ count = 1 ]
Diff:
Rows: [ 1 differences ]
Row No. 1 - Missing: <EMAIL>me#example.com</EMAIL>
at "MySchema.MyTestPackage", line 410 ut.expect(v_Actual).to_equal(v_Expected);
See also: 'comparing cursors' from utPLSQL 3 concepts
I am new in storm framework(https://storm.incubator.apache.org/about/integrates.html),
I test locally with my code and I think If I remove stop words, it will perform well, but i search on line and I can't see any example that removing stopwords in storm.
If the size of the stop words list is small enough to fit in memory, the most straighforward approach would be to simply filter the tuples with an implementation of storm Filter that knows that list. This Filter could possibly poll the DB every so often to get the latest list of stop words if this list evolves over time.
If the size of the stop words list is bigger, then you can use a QueryFunction, called from your topology with the stateQuery function, which would:
receive a batch of tuples to check (say 10000 at a time)
build a single query from their content and look up corresponding stop words in persistence
attach a boolean to each tuple specifying what to with each one
+ add a Filter right after that to filter based on that boolean.
And if you feel adventurous:
Another and faster approach would be to use a bloom filter approximation. I heard that Algebird is meant to provide this kind of functionality and targets both Scalding and Storm (how cool is that?), but I don't know how stable it is nor do I have any experience in practically plugging it into Storm (maybe Sunday if it's rainy...).
Also, Cascading (which is not directly related to Storm but has a very similar set of primitive abstractions on top of map reduce) suggests in this tutorial a method based on left joins. Such joins exist in Storm and the right branch could possibly be fed with a FixedBatchSpout emitting all stop words every time, or even a custom spout that reads the latest version of the list of stop words from persistence every time, so maybe that would work too? Maybe? This also assumes the size of the stop words list is relatively small though.
I am trying to create a PIG script which changes the order of the columns. This is what I have come up with so far :
inputdata = LOAD 'path/to/file/on/hdfs' USING PigStorage() AS (param1:chararray, param2:chararray, param3:chararray);
outputdata = FOREACH inputdata GENERATE param1, param3, param2;
DUMP outputdata;
I've not tried this yet on HDFS but I figured I'd go ahead and write the unit test first. Unfortunately it doesn't work.
Unit Test code :
PigTest test = new PigTest("path_to_script.pig");
FixHadoopOnWindows.runFix();
String[] input = {
"valueparam1\tvalueparam2\tvalueparam3"
};
String[] output = {
"valueparam1\tvalueparam3\tvalueparam2"
};
test.assertOutput("inputdata", input, "outputdata", output);
The FixHadoopOnWindows bit is a fix so I can run my unit tests on a windows machine easily. I found it in some blog and it helped resolve the permission issues I was having.
So now my tests run, but the problem is that the assertOutput fails. When I check the difference, I get this:
Expected:
valueparam1 valueparam3 valueparam2
Actual:
(valueparam1,valueparam3,valueparam2)
So I'm getting these brackets and comma's I never asked for. Now I'm not sure whether this is a bug in my unit testing code or in my actual script so any advice to get me started would be great. Thanks.
Your code looks ok. The brackets mean that your relation outputdata consists of a tuple with three values.
If you later want to store your data separated by tabs just do STORE outputdata INTO 'dest' USING PigStorage('\t');
http://pig.apache.org/docs/r0.11.0/basic.html#Data+Types+and+More
Figured it out. The PigUnit reads out the outputdata value, which is a Tuple in PIG. It's not until I store it to file, that the tuples are converted to a tab separated record.
When we fire a SQL query like
SELECT * FROM SOME_TABLE_NAME under ORACLE
What exactly happens internally? Is there any parser at work? Is it in C/C++ ?
Can any body please explain ?
Thanks in advance to all.
Short answer is yes, of course there is a parser module inside Oracle that interprets the statement text. My understanding is that the bulk of Oracle's source code is in C.
For general reference:
Any SQL statement potentially goes through three steps when Oracle is asked to execute it. Often, control is returned to the client between each of these steps, although the details can depend on the specific client being used and the manner in which calls are made.
(1) Parse -- I believe the first action is actually to check whether Oracle has a cached copy of the exact statement text. If so, it can save the work of parsing your statement again. If not, it must of course parse the text, then determine an execution plan that Oracle thinks is optimal for the statement. So conceptually at least there are two entities at work in this phase -- the parser and the optimizer.
(2) Execute -- For a SELECT statement this step would generally run just enough of the execution plan to be ready to return some rows to the client. Depending on the details of the plan, that might mean running the whole thing, or might mean doing just a very small fraction of the work. For any other kind of statement, the execute phase is when all of the work is actually done.
(3) Fetch -- This is when rows are actually returned to the client. Generally the client has a predetermined fetch array size which sets the maximum number of rows that will be returned by a single fetch call. So there may be many fetches made for a single statement. Of course if the statement is one that cannot return rows, then there is no fetch step necessary.
Manasi,
I think internally Oracle would have its own parser, which does parsing and tries compiling the query. Think its not related to C or C++.
But need to confirm.
-Justin Samuel.
Is there a way to unit test data flow in a ssis package .
Ex: testing the sort - verify that the sort is porperly done.
There is an unit testing framework for SSIS - see SSISUnit.
This is worth looking at but it may not solve your problem. It is possible to unit test individual components at the control flow level using this framework, but it is not possible to isolate and individual Data Flow Transformations - you can only test the whole Data Flow component.
One approach you could take is to redesign your package and break down your DataFlow component into multiple DataFlow components that can be individually tested. However that will impact the performance of your package, because you will have to persist the data somewhere in between each data flow task.
You can also adopt this approach by using NUnit or a similar framework, using the SSIS api to load a package and execute an individual task.
SSISTester can tap data flow between two components and save the data into a file. Output can be accessed in a unit test. For more information look at ssistester.bytesoftwo.com.An example how to use SSISTester to achieve this is given bellow:
[UnitTest("DEMO", "CopyCustomers.dtsx", DisableLogging=true)]
[DataTap(#"\[CopyCustomers]\[DFT Convert customer names]\[RCNT Count customers]", #"\[CopyCustomers]\[DFT Convert customer names]\[DER Convert names to upper string]")]
[DataTap(#"\[CopyCustomers]\[DFT Convert customer names]\[DER Convert names to upper string]", #"\[CopyCustomers]\[DFT Convert customer names]\[FFD Customers converted]")]
public class CopyCustomersFileAll : BaseUnitTest
{
...
protected override void Verify(VerificationContext context)
{
ReadOnlyCollection<DataTap> dataTaps = context.DataTaps;
DataTap dataTap = dataTaps[0];
foreach (DataTapSnapshot snapshot in dataTap.Snapshots)
{
string data = snapshot.LoadData();
}
DataTap dataTap1 = dataTaps[1];
foreach (DataTapSnapshot snapshot in dataTap1.Snapshots)
{
string data = snapshot.LoadData();
}
}
}
Short answer - not easily. Longer answer: yes, but you'll need lots of external tools to do it. One potential test would be to take a small sample of the data set, run it through your sort, and dump to an excel file. Take the same data set, copy it to an excel spreadsheet, and manually sort it. Run a binary diff tool on the result of the dump from SSIS and your hand-sorted example. If everything checks out, it's right.
OTOH, unit testing the Sort in SSIS shouldn't be necessary, unless what you're really testing is the sort criteria selection. The sort should have been tested by MS before it was shipped.
I would automate the testing by having a known good file for appropriate inputs which is compared binarily with an external program.
I like to use data viewers when I need to see the data moving from component to component.