14 July 2013

Getting Started with IBM PureData System for Analytics

PureData System for Analytics is a data appliance offered by IBM and powered by the IBM Netezza technology. It runs on a hardware that is not likely to have at home, but if you want to test how it works, there is a Netezza emulator that runs as a virtual machine under Windows. Here I’m going to summarize the steps to run the emulator on your machine:

  1. Join the Netezza Developer Network (NDN) by sending an email to the group admins using the link on the page. After joining the group you will be able to download the emulator. The group has a wiki page where you can find the login credentials for the emulator and other useful information.
  2. The Netezza emulator runs as a VMware machines. You will need to have VMware Player (free) or VMWare Workstation (paid). The VMware Player requires installing the VIX API – more info can be found in the wiki section of the NDN.
    The emulator can run on VMware vSphere but officially this is not supported and you will need to manually tweak the virtual machines to run there.
  3. Download the windows client software for Netezza:
    • Go to IBM Fix Central
    • Select “Information Management” for a “Product Group
    • Then select “IBM Netezza NPS Software and Clients
    • Select the latest version. Currently, this is 7.0.3
    • Select “Linux” as a “Platform
    • Select “Browse for Fixes” and hit “Continue
    • Locate the NPS version file in which you can download “nz-winclient-v7.0.3.zip
    Inside the zip file you will find the Netezza documentation, the JDBC, ODBC and OLE DB drivers and the Administrative Tools.
  4. The windows client software does not provide developer IDE so a third party one should be used. My personal preference goes to Aginity Workbench for Netezza. Can be freely downloaded and after period of time you are required to do a free registration to continue using it – no payment is required.

05 July 2013

DataStage Anti-Patterns–Aggregator and Lookup

Today I would like to start a new series of posts dedicated to some patterns and anti-patterns commonly found in a DataStage jobs.

Requirement: Build a job that reads invoice items (each invoice has multiple items) and needs to calculate the average invoice item sale price for each invoice. The newly calculated value should be added as a new column.

Solution 1 (not recommended): The straight-forward way to implement the requirement above is to use a copy stage to split the data followed by sort and aggregator stages to calculate the average sale price. Then use a lookup to put the newly calculated value back to the invoice item. The job design looks as follow:


But what is the downside with this? Since the input before the cp stage is not sorted we need to use the Entire partitioning for the lookup stage – this requires repartitioning and puts even more pressure to the memory for the lookup stage. A simple work-around would be to move the sort stage before the copy stage. The job design the will look like this:


But what seems to be the problem with this design? It’s not obvious but this design puts also pressure to the memory and delays the processing – the lookup stage does not output data unless all the records from the reference link are available. In other words – the above lookup will start matching records only when the aggregator stage has calculated all the input records. To overcome this we can design the job by replacing the lookup stage with a join stage.

Solution 2 (recommended):

By replacing the lookup stage with a join stage we solve the problems with the previous designs: the records are joined as soon as they are outputted from the aggregator stage. The job design looks like this: