05 July 2013

DataStage Anti-Patterns–Aggregator and Lookup

Today I would like to start a new series of posts dedicated to some patterns and anti-patterns commonly found in a DataStage jobs.

Requirement: Build a job that reads invoice items (each invoice has multiple items) and needs to calculate the average invoice item sale price for each invoice. The newly calculated value should be added as a new column.

Solution 1 (not recommended): The straight-forward way to implement the requirement above is to use a copy stage to split the data followed by sort and aggregator stages to calculate the average sale price. Then use a lookup to put the newly calculated value back to the invoice item. The job design looks as follow:


But what is the downside with this? Since the input before the cp stage is not sorted we need to use the Entire partitioning for the lookup stage – this requires repartitioning and puts even more pressure to the memory for the lookup stage. A simple work-around would be to move the sort stage before the copy stage. The job design the will look like this:


But what seems to be the problem with this design? It’s not obvious but this design puts also pressure to the memory and delays the processing – the lookup stage does not output data unless all the records from the reference link are available. In other words – the above lookup will start matching records only when the aggregator stage has calculated all the input records. To overcome this we can design the job by replacing the lookup stage with a join stage.

Solution 2 (recommended):

By replacing the lookup stage with a join stage we solve the problems with the previous designs: the records are joined as soon as they are outputted from the aggregator stage. The job design looks like this:


No comments: