2023-03-21
Several important RevoScaleR functions include provisions for transforming data within the function itself rather than requiring separate steps in addition to the function call. This is advantageous since it means that large datasets can be read once instead of having to be read repeatedly by several functions.
rxImport, rxDataStep, and rxSplit support data transformation. In addition, most of the analytic and data mining functions also support data transformation. These include rxLinMod, rxLogit, rxGLM, rxDTree and others. Machine learning functions in the MicrosoftML library, such as rxNeuralNet, though not strictly speaking RevoScaleR functions, also provide transformation support.
To make effective use of RevoScaleR transformations, there are a few basic principles of which you must be aware.
The Closure of the Transformation
Transformations do not have direct access to variables defined in the calling environment. For example, we may wish to adjust a calculation performed in a transformation by some variable amount, in this example, defined as "rate". We can make the rate variable available to the transformation by declaring it as a transformObject. Note that a different name may be assigned; in this case, the value called "rate" in the calling environment will be dubbed "adjustment" in the transform.
This example makes use of the sample dataset DJIAdaily.xdf distributed with Microsoft R Client. Note that it creates a new column called "adjSpread" from the existing columns High and Low.
rate<-0.05 # a local variable in the calling environment that will be used in the transformation
dji<-rxImport(djiXdf_filename,
transformObjects=list(adjustment=rate),
transforms=list(adjSpread=(High-Low) * (1+adjustment))
)
head(dji)
Using a Custom Transformation Function
A function may be created defining the required transformations. Not only does this simplify the function call invoking the transformation, but it also makes the transformation available to any RevoscaleR function that might require it. As with a direct transformation, any variable to be used in your transformation must be included in a list passed to the TransformObjects parameter.
A transformation function receives a single list of vectors of equal length. Call them "rows" if you like. It is not a dataframe since, unlike the usual case when processing dataframes, the processing here must be done row-by-row rather than on the column vectors individually.
The transformation also returns a data list that may have some columns added, some removed, and some altered. Columns may be removed in a transformation simply by setting them to NULL.
This example function performs the same operation as the inline code above. Note that, just as in the code above, the function can access a variable in the calling environment if that variable is declared as one of the transformObjects in the RevoScaleR function.
rate<-0.05
transform_func <- function(dataList)
{
print(class(dataList))
dataList$spread <- dataList$High-dataList$Low
dataList$adjSpread <- (dataList$High-dataList$Low) * (1.0-adjustment)
# Return the adjusted data list
dataList
}
dji<-rxImport(djiXdf_filename,
varsToDrop = High, # this will have no effect, since High is required by the transformation
transformObjects=list(adjustment=rate), # a warning message will appear if there is
# an object named "adjustment" in the calling environment
transformFunc=transform_func
)
head(dji)
Best Practices for RevoScaleR Transformations
Do Everything at Once
Create a single transformation to make all the column changes you need. When working with a large dataset, the time it takes to read the data is a bottleneck, and there is probably no good reason to perform additional transformations, which would require additional scans of the dataset.
Be Careful with Factors
If your transformation takes place in Microsoft ML Server, you run the risk that not all representative column values are present in the first chunk read. This can result in the creation of incorrect factor definitions. It is always better to take command of factor creation yourself using rxFactors( ).
Don't Forget to Check .rxIsTestChunk
The name says it all. Don't forget that RevoScaleR does a preliminary test read of 10 rows before beginning the full read. Test for this, and do not accidentally include the rows from the test read in your results.
Use a Transformation Function
Microsoft suggests that you should always create a function to perform your transformations in preference to including in-line code within the RevoScaleR function call.
Conclusion
Many data tidying tasks that would require multiple steps in base R code can be performed in a single transformation operation in RevoScaleR. This may be done in a call to rxDataStep, but many of the data mining and machine learning functions of RevoScaleR incorporate the transformation operations. This often provides a substantial advantage when dealing with larger datasets in R.
This piece was originally posted on Aug 13, 2019, and has been refreshed with updated styling.