Handling the unknown categorical levels in MOJO and POJO during prediction:
Problem: I have a model that I have exported to Mojo and am pushing data through it.
I understand that there are effectively two options for dealing with unknown categorical variables. By default, and unknown category will throw a PredictUnknownCategoricalLevelException. Alternatively we can select setConvertUnknownCategoricalLevelsToNa(true) and the unknown level will be set to Double.NaN.
With first option we also have the option of getting the count of unknown levels per column – but no information on what those levels actually are. This information is needed for debugging – is there a way to get it?
I was thinking I could do an initial pass on my data to test for unknown levels, but I’m not sure whether it’s possible to interrogate the model for a list of known levels. In order to properly understand / diagnose the arrival of new levels, it would be helpful to interrogate the model for a list of levels (per column index)
Another issue is related with type. If data are read in from a text file for the purpose of predictions, the type is a string – however the model may expect a different type. Is there a way to interrogate the type of each column from the model so that casting can be done correctly before passing the data into the RowData object ?
If you catch PredictUnknownCategoricalLevelException it has a field, unknownLevel, which lets you handle the unknown level any way you like. You can repair the row, report the unknownLevel, and try again.
- The model has a getDomains() function. This returns the level names for each column. In the MOJO world, “domain” == “level”. So you can interrogate the model
- If the data is read in from a text file as a string, the Easy wrapper is smart enough to parse it as a double first.