0 votes
I have a dataset (250k rows) that I need to predict labels for. However, whenever I run the Score recipe I run into a Memory Error. Is there a way to batch score datasets?
asked by Simon

1 Answer

0 votes

Scoring is already done in small batches. What amount of memory do you have on your machine ? How much free memory before running the recipe ? How many columns in the dataset ? What kind of processing (ie, are you using hashing, count vectorization or tfidf for example ?)
answered by
I have 8 GBs of memory available. 1GB is still free on the machine, pretty much all of the other 7GB are used by DSS. The input dataset has 8 columns, but I apply a TF-IDF vectorization on a column containing lots of tags. The "algorithm" tab in the model view says after pre-processing there are 1016 columns. Estimated memory usage is 94MB (for training only, I guess).
The training works perfectly with 25k rows, but the scoring on the 250k rows fails.

Here's the traceback also:
[11:57:18] [INFO] [dku.utils]  - Traceback (most recent call last):
[11:57:18] [INFO] [dku.utils]  -   File "/usr/lib64/python2.7/runpy.py", line 174, in _run_module_as_main
[11:57:18] [INFO] [dku.utils]  -     "__main__", fname, loader, pkg_name)
[11:57:18] [INFO] [dku.utils]  -   File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
[11:57:18] [INFO] [dku.utils]  -     exec code in run_globals
[11:57:18] [INFO] [dku.utils]  -   File "/home/dataiku/dataiku-dss-4.0.3/python/dataiku/doctor/prediction/reg_scoring_recipe.py", line 146, in <module>
[11:57:18] [INFO] [dku.utils]  -     json.load_from_filepath(sys.argv[7]))
[11:57:18] [INFO] [dku.utils]  -   File "/home/dataiku/dataiku-dss-4.0.3/python/dataiku/doctor/prediction/reg_scoring_recipe.py", line 133, in main
[11:57:18] [INFO] [dku.utils]  -     for output_df in output_generator():
[11:57:18] [INFO] [dku.utils]  -   File "/home/dataiku/dataiku-dss-4.0.3/python/dataiku/doctor/prediction/reg_scoring_recipe.py", line 78, in output_generator
[11:57:18] [INFO] [dku.utils]  -     output_probas=recipe_desc["outputProbabilities"])
[11:57:18] [INFO] [dku.utils]  -   File "/home/dataiku/dataiku-dss-4.0.3/python/dataiku/doctor/prediction/classification_scoring.py", line 197, in binary_classification_predict
[11:57:18] [INFO] [dku.utils]  -     (pred_df, proba_df) = binary_classification_predict_ex(clf, modeling_params, target_map, threshold, transformed, output_probas)
[11:57:18] [INFO] [dku.utils]  -   File "/home/dataiku/dataiku-dss-4.0.3/python/dataiku/doctor/prediction/classification_scoring.py", line 148, in binary_classification_predict_ex
[11:57:18] [INFO] [dku.utils]  -     features_X_df = features_X.as_dataframe()
[11:57:18] [INFO] [dku.utils]  -   File "/home/dataiku/dataiku-dss-4.0.3/python/dataiku/doctor/multiframe.py", line 253, in as_dataframe
[11:57:18] [INFO] [dku.utils]  -     return pd.concat(blockvals, axis=1)
[11:57:18] [INFO] [dku.utils]  -   File "/home/dataiku/dss/pyenv/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 846, in concat
[11:57:18] [INFO] [dku.utils]  -     return op.get_result()
[11:57:18] [INFO] [dku.utils]  -   File "/home/dataiku/dss/pyenv/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 1038, in get_result
[11:57:18] [INFO] [dku.utils]  -     copy=self.copy)
[11:57:18] [INFO] [dku.utils]  -   File "/home/dataiku/dss/pyenv/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 4545, in concatenate_block_managers
[11:57:18] [INFO] [dku.utils]  -     for placement, join_units in concat_plan]
[11:57:18] [INFO] [dku.utils]  -   File "/home/dataiku/dss/pyenv/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 4648, in concatenate_join_units
[11:57:18] [INFO] [dku.utils]  -     concat_values = concat_values.copy()
[11:57:18] [INFO] [dku.utils]  - MemoryError
[11:57:18] [INFO] [dku.flow.activity] - Run thread failed for activity score_Companies_unlabelled_AI_prepared_NP
595 questions
605 answers
327 users