What does exit code 132 mean when training a deep learning model?

Dss28 · ‎11-13-2018

Hey I am doing the HowTo "Deep Learning Image Classification" with the Dataset "Cats and Dogs".

But when training the model I get the an error with the exit code 132.

I already tried to post this question but it seems my post got stuck in the filter (I posted the logs too in the previous question).

Can somebody help me with this issue?

Nicolas_Servel · ‎11-14-2018

Hello,

Could you please try to attach the logs of your training ? Otherwise it will be complex to investigate ?

Regards,

Nicolas

Dss28 · ‎11-14-2018

Hi Nicolas, here is the last part of the logs (couldnt post the whole thing)

[2018-11-13 17:57:21,791] [7113/MainThread] [INFO] [root] Realign target series = (1598,)
[2018-11-13 17:57:21,792] [7113/MainThread] [INFO] [root] After realign target: (1598,)
[2018-11-13 17:57:21,792] [7113/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:DropRowsWhereNoTarget
[2018-11-13 17:57:21,793] [7113/MainThread] [INFO] [root] Deleting 0 rows because no target
[2018-11-13 17:57:21,793] [7113/MainThread] [INFO] [root] MF before = (0, 0) target before = (1598,)
[2018-11-13 17:57:21,795] [7113/MainThread] [INFO] [root] MultiFrame, dropping rows: []
[2018-11-13 17:57:21,798] [7113/MainThread] [INFO] [root] After DRWNT input_df=(1598, 2)
[2018-11-13 17:57:21,799] [7113/MainThread] [INFO] [root] MF after = (0, 0) target after = (1598,)
[2018-11-13 17:57:21,799] [7113/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:DumpPipelineState
[2018-11-13 17:57:21,799] [7113/MainThread] [INFO] [root] ********* Pipieline state (Before feature selection)
[2018-11-13 17:57:21,799] [7113/MainThread] [INFO] [root] input_df= (1598, 2)
[2018-11-13 17:57:21,800] [7113/MainThread] [INFO] [root] current_mf=(0, 0)
[2018-11-13 17:57:21,800] [7113/MainThread] [INFO] [root] PPR:
[2018-11-13 17:57:21,800] [7113/MainThread] [INFO] [root] target = ((1598,))
[2018-11-13 17:57:21,800] [7113/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:EmitCurrentMFAsResult
[2018-11-13 17:57:21,801] [7113/MainThread] [INFO] [root] Set MF index len 1598
[2018-11-13 17:57:21,801] [7113/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:DumpPipelineState
[2018-11-13 17:57:21,801] [7113/MainThread] [INFO] [root] ********* Pipieline state (At end)
[2018-11-13 17:57:21,801] [7113/MainThread] [INFO] [root] input_df= (1598, 2)
[2018-11-13 17:57:21,802] [7113/MainThread] [INFO] [root] current_mf=(0, 0)
[2018-11-13 17:57:21,802] [7113/MainThread] [INFO] [root] PPR:
[2018-11-13 17:57:21,802] [7113/MainThread] [INFO] [root] UNPROCESSED = ((1598, 2))
[2018-11-13 17:57:21,802] [7113/MainThread] [INFO] [root] TRAIN = ((0, 0))
[2018-11-13 17:57:21,802] [7113/MainThread] [INFO] [root] target = ((1598,))
[2018-11-13 17:57:21,804] [7113/MainThread] [INFO] [root] END - Fitting preprocessors
[2018-11-13 17:57:21,804] [7113/MainThread] [INFO] [root] START - Preprocessing train set
[2018-11-13 17:57:21,805] [7113/MainThread] [INFO] [root] END - Preprocessing train set
[2018-11-13 17:57:21,805] [7113/MainThread] [INFO] [root] START - Preprocessing test set
[2018-11-13 17:57:21,811] [7113/MainThread] [INFO] [root] END - Preprocessing test set
[2018-11-13 17:57:21,818] [7113/MainThread] [INFO] [root] START - Fitting model
/home/dataiku/dss/code-envs/python/Python/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
[2018/11/13-17:57:22.081] [KNL-python-single-command-kernel-monitor-18717] [INFO] [dku.kernels] - Process done with code 132
[2018/11/13-17:57:22.082] [KNL-python-single-command-kernel-monitor-18717] [INFO] [dip.tickets] - Destroying API ticket for analysis-ml-CATS_DOGS-SrUtxam on behalf of Dataiku28
[2018/11/13-17:57:22.083] [MRT-18713] [INFO] [dku.kernels] - Getting kernel tail
[2018/11/13-17:57:22.084] [MRT-18713] [INFO] [dku.kernels] - Trying to enrich exception: com.dataiku.dip.io.SocketBlockLinkIOException: Failed to get result from kernel from kernel com.dataiku.dip.analysis.coreservices.AnalysisMLKernel@4842e569 process=null pid=?? retcode=132
[2018/11/13-17:57:22.184] [MRT-18713] [INFO] [dku.kernels] - Getting kernel tail
[2018/11/13-17:57:22.186] [MRT-18713] [WARN] [dku.analysis.ml.python] - Training failed
com.dataiku.dip.exceptions.ProcessDiedException: Process died (exit code: 132)
at com.dataiku.dip.kernels.DSSKernelBase.maybeRethrowAsProcessDied(DSSKernelBase.java:219)
at com.dataiku.dip.analysis.ml.prediction.PredictionTrainAdditionalThread.process(PredictionTrainAdditionalThread.java:78)
at com.dataiku.dip.analysis.ml.shared.PRNSTrainThread.run(PRNSTrainThread.java:130)
[2018/11/13-17:57:22.193] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.ml.python] T-I1AGUr2h - Processing thread joined ...
[2018/11/13-17:57:22.193] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.ml.python] T-I1AGUr2h - Joining processing thread ...
[2018/11/13-17:57:22.194] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.ml.python] T-I1AGUr2h - Processing thread joined ...
[2018/11/13-17:57:22.195] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.prediction] T-I1AGUr2h - Train done
[2018/11/13-17:57:22.195] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.prediction] T-I1AGUr2h - Train done
[2018/11/13-17:57:22.202] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.prediction] T-I1AGUr2h - Publishing mltask-train-done reflected event

Dss28 · ‎11-15-2018

Hey Nicolas I attached the Logs.
Are they not enough or is there simply no solution to the problem?

Nicolas_Servel · ‎11-15-2018

Hello,

The logs do not bring much more information. An error 132 corresponds to a SIGILL signal, which is an illegal instruction sent to the hardware. So it probably comes from a bug inside Keras or Tensorflow.

To try to reproduce the error, could you please attach, if it's possible:
- The architecture that you used (content of the "Architecture" tab)
- The definition of the code-env that you used to run the model
- a sample of the data (or at least what it looks like) on which the model is trained

Thanks in advance,

Nicolas

Dss28 · ‎11-15-2018

I am going to split this.
The Code in the feauture handling tab:

from keras.preprocessing.image import img_to_array, load_img

# Custom image preprocessing function.
# Must return a numpy ndarray representing the image.
# - image_file is a file like object
def preprocess_image(image_file):
img = load_img(image_file,target_size=(299, 299, 3))
array = img_to_array(img)

# Normalize image between 0 and 1.
array /= 255

return array

Dss28 · ‎11-15-2018

The Code in Architecture

from keras.layers import Input, Dense, Flatten, GlobalAveragePooling2D
from keras.models import Model
from keras.applications import Xception
import os
import dataiku

def build_model(input_shapes, n_classes=None):

#### DEFINING INPUT AND BASE ARCHITECTURE
# You need to modify the name and shape of the "image_input"
# according to the preprocessing and name of your
# initial feature.
# This feature should to be preprocessed as an "Image", with a
# custom preprocessing.
image_shape = (299, 299, 3)
image_input_name = "path_preprocessed"
image_input = Input(shape=image_shape, name=image_input_name)

base_model = Xception(include_top=False, weights=None, input_tensor=image_input)

#### LOADING WEIGHTS OF PRE TRAINED MODEL
# To leverage this architecture, it is better to use weights
# computed on a previous training on a large dataset (Imagenet).
# To do so, you need to download the file containing the weights
# and load them into your model.
# You can do it by using the macro "Download pre-trained model"
# of the "Deep Learning image" plugin (CPU or GPU version depending
# on your setup) available in the plugin store. For this architecture,
# you need to select:
# "Xception trained on Imagenet"
# This will download the weights and put them into a managed folder
folder = dataiku.Folder("xception_weights")
weights_path = "xception_imagenet_weights_notop.h5"

base_model.load_weights(os.path.join(folder.get_path(), weights_path))

for layer in base_model.layers:
layer.trainable = False

#### ADDING FULLY CONNECTED CLASSIFICATION LAYER
x = base_model.layers[-1].output
x = Flatten()(x)
predictions = Dense(n_classes, activation="softmax")(x)

model = Model(input=base_model.input, output=predictions)
return model

def compile_model(model):
model.compile(
optimizer="adam",
loss="categorical_crossentropy"
)
return model

Dss28 · ‎11-15-2018

The Environment is a Python environment. I am not sure what definition means in this case.
The input data are images "Cats_Dogs" in the Transfer learning section of this HowTo:

https://www.dataiku.com/learn/guide/visual/machine-learning/deep-learning-images.html

Thanks for your help. I really dont know what I am doing wrong

Nicolas_Servel · ‎11-15-2018

The code-env is the Python that you had to set-up to be able to run Keras/Tensorflow code. The steps to create it are mentioned in the the "Prerequisites" of the tutorial.

To access it afterwards, you can go to Administration > Code Envs, select it and go to installed packages. You can then send us the list of installed packages.

On which type of server your DSS instance is installed ?

It seems that the issue does not come from your code and we've never seen a similar error when we run the tutorial on our side.

What you can try, but there is no certainty that it will change anything:
- re-install the code-env, i.e. go to the page of the code-env, select "rebuild env" and click on update
- decrease the batch size, in case this would be a hidden out of memory error. To do so, go to your model page, to the "Training" tab, and select for example 10 as a batch size.

In any case, the bug seems to come from Keras/Tensorflow and how it interacts with your server, not from DSS.

Regards,

Nicolas

Dss28 · ‎11-15-2018

Send the installed packages. But stuck in the spam filter. I dont know which server this on. The server is on my univerities sites since its project I am doing. I tried to lower the batch size, but it doesnt work. I even tried to copy the finished modell project of this tutorial step by step. Didnt work. Maybe its a problem with the Training code?

The Code:

from dataiku.doctor.deep_learning.sequences import DataAugmentationSequence
from keras.preprocessing.image import ImageDataGenerator
from keras import callbacks

# A function that builds train and validation sequences.
# You can define your custom data augmentation based on the original train and validation sequences

# build_train_sequence_with_batch_size - function that returns train data sequence depending on
# batch size
# build_validation_sequence_with_batch_size - function that returns validation data sequence depending on
#
def build_sequences(build_train_sequence_with_batch_size, build_validation_sequence_with_batch_size):

# The actual batch size of the train sequence will be (batch_size * n_augmentation)
batch_size = 32
n_augmentation = 1 # Number of augmentation per batch, lower means better learning but also slower

train_sequence = build_train_sequence_with_batch_size(batch_size)
validation_sequence = build_validation_sequence_with_batch_size(batch_size)

augmentator = ImageDataGenerator(
zoom_range=0.2,
shear_range=0.2,
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True
)
augmented_sequence = DataAugmentationSequence(
train_sequence,
'path_preprocessed',
augmentator,
n_augmentation
)

return augmented_sequence, validation_sequence

# A function that contains a call to fit a model.

# model - compiled model
# train_sequence - train data sequence, returned in build_sequence
# validation_sequence - validation data sequence, returned in build_sequence
# base_callbacks - a list of Dataiku callbacks, that are not to be removed. User callbacks can be added to this list
def fit_model(model, train_sequence, validation_sequence, base_callbacks):
epochs = 5

# Adding a callback that will reduce the 'learning rate' when the model
# has difficulty to improve itself on the validation data.
callback = callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.2,
patience=5
)

base_callbacks.append(callback)

model.fit_generator(train_sequence,
validation_data=validation_sequence,
epochs=epochs,
callbacks=base_callbacks,
shuffle=True)

Dss28 · ‎11-15-2018

List of packages:

absl-py==0.6.1
astor==0.7.1
backports-abc==0.5
backports.shutil-get-terminal-size==1.0.0
backports.ssl-match-hostname==3.5.0.1
backports.weakref==1.0.post1
bleach==1.5.0
certifi==2018.10.15
chardet==3.0.4
Click==7.0
decorator==4.2.1
enum34==1.1.6
Flask==0.12.4
funcsigs==1.0.2
futures==3.2.0
gast==0.2.0
grpcio==1.16.0
h5py==2.7.1
html5lib==0.9999999
idna==2.6
ipykernel==4.8.2
ipython==5.8.0
ipython-genutils==0.2.0
itsdangerous==1.1.0
Jinja2==2.10
jupyter-client==5.2.2
jupyter-core==4.4.0
Keras==2.1.5
Markdown==3.0.1
MarkupSafe==1.1.0
mock==2.0.0
numpy==1.15.4
pandas==0.20.3
pathlib2==2.3.2
patsy==0.5.1
pbr==5.1.1
pexpect==4.4.0
pickleshare==0.7.4
Pillow==5.1.0
prompt-toolkit==1.0.15
protobuf==3.6.1
ptyprocess==0.5.2
Pygments==2.2.0
python-dateutil==2.6.1
pytz==2018.3
PyYAML==3.13
pyzmq==16.0.4
requests==2.18.4
scandir==1.9.0
scikit-learn==0.19.2
scipy==1.1.0
simplegeneric==0.8.1
singledispatch==3.4.0.3
six==1.11.0
statsmodels==0.8.0
tensorboard==1.8.0
tensorflow==1.8.0
termcolor==1.1.0
tornado==4.5.3
traitlets==4.3.2
urllib3==1.22
wcwidth==0.1.7
Werkzeug==0.14.1
xgboost==0.71

Nicolas_Servel · ‎11-15-2018

Hello again,

The issue does not come from the code of the tutorial.

After a quick search on the internet, it seems that tensorflow >= 1.6 does not work for servers with CPUs that do not support AVX instruction sets, which can end up with exit 132 errors (more info here https://github.com/tensorflow/tensorflow/issues/19584). It is maybe the case for your server.

Can you try to downgrade the version of your tensorflow to 1.4.0 and see if this works.

To achieve that, go to Administration > Code Envs > Your code-env > Packages to Install > then to replace "tensorflow==1.8.0" with "tensorflow==1.4.0" and click on "Save and update"

Regards,

Nicolas

What does exit code 132 mean when training a deep learning model?

What does exit code 132 mean when training a deep learning model?

Labels

Deep Learning

Image data

Machine Learning

Sign up to take part

What does exit code 132 mean when training a deep learning model?

What does exit code 132 mean when training a deep learning model?

Labels

Deep Learning

Image data

Machine Learning