0 votes

I have a Filesystem datasource which is contains thousands of folders and each folder contains a list of comma separated files.  Each file in each directory contains a different schema and the file name criteria is used to create partitioned data sources with the following using the following format:

/%{DIR_NAME}/KEY_%{DIR_NAME}.csv

This creates a datasource based on all the files that start with KEY in its name.  That part is working as expected.  My problem is that I can't do any recipe against that data source.  I tried python, shell and sync recipes and all of the failed with the same error:

	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
	at com.dataiku.dip.security.process.RegularProcess.start(RegularProcess.java:47)
	at com.dataiku.dip.security.process.InsecureProcessesLaunchService.launch(InsecureProcessesLaunchService.java:34)
	at com.dataiku.dip.dataflow.exec.AbstractCodeBasedActivityRunner.execute(AbstractCodeBasedActivityRunner.java:263)
	at com.dataiku.dip.dataflow.exec.AbstractCodeBasedActivityRunner.execute(AbstractCodeBasedActivityRunner.java:231)
	at com.dataiku.dip.dataflow.exec.AbstractPythonRecipeRunner.executeScript(AbstractPythonRecipeRunner.java:37)
	at com.dataiku.dip.recipes.code.python.PythonRecipeRunner.run(PythonRecipeRunner.java:49)
	at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:353)
Caused by: java.io.IOException: error=7, Argument list too long
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
	at java.lang.ProcessImpl.start(ProcessImpl.java:134)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)

My current recipe is in python and the code is:

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Recipe inputs

print("Here")

events_CSV = dataiku.Dataset("KEY_CSV")
events_CSV_df = events_CSV.get_dataframe()

# Recipe outputs
events_ORC = dataiku.Dataset("KEY_ORC")
events_ORC.write_with_schema(events_CSV_df)

Job fails before printing "Here".

These are the DSS instance settings:

{u'dipInstanceId': u'8bu1n1os-203c299d56c99ef078a53a1a81b6ea23-c60f6bab8e57ecd615a8ec240207f819', u'features': {u'TWITTER': {}, u'HADOOP': {}, u'HIVE': {}, u'PIG': {}, u'R': {}, u'SPARK': {}}, u'devInstance': False, u'distribVersion': u'7.3', u'debug': False, u'version': {u'product_commitid': u'', u'conf_version': u'16', u'product_version': u'4.0.5'}, u'distrib': u'redhat'}

 

asked by serqql

Please log in or register to answer this question.

620 questions
626 answers
458 comments
375 users