Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have a connection to a blob storage and I would like to build a dataset from xlsx files. Some of them are empty (0 byte files) but there's on option to ignore these in the GUI...
Thanks, it's a lot easier to understand and reproduce the error now. I can confirm that I get the same error with empty Excel files. I also don't see any ways of filtering the empty files or avoid the exception in a visual no-code way.
Personally I think this is a bug for two reasons:
You should raise it with Dataiku Support but I suspect that even if it is accepted as a bug it will not have much priority for fixing since this is a problem of your own making (ie why do you have empty files in the blob storage?). And even if it was fixed you will probably not be able to go to the latest version that quickly.
So I think you have 2 options:
1) Fix your writer process so that it doesn't leave 0 bytes files
2) Pre-process the blob storage and remove any 0 bytes files before you attempt to load them. This will certainly require some Python code so not going to be possible with a visual recipe.
Which option are you able to go for?
How are you loading your files? Where do you actually get the exception?
Thank you for your response @Turribeach.
The files are loaded from an Azure blob storage with Excel files > created a dataset in DataIKU by explicitly selecting them > run any recipe to store the amalgamated dataset:
The error I'm receiving is this:
Please see our options for getting help
HTTP code: , type: java.io.IOException
java.io.IOException: Failed to open Excel file at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:422) at com.dataiku.dip.formats.excel.ExcelFormatExtractor.doExtractStream(ExcelFormatExtractor.java:349) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:154) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59) at com.dataiku.dip.datasets.AbstractSingleThreadPusher.pushSplits(AbstractSingleThreadPusher.java:184) at com.dataiku.dip.datasets.UniversalSingleThreadPusher.push(UniversalSingleThreadPusher.java:224) at com.dataiku.dip.dataflow.exec.stream.SingleThreadFSLikeDatasetRunnable.run(SingleThreadFSLikeDatasetRunnable.java:71) at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:378) Caused by: org.apache.poi.EmptyFileException: The supplied file was empty (zero bytes long) at org.apache.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:111) at org.apache.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:206) at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:143) at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipFile(ZipHelper.java:201) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:140) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:277) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:186) at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:123) at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:90) at com.github.pjfanning.xlsx.StreamingReader$Builder.open(StreamingReader.java:307) at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:416) ... 7 more [09:13:25] [INFO] [dku.flow.activity] running compute_newar_dataset_stacked_prep_NP - activity is finished [09:13:25] [ERROR] [dku.flow.activity] running compute_newar_dataset_stacked_prep_NP - Activity failed java.io.IOException: Failed to open Excel file at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:422) at com.dataiku.dip.formats.excel.ExcelFormatExtractor.doExtractStream(ExcelFormatExtractor.java:349) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:154) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59) at com.dataiku.dip.datasets.AbstractSingleThreadPusher.pushSplits(AbstractSingleThreadPusher.java:184) at com.dataiku.dip.datasets.UniversalSingleThreadPusher.push(UniversalSingleThreadPusher.java:224) at com.dataiku.dip.dataflow.exec.stream.SingleThreadFSLikeDatasetRunnable.run(SingleThreadFSLikeDatasetRunnable.java:71) at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:378) Caused by: org.apache.poi.EmptyFileException: The supplied file was empty (zero bytes long) at org.apache.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:111) at org.apache.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:206) at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:143) at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipFile(ZipHelper.java:201) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:140) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:277) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:186) at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:123) at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:90) at com.github.pjfanning.xlsx.StreamingReader$Builder.open(StreamingReader.java:307) at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:416)
Thanks, it's a lot easier to understand and reproduce the error now. I can confirm that I get the same error with empty Excel files. I also don't see any ways of filtering the empty files or avoid the exception in a visual no-code way.
Personally I think this is a bug for two reasons:
You should raise it with Dataiku Support but I suspect that even if it is accepted as a bug it will not have much priority for fixing since this is a problem of your own making (ie why do you have empty files in the blob storage?). And even if it was fixed you will probably not be able to go to the latest version that quickly.
So I think you have 2 options:
1) Fix your writer process so that it doesn't leave 0 bytes files
2) Pre-process the blob storage and remove any 0 bytes files before you attempt to load them. This will certainly require some Python code so not going to be possible with a visual recipe.
Which option are you able to go for?
Went for another option. Asked the person responsible for the file sync to only deal with files over 0 bytes.
I agree with your points though; the reader should avoid empty files indeed.
That's option 1) for me. ๐