How to ignore EmptyFileException

gjoseph · ‎10-01-2023

I have a connection to a blob storage and I would like to build a dataset from xlsx files. Some of them are empty (0 byte files) but there's on option to ignore these in the GUI...

Turribeach · ‎10-04-2023

Thanks, it's a lot easier to understand and reproduce the error now. I can confirm that I get the same error with empty Excel files. I also don't see any ways of filtering the empty files or avoid the exception in a visual no-code way.

Personally I think this is a bug for two reasons:

The File for Test and preview is clever enough to look for non-empty files, then why wouldn't the loaded process do the same? Seems silly to do it for tester and not the loader
Why would the loader attempt to load an empty file? It's obviously nothing to load

You should raise it with Dataiku Support but I suspect that even if it is accepted as a bug it will not have much priority for fixing since this is a problem of your own making (ie why do you have empty files in the blob storage?). And even if it was fixed you will probably not be able to go to the latest version that quickly.

So I think you have 2 options:

1) Fix your writer process so that it doesn't leave 0 bytes files

2) Pre-process the blob storage and remove any 0 bytes files before you attempt to load them. This will certainly require some Python code so not going to be possible with a visual recipe.

Which option are you able to go for?

View solution in original post

Turribeach · ‎10-01-2023

How are you loading your files? Where do you actually get the exception?

gjoseph · ‎10-02-2023

Thank you for your response @Turribeach.

The files are loaded from an Azure blob storage with Excel files > created a dataset in DataIKU by explicitly selecting them > run any recipe to store the amalgamated dataset:

The error I'm receiving is this:

Oops: an unexpected error occurred

Failed to open Excel file, caused by: EmptyFileException: The supplied file was empty (zero bytes long)

Please see our options for getting help

HTTP code: , type: java.io.IOException

java.io.IOException: Failed to open Excel file
	at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:422)
	at com.dataiku.dip.formats.excel.ExcelFormatExtractor.doExtractStream(ExcelFormatExtractor.java:349)
	at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:154)
	at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59)
	at com.dataiku.dip.datasets.AbstractSingleThreadPusher.pushSplits(AbstractSingleThreadPusher.java:184)
	at com.dataiku.dip.datasets.UniversalSingleThreadPusher.push(UniversalSingleThreadPusher.java:224)
	at com.dataiku.dip.dataflow.exec.stream.SingleThreadFSLikeDatasetRunnable.run(SingleThreadFSLikeDatasetRunnable.java:71)
	at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:378)
Caused by: org.apache.poi.EmptyFileException: The supplied file was empty (zero bytes long)
	at org.apache.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:111)
	at org.apache.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:206)
	at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:143)
	at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipFile(ZipHelper.java:201)
	at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:140)
	at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:277)
	at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:186)
	at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:123)
	at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:90)
	at com.github.pjfanning.xlsx.StreamingReader$Builder.open(StreamingReader.java:307)
	at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:416)
	... 7 more
[09:13:25] [INFO] [dku.flow.activity] running compute_newar_dataset_stacked_prep_NP - activity is finished
[09:13:25] [ERROR] [dku.flow.activity] running compute_newar_dataset_stacked_prep_NP - Activity failed
java.io.IOException: Failed to open Excel file
	at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:422)
	at com.dataiku.dip.formats.excel.ExcelFormatExtractor.doExtractStream(ExcelFormatExtractor.java:349)
	at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:154)
	at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59)
	at com.dataiku.dip.datasets.AbstractSingleThreadPusher.pushSplits(AbstractSingleThreadPusher.java:184)
	at com.dataiku.dip.datasets.UniversalSingleThreadPusher.push(UniversalSingleThreadPusher.java:224)
	at com.dataiku.dip.dataflow.exec.stream.SingleThreadFSLikeDatasetRunnable.run(SingleThreadFSLikeDatasetRunnable.java:71)
	at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:378)
Caused by: org.apache.poi.EmptyFileException: The supplied file was empty (zero bytes long)
	at org.apache.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:111)
	at org.apache.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:206)
	at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:143)
	at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipFile(ZipHelper.java:201)
	at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:140)
	at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:277)
	at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:186)
	at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:123)
	at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:90)
	at com.github.pjfanning.xlsx.StreamingReader$Builder.open(StreamingReader.java:307)
	at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:416)

Turribeach · ‎10-04-2023

Thanks, it's a lot easier to understand and reproduce the error now. I can confirm that I get the same error with empty Excel files. I also don't see any ways of filtering the empty files or avoid the exception in a visual no-code way.

Personally I think this is a bug for two reasons:

The File for Test and preview is clever enough to look for non-empty files, then why wouldn't the loaded process do the same? Seems silly to do it for tester and not the loader
Why would the loader attempt to load an empty file? It's obviously nothing to load

You should raise it with Dataiku Support but I suspect that even if it is accepted as a bug it will not have much priority for fixing since this is a problem of your own making (ie why do you have empty files in the blob storage?). And even if it was fixed you will probably not be able to go to the latest version that quickly.

So I think you have 2 options:

1) Fix your writer process so that it doesn't leave 0 bytes files

2) Pre-process the blob storage and remove any 0 bytes files before you attempt to load them. This will certainly require some Python code so not going to be possible with a visual recipe.

Which option are you able to go for?

gjoseph · ‎10-04-2023

Went for another option. Asked the person responsible for the file sync to only deal with files over 0 bytes.

I agree with your points though; the reader should avoid empty files indeed.

Turribeach · ‎10-04-2023

That's option 1) for me. 😉

Sign up to take part

How to ignore EmptyFileException

How to ignore EmptyFileException

Oops: an unexpected error occurred

Failed to open Excel file, caused by: EmptyFileException: The supplied file was empty (zero bytes long)