0 votes

 

I am trying to download multiple CSV files into a single data source. The data is in a consistent format and can be stacked (i am doing that in another pipeline (but individual data sources for each file).

Any advice on how i can pull multiple similar files into one stacked dataset, my flow view is very very busy, and was hoping I could use the "add another source" button. But i can not find out the constraints on how to use it.

Once it processes the first file I get the error:

[2017/04/15-07:57:00.679] [Thread-709] [INFO] [dku.remotefiles]  - Writing in /home/dataiku/dss/managed_datasets/E55V2.EFAST_SCH_C_P1_I3
[2017/04/15-07:57:00.679] [Thread-709] [INFO] [dku.remotefiles]  - outputPartition = NP substituted URL https://www.askebsa.dol.gov/FOIA%20Files/2015/Latest/F_SCH_C_PART1_ITEM3_2015_Latest.zip
[2017/04/15-07:57:00.679] [Thread-709] [WARN] [com.dataiku.dip.ApplicationConfigurator]  - GeneralSettings: create a temporary read transaction
[2017/04/15-07:57:00.738] [qtp1545237089-1453] [DEBUG] [dku.tracing]  - [ct: 2] Start call: /api/datasets/remote-files/get-fetch-status user=admin
[2017/04/15-07:57:00.742] [qtp1545237089-1453] [DEBUG] [dku.tracing]  - [ct: 6] Done call: /api/datasets/remote-files/get-fetch-status time=6ms user=admin
[2017/04/15-07:57:01.754] [Thread-709] [INFO] [dku.remotefiles]  - Copied = 7890
[2017/04/15-07:57:02.811] [qtp1545237089-1453] [DEBUG] [dku.tracing]  - [ct: 1] Start call: /api/datasets/remote-files/get-fetch-status user=admin
[2017/04/15-07:57:02.815] [qtp1545237089-1453] [DEBUG] [dku.tracing]  - [ct: 5] Done call: /api/datasets/remote-files/get-fetch-status time=5ms user=admin
[2017/04/15-07:57:04.341] [Thread-709] [INFO] [dku.remotefiles]  - outputPartition = NP substituted URL https://www.askebsa.dol.gov/FOIA%20Files/2014/Latest/F_SCH_C_PART1_ITEM3_2014_Latest.zip
[2017/04/15-07:57:04.341] [Thread-709] [WARN] [com.dataiku.dip.ApplicationConfigurator]  - GeneralSettings: create a temporary read transaction
[2017/04/15-07:57:04.342] [Thread-709] [ERROR] [dku.remotefiles]  - Download failed
java.lang.IllegalStateException: Connection pool shut down
	at org.apache.http.util.Asserts.check(Asserts.java:34)
	at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169)
	at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202)
	at org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184)
	at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
	at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
	at com.dataiku.dip.input.remote.RemoteFilesSynchronizer.fetchHTTP(RemoteFilesSynchronizer.java:354)
	at com.dataiku.dip.input.remote.RemoteFilesSynchronizer.runFetch(RemoteFilesSynchronizer.java:267)
	at com.dataiku.dip.server.datasets.RemoteFilesDatasetTestService$FetchThread.run(RemoteFilesDatasetTestService.java:279)
[2017/04/15-07:57:04.342] [Thread-709] [ERROR] [dku.datasets]  - Fetch failed
java.io.IOException: Download failed for https://www.askebsa.dol.gov/FOIA%20Files/2014/Latest/F_SCH_C_PART1_ITEM3_2014_Latest.zip
	at com.dataiku.dip.input.remote.RemoteFilesSynchronizer.fetchHTTP(RemoteFilesSynchronizer.java:408)
	at com.dataiku.dip.input.remote.RemoteFilesSynchronizer.runFetch(RemoteFilesSynchronizer.java:267)
	at com.dataiku.dip.server.datasets.RemoteFilesDatasetTestService$FetchThread.run(RemoteFilesDatasetTestService.java:279)
Caused by: java.lang.IllegalStateException: Connection pool shut down
	at org.apache.http.util.Asserts.check(Asserts.java:34)
	at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169)
	at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202)
	at org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184)
	at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
	at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
	at com.dataiku.dip.input.remote.RemoteFilesSynchronizer.fetchHTTP(RemoteFilesSynchronizer.java:354)
	... 2 more
[2017/04/15-07:57:04.342] [Thread-709] [INFO] [dku.datasets]  - Fetch finished, final status:
{
  "running": false,
  "error": true,
  "filesTotal": 1,
  "sizeTotal": 7655819,
  "filesToFetch": 1,
  "sizeToFetch": 7655819,
  "filesFailed": 1,
  "filesFetched": 1,
  "sizeFetched": 7655819,
  "filesDeleted": 0,
  "perSource": [
    {
      "error": false,
      "filesTotal": 1,
      "sizeTotal": 7655819,
      "filesToFetch": 1,
      "sizeToFetch": 7655819,
      "filesFailed": 0,
      "filesFetched": 1,
      "sizeFetched": 7655819
    },
    {
      "error": true,
      "filesTotal": 0,
      "sizeTotal": 0,
      "filesToFetch": 0,
      "sizeToFetch": 0,
      "filesFailed": 1,
      "filesFetched": 0,
      "sizeFetched": 0
    }
  ],
  "errorMessages": [
    "Download failed for https://www.askebsa.dol.gov/FOIA%20Files/2014/Latest/F_SCH_C_PART1_ITEM3_2014_Latest.zip: Connection pool shut down"
  ],
  "master": {
    "running": true,
    "error": true,
    "filesTotal": 1,
    "sizeTotal": 7655819,
    "filesToFetch": 1,
    "sizeToFetch": 7655819,
    "filesFailed": 1,
    "filesFetched": 1,
    "sizeFetched": 7655819,
    "filesDeleted": 0,
    "perSource": [
      {
        "error": false,
        "filesTotal": 1,
        "sizeTotal": 7655819,
        "filesToFetch": 1,
        "sizeToFetch": 7655819,
        "filesFailed": 0,
        "filesFetched": 1,
        "sizeFetched": 7655819
      },
      {
        "error": true,
        "filesTotal": 0,
        "sizeTotal": 0,
        "filesToFetch": 0,
        "sizeToFetch": 0,
        "filesFailed": 1,
        "filesFetched": 0,
        "sizeFetched": 0
      }
    ],
    "errorMessages": [
      "Download failed for https://www.askebsa.dol.gov/FOIA%20Files/2014/Latest/F_SCH_C_PART1_ITEM3_2014_Latest.zip: Connection pool shut down"
    ]
  }
}
[2017/04/15-07:57:04.885] [qtp1545237089-1453] [DEBUG] [dku.tracing]  - [ct: 1] Start call: /api/datasets/remote-files/get-fetch-status user=admin
[2017/04/15-07:57:04.889] [qtp1545237089-1453] [DEBUG] [dku.tracing]  - [ct: 5] Done call: /api/datasets/remote-files/get-fetch-status time=5ms user=admin
[2017/04/15-07:57:04.972] [qtp1545237089-1453] [DEBUG] [dku.tracing]  - [ct: 1] Start call: /api/datasets/test-and-detect-format user=admin [projectKey=E55V2]
[2017/04/15-07:57:04.977] [qtp1545237089-1453] [DEBUG] [com.dataiku.dip.connections.FilesBasedConnectionsDAO] test-RemoteFiles - ConnectionsDAO: create a temporary read transaction
[2017/04/15-07:57:04.978] [qtp1545237089-1453] [WARN] [dku.dataset.inspector] test-RemoteFiles - DatasetInspector: create a temporary read transaction
[2017/04/15-07:57:04.983] [qtp1545237089-1453] [INFO] [dku.datasets] test-RemoteFiles - Got it, closing
[2017/04/15-07:57:04.983] [qtp1545237089-1453] [INFO] [dku.datasets] test-RemoteFiles - Close done
asked by matthew

1 Answer

0 votes
Hi,

This is a known issue in the current HTTP dataset with multiple sources. At the moment, the only workaround is to make several HTTP datasets and use a stack recipe.

This will be fixed in version 4.1 of DSS (end of summer)
answered by
928 questions
957 answers
958 comments
1,804 users