0 votes

I've gotten this error in many scenarios across a lot of versions of DSS, and always assumed it surely must be a known bug and that it would be fixed. But by this point I'm starting to wonder if somehow I'm doing something wrong. Using async = True seems pretty straight forward, but a lot of times I end up with the error below. Any ideas?

Thanks,
John

 

[2018/12/16-17:15:27.415] [Exec-279] [INFO] [process]  - Traceback (most recent call last):
[2018/12/16-17:15:27.415] [Exec-279] [INFO] [process]  -   File "/home/dataiku/dss_data/scenarios/ERM_MASTER_PORTFOLIO/BUILD_MODEL_OUTPUTS/2018-12-16-17-10-38-992/custom-scenario/script.py", line 27, in <module>
[2018/12/16-17:15:27.416] [Exec-279] [INFO] [process]  -     while not sev_br.is_done():
[2018/12/16-17:15:27.416] [Exec-279] [INFO] [process]  -   File "/home/dataiku/dataiku-dss-5.0.4/python/dataiku/scenario/step.py", line 98, in is_done
[2018/12/16-17:15:27.416] [Exec-279] [INFO] [process]  -     }, err_msg="Failed to track step future")
[2018/12/16-17:15:27.416] [Exec-279] [INFO] [process]  -   File "/home/dataiku/dataiku-dss-5.0.4/python/dataiku/core/intercom.py", line 82, in backend_json_call
[2018/12/16-17:15:27.417] [Exec-279] [INFO] [process]  -     return _handle_json_resp(backend_api_post_call(path, data, **kwargs), err_msg = err_msg)
[2018/12/16-17:15:27.417] [Exec-279] [INFO] [process]  -   File "/home/dataiku/dataiku-dss-5.0.4/python/dataiku/core/intercom.py", line 148, in _handle_json_resp
[2018/12/16-17:15:27.417] [Exec-279] [INFO] [process]  -     raise Exception("%s: %s" % (err_msg, _get_error_message(err_data).encode("utf8")))
[2018/12/16-17:15:27.417] [Exec-279] [INFO] [process]  - Exception: Failed to track step future: JobID not found : OwNLUtij. Running jobs:  ["LZ7ni3cz","KHyjzJ2a","GtxHfmlh","1h3oQSvT","cd9JhCAF","JViR3tLW","z5gJUDu5"]

 

by
Alex,

This has been happening over numerous versions for me, going probably all the way back to 3.X. I'm currently on the latest release, 5.0.4 I believe. It is always caused by waiting on an async call to finish, e.g.:

br = scenario.run_scenario('FOO', async=True)
while not br.is_done():
    time.sleep(1)

For some reason seems to happen more regularly if I include two branches in the same statement, e.g.:
while not br.is_done() and br2.is_done()
    time.sleep(1)

instead of
while not br.is_done():
    while not br2.is_done():
        time.sleep(1)

It seems to me to be overzealous garbage cleanup, that the job ids are getting dumped before they're done being referenced, but nothing that I've tried seems to ensure this error doesn't occur.

Best,
John
Thanks for the details. Is that happening when using a Custom Python scenario or a script to manipulate an existing scenario from outside Dataiku? Could you explain the context of these code snippets so we can try and reproduce the issue?
Hi, let us know if you have some time to explain more, we would like to dig into this issue further. Thanks, Alex
Alex,

I will send you some example code as soon as I have a chance. I am a bit surprised that this has not come up before, as it is not any single scenario or function call that has been an issue for me: it has happened across many Python scenarios using async in build_dataset, in run_scenario, and run_step, and on both GCP and AWS.

Best,
John
Thanks John, indeed it is the first time this has been reported. That is specifically for custom python scenario, correct? It would be great if you can send us a project export or at least the code needed to reproduce this issue. Cheers, Alex
Alex,

I do not see a way to direct message you. Could you please let me know how to do that? I will send you code as soon as I am able, but I will not be able to provide you with a project export.

Best,
John
Hi John, thanks, I will try to reproduce with the code. I assume it is a code for a custom python scenario? Knowledge of the context of code execution would be helpful. Could you add the code to a comment on this thread? Otherwise you can send it to Alexandre dot Combessie at Dataiku dot com.
I will send you an email. Thank you.
Thanks, I have been able to reproduce the issue and reported it to our R&D team. From my initial tests, I have noted that the same code does work if all the scenarios are in the same project. We will investigate more.
Great thank you. I believe I have experienced this before when building datasets in the same project, but that was on a previous DSS version and I do not believe I still have the code to re-produce that.

1 Answer

0 votes
Best answer

Hi John,

We had a deeper look at this issue. The explanation is that after .is_done() has returned True once, it cannot be called again. So ideally you should write something like

scenario_1_done = False
scenario_2_done = False

while not scenario_1_done or not scenario_2_done:
    if not scenario_1_done:
        scenario_1_done = scenario_1.is_done()
    if not scenario_2_done:
        scenario_2_done = scenario_2.is_done()
    time.sleep(1)

Hope it helps,

Alex

by
selected by
Thanks, this appears to be working! This also explains why when I did it the way suggested in the example code:

while not scenario_1.is_done() or not scenario_2.is_done():
    time.sleep(1)

that it seemed to always fail, while using nested loops the way that I did only failed sometimes. With nested loops, it should only fail if the scenario being checked by the inner loop finishes first.
1,200 questions
1,229 answers
1,387 comments
11,760 users

┬ęDataiku 2012-2018 - Privacy Policy