Bug #1041

busy cluster test failure: Although job status is marked "Completed", the job did not appear to run and no results are generated

Added by Mary Laser over 1 year ago. Updated about 1 year ago.

Status:Rejected Start date:02/15/2012
Priority:High Due date:
Assignee:Alex Norton % Done:

0%

Category:Scheduler
Target version:2.0.0
Rank: Tester:

Description

It's not clear if this is a scheduler or agent bug. Will assign to Alex for further investigation.
4 large uploads were started in the test cluster, #2 & 3 were failures. details follow...

  1. One failed legitimately due to a typo.
  2. One upload was paused during wget and removed before it completed; although it was marked "Completed". Subsequent jobs were marked "Completed", but did not appear to run (how could they? there was no upload to operate on) and no results were generated.
    2012-02-14 17:34:18 scheduler [18137] :: JOB[110].wget_agent[18793.griphook]: agent status change: AG_RUNNING -> AG_PAUSED
    2012-02-14 17:34:18 scheduler [18137] :: HOST[griphook] load decreased to 4
    2012-02-14 17:34:18 scheduler [18137] :: JOB[110]: job status changed: JOB_STARTED => JOB_COMPLETE
    2012-02-14 17:34:18 scheduler [18137] :: SIGNALS: received sigchld for pid 18793
    2012-02-14 17:34:18 scheduler [18137] :: JOB[110].wget_agent[18793.griphook]: successfully remove from the system
    2012-02-14 17:34:18 scheduler [18137] :: JOB[110]: job removed from system
  3. upload 3 was paused, restarted, paused & killed. wget was marked "Completed", even though it was killed (maybe status should be Aborted?). All subsequent jobs were marked "Completed", but did not appear to run and no results were generated.
    2012-02-14 17:35:10 scheduler [18137] :: JOB[124].wget_agent[18814.griphook]: agent updated correctly, processed 1 items: 2
    2012-02-14 17:35:45 scheduler [18137] :: INTERFACE: received "pause 124"
    2012-02-14 17:35:45 scheduler [18137] :: JOB[124]: job status changed: JOB_STARTED => JOB_CLI_PAUSED
    2012-02-14 17:35:45 scheduler [18137] :: JOB[124].wget_agent[18814.griphook]: agent status change: AG_RUNNING -> AG_PAUSED
    2012-02-14 17:35:45 scheduler [18137] :: HOST[griphook] load decreased to 3
    2012-02-14 17:36:08 scheduler [18137] :: INTERFACE: received "restart 124"
    2012-02-14 17:36:08 scheduler [18137] :: JOB[124].wget_agent[18814.griphook]: agent status change: AG_PAUSED -> AG_RUNNING
    2012-02-14 17:36:08 scheduler [18137] :: HOST[griphook] load increased to 4
    2012-02-14 17:36:08 scheduler [18137] :: JOB[124]: job status changed: JOB_CLI_PAUSED => JOB_STARTED
    2012-02-14 17:36:08 scheduler [18137] :: JOB[124].wget_agent[18814.griphook]: agent status change: AG_RUNNING -> AG_PAUSED
    2012-02-14 17:36:08 scheduler [18137] :: HOST[griphook] load decreased to 3
    2012-02-14 17:36:08 scheduler [18137] :: JOB[124]: job status changed: JOB_STARTED => JOB_COMPLETE
    2012-02-14 17:36:08 scheduler [18137] :: SIGNALS: received sigchld for pid 18814
    2012-02-14 17:36:08 scheduler [18137] :: JOB[124].wget_agent[18814.griphook]: successfully remove from the system
    2012-02-14 17:36:08 scheduler [18137] :: JOB[124]: job removed from system
  4. the final upload was not tampered with and wget, unpack, adj2nest all completed. nomos & copyright are currently running.

History

Updated by Alex Norton over 1 year ago

  • Status changed from New to In Progress

I still don't know if this is a scheduler or agent bug. In the case of both failed wget agents, when the procedure that caused the failure was copied, they finished correctly.

Possible causes: * There could be a race condition in the scheduler relating to pausing/killing agents. Until I see this duplicated with another agent other than wget, I'm inclined to believe the cause is something else * Wget could be reporting something incorrectly when it is paused/killed. I don't know if the child process is paused correctly, or if it will run to completion while wget_agent is paused. I see this as more likely since wget seems to be the only agent that this reported on.

Updated by Mary Laser about 1 year ago

  • Status changed from In Progress to Rejected

The 2 uploads that failed are related to changing job status and how it is reflected in the UI (as is #795). Bob's rewrite of v2.0 job scheduling will render these issues moot. rejecting.

Also available in: Atom PDF