Bug #1041
busy cluster test failure: Although job status is marked "Completed", the job did not appear to run and no results are generated
| Status: | Rejected | Start date: | 02/15/2012 | |
|---|---|---|---|---|
| Priority: | High | Due date: | ||
| Assignee: | % Done: | 0% |
||
| Category: | Scheduler | |||
| Target version: | 2.0.0 | |||
| Rank: | Tester: |
Description
It's not clear if this is a scheduler or agent bug. Will assign to Alex for further investigation.
4 large uploads were started in the test cluster, #2 & 3 were failures. details follow...
- One failed legitimately due to a typo.
- One upload was paused during wget and removed before it completed; although it was marked "Completed". Subsequent jobs were marked "Completed", but did not appear to run (how could they? there was no upload to operate on) and no results were generated.
2012-02-14 17:34:18 scheduler [18137] :: JOB[110].wget_agent[18793.griphook]: agent status change: AG_RUNNING -> AG_PAUSED
2012-02-14 17:34:18 scheduler [18137] :: HOST[griphook] load decreased to 4
2012-02-14 17:34:18 scheduler [18137] :: JOB[110]: job status changed: JOB_STARTED => JOB_COMPLETE
2012-02-14 17:34:18 scheduler [18137] :: SIGNALS: received sigchld for pid 18793
2012-02-14 17:34:18 scheduler [18137] :: JOB[110].wget_agent[18793.griphook]: successfully remove from the system
2012-02-14 17:34:18 scheduler [18137] :: JOB[110]: job removed from system - upload 3 was paused, restarted, paused & killed. wget was marked "Completed", even though it was killed (maybe status should be Aborted?). All subsequent jobs were marked "Completed", but did not appear to run and no results were generated.
2012-02-14 17:35:10 scheduler [18137] :: JOB[124].wget_agent[18814.griphook]: agent updated correctly, processed 1 items: 2
2012-02-14 17:35:45 scheduler [18137] :: INTERFACE: received "pause 124"
2012-02-14 17:35:45 scheduler [18137] :: JOB[124]: job status changed: JOB_STARTED => JOB_CLI_PAUSED
2012-02-14 17:35:45 scheduler [18137] :: JOB[124].wget_agent[18814.griphook]: agent status change: AG_RUNNING -> AG_PAUSED
2012-02-14 17:35:45 scheduler [18137] :: HOST[griphook] load decreased to 3
2012-02-14 17:36:08 scheduler [18137] :: INTERFACE: received "restart 124"
2012-02-14 17:36:08 scheduler [18137] :: JOB[124].wget_agent[18814.griphook]: agent status change: AG_PAUSED -> AG_RUNNING
2012-02-14 17:36:08 scheduler [18137] :: HOST[griphook] load increased to 4
2012-02-14 17:36:08 scheduler [18137] :: JOB[124]: job status changed: JOB_CLI_PAUSED => JOB_STARTED
2012-02-14 17:36:08 scheduler [18137] :: JOB[124].wget_agent[18814.griphook]: agent status change: AG_RUNNING -> AG_PAUSED
2012-02-14 17:36:08 scheduler [18137] :: HOST[griphook] load decreased to 3
2012-02-14 17:36:08 scheduler [18137] :: JOB[124]: job status changed: JOB_STARTED => JOB_COMPLETE
2012-02-14 17:36:08 scheduler [18137] :: SIGNALS: received sigchld for pid 18814
2012-02-14 17:36:08 scheduler [18137] :: JOB[124].wget_agent[18814.griphook]: successfully remove from the system
2012-02-14 17:36:08 scheduler [18137] :: JOB[124]: job removed from system - the final upload was not tampered with and wget, unpack, adj2nest all completed. nomos & copyright are currently running.
History
Updated by Alex Norton over 1 year ago
- Status changed from New to In Progress
I still don't know if this is a scheduler or agent bug. In the case of both failed wget agents, when the procedure that caused the failure was copied, they finished correctly.
Possible causes: * There could be a race condition in the scheduler relating to pausing/killing agents. Until I see this duplicated with another agent other than wget, I'm inclined to believe the cause is something else * Wget could be reporting something incorrectly when it is paused/killed. I don't know if the child process is paused correctly, or if it will run to completion while wget_agent is paused. I see this as more likely since wget seems to be the only agent that this reported on.
Updated by Mary Laser about 1 year ago
- Status changed from In Progress to Rejected
The 2 uploads that failed are related to changing job status and how it is reflected in the UI (as is #795). Bob's rewrite of v2.0 job scheduling will render these issues moot. rejecting.