/build/static/layout/Breadcrumb_cap_w.png

Has anyone else seen the K1000 / SMA OfflineScheduler Service create an infinite reboot loop?

We have a daily scheduled reboot for all our desktop systems.  This script runs via the offline scheduler so that it runs regardless of the desktop being on network/VPN.  We created the script in 2018 and it has essentially run successfully each day since that time.

However, on Saturday morning all agents running the script went into an infinite loop.  The system would restart, but the script would run again.  The script was timed to run so soon after startup that there was no way to intervene.

We finally used a remote command tool and sc config OFFLINESCHEDULER start=disabled to kill the service.  We currently have the offline scheduler disabled company-wide until Quest helps us complete a root-cause analysis.

Our KACE systems management appliance was updated to the latest version (12.0.149) on 1/25 at 10 PM.

Our KACE systems management clients were updated to the latest client version (12.0.38) on 2/1.  We have no clients in our system not on this version.

The Restart Script has been in place for so long that I cannot find the last time it was changed before yesterday's events in the KACE logs.


0 Comments   [ + ] Show comments

Answers (2)

Posted by: jmendo 2 years ago
White Belt
1

KACE published a KB Article with an statement about this:

Offline KScript Scheduler Infinite Loop (337572)

The issue was related to the daylight saving time change and it won't affect the scheduler anymore.

The KScheduler service can be started again and the Offline KScripts can be enabled.

This issue will not occur again.

Posted by: KyleHoran 2 years ago
White Belt
1

We experienced this exact same event.  We too have a reboot offline script that runs daily and on Saturday morning our computers started to reboot in a loop.  My wonder is if it had anything to do with the daylight savings time but that was not set to happen until Sunday morning.  However the Kace server looks to time.kace.com.  I don't know what that source is managed by...maybe that time source was not set correctly and thus the server time was wrong and kept pushing out this script?

Or, if the offline scripts are run based on the client computer's time and the offline script is stored on the local computer agent then maybe the computers had the wrong system time and kept thinking it was time to run that script.  ??? this is all speculation.  We are in the same boat as you and are still trying to figure out what could cause this behavior.

I do see a couple of windows patches being deployed prior to this event, including Windows Malicious Software removal too.  I wonder if that could have impacted this script?


If you find any ansers would you mind updating this thread?  I will certainly do the same if we find the answer.  


Comments:
  • I'll follow up here.

    Did you open a support ticket? - Jon_at_GLS 2 years ago
    • I have just opened one a couple hours ago. I am about the gather logs from a couple computers to provide to Quest. Have you gotten any feedback yet from them?

      I have two reboot jobs, applied to different targets. I tried applying the other job to a computer and it won't apply at all. In the kagent log it says "KBotScriptManager::GetScript - Failed to get kbot script".

      For some reason it is not downloading this other script. - KyleHoran 2 years ago
      • They asked us to do the following:
        1. Create a new Offline KScript from scratch with the same instructions as the one you're reporting as affected (don't use the "DUPLICATE" button).
        2. Gather 1 or 2 affected devices, and assign them into the Offline KScript schedule.
        3. Trigger the script from the appliance.
        4. From the "Run Now Status" screen, take a screenshot to take a look over the timestamp.
        5. Get a kapture state from the affected device/s.
        - https://support.quest.com/kace-systems-management-appliance/kb/276850/kace-agent-toolkit-kat-now-available-in-the-sma
        - https://support.quest.com/kace-systems-management-appliance/kb/263376/using-the-kace-agent-toolkit-kat-kapture
        6. Get a copy from the appliance logs.
        - https://support.quest.com/kace-systems-management-appliance/kb/134230/obtaining-appliance-server-logs-on-a-kace-appliance - Jon_at_GLS 2 years ago
    • I found that in the "kagent.log" files you typically see a field of data that shows the next scheduled date being set. If you search the kagent.log file for this text "Setting Next" without the quotes, you will find what I am talking about. This should be the NEXT run date and time. When the script ran correctly I see that data report the following days date. However on Saturday that date in the "Setting Next" filed was being set to the current date/time of the script running. I think this is what caused the computers to reboot and then run this script again and again, in a never ending loop. We just don't know WHY it would have not set the right date in Setting Next. I have sent that to Quest for further review. - KyleHoran 2 years ago
    • We performed steps 1-3, and we are still having the issue on systems running the offline schduler. We are being told they are investigating, but if it is a bug, it won't be resolved until the next minor release. - Jon_at_GLS 2 years ago
      • So you are still getting stuck in a boot loop when you try to use the offline scheduler? I was able to disable the script in Kace, force an inventory to the affected computers, re-enable the script in Kace, force another Inventory on the affected computers...then it started working correctly again. - KyleHoran 2 years ago
      • Good to know. We're a little "Once bit, twice shy." This issue cost us a lot in lost opportunity and wages that produced no value.

        Quest has their product team working on this and will release a fix in the future. - Jon_at_GLS 2 years ago
  • So what was the actual root cause if I may ask. If this is a kace defect this is absolutely terrible. How many devices were affected?

    Id like to avoid this as it could mean my damn job.

    BTW I just did a restart script for adhoc after hrs server restarts.. this seems to work fine.

    Launch “$(KACE_SYS_DIR)\shutdown.exe” with params “ /r /t 90 /d P:0:0”.
    Log “DOS Restart Command Issued Successfully.” to “status”. - barchetta 2 years ago
    • Root cause (so far) communicated by Quest in our ticket is that it is a bug. So not an actual root cause really. Our reboot job looks pretty much like yours with the reboot being set to 30 not 90. I know 90 is recommended though. We also pass a comment to the script too. That job has been running for years now with no issue ever and all of a sudden on 3/12/22 it gets messed up. I don't know that it was the job exactly of if it was the kschedulesvs.exe that runs on each computer which handles offline scripts. From the logs I reviewed I see that in the section where it sets the NEXT time it should run it was not properly setting the next time to the following day. It kept setting it to 4:00 of the current day. Which means that when the computer restarted it would see that it had a script scheduled to run...endless loop. For us it affected over 100 computers. All of our back office staff were effected for hours until we could determine that it was this job in Kace that was causing it. At first all we knew was the computers would keep rebooting. Took a bit of trial, killing different services to eventually find it was Kace. Then we had to figure out what in Kace and after some time (hours) we determined it was this offline kscript. Good times. - KyleHoran 2 years ago
      • Im sorry this happened to you. I dont do any offline scripting and wont be until I find out this is resolved. If you get the defect # please post so we can refer to it. What a pain and unneeded stress on you. I would hope KACE would do the right thing and immediately notify all of its customers with specific details as to how to avoid it. We run dell and windows patching.. sure hope those arent involved. We patch our on prem servers. - barchetta 2 years ago
      • I agree that they should be notifying their customers if the feel that this is a but that could potentially cause this issue again. We do have the offline script enabled again. If the same problem happens I believe we can just disable it, do a forced inventory and that will stop the rebooting. The bug # that Quest has applied to this bug is K1A-3906 (product defect ID). - KyleHoran 2 years ago
 
This website uses cookies. By continuing to use this site and/or clicking the "Accept" button you are providing consent Quest Software and its affiliates do NOT sell the Personal Data you provide to us either when you register on our websites or when you do business with us. For more information about our Privacy Policy and our data protection efforts, please visit GDPR-HQ