Monitor and Recover Stopped Automatic Services with OMS – Part 1

I was working with a customer recently and one of the asks was to configure OMS to monitor for stopped automatic services on servers throughout the environment.  My first thought was that we could easily use the data collected by the Configuration Tracking solution and configure queries to alert when a service is stopped.  Unfortunately, although Configuration Tracking is a great solution, for this purpose it did not meet the requirements due to the 1 hour data collection interval.  We needed to be notified of the critical service stopping as close to real-time as possible.  Plan B was to utilize Event ID 7024 and custom fields as we were already collecting the Application log.  However, during my testing on Windows Server 2012 R2, the only event logged to the Application log when a service was stopped manually was Event ID 1.  Further, what if a service just doesn’t start after reboot?   Once again, there may be no events logged as technically there could be no error. 

SO…although technically both of the other options could work in certain scenarios, in this particular case we needed something a bit more granular.  Time for some fun with PowerShell, Azure Automation and the Data Collector API!

Step 1 was to create a PowerShell script to poll for services set to automatic and in a stopped state.

The first part of the script simply gets the necessary credential assets, variables, and logs into Azure.  Nothing fancy here, but you will need to follow the prerequisites at the bottom of this post to ensure that your variables and credential assets are in place.

Write-Output "Getting Azure credentials...."
#Get Creds
$AzureUser="OMSAASvc"
$AzureCred = Get-AutomationPSCredential -Name $AzureUser
Write-Output $AzureCred

Write-Output "Logging into Azure...."
#Login to Azure Subscription
Login-AzureRmAccount -Credential $AzureCred
Select-AzureRmSubscription -SubscriptionName "Microsoft Azure Sponsorship"

Write-Output "Getting Local credentials...."
#Get Domain Creds to run local workflows
$DomainUser="DomainCred"
$DomainCred = Get-AutomationPSCredential -Name $DomainUser
Write-Output $DomainCred

#Update customer Id to your OMS workspace ID
$CustomerID = Get-AutomationVariable -Name 'OMSWSID'

#For shared key use either the primary or secondary Connected Sources client authentication key
$SharedKey = Get-AutomationVariable -Name 'OMSWSPK'

#Get Workspace name and Resourcegroup name for OMS Search API function
$WorkSpaceName =Get-AutomationVariable -Name 'OMSWSName'
$ResourceGroupName = Get-AutomationVariable -Name 'OMSResourceGroup'

The next part of the script utilizes the OMS Search API to collect a list of OMS managed computers to poll for stopped services.  This allows me to avoid using text files or querying AD for a list of computers and avoids collecting data from non-production servers.

#Query OMS for computers registering heartbeats in the last 1 hour
Import-Module AzureRm.OperationalInsights
$dynamicQuery = 'Type=Heartbeat TimeGenerated>NOW-1HOUR | Measure count() by Computer | select Computer'
$Result = Get-AzureRmOperationalInsightsSearchResults `
 -ResourceGroupName $ResourceGroupName `
 -WorkspaceName $WorkspaceName `
 -Query $dynamicQuery
$OMSComputers=$Result.Value | ConvertFrom-Json
$OMSComputers | out-null

Now that I have my list of computers, I can loop through and query each computer using WMI for services that are both stopped and set to automatic.  I’ve also added logic to exclude services set to “Automatic Delayed” to avoid false alarms.  Any services that meet this criteria are then passed to the next section of the script where custom PS Objects are created for each property that will be passed to the Data Collector API using the Send-OMSAPIIngestionFile PowerShell module.  

Note:  For larger environments I’ve provided an example using the PowerShell Workflow  as we can utilize the ForEach -Parallel construction to iterate through a collection of objects in parallel rather than waiting for each loop to finish before moving on to the next.  This can save quite a bit of execution time.  We could of course use jobs as well, but during my testing jobs and Workflow took the same amount of time so I will provide the Workflow version as an example for those that haven’t used PowerShell Workflow in the past.  See the link at the bottom of the post for both runbook examples.  

#Define custom for API
$logtype="ServiceStatus"
$Timestampfield = " " 

Try{
ForEach ($Computer in $OMSComputers)
    {
    #Exclude the Hybrid Worker
    If ($Computer.Computer -ne "AAHybrid01.demo.local")
        {
        $ComputerName=$Computer.Computer
        Write-Output "Getting services on $ComputerName..."

        $Array = @()
        $StoppedSvcs = @()

        $StoppedSvcs = Invoke-Command -ScriptBlock {
            $Services = Get-WmiObject -Class Win32_Service -Filter {State != 'Running' and StartMode = 'Auto'} -Credential $DomainCred -ComputerName $ComputerName -ea Continue

            #Exclude delayed start services
            ForEach ($Service in $Services)
                {
                    $DelayCheckSvc = $Service.Name
                    $DelayCheckReg = Get-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Services\$DelayCheckSvc" -ErrorAction SilentlyContinue
                    $DelayCheck = $DelayCheckReg| Where-Object {$_.Start -eq 2 -and $_.DelayedAutoStart -eq 1}
                    If (!$DelayCheck)
                        {
                        $Service
                        }
                }
            }
        $StoppedSvcs| out-null

        If ($StoppedSvcs)
            {
            Foreach($Svc in $StoppedSvcs)
                {
                    #Format OMS schmea
                    $array=$null
                    $sx = New-Object PSObject ([ordered]@{
                        Computer=$ComputerName
                        SvcDisplay=$Svc.DisplayName
                        SvcName=$Svc.Name
                        SvcState=$Svc.State
                        SvcStartMode=$Svc.StartMode
                        })
                    $array+=$sx
                    $jsonTable = ConvertTo-Json -InputObject $array
                    $jsonTable
                    Send-OMSAPIIngestionFile -customerId $CustomerID -sharedKey $SharedKey -body $jsonTable -logType $logtype -TimeStampField $Timestampfield
                }
            }
        }
    }
}

Catch{
    $ErrorMessage = "Exception Message: $($_.Exception.Message)"
    }
"Exceptions...."
$ErrorMessage

Once the code was fully tested (this is sample code and should be tested thoroughly before using in a production environment), the next step was to copy the code into a new Azure Automation PowerShell runbook called Get-StoppedServices.  Once we validated the functionality, the runbook was published and ready to go!  When executing the runbook in the Azure Automation Test Pane, the output should look similar to below:

output

And now to see if the data is showing up in OMS…

OMS.png

Looking good!  The last step is to schedule the runbook so that we are collecting this data regularly.  I am running the Get-StoppedServices runbook every 10 minutes, but you can schedule the frequency for what works best in your environment.  

Note:  To schedule runbooks at intervals less than 1 hour using Azure Scheduler see my post here.  Additional options include configuring a runbook to schedule intervals or even creating an hourly recurring schedule for each minute interval (see below).

2016-12-09_17-41-23.png

Now that we have our runbook scheduled and the stopped services data is populating in OMS, we can create queries, alerts, and even use the data in custom solutions and views. Let’s take a look at what an alert might look like.

alert.png

Notice that I’ve filtered my query to only alert when specific services are returned. Because this alert is tied to a remediation runbook, you may want to filter the services to avoid restarting services that are not critical or should not be started.  Another option would be to filter these services in the script.  

We can also use the data collected to create a stopped services blade in View Designer or My Dashboard.  The blade below is reproduction of part of an application monitoring solution that I am working on with a customer.  

2016-12-09_19-06-15

Additionally, you may have noticed that I’ve linked the alert to a runbook called Restart-Stopped Services.  The Restart-StoppedServices runbook will be the topic of part 2 of this blog mini-series which I will be releasing soon.  Until then, happy testing!  

NOTE:  The code provided is for testing purposes only and should not be used in production without thorough testing.

Get-StoppedServices Prerequisites:

  • Configure a Hybrid Runbook Worker (for on-premises servers) – if one does not already exist.
  • Configure an Azure Automation Account and link the account to the OMS workspace where the data will be collected.
  • Import the Get-StoppedServices runbook to Azure Automation.
  • Create a Variable Asset for the OMS Workspace ID called ‘OMSWSID’.
  • Create a Variable Asset for the OMS Primary Key called ‘OMSWSPK’.
  • Create a Credential Asset called ‘AzureCred’ with rights to log into Azure and write to OMS.
  • Create a Credential Asset called ‘DomainCred’ with rights to execute Get-WMIObject queries against the on-premises servers.

Get the sample Get-StoppedServices runbooks here.

Advertisements

One thought on “Monitor and Recover Stopped Automatic Services with OMS – Part 1

Comments are closed.