Performance Monitoring and Analysis with OMS

There are many great posts out there which show the performance monitoring and analysis features in OMS. What I want to do in this post is to put it all together and show in detail how we can use OMS to not only monitor and identify performance issues, but then drill down and use features such as NRT Performance Data for root cause analysis. For the purpose of this demo, I will be looking at the Processor\% Processor Time counter as the first step in identifying issues on my servers.

Let’s get started!

To get a high level overview of the Processor(_Total)\% Processor Time on the servers in my environment, I will use two search queries to create a performance chart and performance graph for all computers.

Performance Chart:

  1. Navigate to the log search query page
  2. Enter the following query in the search field:                                                               Type=Perf ObjectName=Processor CounterName=”% Processor Time” TimeGenerated>NOW-1HOUR | measure avg(CounterValue) as ProcTime by Computer
  3. Save the query using a distinct name and category so you can easily access your queries when needed.

SavedSearches

% Proc Time Perf Chart

Performance Graph:

I’m a big fan of the new graphical functionality in OMS. From a performance monitoring and analysis perspective, it’s a great tool to identify issues at a high level and provides the capability to then drill down into a specific instance to access the raw NRT Performance Data for root cause analysis using up to 10 second data points.

  1. Navigate to the log search query page
  2. Enter the following query in the search field (NOTE: This is the exact query that we entered to create the performance chart with exception of the TimeGenerated syntax and the addition of the “Interval 1HOUR” syntax.  This syntax activates the graphical feature.                                                                                                                                      Type=Perf ObjectName=Processor CounterName=”% Processor Time” | measure avg(CounterValue) as ProcTime by Computer Interval 1HOUR
  3. Save the query using a distinct name and category so you can easily access your queries when needed.

% Proc Time Perf Graph.png

At this point we have two very helpful and simple tools to identify instances where % Processor Time is an issue in my environment.  Additionally, we can easily access these tools by clicking on the queries saved on the Saved Queries page.  Excellent!

So I’ve identified that I have a server with high CPU…what next?  We have a few options here.  For the purpose of this demo, I will first show how we can drill down using the performance graph, and then I will manipulate the query a bit to allow us to identify the exact process that is causing the CPU to spike on the identified server.

Performance Graph Drill Down:

One of the really cool features of the performance graph feature is the ability to drill down into each instance to access “raw” data.  Let’s take a look

  1. Navigate to your save performance graph
  2. Identify a computer with high CPU.  In this case I will choose the SCCM server as it has the highest total CPU. Drill Down SCCM01.png
  3. Once we click the instance, we are taken to the NRT Performance Data page where we find our filtered instance counter.  We have two types of data available here.  The Results data set consists of 30 minute aggregated data, while the Metrics data set consists of “raw” data, with the interval depending on what was specified during configuration.  I will choose Metrics to access the raw data.  For more on the NRT Performance Data feature visit my blog post here.

% Proc Time NRT Data

We now have access to real time data which gives us a much clearer picture on when the CPU spike occurred, if it is ongoing, etc.  This is great, but it really doesn’t tell me what is causing the issue.  Let’s take this solution a bit further and create some views to identify which process or processes are taking up the most CPU.

First, I will create a performance graph view to show Process\% Processor Time data for all processes running on SCCM01.  To create this graph, simply edit the query used above as follows:

Type=Perf ObjectName=Process CounterName=”% Processor Time” Computer=SCCM01* InstanceName!=_Total AND InstanceName!=Idle | measure avg(CounterValue) as ProcTime by InstanceName Interval 1HOUR | sort ProcTime desc

As you can see, I added a filter to remove the “_Total” and “Idle” process instances from my query results.

Process

Note: Notice that I am using a wildcard on the server name.  If your naming convention allows, you can use wildcards to group by technology, AD Site, etc. For example, if I want to see all performance data for my domain controllers, I can use the following filter in my query:

Type=Perf ObjectName=Processor CounterName=”% Processor Time” TimeGenerated>NOW-1HOUR Computer=*DC* | measure avg(CounterValue) as ProcTime by Computer

Note: The ability to do multiple group by, where you want to group by both server and instance, is not yet available.  For example, if I want to show the Process\% Processor Time for each process on a group of computers, I cannot.  The result of this query would be result in aggregated values for each process over all computers.

Ok, back on track.  Looking at my performance graph above, it’s clear that the smsexec process is my top offender.  To drill down and take a more granular look at this particular process, we once again hover over the smsexec instance and select the hourglass icon, which brings us to the NRT Performance Data view.

Process Drill.png

On the NRT Performance Data view we can now take a more granular look at the smsexec performance both historically and in real time to identify patterns and assist with root cause analysis.

Process NRT

Additionally, we can easily export this data using Excel, and more recently, Power BI. The exported data will be from the aggregated data set (30 minute aggregate values).

export.png

We now have several saved queries which can be used to troubleshoot CPU issues in our environment.  This same process can be used for any performance KPIs configured to be collected in your OMS workspace.  Additionally, we can take this step further and create dashboards, alerts and even a custom solution based on specified thresholds.  My next post will take the examples outlined here and detail how to create an alert, dashboard views, and a custom solution to enable easy monitoring and visualization of your performance data.

 

 

Advertisements