Thursday, February 24, 2011

Exchange 2010 SP1: The Troubleshooters

When managing an Exchange 2010 server organization in a production environment, an Exchange Administrator many times wants to keep track of things like Database Latency, Disk Space problems, search issues, user needs, and more. PowerShell was created to give us more power in our admin efforts, and many books have already been written to assist us with mastering the ins and outs of PowerShell Programming. What if I told you that you already had some examples of PowerShell programming, right on your Exchange server? And they are pretty helpful on top of that. Beginning with RTM, Exchange 2010 has shipped with 3 PowerShell Troubleshooters out of the box. Written in the PowerShell scripting language, they are ready for you to use:
  • Content Index Troubleshooter(Troubleshoot-CI.ps1)
  • Database Latency Troubleshooter(Troubleshoot-DatabaseLatancy.ps1)
  • Database Disk Space troubleshooter(Troubleshoot-DatabaseSpace.ps1)
These PowerShell scripts are located in the \Program Files\Microsoft\Exchange Server\V14\Scripts folder.

Let's go over them.

The Content Index Troubleshooter:The CI Troubleshooter (Troubleshoot-CI.ps1) is designed to monitor and perform troubleshooting on Content Index catalogs. It detects and resolves the following symptoms:
  • Deadlock: Exchange Search deadlocks waiting on threads from MSSearch
  • Corruption: One or more search indices are corrupted
  • Stall: Similar to deadlock in that the indices are not getting updated
  • Backlog: The Search catalog is backlogged resulting in missing index searches
This troubleshooter can be used against the server or just a database, and can be used to detect as well as resolve problems. It can also be used in just a monitoring context to log warnings and failure events in the app log.
 
Parameters

Server: (optional) The NETBIOS name of mailbox server on which troubleshooting should be attempted for CI catalogs. If this optional parameter is not specified, the local server is assumed.

Database: (optional) This is the name of database to troubleshoot. If this parameter is not specified, catalogs for all databases on the server specified by the Server parameter are used.

Symptom: (optional) Specifies the symptom to detect and troubleshoot.
Possible values are:
  • Deadlock
  • Corruption
  • Stall
  • Backlog
  • All(default)
When 'All' is specified, all symptoms are tested.

Action: (optional) This specifies the action to be performed to resolve a symptom. Possible values are:
  • Detect (default)
  • DetectAndResolve
  • Resolve
MonitoringContext: (optional) This specifies that the troubleshooter is being run in a monitoring context. The possible values are $true and $false. Default is $false. If the value is $true, warning/failure events are logged to the application event log.

FailureCountBeforeAlert: (optional) This specifies the number of failures the troubleshooter will allow before raising an Error in the event log, leading to a SCOM alert. The allowed range for this parameter is [1,100], default is 3. No alerts are raised if MonitoringContext is $false.

FailureTimeSpanMinutes: (optional) This specifies the number of minutes in the time span during which the troubleshooter will check the history of failures to count the failures and alert. If the failure count during this time span exceeds the value for FailureCountBeforeAlert, an alert is raised. No alerts are raised if MonitoringContext is $false. The default value for this parameter is 600 minutes.

Examples
C:\PS> .\Troubleshoot-CI.ps1 -database DB01

Detects and reports if there's any problem with catalog for

database DB01. Does not attempt any Resolution.

C:\PS> .\Troubleshoot-CI.ps1 -database DB01 -symptom Stall

Detects if indexing on catalog for database DB01 is stalled. Does not attempt any Resolution.

C:\PS> .\Troubleshoot-CI.ps1 -Server <S001>

Detects and reports problems with all catalogs on server S001, if any. Does not attempt any Resolution.

C:\PS> .\Troubleshoot-CI.ps1 -database DB01 -Action DetectAndResolve

Detects and reports if there's any problem with catalog for database DB01.

Attempts a Resolution of the problem.

Event Log
Events logged to the Microsoft-Exchange-Troubleshooters/Operational event log:

5000 Informational The troubleshooter started successfully
5001 Informational The troubleshooter finished successfully
5002 Informational The troubleshooter didn't find any issues for any catalog
5003 Informational The troubleshooter didn't find any catalog issues for the specified database
5004 Informational Restart of search services succeeded
5005 Informational Reseeding succeeded for the catalog of the specified database
5300 Warning Detected search service deadlock
5301 Warning Detected catalog corruption for the specified database
5302 Warning Detected indexing stall for the specified database
5600 Error The troubleshooter failed with the specified exception
5601 Error The troubleshooter detected the symptom %1 %2 times in the past %3 hours for catalog %4. This exceeded the allowed limit for failures.
5602 Error Search services failed to restart.
5603 Error Reseeding failed for the content index catalog of mailbox database %1. Reason: %2
5604 Error Indexing backlog reached a critical limit of %2 hours or more for the specified database
5605 Error Another instance of the troubleshooter is already running on this machine. Two or more instances cannot be run simultaneously.
6000 Informational The troubleshooter started detection.
6001 Informational The troubleshooter finished detection
6002 Informational The troubleshooter started resolution.
6003 Informational The troubleshooter finished resolution.
6600 Error The troubleshooter failed during detection.
6601 Error The troubleshooter failed during resolution.

The Database Latency Troubleshooter:The Database Latency Troubleshooter (Troubleshoot-DatabaseLatency.ps1) is designed to monitor and perform troubleshooting on database latency. It detects and resolves the following symptoms:
  • Disk latency
  • Active Directory Latency
  • RPC Latency
  • Top CPU user
This troubleshooter will run again the local mailbox database only.
 
Checks Disk Latency:
The troubleshooter detects disk latency by checking the following counters on the mailbox server database it is running against:

"\MSExchange Database ==> Instances($database)\I/O Database Reads Average Latency"
"\MSExchange Database ==> Instances($database)\I/O Database Reads/sec"
"\MSExchange Database ==> Instances($database)\I/O Database Writes Average Latency"
"\MSExchange Database ==> Instances($database)\I/O Database Writes/sec"

Note: The trouble shooter will take into account that the disk latencies are not caused by a heavily loaded disk subsystem.

It uses the following default thresholds:
  • Maximum latency threshold for disk read before it is deemed as bad is 200
  • Minimum read rate for disk before it is deemed as bad is 20
  • Minimum write rate for disk before it is deemed as bad is 20
Checks RPC Latency:
The troubleshooter detects the RPC latency by checking the following performance counters on the mailbox server for the database it is running against

"\MSExchangeIS Mailbox($database)\rpc average latency"
"\MSExchangeIS Mailbox($database)\RPC Operations/sec"
The maximum RPC average latency by default is set to 70
The maximum RPC operations/sec threshold by default is set to 50

Checks the Top CPU User:
The troubleshooter detects the top CPU user by generating a descending list of the users that are using up the most time on server for a given database. It uses the output of Get-StoreUsageStatistics to get the MailboxGuid and the time in the server used during the sampling period (10 min). If Quarantine has been enabled, the troubleshooter will log an event and quarantine the top CPU user.

Parameters:
The following parameters are provided as a part of the Database Latency troubleshooter:
MailBoxDatabaseName: (required) The Mailbox database the troubleshooter will run against.

Possible values are:
  • GUID
  • Distinguished name (DN)
  • Database name
LatencyThreshold: (optional) The maximum RPC average latency the server should be experiencing. The valid range for LatencyThreshold is from 1 to 200 with default as 70.
 
MonitoringContext: (optional) Specifies whether or not the troubleshooter is to write monitoring events to the Application and the Operations log in the Event Viewer.

TimeInServerThreshold: (optional) The TimeInServerThreshold sets the threshold for the top users that are causing the CPU starvation. The valid range is from 1 to 600000 with default as 60000.

Quarantine: (optional) Specifies whether or not to quarantine heavy users. By default it does not quarantine the heaviest user.

Example
Troubleshoot-Databaselatency.ps1 -MailboxDatabaseName <DatabaseID> [-latencyThreshold <1-200>] [-TimeinServerThreshold <1-600000>] [-Quarantine <switch>] [-MonitoringContext <switch>]

Event Log
Events logged to the Microsoft-Exchange-Troubleshooters/Operational event log:
5110 Informational The troubleshooter started successfully
5111 Informational The troubleshooter detected latency
5411 Warning The troubleshooter quarantined a user
5412 Warning Unusual activity found in a mailbox, but quarantine was not specified
5710 Error Disk latencies are abnormal for the specified database
5711 Error DSAccess latencies are abnormal for the specified database
5712 Error High RPC Average Latencies were detected for the specified database, but the troubleshooter was unable to determine the cause.

The Database Disk Space Troubleshooter:The Database Disk Space Troubleshooter (Troubleshoot-DatabaseSpace.ps1) is designed to monitor and perform troubleshooting on database disk space issues. It detects the amount of database disk space, and it detects the cause for log generation:
  • It tracks the top users that are generating logs
  • It tracks the available disk space for both logs and database
Based on a disk space and time threshold, the troubleshooter has the option to Quarantine the top users.

Parameters:
The following parameters are provided as a part of the Database Disk Space Troubleshooter:
Server: (required) Specifies the mailbox server on which you are monitoring the log growth for all mailbox databases.

Note   You can't use this parameter in conjunction with the MailboxDatabaseName parameter.
MailboxDatabaseName: (required) Specifies the mailbox database on which you are monitoring the log growth.

Possible values are:
  • GUID
  • Distinguished name (DN)
  • Database name
Note:   You can't use this parameter in conjunction with the Server parameter.

PercentEdbFreeSpaceThreshold: (optional) Specifies the percentage of disk space for the EDB file at which Exchange should begin quarantining users. The valid value is from 1-99 and the default is 25 percent.

PercentLogFreeSpaceThreshold: (optional) Specifies the percentage of disk space for the log files at which Exchange should begin quarantining users. The valid value is from 1-99 and the default is 25 percent.

HourThreshold: (optional) Specifies the number of hours that you can wait until running out of space. The default value is 12 hours. The valid range is 1 to 1000000000

MonitoringContext: (optional) Specifies whether the results of the command include monitoring events to be written in the regular application logs in Event Viewer and in the Operations log.

Quarantine: (optional) Specifies that heavy users will be quarantined

Examples
Troubleshoot-DatabaseSpace.ps1 -MailboxDatabaseName <DatabaseID> [-PercentEdbFreeSpaceThreshold <1-99>] [-PercentLogFreeSpaceThreshold <1-99>] [-HourThreshold <1- 1000000000>] [-Quarantine <switch>] [-MonitoringContext <switch>]
Troubleshoot-DatabaseSpace.ps1 -Server <ServerID> [-PercentEdbFreeSpaceThreshold <1-99>] [-PercentLogFreeSpaceThreshold <1-99>] [-HourThreshold <1- 1000000000>] [-Quarantine <switch>] [-MonitoringContext <switch>]

Event Log
Events logged to the Microsoft-Exchange-Troubleshooters/Operational event log:
5100 Informational The troubleshooter started successfully
5101 Informational The troubleshooter finished. No problems detected.
5400 Warning The database is over the expected threshold.
5401 Warning Over the threshold, but the database is not growing. No action taken.
5410 Warning The troubleshooter quarantined the specified mailbox.
5700 Error The database is over the expected threshold and continues to grow. Manual intervention is required.

To watch for:
Even though you start them from the Exchange Management Shell, for the most part these troubleshooters give very little to no output to the window itself. You will need to go to the event log to see the reports. The exception is if you run the troubleshooters in Verbose Mode.

But, on the good side, you will be able to filter the event log for warnings and errors. So you will be able to periodically check the server and then take a quick glance at the event log to see if there are any problems. And speaking of periodically checking the server, that brings me to...

Bonus:
You can call the troubleshooters from within other PowerShell scripts as part of an overall server health monitoring plan. Here's a simple example:

$servers= get-mailboxserver
while($true)
{
     foreach ($server in $servers)
     {
        .\Troubleshoot-CI.ps1 -verbose -server $server.Name -Action:DetectAndResolve -Symptom:Stall
     }
     sleep 14440
}

Calling get-mailboxserver with no parameters will return a complete list of the mailbox servers in the entire Exchange organization. The above script will poll each Exchange Server in the org, sleep, and then poll them again.

View Article....

No comments:

Post a Comment