Service Impact

  Zenoss

Service Impact

To Do

  • Understand Process Sets
  • Monitor II

Resources

What is Service Impact (Impact for short)

  • Resource Manager tells you what happened
  • Impact tells you why you should care
    • By building an Impact Service, you can tell what effect an event is going to have on a service downstream.

Benefits

  • Constantly updated service events /  status
  • Automated root cause analysis
  • Enhanced event management
    • If a member of a service has an event, an event is also generated for the service.
      • /Service/ event class
    • By creating triggers and notifications only on Dynamic Services, you will reduce the amount of event “noise”

How it works

  • Based on devices in Resource Manager and events generated by these devices.
  • Adds 2 new services
    • Impact:
      • State propagation
      • provides service alerting and root cause analysis
    • zenimpactstate:
      • Performs state calculations and event filtering for the Impact service.
    • Add Graph Database
      • All devices in the Object Database are copied/replicated into the Graph database.
      • Add relationships between service elements and the state of each element
      • Built for speed.  Does not bog down the object database.

How the Graph database works

  • Services are built on Members (aka Nodes in the database).
    • Nodes have properties
      • Node type : Service, Device, etc.
      • Availability State
      • Performance State
      • Production State …
  • Relationships bubble up.  Underlying events bubble to higher (Parent Nodes) to the top node.

Service States (aka Impact Types) and State Mappings

  • Two Impact Types – Availability and Performance
    • Availability Health
      • Up (Clear, Debug, Info)
      • At Risk (Warning)
      • Degraded (Error)
      • Down (Critical)
    • Performance Health
      • Acceptable (Clear)
      • Degraded (Warning, Error)
      • Unacceptable (Critical)
  • Type of Impact is determined by an event’s Event Class
    • Anything with /Status/ will be categorized under Availability
    • Anything with /Perf/ will be categorized under Performance
  • The State of the impacted element is derive3d from the Event Severity
    • It is Important that template Thresholds are configured accurately.
    • If Thresholds are too broad or too narrow, the Impact State will be inaccurate.
  • By default, a Service’s state is the most severe state of any of its constituent members.

Building Services

  • A Dynamic Service is a collection of Members.
  • Service members may consist of
    • Devices
    • Components
      • When adding a component, you get the rest of the device as a dependent part / related member of that service
    • Other Services
    • Logical Nodes

Events

  • When an event occurs on a Member, an event will also be generated against the Service’s Availability or Performance.
  • Events are ‘scored’ with a Confidence value which will show which event is most likely to cause the problem.
    • Generally, the farther down in the chain an event occurs, the higher the confidence score will be.
    • If a database is down, and that database is built on a VM that is also down, we should probably spend our time looking into the VM.
  • 58:00

Server Component Monitoring

1:07:00

Process Monitoring

1:19:00

Application Monitoring

1:39:00

Other Dynamic Services

1:46:00

Building services out of services.

Monitoring WebSites

2:21:00

Creating “dummy devices”

  • Do not have an IP Address
  • Do not have a default monitoring template
  • Names do not resolve in DNS
    • site-blog.domain.tld
  • Use Custom Properties to add the IP address
  • Add required local copies of the required monitoring templates
  • Edit the template to use the cProp vs. zProp.

Navigating Impact View

  • Right click on a node to see Child Services
  • Right click on the white space for Show All

Root Cause Analysis & Confidence Scores

2:39:00

  • Visible only in the Impact Events view
  • Use the Confidence scores.  Highest score likely your root cause.
  • Rating auto-magically applies.
    • Down (Critical) will rank higher than Degraded or At Risk
      • This can be altered by creating a Service Policy.  See below
    • The lower in the chain (closer to the hardware, sort of) will often rank higher than events higher up.

Synthetic Web Transactions using Twill

2:42:00

Logical Nodes

2:46:00

  • Logical Nodes are similar to Triggers
    • You define the criteria that sets them.
  • Often used for events that do not fall under the /Status/ or /Perf/ event classes.
    • Hint: Look at Event Classes to see all of the classes that are not /Status/ and /Perf/
    • Example: Event class /Capacity/ for a Predictive Threshold event for a hard disk.
    • When setting the Availability or Performance states, use the class (Availability or Performance) the event would fall under.
      • In the example above, running out of disk space would eventually become a /Perf/ event.
    • This can be thought of as mapping non-/Status/ and non-/Perf/ events to /Status/ or /Perf/
  • These can then be added as Members to any Dynamic Service

 

Service Policies

  • By default, all Dynamic Services fall under the Default Service Policy
  • Global Policies apply to all nodes above the affected node
  • Contextual Policies apply to ONLY the Parent node and are not propagated up.

Service Policies Overview

  • Collection of State Triggers.  The conditions under which the state of a Service or Member element changes.
    • Example:
      • Critical events cause the Dynamic Services state to be critical…
      • Don’t mark degraded unless 50% or more of (Member Set) are degraded
  • Applied to a service or any of its elements
  • Assist in root cause identification
  • Provide multiple availability and performance states
  • Provide event storm filtering, roll-up and windowing
  • Can be applied globally or contextually
  • State of impacting elements rolled up into state of impacted element, worst resulting state takes precedence.

How to Configure a Service Policy: Cluster Example

  • Cluster 1 Servers > Impact View
  • Ensure Aspect = Availability
  • Cluster 1 Servers > Right Click > Edit Impact Policies

Policy Applied

 

Events Applied

25% of servers down = No Change

50% of servers down = At Risk

75% of servers down = Degraded

100% of servers down = Down

 

Dashboard Portlet

  • Quick view of Dynamic Services’ States
  • Can be narrowed down by selecting an Organizer

Edit View

Dashboard View

 

Impact Triggers and Notifications

Triggers

  • Triggers can test for events in the Service event class
  • One use case for this are Atos Collector Failovers
    • When all delegate hosts in a collector are down, a trigger enables a Command notification that runs a script to move all devices to the backup collector.

Notifications

Notification type: SNMP Trap – Impact

  • Impact also includes a new Impact MIB
  • These notifications will send SNMP Traps that can be decoded with this MIB
  • Usage is very rare, but available

 

Tips and Tricks

Add Devices by Device Organizer

  • If you have automation that automatically adds/creates VMs that also adds them to Zenoss Monitoring and assigns them to a Device Organizer (Example: Systems > Clusters > Web), you can simply add the Organizer as a Member of a Dynamic Service.
  • This prevents the need to continually add/remove devices manually.

 

Best Practices

  • Create reusable Dynamic Services often.

LEAVE A COMMENT