The performance of a network may deteriorate because of internal or external problems and then faults may occur that affect the service performance, availability and ultimately the customer experience. Performance management enables you to detect the deteriorating tendency in advance and proactively solve the potential threats so that faults can be prevented.

Read below to understand what are the main functions of performance management, how these functions create two big challenges for Network Management Systems (NMS) trying to implement them, namely massive data handling and data unification, and the techniques that can be used to address these challenges.

 

Functions of Performance Management

Monitoring of the network performance is typically done from one or more network management systems (NMS) located at the Network Operations Center (NOC), that through communication protocols  (SNMP, Netflow, FTP etc.), databases (PostrgeSQL, MySQL, Oracle, Sybase, Handoop etc.) and reporting engines realise the following operations / functions / capabilities:

  • Query the current and historical performance of a single network element, of the whole network or of a domain/sub-network.
  • Analyze the performance with predefined advanced tabular reports and graphs created specifically for each network element type to identify trends and predict faults.
  • Customize the pre-defined reports and graphs to cover new requirements for network performance monitoring.
  • Personalize these customized network performance-monitoring reports for later use. This is about being able to save the custom reports for later access.
  • Create completely new network performance monitoring reports and graphs based on the stored data.

 

All these operations rely on data, i.e. performance measurements related to the performance of the devices and the network as a whole. Therefore the Network Management System(NMS) selected to monitor the network must be able to process, store and administer performance data collected from the whole network using communication protocols such as the Simple Network Management Protocol (SNMP) protocol, Netflow or even plain File Transfer Protocol (FTP).

 

The Two Big Challenges of Historical Performance Management

The first big challenge of Network Management Systems (NMS) focusing on Network Performance is:

“How to efficiently administer, i.e. collect, store, process and aggregate massive volume of performance measurements data collected over time, to ensure that the relational database remains healthy, its size is under control, and the responsiveness of the User Interface and Reports is acceptable.”

The capability to handle a big amount of historical performance measurements is an advanced feature that is not usually possible with general-purpose entry-level network monitoring applications.

The second big challenge of Network Management Systems (NMS) focusing on Network Performance is:

“How to handle performance measurements that are not unified in structure or content. Performance measurements tend to differ for each network device type, as manufacturers frequently use proprietary protocols and data structures in their devices to gauge performance, utilization and availability. Even devices from the same vendor may have differences in the implementation of performance management capabilities.”

Although complex and demanding, the capability to monitor and analyse the performance of the network over time is a valuable tool that helps in the operation and expansion of the managed network, as it allows network administrators to proactively discover and address any performance issue relevant to their assigned networks either with maintenance and troubleshooting, traffic re-routing or just network expansion. Therefore, it is important to address the above challenges in an effective way.

 

Handling Massive Network Performance Data

The following techniques are important for of Network Management Systems (NMS) designed for Performance Management, as they are designed to address the challenges related with the management of massive volume of data records without using a large data centre like Apple’s, displayed below(!):

  • Network data Collection using multiple pollers, threads and Probes: Capability to collect massive data from a big network is enabled using advanced software techniques (multi-threaded pollers) and also hardware architecture techniques (probes)
  • Aggregation of records: this is done according to flexible aggregation rules using what is called Round-Robin Database Structures.
  • Scheduled archiving of old records to flat files (e.g. CSV files): the records are exported to a CSV file after the predefined retention period expires. Then, they are permanently deleted from the database.
  • Scheduled permanent deletion of old records: the records are permanently deleted from the database automatically after a user-predefined retention period.

Let’s examine each one separately.

 

Collection using Pollers

Pollers are used to proactively poll or collect in a scheduled base, KPI (key performance indicator) information directly from a device. Using more advanced techniques, pollers can even become clustered and provide intelligent health monitoring data. Once collected, metric information is then stored in the database of the Network Management System, which can then be provided to the user via the web interface, an emailed report, or other methods.

Collecting data can become a difficult challenge for pollers, especially in large distributed networks. Performance measurements are normally collected every 15-minute, but some “instant” parameters are meaningful only when collected more frequently. Examples of such “instant” parameters are Radio Performance Measurements like RSSI, CINR etc. These are updated at least every second, thus collecting data every 15 minutes will result loosing the details needed to monitor effectively. But what happens when collecting data more frequently?

Let’s take for example a scenario where we want to collect sixty (60) parameters from a network of six thousand (6000) devices every second. Doing the simple math we see that this translates to a requirement for the Network Manager pollers to be capable to collect 360.000 parameters per second! As this is really impossible for a simple process, what solutions are available for this problem?
Network Operator, Tier-1 Service Provider

 

The common solutions for this problem are three:

  • Use multiple threads (threaded pollers)
  • Use remote probes for better performance.
  • Use protocols that are more efficient with large data, such as FTP, NetFlow etc. instead of SNMP.

 

Multi-threaded Pollers can increase performance

A Performance Monitoring application is threaded when different processes can be split up to allow multiple CPUs to work on a single process.

Threading is very useful when collecting performance measurements. Instead of only running performance collection on a single thread, multiple threads can be used so these jobs run concurrently and are therefore completed quickly. While it is possible for a single processor to run a multi-threaded application, the process may not run as quickly in comparison to a multi-CPU system.

 

Using probes to increase efficiency

A probe is a program or other device inserted at a key juncture in a network for monitoring or collecting data about network activity. In large, distributed network architectures, probes can provide the mechanism needed to gather data essential to the construction of network history and trends and be more efficient than central servers.

 

Probes features include:

  • Aggregate information across network sites near and far — extending your reach without sacrificing monitoring integrity.
  •  Monitor network activity across multiple topologies with robust intelligence and a menu of rich features.
  • Troubleshoot network performance,
  • Collect massive volume of real-time NetFlow data,
  • Proactively monitor the performance of the network
  • etc.

As you can’t manage what you can’t see, network probes can provide that visibility with accurate statistics concerning a network’s operation, such as traffic analysis and top users, applications and protocol usage. Using this information, you can manage the network efficiently, ensuring peak operation and performance.

 

Aggregation of Records

Once the challenge of collecting massive data is addressed, we now face the next problem. How much data can be stored without affecting the performance of the NMS server, database or user interface? Are all these data useful to be stored in full detail (“raw data”) indefinitely?

Raw data are performance measurements that once they are collected and stored in the database, are kept as they are. For example, if we collect measurements every 15 minutes, the raw data will consist of these 15-minute records for a period we have defined to keep the raw data. For obvious reasons there is a limitation to the amount of raw data that can be kept in any database, especially for large networks. So we introduce a time limit for raw data, e.g. 1 week and then we aggregate.

 

Usually when aggregating data, we define multiple aggregation time windows. The Aggregation Resolution, Aggregation period and Aggregation function are the parameters that define each time window.

For example, for the 1st aggregation time window, a compressing or statistical process will be applied to raw data, e.g the 15-minute records (raw data) once they become 1 week old will be replaced by 60-minute (Aggregation Resolution) records that will be kept e.g. for 1 month (Aggregation Period). We have just used/defined the following parameters:

  • Raw data are 15-minute performance measurements we collect, e.g. a 15-minute traffic parameter with value 200Mbit will indicate that 200Mbts of data passed during the last 15 minutes.
  • Raw data time period is 1 week, i.e. after 1 week the raw data will be replaced by aggregated data.
  • 1st Aggregation Resolution is 60 minutes. In the previous example four records of 15-minute raw data will be replaced by one record of 60-minute aggregated data
  • 1st Aggregation Period is 1 month. That means that for one month the database will store 60-minute performance measurements. Obviously this is a well defined number that can be calculated in term of number of rows and hard disk space.
  • 1st Aggregation function is specific to the measurement. We need to be able to define the statistical function to be applied, e.g. average, min, max, etc. per counter.

Multiple Aggregation Time windows are used for flexibility.

 

Using the Appropriate Aggregation Function

For every aggregation time window, an aggregation function (avg, min, max, etc.) will be applied to already aggregated data. For example, twenty-four 60-minute records will be replaced by a single 24-hour (Aggregation Resolution) record for 6 months (Aggregation Period) applying an average function. The Aggregate Function is selected per counter and for consistency is the same with the 1st aggregation time window.

Each time we execute the Aggregation Action a statistical function is applied at the stored measurements.

 

This statistical function may differ from counter to counter and is applied according to the Aggregation Parameters (i.e. the Aggregation Period and the Aggregation Resolution). For example, during the 1st Aggregation, every four 15-minute records of the “EthernetTraffic” counter may be replaced by one 60-minute record (sum) for 1 month.

 

An Aggregation Example

In this example, we’ll aggregate the performance measurements collected through a schedule with time intervals of 15 minutes. The Aggregation Resolutions and Periods are shown in the following table:

 

Aggregation Resolution Aggregation Period Time Window
Raw Data
(1 record per 15 minutes)
1 day #1
Aggregated Data (1 record per 1 hour) 1 week #2
Aggregated Data
(1 record per 12 hours)
1 month #3

 

The new measurement records that are collected and stored in the database continually traverse the aforementioned time windows.

Let’s see how this aggregation scenario works, step-by-step (as time passes by…):

[accordian]
Data collected during the 8th of June enter the Time Window #1 therefore they remain raw (15- minute records).

Data collected during the 9th of June enter the Time Window #1 therefore they remain raw (15- minute records). What happened to the 8th of June data? These data have been shifted to the Time Window #2 therefore they are now aggregated and replaced by 1 record per hour.

 Data collected during the 15th of June enter the Time Window #1 therefore they remain raw (15- minute records). Data stored in the database for the previous week (8th to 14th of June) belong to the Time Window #2 and are therefore aggregated and replaced by 1 record per hour.

 Data collected during the 16th of June enter the Time Window #1 therefore they remain raw (15- minute records). Data stored in the database for the previous week (9th to 15th of June) belong to the Time Window #2 with 1-hour Resolution. Data stored in the database for the previous month (8th of May to 8th of June) belong to the Time Window #3 with 12-hour Resolution.

4th aggregation time window of network performance data
[/fusion_accordion]

The following figure shows the shifting of the Aggregation Periods and what’s happening to the data records as the time passes.

The Aggregation Period, Aggregation Resolution and Aggregation Function are three rules you set to define the aggregation mechanism.

Archiving & Deletion of Records

The Network Management application must permit you to disable aggregation of old records and work only with raw data.

But as raw data may increase the database size exponentially, it is important to be able to setup a scheduled process to permanently delete old records or export them to a flat files e.g. coma delimited formatted files (.csv).

 

Summarising

I know this is a long article, but it has managed to describe in good enough detail what are the main functions of performance management, how these functions create two big challenges for Network Management Systems (NMS) trying to implement them, namely massive data handling and data unification, and the techniques that can be used to address these challenges.