Usm checker EN: различия между версиями

Версия от 11:04, 5 сентября 2024

This module is a replacement for the deprecated usm_ping module

Description

The module is designed to determine the availability of network nodes (including the availability of user devices).

To determine availability, various methods of verification are used, which give more reliable results depending on the specifics of particular networks or devices.

The module is analytical and cannot determine the activity of a node in the network with 100% reliability. For example, if ICMP requests are blocked on the tested node, the only way to determine the availability of such a node may be to check the entry in the ARP table of the router of this node, but even in the ARP table of the router the data may not be reliable, because the entry in it remains for some time, regardless of the actual availability of the node, or it is not found if there has been no exchange of IP-packets with the node for a long time. Therefore, you should choose inspection methods that give a more accurate result for the device being inspected, or configure the device to respond to ICMP ECHO.

For example, it is not always reasonable to perform a ping check to determine the availability of a user's equipment, since most users are not allowed to have ICMP traffic on their equipment; for this purpose, it is better to use a less accurate method - reading the ARP table, which gives some error, but it can be neglected to determine the availability of users.

At the same time, reading the ARP table to determine the availability of equipment is also not reasonable, because after the loss of availability of equipment, its ARP-entry will be in the ARP-table of the router for some time (depending on the settings of the router, it can be quite a long period); here it is better to use only ICMP-ECHO.

The module replaces the obsolete usm_ping module, which checked node availability only by ping.

Requirements

Python 3.5
USERSIDE 3.12.41+ (for version 2.0.0-alpha)
USERSIDE 3.12.69+ (for version 2.0.0-beta)
USERSIDE 3.13+ (for version 2.0.0)

Installation

Install Python and the pip package manager

sudo apt install -y python3 python3-dev python3-pip libsnmp-dev

Download the archive, extract the directory with the module files from it (you should do this also when upgrading). You can use virtual python environment with the appropriate fixes to the commands below, or install everything according to these instructions globally. Navigate to this directory and install the dependencies by entering the following command:

sudo pip3 install --upgrade -r requirements.txt

Then, if this is an initial installation, copy the settings.yml-example file named settings.yml

sudo cp settings.yml-example settings.yml

Edit this file (familiarise yourself with the yaml format beforehand, if necessary):

Specify the correct values for the api section - URL of your USERSIDE and API key.

Configure the required detection methods (checks) as described in the next section Availability Check Strategies. There are three checks of different types in the original example configuration file. The last two are commented out. First review the configuration of the availability check strategies and configure the necessary checks (one is sufficient for a test run) before proceeding.

Strategies for checking availability

There are a total of three strategies (methods) for checking node availability: ping, cmd, snmp

Each method contains mandatory fields such as:

method - method name
networks - list of subnets for which to apply this method.

And non-mandatory fields:

group - name of the group in which this check is included (the use of groups will be discussed later).
host_type - type of hosts to be selected from the specified subnets (networks parameter). Can take three values:
- any - any node types (default);
- customer - only nodes that are users;
- equipment - only nodes that are equipment. This filter can help separate the inspection by host type if you mix both equipment and users on the same subnet. This parameter was introduced in version 3.12.69.

These fields are related to absolutely all checks.

Any number of checks can be configured in the configuration, if necessary. The methods in different checks may be repeated.

Check methods

ping

This method is designed to detect active nodes by asynchronous icmp-flud checking responses to ICMP-ECHO request to these nodes. If a node responds, it is considered active. Otherwise, it is not active. The method works only in Linux.

This method is based on low-level interaction with linux network asynchronous sockets and is a very powerful and fast way to detect active nodes in the network. All IP addresses to be polled are divided into iterations (the number of addresses in an iteration is determined by the in_iteration parameter) and then each iteration is started asynchronously (avalanche ICMP-ECHO-REQUEST traffic is sent to the network for all nodes in the iteration). After the time specified in timeout, the next iteration is started and the nodes that have not had time to respond to the requests from the previous iteration are considered unavailable (if retry is set to 0, otherwise the previous iteration is repeated as many times as specified in retry until all nodes in it have responded.

Because of the high polling rate, this method puts a heavy load on the network, consuming all free resources. This can cause the network subsystem of the server, especially if it is a virtual machine, to fail to cope with the load (hypervisors specifically filter this type of flooding). The problem can also occur on intermediate or destination nodes on the network. Some switches/routers may also filter flooding. All this can lead to ICMP-ECHO-REQUEST drops during the delivery phase between the module and the target node, or the same thing, but for ICMP-ECHO-RESPONSE responses. If this happens (clearly active nodes do not respond and are marked as not active), you should use smaller iterations (reduce in_iteration) and increase the timeout (gap between iterations), thus creating less load. Experimentally find a value that your hardware is guaranteed to pass through without loss.

This method may include additional optional parameters, such as:

in_iteration - whose value is the number of asynchronous ICMP-ECHO requests in one iteration. The default value is 10, but if your network infrastructure prevents such a large amount of ICMP traffic from passing through, reduce the value to one where ICMP traffic will not be filtered out by network equipment. Up to one if the network infrastructure is unable to operate asynchronously. It is also worth increasing this value to any possible value (up to 65535) if your network infrastructure is a monster capable of chewing through rebar.
timeout - timeout for responses in seconds (time per iteration). The default value is 1 second. It is not recommended to specify a timeout of less than a second. If you need to increase the time between iterations, increase this value. If your network infrastructure does not work asynchronously and in_iteration is 0 or so, then you can set this value to 0 or so. = 0 or so, you can use values less than a second to speed up polling.
retry - number of attempts to execute queries. The default is 0 - no additional attempts are made. If you have network segments with frequent losses, you can increase this value to 1 or 2, but no more, because the retry will be performed for all nodes within one iteration (asynchronous packet).

cmd

This method is designed to detect active nodes by executing an arbitrary command. This can be useful for reading the ARP table using the ip neigh command, or using fping if the built-in asynchronous pinger is not suitable for some reason, or any other commands that output a list of IP addresses of active hosts one per line.

Your command should output to standard output a list of IP addresses of active hosts one per line. Before you use it in the config, be sure to make sure that the command works and outputs the list of IP addresses one per line.

ARP. An example would be the command ip neigh show | awk '{print $1}'

If your server uses the legacy net-tools instead of iproute2, you should use the arp tool instead of ip neigh.

For example, if FreeBSD's arp -an command outputs a list of ARP records with IP addresses inside parentheses, you can use the command:

arp -an | grep eth0 | awk -F '[()]' '{print $2}' to extract them into a simple list of IP addresses.

SSH. You can also execute a command on a remote host via ssh, for example: ssh gw.network.net -i /path/to/id_rsa 'ip neigh show dev eth0' 2>/dev/null | awk '{print $1}'

fping. You can use a call to the third-party utility fping if you are more comfortable and used to using its capabilities to check the availability of nodes. To do this, your command should be something like the following: /usr/bin/fping -r 1 -t 20 -a -A -q %hosts%. See man fping for more details on the parameters. Also for fping to work it is necessary to set ignore_status_code to yes.

The command is specified in the value of the command parameter. A variable %hosts% can be used inside the command, and during execution it will be replaced by IP-addresses to be checked, separated by spaces.

This method may include an additional optional parameter:

ignore_status_code' - can take the values yes or no (default). Determines whether to respond to a command completion status code other than zero. If you are using fping, be sure to set this parameter to yes.

snmp

This method is designed to identify active nodes by reading the ARP table or any other SNMP table containing the IP addresses of active nodes using the SNMP GET-BULK request. By default, it reads from table .1.3.6.1.2.1.4.22.1.3 (ipNetToMediaNetAddress).

This method includes the following mandatory parameters:

community - read-only community name
host - SNMP agent host address

And one non-mandatory parameter:

value_from - can take the value value or index (default) and specifies from where to read the IP address, from the string value or from the index (suffix) of the OID.

You can also specify an alternate ARP table OID (if your device does not have a standard OID). The optional parameter oid is used for this purpose. You can use any other SNMP table containing IP addresses of active users. For example, the table of active BRAS sessions.

The IP address of a node can be contained in the value of each table row (value) or in the OID suffix of each table row (index). The module extracts the IP address of a node from the OID and reads it only if the value of this OID is not empty (it is convenient if the IP is extracted from the index and not from the value.

Initial test run

Perform a test run of the module with superuser rights:

 sudo python3 usm_checker.py

Messages about its operation will be displayed in the console. Make sure that the module has worked correctly. When the module finishes, it passes data about active hosts to USERSIDE and displays the number of hosts in USERSIDE that have changed their state as a result of the module. Make sure everything is working as it should before putting the module on automatic startup.

Automated startup

Edit the settings.yml file by changing the to_console value in the log section to no and the level value to 2.

Add a line to the system cron that periodically runs the module.

If you have a small network (up to 500 nodes), it is enough to run the module once every few minutes to detect the activity of both equipment and users.

 */2 * * * *    root    python3 /path/to/usm_checker.py > /dev/null 2>&1

In this case, all checks from the configuration file will be launched.

However, if the number of nodes is large or the module runs for a long time (more than 30 seconds), you should use a grouped launch. To do this, use the group parameter in the settings.yml file to divide the checks by groups. For example, separate the group for equipment by calling it equipment and the group for subscribers by calling it customers.

Note! For module versions before 2.1, only one group that has the ping method is allowed to run concurrently. Since version 2.1 the algorithm has been changed and parallel launching works normally. That is, if you have only two groups (equipment = ping and customers = cmd) you can run them simultaneously via cron. But if you had two different groups with ping method (e.g. equipment_radio = ping, equipment_cable = ping), then their simultaneous launching is undesirable! To do this, you need to configure cron so that it runs these two processes with a time shift of one relative to the other.

Groups

Checks can be organised into groups. A group can contain any number of checks, but a check can only belong to one group. A configuration file might look like this (here the ping check is assigned to the equipment group, and the cmd check that reads the ARP table is assigned to the customers group).

checks:
  - method: ping
    group: equipment
    networks:
      - 192.186.0.0/24
      - 192.168.1.128/25
  - method: cmd
    group: customers
    command: "ip neigh show | awk '{print $1}'"
    networks:
      - 192.168.2.0/24

Now you can run the module separately for each group of checks by specifying the name of the required groups as arguments.

For example, to run checks for equipment once a minute, and checks for users once every 10 minutes, the cron would look like this:

 */1  * * * *    root    python3 /path/to/usm_checker.py equipment > /dev/null 2>&1
 */10 * * * *    root    python3 /path/to/usm_checker.py customers > /dev/null 2>&1

If you run the module without arguments, all checks from the settings file will be executed, regardless of their group. To execute checks from several groups, list them all with a space.

In FreeBSD, you must either write the full path to python3 /usr/local/bin/python3, or (better) write the path ;/usr/local/bin; in the PATH variable of the /etc/crontab file.

Log file rotation

The log files created by the module need to be rotated. To do this, the standard log rotation tools in the operating system should be used.

The following configuration will rotate all *.log files every day and store 7 archive copies (for 7 days)

In Linux, the logrotate daemon is responsible for the rotation. Create a file /etc/logrotate.d/userside, in which put the following text (don't forget the path if you change it):

/var/log/userside/*.log {
    rotate 7
    daily
    compress
    delaycompress
    missingok
    notifempty
}

Running the module in Docker

Instructions on how to use the usm_checker module in a Docker environment are available at: https://github.com/userside/usm_checker-docker-env.

F.A.Q.

Q: Plausibly active nodes are identified as inactive when pinged.

A: Your server network subsystem or some device on the network in the path of ICMP traffic limits the bandwidth of ICMP packets, taking them for flooding and discarding all packets over the limit. The module performs asynchronous polling of nodes. That is, a large number of ICMP-ECHO requests are sent to the network at once, which looks like a rather strong instantaneous ICMP traffic. The PING module's PING method implements so-called iterations. These are blocks in which a certain number of IP addresses are placed for asynchronous polling. Each such block (each iteration) has its own timeout (1 second by default). This means that iterations are started at one second intervals: first ICMP-ECHO requests are sent to the network, and then responses are expected for one second. The next iteration is then started. This helps to avoid shaping of ICMP traffic. If you see that nodes that are known to be active and available are suddenly identified by the module as unavailable, try to reduce the number of IP addresses in one iteration. The in_iteration parameter in the settings of the ping method is responsible for this. You can reduce this value up to one. However, it should be understood that the more nodes can be pinged asynchronously (simultaneously) - the faster the ping is and the more relevant the information is. You should try to find the maximum value for in_iteration, at which ICMP traffic is guaranteed not to be lost. On virtual machines with a simple, rudimentary network adapter shared by other virtual machines, this parameter can take values even lower than 10. It depends on how well your network hardware can handle ICMP traffic.

Q: PING error [Errno 97] Address family not supported by protocol

A: Most likely you have IPv6 protocol disabled. In this case the module will not work correctly due to operating system limitations. Check the /etc/default/grub file for a line like GRUB_CMDLINE_LINUX="ipv6.disable=1" and if it is there, remove it, then rebuild grub with sudo update-grub command