Can AI speed up root cause analysis in networks?

Can AI speed up root cause analysis in networks?

Mobile networks are designed and built with security in mind. It takes a lot for attackers to do harm. Yet, of course, security incidents can happen, and when they do, quick countermeasures are essential. Incident investigation is equally important – to find the root cause and take appropriate action so that the real problem can be resolved in an instant and prevented from happening again. These metrics become even more critical as new applications, new use cases, and new industries connect to the network, emphasizing requirements for high resiliency, minimal downtime, and rapid recovery.

One such measure is root cause analysis. Recent high-profile incidents of network and cloud outages have led R&D communities to intensify their efforts to develop solutions that can restore services much faster and minimize damage and loss. Today, there are also compliance regulations that require service providers to provide timely information about the root cause of incidents, which reinforces the importance of root cause analysis.

In the 5G era, networks rely on virtualization for increased flexibility and performance. To realize these benefits, network function virtualization introduces several layers. When an incident occurs, symptoms detected at one level may very well have their true root cause at another level. It is important to link the symptoms to the true root cause effectively and efficiently even if they belong to different levels.

What is Network Functions Virtualization?

Network functions virtualization is the migration of network functionality from custom physical network nodes to software that runs on a generic hardware computing platform. It enables communications service providers (CSPs) to manage, move, and expand their network capabilities on demand using virtual software applications on distributed hardware resources.

In collaboration with researchers from Concordia University, we explored the possibilities for improving incident investigation in virtualized environments, particularly with regard to 5G. This research has resulted in a new solution that combines well-established graphical provenance analysis with AI-based techniques for effective root cause analysis.

Let’s develop the thinking behind this research work and its applicability in mobile networks.

What is Provenance Incident Investigation? And what are the challenges?

The provenance graph is a well-known tool for capturing the causal relationships between events that occur in the system. Provenance graph analysis can help identify the root cause of security incidents by tracking all events in order, from the last recorded event related to the incident (i.e. symptom), down to the source event that caused the incident – the root cause.

However, in a virtualized environment, this process can become difficult and expensive. This is due to the fact:

  • With an increased number of logged events, the effectiveness and scalability of existing provenance-based solutions can significantly decrease if applied as is.
  • The layered aspect, introduced by the Network Functions Virtualization (NFV) environment, makes provenance capture and analysis very difficult and error-prone without proper models and processes.

For example, identifying the causal dependency and semantic relationship between events that occurred at different levels requires deep domain knowledge and most likely human expertise. However, the task of the human analyst could still be made faster and easier with the support of the right tools.

How to Troubleshoot Incident Investigation in Tiered NFV

In this context, we embarked on a research journey that resulted in what we call ProvTalk – a provenance analysis system designed to handle the unique multilevel nature of NFV. It is based on our previous root cause analysis research prototype, DominoCatcher.

ProvTalk video thumbnail

Domino Catcher Root Cause Analysis Research Prototype

Our solution is developed in collaboration with experts from Concordia University and answers the following questions:

  • Connects provenance graphs at different levels of the NFV stack by capturing dependencies between levels.
  • Helps the human analyst identify the root cause of security incidents. To this end, it uses graph pruning techniques and data mining approaches for frequent patterns (system or user related) to encapsulate the complexity of graph analysis through aggregations. , while preserving valuable details for effective root cause analysis.
  • Finally, a rules-based approach is leveraged to automatically translate the details of a provenance graph (or a subset thereof) into an incident report that can be interpreted by human analysts.
Provenance Analysis Solution Overview

Figure 1: Provenance Analysis Solution Overview

Provenance analysis and data mining: how does it work?

Let’s take a closer look at the technical features of the new provenance analysis solution.

To enable provenance analysis, we first defined a platform-independent provenance model based on the PROV-DM standard specification of the World Wide Web Consortium (W3C), which allows us to organize different levels of the NFV stack into different layers in the provenance graph. This model captures virtual resources (at different levels of abstraction) as nodes and operations on those resources as edges connecting the nodes. To define the dependencies between levels, we used specifically labeled edges to connect virtual resources of different levels.

Once the model is defined and verified, we then use it to automatically capture all virtual resources at different levels and management operations modifying them, using event interception mechanisms deployed as middleware, to trace this happens at runtime at different levels.

But what should be done when an incident occurs, for example a security breach of virtualized resources at a given level? First, the tiered provenance graph should be investigated for the root cause, until the alert is first received. Since human involvement is fundamental in this process and such a provenance model usually includes too much low-level information to process manually, we have developed a set of useful tools that simplify provenance-related information. origin and facilitate the task of human analysts. interpret and understand what happened. All of these tools run automatically and the information is then provided to the analyst who can adjust and analyze accordingly applying their own expertise.

These tools are executed in three steps:

Step 1: Multilevel pruning

The first tool is multi-level pruning, which uses incident alert meta information to filter out irrelevant information from the provenance graph using cross-level dependencies. This means that human analysts can identify potentially irrelevant parts of the chart at different levels much more efficiently and in ways that are otherwise non-existent today. This tool can significantly reduce the search space for root causes.

Step 2: Aggregation based on mining

The second tool is data mining-based aggregation, which allows parts of the graph to be grouped reversibly to reduce redundancy in the graph and add high-level semantics to low-level operations. This can make the provenance chart much easier to understand. Specifically, this aggregation targets the most frequent sequence of lower level operations that are automatically triggered after a higher level operation in the NFV stack. It also targets routine administrative operations (e.g. maintenance tasks) that regularly appear in the provenance graph. Mining-based aggregation provides human analysts with the information they need about what is happening at low-level details, allowing them to focus on the main task, which is finding root causes.

Step 3: Translate the graphic into human readable text

Finally, when certain paths have been identified in the provenance graph, the third tool can be used to translate those parts into text that is easily readable by human analysts and provide additional useful guidance in the investigation process. This functionality can also be used to generate a report describing the result of the analyst’s investigations. The generated report explains in natural language (in our case English) what happened and how the symptoms of the incident relate to the root cause. It does this by describing what virtual resources and suspicious operations were involved, when it happened, and which parties performed those operations.

Towards effective incident management in 5G and future 6G

The main benefit of this research work is to construct a concise and interpretable provenance graph using data comprising a large number of events that have taken place across multiple levels of NFV. This makes it easier for the human analyst to find the root cause of the security incident and leads to a significant reduction in the size of the provenance graphs without losing vital information for the investigation, and with latency. and lower computational load. For CSPs, this offers the obvious and substantial benefits of reduced incident investigation costs, as well as a significant improvement in incident response time.

It is expected that our solution can be smoothly adapted as a basis for efficient incident analysis also in 6G. Many aspects of future 6G will evolve from 5G, with virtualization and cloud-native technologies as key drivers, and we have specifically designed ProvTalk to deal with the specificity and complexity of virtualization environments.

Further reading

Read the research paper behind this work, published in the Network and Distributed System Security Symposium (NDSS).

Learn about other Ericsson cybersecurity initiatives developed in collaboration with Concordia University.

Learn more about Ericsson’s vision for future network security.

Learn more about Network Functions Virtualization (NFV) and its role in improving 5G reliability.

This work was carried out as part of the Industrial Research Chair between Ericsson and Concordia University with funding from the Natural Sciences and Engineering Research Council of Canada (NSERC). Learn more here.

#speed #root #analysis #networks

Leave a Comment

Your email address will not be published. Required fields are marked *