Computate and Red Hat OpenShift Observability use case

Author: Christopher Tate

The beginning of Observability in the Mass Open Cloud

The Red Hat OpenShift Observability use case began in July 2022, after I switched teams at Red Hat from Financial Services Industry Consulting to the Red Hat Research team. In Red Hat Research, in addition to working on the Smarta Byar Smart Village research project, I was asked to deploy Red Hat ACM Observability and Loki Logging to the New England Research Cloud OpenShift environment.

I made my first pull request for the New England Research Cloud to add ACM Observability resources to the infrastructure OpenShift cluster. I proceeded to configure the new Loki Stack for multi-cluster logging, for a centralized log search solution to replace the no longer open source Elasticsearch solution for logs in OpenShift in the past. It also involved deploying the OpenShift Logging Operator and log forwarding from each of the clusters to a central log solution on the infrastructure cluster. I also deployed the free version of the Grafana Operator and integrated the ACM Observability data source for a developer instance of Grafana with edit and create access to Grafana Dashboards, which is unavailable in the read-only Grafana instance provided by the Red Hat ACM Observability product. Then I deployed the desired alerts in ACM Alert Manager that would send updates to Slack when certain metric events occured. I wrote up some monitoring and logging documentation for the New England Research cloud to wrap things up in 2022 for most of the data management acceptance tests in the New England Research Cloud.

Troubleshooting storage

After the new year 2023, there still remained some trouble with object storage for logging as I found out after the holidays. The NooBaa object storage provided by the Red Hat OpenShift Data Foundations product connected to the NERC's external ceph cluster NESI was unstable for an unknown reason. Since both ACM Observability and Loki Logs require object storage, we used the existing OpenShift Data Foundation NooBaa object storage in the cluster. As Loki was ingesting logs for our OpenShift clusters, the BackingStores for NooBaa would become degraded and stop working. Even with Red Hat support nothing was working to resolve the object storage issue. We continued to onboard new metrics for IPMI and Ceph, but we could never query more than 1-2 weeks of data because of the broken object storage.

Onboarding operators

Near the end of 2023, our new NERC OpenShift production cluster had become available. Because of my work with the Smart Village project, I saw the need for several OpenShift Operators that would be useful now and in the future. In addition to Red Hat ACM, The Red Hat OpenShift Elasticsearch Operator, the Red Hat Loki Operator, and the Red Hat OpenShift Logging Operator required for metrics and logging, I installed the Red Hat OpenShift Serverless and Red Hat Integration and Camel K Operators, which turned out to also be a requirement of OpenShift AI model serving as well. I installed Red Hat Integration - AMQ Broker for RHEL 8 (Multiarch) Operator and Red Hat Integration - AMQ Streams Operator (Kafka) to prod, which are useful for event-driven messaging. I installed Ansible Automation Platform Operator to prod, which is the Red Hat way of doing automation. I installed Apache Solr Operator Helm Chart Repository and CRDs to prod, which is used in applications like Smarta Byar Smart Village and AI Telemetry as a fast and filterable search engine with pagination, faceting, statistics, and pivoting on data, and building REST APIs.

Red Hat Research Observability Gig #1

Because we were stuck with storage related issues regarding ACM Observability and Loki Logs, we decided to reach out to other Red Hat associates who would like to participate in a 6-month gig with us starting in October 2023. We had an excellent response, and were able to bring in 3 other Red Hat associates to work part-time up to 10 hours per week with Red Hat Research and the Mass Open Cloud to find solutions to Observability issues. We had a Senior Engineer in Germany who during that time provided Loki RBAC, storage, backup strategy, and bug fixes. We had an architect in the US who fixed our ACM 2.8 upgrade issues, and helped us with our ACM Observability configuration. We had another architect in the US who implemented our external grafana and new observability cluster infrastructure. We also had Thorsten Schwesig join our Red Hat Research team who also began contributing full time to Observability in NERC with his extensive background in observability already. Together, we accomplished a very successful gig, and the most important part was the complete installation of a brand new OpenShift cluster for Observability.

NERC Observability OpenShift cluster

The NERC team as a group decided a new OpenShift cluster was required to relieve some of the memory, compute, and network bottlenecks from running an ACM infrustucture managment cluster in addition to logs and metrics. Other reasons included these:

  1. We wish to have our metrics available to researchers.
  2. We currently have our metrics and observability on the infra cluster
  3. The infra cluster is not on a public network
  4. The Infra cluster is strained by the load of Infrastructure processes as well as logging, metrics, and observability
  5. Creating a new Cluster that manages Metrics, Logs & Observability would solve a lot of these problems

We were able to accomplish the entire cluster install and reconfiguration of metrics and logs as part of our gig.

Updates to NERC infrastructure enable fine-grained resource permissions for observability data.

This section of the use case is quoted directly from the article in Observability cluster added to the MOC Alliance’s New England Research Cloud in the Red Hat Research Quarterly by Thorsten Schwesig and I.

Updates to NERC infrastructure enable fine-grained resource permissions for observability data.

Observability data provides essential insights for optimizing performance, troubleshooting, and using resources sustainably. For users of the New England Research Cloud (NERC), part of the Mass Open Cloud (MOC) Alliance, this data also provides critical information for innovative research projects. Until recently, access to this data was restricted for most users.

A standalone cluster

NERC container infrastructure is based on OpenShift and includes several clusters (e.g., an infra cluster, prod cluster, and test cluster) operated within a VPN. Access to these clusters is therefore limited. This restriction especially affects observability data, such as metrics, logs, and traces. As the amount of observability data continues to grow, it becomes increasingly useful for research and teaching, independent of the applications, models, and data that generate it.

Initially, the observability data and systems in NERC, such as Thanos, Prometheus, Grafana, and Loki, ran on the infra cluster, which put higher demands on this cluster, which in turn can affect its operation in extreme cases. To enable access to observability data outside the VPN—and to relieve the infra cluster and separate tasks—we developed and implemented the idea of a standalone observability cluster.

Since March 2024, the NERC Observability Cluster has been running in its base version and has already successfully met several requirements. The cluster captures and stores metrics and logs with an increased retention rate and is accessible outside the VPN, which makes it much easier for researchers and educators to use. Additionally, we have made static dashboards for NERC data available in Grafana, providing a first basic visualization of the collected data to support analysis and monitoring, along with the ability to develop new dashboards.

Controlling data access

With the NERC Observability Cluster in place, our next step was implementing fine-grained access control. With multiple research projects and classes hosted on NERC, maintaining data privacy compliance is essential. We needed to ensure that specific user groups, such as admins, researchers, professors, students, and apps (via API access), can access the data they need, and only the data they need.

Our primary challenges were ensuring seamless integration and maintaining high security standards. We accomplished this in May 2024 by introducing a new keycloak-permissions-operator to both operatorhub.io and Red Hat OpenShift to automate a previously missing feature of the Red Hat build of the Keycloak Operator. It introduces an advanced authorization feature of Keycloak and makes it easy to configure user, group, and application access to resources. We configure Keycloak for resource definitions, scopes, and permissions and set up a secure proxy to validate access tokens. We initially built these resources for the AI for Cloud Ops project team to give them access to certain metrics only on the prod OpenShift cluster. However, this operator was very reusable for other customers and projects as well.

The next step was to deploy a reverse-proxy (prom-keycloak-proxy) with fine-grained resource permissions authentication and authorization between applications on NERC and Red Hat Advanced Cluster Management (ACM) observability metrics. We’ve also shared this work with the ACM Observability team, which has features for fine-grained access to metrics on its roadmap.

Future enhancements and long-term goals

In the next phase of this project, we will develop and implement mechanisms for data anonymization to ensure both privacy and usability of the data for research. We also plan to implement traces and develop interactive, dynamic dashboards that allow personalized and detailed data analysis. Additionally, the retention rate will be further optimized to support long-term analyses.

In time, we aim to introduce a proactive alerting and optimization system that captures event-based logs and provides targeted recommendations and optimizations. Additionally, we will continuously optimize the cluster’s scalability and performance. We plan to promote use of the cluster by more research projects and institutions and integrate additional observability systems and data sources for a more comprehensive analysis of system performance.

The NERC Observability Cluster represents a significant improvement in the accessibility and usability of observability data for research and education. With ongoing development, it will meet growing demands and provide a solid foundation for innovative research projects. The key ideas and tools we’ve used can also be applied to other kinds of data that require fine-grained access control. Keep up with our work on NERC Observability on GitHub.

AI Telemetry

The New England Research Cloud is a perfect environment for AI/ML related research. We wish to help researchers and managers of GPU devices in the New England Research Cloud (NERC) shared computing environment understand how the valueable and limited GPU resources are being utilized. We will provide access to researchers and managers through Keycloak and GitHub as identity provider to access the AI Cluster data and research project GPU allocation telemetry through a modern open source platform that will be built for the MOC and usable by other organizations wishing to collect the same telemetry. Other Red Hat associates outside of Red Hat Research have also been invited to participate in building the platform through the Red Hat Internal Gig Program.

  1. The objective of the AI Telemetry application is performance introspection/telemetry for AI optimization.
  2. The initiative is to build a distributed, optimized AI/ML platform from the MOC baseline.
  3. The use case is to Develop a platform for reporting on AI/ML cluster and GPU device usage data and performing telemetry on AI/ML workloads through a secure dashboard and API.

Required components to be in place

The AI Telemetry architecture represents several components that are already in place, and several components that will be developed as part of this project:

  1. GPU enabled clusters
  2. ACM Hub on Infra cluster
  3. Prom Keycloak Proxy
  4. GPUs in the New England Research Cloud have already been deployed for the following use cases, and more GPUs are coming later in the year:
    1. OpenShift AI GPU Workbenches
    2. OpenShift AI Model Servers
    3. OpenShift Deployments with a GPU allocation
    4. RHEL AI OpenShift VMs
    5. RHEL AI bare metal machines
    6. InstructLab on OpenShift
  5. The ACM Hub on the Infra cluster is already deployed in the New England Research Cloud. We have also built a new Observability OpenShift cluster to improve Observability performance of ACM and migrate logs and metrics by utilizing the new Observability cluster.
  6. The Prom Keycloak Proxy and Keycloak Permissions Operator were a separate project and tool completed earlier this year in 2024 for the New England Research Cloud researchers to access metrics backed by Keycloak fine-grained authorization to specific clusters and namespaces, including GPU metrics.

Components being developed

  1. OpenShift Open Telemetry is a collection of 3 operators (Red Hat Build of Open Telemetry, Red Hat OpenShift Distributed Tracing Platform, Tempo Operator), which we have already deployed to the NERC Observability cluster. We now need to configure and utilize these same components to enable tracing and metrics for GPU related research activity.
  2. The AI Telemetry Worker component will be a Quarkus/Vert.x reactive, asynchronous, event bus driven background worker application that will receive messages from AlertManager about starting open traces. It can also be for scheduled cron jobs that run and collect metrics from APIs on a regular schedule.
  3. The AI Telemetry site is a similar Quarkus/Vert.x reactive, asynchronous, event bus driven application, except this one is a front end application. It's similar platform to the Smarta Byar Smart Village research platform, as well as 2 other Red Hat Social Innovation projects running in NERC (rerc.southerncoalition.org and opendatapolicingnc.com) built with the computate.org open source platform. Thanks to the computate.org code generation technology demonstrated in the Smart Device API Code Generation during the Red Hat AI Combinator Hackathon, very useful dashboards and OpenAPIs for our models (like AI Cluster, AI Node, GPU Device, GPU, GPU Slice, Rersearch Project) comes together fast. These dashboards will be useful to researchers and managers of research projects utilizing GPUs to observe and review telemetry of GPU usage within the clusters and projects.

How the AI Telemetry project works

The Red Hat Research team together with the Red Hat Gig participants will build the applications approved to do AI Telemetry with access to the NERC OpenShift cluster metrics, and integrate these components together:

  1. The NVIDIA Operator is in place in NERC to collect metrics about GPUs, what cluster, node, project, container, and pod they are assigned to.
  2. The AI Telemetry Worker will be built to query ACM Observability GPU metrics on NERC on a regular schedule to populate our base models of what is available in each AI Cluster, OpenShift Node, GPU Device, GPU, GPU Slice, and each Research Project allocation.
  3. This aggregated data for each model at each level will be available in a dashboard in the AI Telemetry Site.
  4. AlertManager will be configured to detect when GPU usage starts and stops, triggering alerts directly to the AI Telemetry Worker using the http_config, oauth2, and tls_config provided by the Prometheus Alerting configuration.
  5. The AI Telemetry Worker will be built to take AlertManager GPU usage events and start traces in the Red Hat Build of Open Telemetry Operator.
  6. The AI Telemetry Worker will be built with open source standards for IoT Device data like NGSI-LD and @Context data already supported by exising Red Hat Research projects and the FIWARE open source community, so that researchers have documentation and insights into every data point available in the AI Telemetry system.
  7. The AI Telemetry Site will be built to connect researchers and managers directly to the projects and GPU metrics and traces that they observe.
  8. The Keycloak service deployed on the obs cluster has been configured with research team policies and permissions to approved metrics, applied by our new Keycloak Permissions Operator.
  9. The NERC Observability admin team will will be able to grant permissions to approved managers and research teams to access metrics and telemetry for their GPU enabled AI projects on the obs cluster through the AI Telemetry Site dashboards.
  10. The NERC Observability admin team can share a Client ID and Client Secret in the OpenShift project with approved research teams to access metrics for their GPU enabled AI projects.

The technology

  1. NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit, automatic node labelling using GFD, DCGM based monitoring and others.
  2. Red Hat Advanced Cluster Management Observability provides a centralized hub for metrics, alerting, and monitoring of platforms for a multi-cluster environment. In addition, the observability component also focuses on displaying cluster health metrics, which describes the control plane health, cluster optimization, and resource utilization. The service gets deployed automatically to each cluster when Observability is enabled in RHACM.
  3. Red Hat Build of Keycloak Operator is a cloud-native Identity Access Management solution based on the popular open source Keycloak project. We configure a realm called NERC and a main client called nerc where permissions to all clients are granted. We create a new client for each approved research team requiring access to metrics with the Red Hat Build of Keycloak Operator.
  4. Keycloak Permissions Operator is an OpenShift Operator for managing Keycloak resources, scopes, policies, and permissions for fine-grained resource permissions. This operator is built by the NERC software engineers. It's available as an OpenShift Operator, and a Kubernetes Community Operator.
  5. Prometheus Keycloak Proxy is a proxy for observatorium and prometheus on OpenShift, secured by Keycloak Fine-Grained Resource Permissions. This application is built by the NERC software engineers.
  6. Red Hat Build of OpenTelemetry, also including Red Hat OpenShift Distributed Tracing Platform and Tempo Operator, is for collecting unified, standardized, and vendor-neutral telemetry data for cloud-native software in OpenShift Container Platform.

NERC OpenShift clusters involved

  1. The NERC infra cluster is where the Red Hat Cluster Management Observability service is installed. For more information, see our NERC observability architecture documentation. The Observability service provides a centralized hub for metrics, alerting, and monitoring of platforms for a multi-cluster environment. The Observability service exposes the Observatorium API as a secured route which requires a certain TLS certificate, private key, and CA certificate required to connect. The Observatorium API is also secured behind a Harvard VPN. The metrics query Observatorim APIs will be queried by services deployed on the obs cluster. This prevents any approved researchers from building approved applications for querying and reporting on our NERC OpenShift metrics.
  2. The NERC obs cluster is where we deploy 2 new services to authenticate applications and users wishing to query NERC metrics. We configure the clusters, namespaces, and metrics they wish to connect to and grant them permissions to approved resources with the new Keycloak Permissions Operator we built for this purpose together with the Red Hat Build of Keycloak Operator. Our new Prometheus Keycloak Proxy application we built checks their authorizations to metrics resources before querying any Observatorim API metrics they have requested. We will deploy the AI Telemetry Worker, OpenAPI, and front-end dashboard to the Observability cluster and grant access to managers and research teams to project specific telemetry data, backed by Keycloak Fine-Grained Authorization.
  3. The NERC prod cluster where most of the research projects for GPU workloads are in development.
  4. The NERC test clusters where Other AI/ML GPU research projects and Red Hat projects are taking place, like RHEL AI and InstructLab.