Observability

Table of Contents

Instabase provides observability tools for measuring the internal state of its infrastructure and applications.

Log aggregation tooling

Instabase’s log aggregation features let you aggregate logs across Instabase services and store them for purposes including querying, debugging, activity analysis, and alerting.

Log aggregation infrastructure

Instabase’s log aggregation solution involves two components working together: a Fluent Bit sidecar and Grafana Loki.

The Fluent Bit sidecar collects the logs from Instabase services inside a pod and sends them to Loki. The sidecar also performs log rotation over the log files that are created by Instabase services. Grafana Loki is a log aggregation storage tool. It’s responsible for aggregating, indexing, and storing the log information. Loki also provides a powerful query language, LogQL, to query logs.

Each container inside the Instabase services pod writes the logs to disk (emptyDir). The disk volume is mounted to the Fluent Bit container running inside the same pod. Fluent Bit tails all log files written by Instabase services on the mounted volume and sends them to Loki.

Grafana Loki has two different types of replicas: read and write. The write replicas are responsible for accepting the streams of logs coming from Fluent Bit. Read replicas serve the read queries. The Loki write replica persists the logs to a storage system shared by both the read and write replicas, either a network file system (NFS) or Amazon S3 bucket. You can then query and visualize the stored logs from your Instabase Grafana dashboard.

Resource requirements

System Requirements

Grafana Loki requirements: We recommend running 2 pods of Grafana Loki read replicas and 2 pods of Grafana Loki write replicas (4 pods total). This setup requires the allocation of approximately 2 cores of CPU and 2 GB of RAM.
Fluent Bit sidecar requirements: Each Fluent Bit sidecar uses an additional 30 millicores and 35 MB of RAM, which should be factored into the CPU and memory resource calculations.

Cloud storage

Cloud storage that will be shared between all Grafana Loki replicas is required. The cloud storage will store 14 days’ worth of logs. The following cloud storage options are supported:

NFS volume (~ 64 GB)
Amazon S3 bucket

Access requirements

To attach metadata to logs, the Fluent Bit sidecar makes calls to the Kubernetes API server to get pods-related information. This requires that the following roles and bindings are granted to the Kubernetes service account that runs Instabase services.

Querying the Kubernetes API server is a one-time operation that happens during pod startup.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: {{namespace}}
  name: ib-sa-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get"]

---

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ib-sa-role-binding
  namespace: {{namespace}}
subjects:
  - kind: ServiceAccount
    name: {{IB_SERVICE_ACCOUNT}}
    namespace: {{namespace}}
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: ib-sa-role

Stats tooling

Instabase provides statistics tooling for observing Instabase’s internal components. This includes measuring aspects of:

Infrastructure and application traffic, including request and error rates.
Performance, including latencies.
Specific internal states, such as worker count in Celery.
Saturation of CPU, memory, file system, or open connection handles.

This data is represented as a numerical time series with attributes and labels (such as cluster, service, or container) for filtering and aggregations.

Stats infrastructure

Instabase stats architecture is divided into two parts:

Instabase stats infrastructure - this part of the stack is highlighted in the green box in the figure below. The components in this stack are responsible for collecting stats telemetry information from Instabase services. It can also receive stats telemetry data (ex: cpu, memory and network usage) from the cluster control plane.
Cluster stats infrastructure - this part of the stack is highlighted in the red box in the figure below.Usually this part of the cluster is already installed/managed by the platform admins for the Kubernetes cluster and you can ask them for details for connecting with platform prometheus. If platform prometheus is not present IB can also install the required components or help directly connect Instabase stats infra (in #1) to the Kubernetes control plane.

Instabase services (green) are instrumented with telemetry agents (blue) to start providing stats data. The stats infrastructure (yellow) components allow collecting, storing, presenting / alerting on this information. They internally leverage Kubernetes api-server for performing service discovery.

We also create federated Prometheus connection to Kubernetes cluster provided prometheus agent to collect container specific CPU/memory and disk information from kubelet.

Resource requirements

System Requirements

Compute and memory: We recommend allocating ~3.5 cores and ~21GB RAM for a typical setup. This may change based on the deployment size and the number of monitored workloads.
Persistent Volume 128GB+ for durable stats storage. This is used for storing and querying historical stats data. This is optional to setup, and as an alternative the data can be stored on ephemeral pod storage. Data on ephemeral storage is not available if the prometheus pod is recycled.

Access requirements

Access to the namespace to apply the different kubernetes objects such as Roles, Rolebinding, Deployment, Statefulset, Service, Configmap
For collecting CPU / Memory information we need stats information from kubelet. For this you need to provide federation connection details [platform prometheus URL and access details] from the design above, and add them into IB prometheus configuration.
We use service discovery to identify the targets for stats collection. Following roles / bindings [refer: Role-based Access Control] are required to be granted to the serviceaccount that runs the IB-prometheus service

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  labels:
    app: prometheus
  namespace: ib-namespace
  name: role-ib-prometheus
rules:
  - apiGroups:
      - ""
    resources:
      - services
      - endpoints
      - pods
      - configmaps
    verbs:
      - get
      - list
      - watch

---

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    app: prometheus
  name: rolebinding-ib-prometheus
  namespace: ib-namespace
subjects:
  - kind: ServiceAccount
    name: ibprom
    namespace: ib-namespace
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role

To enable service discovery, IB-Prometheus makes HTTP (GET) calls to the kubernetes api-server using following files mounted in the pod for authentication -

/var/run/secrets/kubernetes.io/serviceaccount/namespace
/var/run/secrets/kubernetes.io/serviceaccount/token
/var/run/secrets/kubernetes.io/serviceaccount/ca.crt

Additionally the kube-state-metrics needs following roles/bindings [refer: Role-based Access Control] in order to read kubernetes object states -

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  labels:
    app: kube-state-metrics
  name: instabase-kube-state-metrics
  namespace: ib-namespace
rules:

- apiGroups: [""]
  resources:
  - configmaps
  - endpoints
  - limitranges
  - persistentvolumeclaims
  - pods
  - replicationcontrollers
  - secrets
  - resourcequotas
  - services
  verbs: ["list", "watch"]

- apiGroups: ["batch"]
  resources:
  - cronjobs
  - jobs
  verbs: ["list", "watch"]

- apiGroups: ["extensions", "apps"]
  resources:
  - daemonsets
  - deployments
  verbs: ["list", "watch"]

- apiGroups: ["autoscaling"]
  resources:
  - horizontalpodautoscalers
  verbs: ["list", "watch"]

- apiGroups: ["extensions", "networking.k8s.io"]
  resources:
  - ingresses
  verbs: ["list", "watch"]

- apiGroups: ["networking.k8s.io"]
  resources:
  - networkpolicies
  verbs: ["list", "watch"]

- apiGroups: ["policy"]
  resources:
    - poddisruptionbudgets
  verbs: ["list", "watch"]

- apiGroups: ["extensions", "apps"]
  resources:
  - replicasets
  verbs: ["list", "watch"]

- apiGroups: ["apps"]
  resources:
  - statefulsets
  verbs: ["list", "watch"]

- apiGroups: ["storage.k8s.io"]
  resources:
    - volumeattachments
  verbs: ["list", "watch"]

---

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    app: kube-state-metrics
  name: instabase-kube-state-metrics
  namespace: ib-namespace
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: instabase-kube-state-metrics
subjects:
- kind: ServiceAccount
  name: {{service_account}}
  namespace: ib-namespace

Install stats

Create service accounts, roles and rolebindings for prometheus based on the files provided to you.
Apply the configmaps, deployment and services files shared with you. For setting up federation add the url and authentication details (if applicable) to Prometheus configmap file for the “job_name: ‘federate”.
Confirm that the following links are accessible via IB base urls. The login credentials for these are shared separately with Instabase
- Grafana: <base_url>/grafana
- Prometheus: <base_url>/prometheus/graph
- Alertmanager: <base_url>/alertmanager
Confirm that the datasources are available in Prometheus by navigating to status > targets in Prometheus UI.
Confirm that alerts are also available in the alerting UI by confirming for the dummy watchdog alert in the following screenshot.
Confirm that the data is available in grafana dashboards.