LogoLogo
  • Technical Direction
  • Technical overview
    • Technical Implementation
    • API Design Guide
      • Data Definitions and Standards
      • Data Transfer Objects
      • Documentation
      • Environments
      • Error Handling
      • Example API Service
      • GraphQL Naming Conventions
      • Methods
      • Naming Conventions
      • Once Only Principle
      • Pagination
      • Resource Oriented Design
      • REST Request
      • REST Response
      • Security
      • Versioning
    • Ísland.is Public Web Data Flow
    • Code Reviews
    • Code Standards
    • Monorepo
    • Project Management
    • Teamwork
    • Architectural Decision Records
      • Use Markdown Architectural Decision Records
      • Use NX
      • Continuous Integration
      • CSS
      • Branching and Release Strategy
      • Error Tracking and Monitoring
      • What API Management Tool to Consider
      • Viskuausan Static Site Generator
      • Use OAuth 2.0 and OpenID Connect As Protocols for Authentication and Authorization
      • Unified Naming Strategy for Files and Directories
      • CMS
      • Open Source License
      • What Chart Library Should We Use Across Island.is?
      • What Feature Flag Service/application Should We Use at Island.is?
      • Logging, Monitoring and APM Platform
      • ADR Template
    • Log Management Policy
  • Products
    • Island.is Authentication Service
      • Terminology
      • Integration Options
      • Authentication Flows
      • Authorising API Endpoints
      • Session Lifecycle
      • Scopes and Tokens
      • Delegations
      • Configuration
      • Tools and Examples
      • Environments
      • Test IAS with Postman
      • Using the IAS admin portal
    • Notifications / Hnipp
      • New Notification Setup Guide
      • Notifications service workflow overview
      • Email notifications
    • Pósthólfið
      • Security Checklist
      • Introduction
      • Skjalatilkynning API
      • Skjalaveita API
      • Sequence Diagram
      • Interfaces
    • Straumurinn (X-Road)
      • Architecture Guidelines for Service Providers and Consumers
      • Setting up an X-Road Security Server
        • Network Configuration
      • X-Road - Uppfærsla á öryggisþjónum
      • Straumurinn - Notkun og umsýsla
      • X-Road Central - current version
  • Development
    • Getting Started
    • Generating a New Project
    • Definition of done
    • Devops
      • Continuous Delivery
      • Database
      • Dockerizing
      • Environment Setup
      • Logging
      • Metrics
      • NextJS Custom Server
      • Observability
      • Operations Base Principles
      • Security
      • Service Configuration
      • Support
    • AWS Secrets
    • Feature Flags
    • Documentation Contributions
    • Defining Monorepo Boundaries With Tags
    • OpenAPI
    • Code Generation
    • Workspace Settings (Deprecated)
    • External Contributions
  • REFERENCE
    • Problems
      • 400 Validation Failed
      • 400 Attempt Failed
      • 403 Bad Subject
      • 400 500 Template API Error
    • Glossary
  • Misc
    • Guide: Adding a Payment Step to an Application
    • Guide: Enable Organisations to Make Requests to an Application
    • README Template
Powered by GitBook
On this page
  • Intro
  • What is observability
  • Provide observability into the applications
  • Define the boundaries of expected behaviour
  • Document applications' components and their maintenance operations
  • Application recent change history
  • Contact information for dev team emergency contact

Was this helpful?

  1. Development
  2. Devops

Observability

PreviousNextJS Custom ServerNextOperations Base Principles

Last updated 1 year ago

Was this helpful?

Intro

We need to ensure that the applications stay available and performant once they have been deployed to Production. For that to happen we need all of these things to happen:

  • Provide observability into the applications

  • Define the boundaries of expected behaviour

  • Document applications' components and their maintenance operations

  • Application recent change history

  • Contact information for dev team emergency contact

What is observability

Here is a that talks about the three pillars of observability - logs, metrics and traces. We are starting with these basics and we are planning to add more - stack-trace collecting, deployment events, etc.

Provide observability into the applications

Logs

There is a clear separation of strategy where we keep the logs:

  • all AWS services send their logs to AWS CloudWatch

  • all services running in Kubernetes send their logs to DataDog

There are a few reasons we keep the AWS services logs in CloudWatch - price, native integration, target audience (ops), etc. For developers is enough to have access to DataDog, since those logs are most relevant to their apps and services.

Metrics

For metrics we use DataDog. We collect metrics from both AWS services as well as our custom built Kubernets services. These metrics are useful for observability - runtime behaviour over longer periods of time, as well as for alerting about abnormal behaviour.

Tracing

We use DataDog for effortless Application Performance Monitoring (APM). With this capability, we get end-to-end visibility into our services - from the web browser, all the way to the database. In the context of a microservice architecture, this is an essential tool to undertand behaviour during interactions and troubleshoot individual issues.

Accessing logs, metrics and traces

Define the boundaries of expected behaviour

When it comes to metrics we need to baseline the correct behaviour. That usually takes a few days, maybe even weeks. Together, the Ops and Dev team can decide the allowed deviation from the baseline behaviour.

Document applications' components and their maintenance operations

  • component type - web service, web frontend, message processor, etc.

  • description - the purpose and responsibilities of this component.

  • authentication - the auth model for the service - user-token, service-token, none, etc

  • dependencies

    • local

    • external

  • operations (routes/paths/etc) - the different entry points into this component. For each operation:

    • description - a short description of the operation

    • path - path, topic, etc. that identify the operation

    • parameters - a list of the parameters

      • name - string

      • type - string

      • required - boolean

    • type - query, modify or query-modify behaviour

    • verb - HTTP verb if applicable

    • examples - could be really helpful to have a few curl examples

Application recent change history

Ops need to have access to the recent history of significant changes for each application. That can provide important clues, helpful for the resolution of an issue.

Contact information for dev team emergency contact

In case Ops cannot resolve the issue in a timely manner with the available tools and information, they will seek for assistance from the development team that has worked on the application experiencing the issue. The process of handling an issue is described in Handling production issues (TODO missing document)

Every developer has access to logs, metrics and tracing (APM) information through the .

need to know each application's components, how they interact together as well as interaction with any external services. For example the following information would be really useful:

DataDog portal
Ops
sample chapter from the book Distributed Systems Observability by Cindy Sridharan
Monitoring at SÍ
DataDog APM