Comment on page
We need to ensure that the applications stay available and performant once they have been deployed to Production. For that to happen we need all of these things to happen:
- Provide observability into the applications
- Define the boundaries of expected behaviour
- Document applications' components and their maintenance operations
- Application recent change history
- Contact information for dev team emergency contact
Here is a sample chapter from the book Distributed Systems Observability by Cindy Sridharan that talks about the three pillars of observability - logs, metrics and traces. We are starting with these basics and we are planning to add more - stack-trace collecting, deployment events, etc.
Monitoring at SÍ
There is a clear separation of strategy where we keep the logs:
- all AWS services send their logs to AWS CloudWatch
- all services running in Kubernetes send their logs to DataDog
There are a few reasons we keep the AWS services logs in CloudWatch - price, native integration, target audience (ops), etc. For developers is enough to have access to DataDog, since those logs are most relevant to their apps and services.
For metrics we use DataDog. We collect metrics from both AWS services as well as our custom built Kubernets services. These metrics are useful for observability - runtime behaviour over longer periods of time, as well as for alerting about abnormal behaviour.
We use DataDog for effortless Application Performance Monitoring (APM). With this capability, we get end-to-end visibility into our services - from the web browser, all the way to the database. In the context of a microservice architecture, this is an essential tool to undertand behaviour during interactions and troubleshoot individual issues.
When it comes to metrics we need to baseline the correct behaviour. That usually takes a few days, maybe even weeks. Together, the Ops and Dev team can decide the allowed deviation from the baseline behaviour.
- component type - web service, web frontend, message processor, etc.
- description - the purpose and responsibilities of this component.
- authentication - the auth model for the service -
- operations (routes/paths/etc) - the different entry points into this component. For each operation:
- description - a short description of the operation
- path - path, topic, etc. that identify the operation
- parameters - a list of the parameters
- name - string
- type - string
- required - boolean
- type -
- verb - HTTP verb if applicable
- examples - could be really helpful to have a few
Ops need to have access to the recent history of significant changes for each application. That can provide important clues, helpful for the resolution of an issue.
In case Ops cannot resolve the issue in a timely manner with the available tools and information, they will seek for assistance from the development team that has worked on the application experiencing the issue. The process of handling an issue is described in Handling production issues (TODO missing document)