Hey there, fellow Prometheus enthusiasts!
As part of my preparation, I created a list of sample exam questions. These not only gauged my progress but also highlighted areas where I needed to improve. I made it a point to review my answers thoroughly, learning from my mistakes and solidifying my understanding of the Prometheus ecosystem. It was specially helpful to utilize AI Chats for asking mock questions or simular topic questions and delving deeper into the topic.
Furthermore, I highly recommend the book “Prometheus: Up & Running” as an essential resource to comprehend all the vital topics of Prometheus. Gaining proficiency in PromQL and queries was greatly aided by reading the informative posts from “PromLabs” and referring to the “promql-cheat-sheet”.
Good luck to everyone, happy learning and happy exam-taking!🤞
Exam Info
- Duration: 90 min
- Question: 60
- Pass: x > 75% (min 45/60)
💬 120 Sample Exam Questions 💬
Observability Concepts (18%)
- What is the preferred approach used by Prometheus to collect metrics from a target?
- What is the Observability?
- What is RED Method?
- What are the distinctions between SLO, SLI, and SLA?
- In the context of tracing, what is the meaning or representation of a span?
- In which scenarios is distributed tracing less beneficial or NOT as applicable?
- What are typically tracked within a span of a trace?
- What is the good and bad metric?
- Which type of data is monitored by Prometheus?
- In the context of monitoring and observability, what type of data is typically used to define a SLI?
- What is the meaning or purpose of an error budget policy?
- What is one advantage of the push model for recoring metrics compared to pull models?
- How do Prometheus, ELK stack, and InfluxDB differ in terms of their functionalities and use cases?
- What is the definition of a metric?
- What are the Prometheus exemplars?
- What is one of the main purposes or goals of logging?
- What are the 3 core components of observability?
- What is the Monitoring
- What is the Telemetry?
- What is the challenges of observability?
Prometheus Fundamentals (20%)
- What is the CLI utility tool for Prometheus called?
- What are the limitations of Prometheus?
- What is Service Discovery and which categories are there?
- Which property configures the timing to scrape metrics from targets?
- Which section in the Prometheus configuration file governs the selection of targets to be scraped?
- Which action in the label configuration is used to delete a specific target?
- How is managed data retention in prometheus?
- What are the essential 3 components of Prometheus?
- What is required to be able to reload Prometheus?
- What are 3 methods to restart the Prometheus server?
- What HTTP method does Prometheus employ for performing scrapes?
- Which SD configuration is recommended for scraping EC2 instances?
- Which SD configuration is recommended for nodes of Elastic Kubernetes Service on AWS?
- What is the purpose of the
scrape_interval
configuration in Prometheus? - Which type of database does Prometheus utilize?
- What component is responsible for collecting metrics from an instance and exposing them in a format that Prometheus expects?
- Which component is suitable for collecting metrics from batch/cron jobs?
- When is the configuration option
honor_labels:true
used? - What is the purpose of port
9090/9093/9100/9091/9115
in Prometheus? - what are 2 default metric labels?
- Which of the file systems is recommended/supported by Prometheus?
- How can you configure a Blackbox Exporter probe to check the successful response of your servers to PING?
- How do you configure the targets that Prometheus should scrape?
- What is the agent deployment mode of Prometheus?
- Which CLI command is suitable for unit testing Prometheus rules?
- Which CLI command is suitable for checking validity of the config files?
- How do you define the targets with SD that Prometheus should collect metrics from?
- How can you delete the specific time series metrics of Prometheus?
- How can you delete the all time series metrics of Prometheus?
- Which format does file-based SD provide?
PromQL (28%)
- What is PromQL?
- What is histogram metric in Prometheus?
- Which 4 data types are used in PromQL?
- What is the name of the vector in Prometheus that stores a single sample value?
- Which PromQL function is used to estimate the value of a time series at a future time, t seconds from the current time, based on the range vector v?
- Between what type of expressions can logical operators be defined?
- Which function can be used to calculate the average of a range vector in Prometheus?
- What is the diff between
avg(...)
andavg_over_time(...)
? - With which type of metrics is the
rate(...)
function primarily used in Prometheus? - What does the term “offset” refer to in Prometheus?
- What distinguishes the
rate(...)
andirate(...)
query functions in Prometheus? - What distinguishes the
rate(...)
andderiv(...)
query functions in Prometheus? - Which type of metric is suitable for measuring the internal temperature of a server?
- What is the data type of Prometheus metric values?
- How many unique series are generated by a histogram metric type?
- What are the 4 components of the Prometheus metrics data model?
- What is the difference between the ceil and floor functions?
- Which query function among the following returns a result of 1 in case the specified time series does not exist?
- What is the logical/arithmetic/comparison binary operator?
- What is the vector matching?
- What is the group modifiers?
- Which function is NOT using counter metrics?
irate(), increase(), reset(), idelta(), avg(), rate()
- How to calc the time in days until the LAST certificate expiration?
- What is the dimensional aggregation?
- What is the significance of the double underscore “__” before a label name?
Instrumentation and Exporters (16%)
- What is the HTTP headers to establish by Prometheus during each scrape?
- Which 2 query parameters are required when configuring a Blackbox Exporter probe?
- What is the exposition format of Prometheus?
- Does Prometheus need to perform any format conversion on the metrics returned by a monitored Linux machine?
- What is the default endpoint that Prometheus uses to scrape the metrics from the target?
- Where is the version of the Prometheus exporter typically defined?
- What is the most suitable exporter for monitoring an HTTP web server endpoint to verify that it returns a 200 status code?
- Which Prometheus exporter is recommended for monitoring network devices?
- Which networking protocol does Prometheus utilize for performing scrapes?
- What is the purpose of a Prometheus metrics registry?
- What is the purpose or definition of a Prometheus exporter?
- In what scenarios would you use the Blackbox Exporter?
- How does Prometheus identify the scrape path for its targets?
- Which endpoints allows blackbox probing?
- In a scenario where you have a dynamic etcd database containing scrape targets for Prometheus, how should you configure service discovery?
- What are the 2 types of attributes that can be present in the
/metrics
endpoint? - Which exporter is the most suitable for monitoring Scala metrics among the following options?
- How to keep pushgateway job labels? normally there are overwritten
- How does Prometheus scrape the last batch job push time?
- What is the 3 types of service system?
Recording & Alerting & Dashboarding (18%)
- Is there a way to deactivate a specific route in Alertmanager for a specific time frame?
- What is considered a best practice when it comes to alerting in monitoring systems: focusing on alerting based on symptoms or alerting based on causes?
- What is the meaning of “alert symptoms” and “alert causes” in the context of monitoring systems?
- Which aspect, symptoms or causes, is more visible to customers in the context of an issue?
- What is the good naming convention for the recoring rules?
- What is the acknowledge-based throttling and Waht is the time-based throttling?
- What are the 3 statuses of a Prometheus alert?
- How can I use a PromQL query to retrieve the currently active alerts in Alertmanager?
- What is the recording rules in Prometheus?
- How to define the recording rules?
- Whas is the alert fatigue?
- Which feature of Alertmanager is responsible for formatting and customizing the alerts?
- How can you configure Alertmanager to disable the grouping of alerts for a specific route effectively?
- Which software is commonly used for visualizing Prometheus metrics?
- What does the term “inhibiting” refer to in the context of Alertmanager?
- What is the format used for defining alerting rules?
- What is the significance of the
for
attribute in a Prometheus alert rule? - How can you temporarily mute/snooze/suppress an alert during maintenance in Prometheus?
- What is the name of Prometheus native dashboarding and visualization feature?
- How can you coordinate the simultaneous sending of multiple alerts with similar label sets in Prometheus?
- Which feature of Alertmanager is resonsilbe for sending alert to the right receiver?
- What is the purpose of the
repeat_interval/conitnue/group_wait/group_inteval
attribute in an Alertmanager route configuration? - Which 2 attributes of an alerting rule can be used to include extra metadata?
- What are required for a high-availability configuration of Alertmanager?
- What are the 3 statuses of Alertmanager Silences?
💡 Answer of Questions 💡
Observability Concepts (18%)
- pull-based
- Observability: understand what’s happening inside a system and predict how it will behave in the future
- RED Method consists of: (Request) Rate + (Request) Errors + (Request) Duration
- SLO: Service Level Objective (Goal), SLA: Service Level Agreement (Contract), SLI: Service Level Indicator (Metrics)
- Span is a single operation/unit of work within a distributed system and captures the start and end times, duration, and associated metadata of a specific operation
- for monolith system
- Operation Name, Trace ID and Span ID, Start and End Timestamps, Duration, Parent Span ID
- bad: a metric with a lot fo variance and poor correlation with user experience, good: metric to set easier threashold for bcs there is no overlap at all.
- Metrics (numeric value)
- SLI is typically derived from metrics
- An error budget policy is a concept used in the context of SLOs and SLAs and is to define the acceptable level of errors or service disruptions that a system or service can experience within a given time period.
- timely and proactive data collection (real-time or near real-time) / pushing into the centralized data system
- InfluxDB is a pull-based time-series database designed to handle high volumes of time-stamped data (IoT, Sensor, Analytics).
- ELK stack is a push-based system, used for collecting, processing, storing, and visualizing log data.
- Prometheus is a pull-based time-series database and monitoring system specifically designed for monitoring dynamic cloud-native environments.
- numeric time-series data point
- An exemplar is a specific trace representative of measurement taken in a given time interval and provides additional information about a specific data point.
- To gather and aggregate textual event data from a service for troubleshooting
- Logging, Trace and Metrics
- Monitoring: continues observation of a system to detect and alert on abnormal behavior.
- Telemetry: automates collection and transmission of data from remote source.
- Data silos, Volume, velocity, variety, and complexity of data, Lack of pre-production
Prometheus Fundamentals (20%)
- promtool
- scalability for large-scale deployments with millions of TS, Long-term storage, High cardinality, HA and Replication
- SD is a mechanism that allow to automatically discover and monitor targets and services. There are 2 categories: top-down (e.i. ec2) and bottom-up (e.i. consol) mechanisms of static SD
scrape_interval
scrape_configs
scrape_configs
->relabel_configs
->action: drop
oraction: keep
- with the flag
--storage.tsdb.retention.time
and--storage.tsdb.retention.size
- Retrieval, TSDB, HTTP Server
- with the flag
--web.enable-lifecycle
- Sending a SIGHUP signal to the Prometheus process, Using the Prometheus API
POST or PUT + /-/reload
, Using a service manager (systemctl) or orchestration tool (k8s) - HTTP GET method
ec2_sd_configs
ec2_sd_configs
- how frequently Prometheus collects and updates the metrics
- time-series database
- Prometheus exporter
- Pushgateway
- Using
honor_labels
can make your collected metrics more informative and allow you to differentiate between different metrics coming from various sources or probe targets - 9090:prometheus-server, 9093:altermanger, 9100:node-exporter, 9091:pushgateway, 9115:blackbox-exporter
instance
andjob
- ext4, XFS, and NTFS
- Internet Control Message Protocol (ICMP) ->
prober:icmp
- scrape_configs > static_configs -> targets:xxx
- agent mode is a light promtheus mode, which is focused for remote-write (remote storage), service-discovery and scraping specially for edge-computing/IoT and reducing for querying, alerting and local storage
./promtool test rules test.yml
./promtool check rules test.yml
scrape_configs
and*_sd_configs
on per-job basis- starting the server with the flag
--web.enable-admin-api
+curl - X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={xxxx="yyy"}'
- starting the server with the flag
--web.enable-admin-api
+$ curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/clean_tombstones'
- YAML and JSON
PromQL & Metrics (28%)
- Query Language for Prometheus
- Histogramm samples observations (e.g. request durations or response sizes) and counts them in configurable buckets
- Scalar, String, Instant Vector, Range Vector
- Instant Vector
predict-linear()
boolean
avg_over_time(metrics[x])
avg_over_time(...)
has range vector as input and returns range vector as output.avg(...)
has instant vector as input and returns aggregated number.rate(...)
needs COUNTER type metricsoffset
refers to the past time as durationrate(...)
calc avg rate of change of a time series over the specified time range,irate(...)
calc avg rate of change of a time series at the last 2 data pointsderiv(...)
operates on gauge andrate(...)
operates on counter- guage
- float64.
<basename>_bucket
,<basename>_sum
and<basename>_count
- metric name, metrics label, timestamp, value
floor(...)
= round a number down,ceil(...)
= round a number upabsent(...)
- logical =>
OR, AND, UNLESS
, arithmetic =>+ - * / % ^
, comparison =>==, !=, >, <, >=, <=
on, ignoring
- a part of vector matching.
on, ignoring
+group_left, group_right
idelta()
max(cert_expiry - time()) / 86400
sum(), min(), max(), avg(), count()
- The label is a reserved label
Instrumentation and Exporters (16%)
X-Prometheus-Scrape-Timeout-Seconds
target
+module
- text-based format for exposing metrics
- No
/metrics
build_info
- Blackbox Exporter
- SNMP exporter
- HTTP protocol
- Registry serves as a central repository for collecting, storing, and managing metrics
- Exporter is responsible for collecting metrics from a specific system, application, or service and exposing them for Prometheus
- Network Service Monitoring, Helth Check, Externe Monitoring
scrape_configs
>metrics_path: /metrics
- Blackbox Exporter allows blackbox probing of endpoints over
HTTP, HTTPS, DNS, TCP, ICMP, gRPC
file_sd_configs
- HELP, TYPE
- JMX Exporter
honor_labels:true
- PromQL >
job_last_success_unixtime
- online-serving, offline-processing, and batch jobs
Recording & Alerting & Dashboarding (18%)
- attribute in the route
time_intervals
ex.time_intervals: [holidays, offhours]
.mute_time_interval
is DEPRECATED. - symptom-based and NOT causes-based
- sympton: The “what’s broken”, cause: “why broken”
- symptom is customer visible error
<<level>>:<<metric>>:<<operations>>
, e.g.job:node_cpu_seconds:avg_idle
- acknowledge-based = notifications for an alert are sent to the recipient only once until the alert is acknowledged or resolved
- time-based = timiting the rate of notifications based on a specific time interval (ex. goup_interval, scrape_interval)
- firing, pending, inactive
- Mene > Alerts > Query >
ALERTS
- aggregate and filter metrics with PromQL and storing them into Prometheus DB
rules
->record: xxx
,expr: xxx
- Alert fatigue refers to a situation where individuals or teams become overwhelmed or desensitized by a large volume of alerts
- notification templates
group_by
- Grafana
- Inhibiting refers to a feature that allows certain alerts to be stopped or prevented from generating notifications for a specified duration of time
- YAML
for
allows for a delay or threshold before an alert is firing, helping to prevent false positives and reduce noise in alerting systems- Slience
- Prometheus Console
- Grouping
- Routing
repeat_interval
: is used to determine the wait time before a firing alert that has already been successfullycontinue
: specifies whether to continue processing subsequent routes after sending a notification for an alertgroup_wait
: sets how long to initially wait to send a notificationgroup_interval
: dictates how long to wait before sending notifications about new alertsannotations + labels
- This can be configured using the
--cluster-*
flags - Active, Pending, Expired