Hey there, fellow Prometheus enthusiasts!

As part of my preparation, I created a list of sample exam questions. These not only gauged my progress but also highlighted areas where I needed to improve. I made it a point to review my answers thoroughly, learning from my mistakes and solidifying my understanding of the Prometheus ecosystem. It was specially helpful to utilize AI Chats for asking mock questions or simular topic questions and delving deeper into the topic.

Furthermore, I highly recommend the book “Prometheus: Up & Running” as an essential resource to comprehend all the vital topics of Prometheus. Gaining proficiency in PromQL and queries was greatly aided by reading the informative posts from “PromLabs” and referring to the “promql-cheat-sheet”.

Good luck to everyone, happy learning and happy exam-taking!🤞

Exam Info

  • Duration: 90 min
  • Question: 60
  • Pass: x > 75% (min 45/60)

💬 120 Sample Exam Questions 💬

Observability Concepts (18%)

  1. What is the preferred approach used by Prometheus to collect metrics from a target?
  2. What is the Observability?
  3. What is RED Method?
  4. What are the distinctions between SLO, SLI, and SLA?
  5. In the context of tracing, what is the meaning or representation of a span?
  6. In which scenarios is distributed tracing less beneficial or NOT as applicable?
  7. What are typically tracked within a span of a trace?
  8. What is the good and bad metric?
  9. Which type of data is monitored by Prometheus?
  10. In the context of monitoring and observability, what type of data is typically used to define a SLI?
  11. What is the meaning or purpose of an error budget policy?
  12. What is one advantage of the push model for recoring metrics compared to pull models?
  13. How do Prometheus, ELK stack, and InfluxDB differ in terms of their functionalities and use cases?
  14. What is the definition of a metric?
  15. What are the Prometheus exemplars?
  16. What is one of the main purposes or goals of logging?
  17. What are the 3 core components of observability?
  18. What is the Monitoring
  19. What is the Telemetry?
  20. What is the challenges of observability?

Prometheus Fundamentals (20%)

  1. What is the CLI utility tool for Prometheus called?
  2. What are the limitations of Prometheus?
  3. What is Service Discovery and which categories are there?
  4. Which property configures the timing to scrape metrics from targets?
  5. Which section in the Prometheus configuration file governs the selection of targets to be scraped?
  6. Which action in the label configuration is used to delete a specific target?
  7. How is managed data retention in prometheus?
  8. What are the essential 3 components of Prometheus?
  9. What is required to be able to reload Prometheus?
  10. What are 3 methods to restart the Prometheus server?
  11. What HTTP method does Prometheus employ for performing scrapes?
  12. Which SD configuration is recommended for scraping EC2 instances?
  13. Which SD configuration is recommended for nodes of Elastic Kubernetes Service on AWS?
  14. What is the purpose of the scrape_interval configuration in Prometheus?
  15. Which type of database does Prometheus utilize?
  16. What component is responsible for collecting metrics from an instance and exposing them in a format that Prometheus expects?
  17. Which component is suitable for collecting metrics from batch/cron jobs?
  18. When is the configuration option honor_labels:true used?
  19. What is the purpose of port 9090/9093/9100/9091/9115 in Prometheus?
  20. what are 2 default metric labels?
  21. Which of the file systems is recommended/supported by Prometheus?
  22. How can you configure a Blackbox Exporter probe to check the successful response of your servers to PING?
  23. How do you configure the targets that Prometheus should scrape?
  24. What is the agent deployment mode of Prometheus?
  25. Which CLI command is suitable for unit testing Prometheus rules?
  26. Which CLI command is suitable for checking validity of the config files?
  27. How do you define the targets with SD that Prometheus should collect metrics from?
  28. How can you delete the specific time series metrics of Prometheus?
  29. How can you delete the all time series metrics of Prometheus?
  30. Which format does file-based SD provide?

PromQL (28%)

  1. What is PromQL?
  2. What is histogram metric in Prometheus?
  3. Which 4 data types are used in PromQL?
  4. What is the name of the vector in Prometheus that stores a single sample value?
  5. Which PromQL function is used to estimate the value of a time series at a future time, t seconds from the current time, based on the range vector v?
  6. Between what type of expressions can logical operators be defined?
  7. Which function can be used to calculate the average of a range vector in Prometheus?
  8. What is the diff between avg(...) and avg_over_time(...)?
  9. With which type of metrics is the rate(...) function primarily used in Prometheus?
  10. What does the term “offset” refer to in Prometheus?
  11. What distinguishes the rate(...) and irate(...) query functions in Prometheus?
  12. What distinguishes the rate(...) and deriv(...) query functions in Prometheus?
  13. Which type of metric is suitable for measuring the internal temperature of a server?
  14. What is the data type of Prometheus metric values?
  15. How many unique series are generated by a histogram metric type?
  16. What are the 4 components of the Prometheus metrics data model?
  17. What is the difference between the ceil and floor functions?
  18. Which query function among the following returns a result of 1 in case the specified time series does not exist?
  19. What is the logical/arithmetic/comparison binary operator?
  20. What is the vector matching?
  21. What is the group modifiers?
  22. Which function is NOT using counter metrics? irate(), increase(), reset(), idelta(), avg(), rate()
  23. How to calc the time in days until the LAST certificate expiration?
  24. What is the dimensional aggregation?
  25. What is the significance of the double underscore “__” before a label name?

Instrumentation and Exporters (16%)

  1. What is the HTTP headers to establish by Prometheus during each scrape?
  2. Which 2 query parameters are required when configuring a Blackbox Exporter probe?
  3. What is the exposition format of Prometheus?
  4. Does Prometheus need to perform any format conversion on the metrics returned by a monitored Linux machine?
  5. What is the default endpoint that Prometheus uses to scrape the metrics from the target?
  6. Where is the version of the Prometheus exporter typically defined?
  7. What is the most suitable exporter for monitoring an HTTP web server endpoint to verify that it returns a 200 status code?
  8. Which Prometheus exporter is recommended for monitoring network devices?
  9. Which networking protocol does Prometheus utilize for performing scrapes?
  10. What is the purpose of a Prometheus metrics registry?
  11. What is the purpose or definition of a Prometheus exporter?
  12. In what scenarios would you use the Blackbox Exporter?
  13. How does Prometheus identify the scrape path for its targets?
  14. Which endpoints allows blackbox probing?
  15. In a scenario where you have a dynamic etcd database containing scrape targets for Prometheus, how should you configure service discovery?
  16. What are the 2 types of attributes that can be present in the /metrics endpoint?
  17. Which exporter is the most suitable for monitoring Scala metrics among the following options?
  18. How to keep pushgateway job labels? normally there are overwritten
  19. How does Prometheus scrape the last batch job push time?
  20. What is the 3 types of service system?

Recording & Alerting & Dashboarding (18%)

  1. Is there a way to deactivate a specific route in Alertmanager for a specific time frame?
  2. What is considered a best practice when it comes to alerting in monitoring systems: focusing on alerting based on symptoms or alerting based on causes?
  3. What is the meaning of “alert symptoms” and “alert causes” in the context of monitoring systems?
  4. Which aspect, symptoms or causes, is more visible to customers in the context of an issue?
  5. What is the good naming convention for the recoring rules?
  6. What is the acknowledge-based throttling and Waht is the time-based throttling?
  7. What are the 3 statuses of a Prometheus alert?
  8. How can I use a PromQL query to retrieve the currently active alerts in Alertmanager?
  9. What is the recording rules in Prometheus?
  10. How to define the recording rules?
  11. Whas is the alert fatigue?
  12. Which feature of Alertmanager is responsible for formatting and customizing the alerts?
  13. How can you configure Alertmanager to disable the grouping of alerts for a specific route effectively?
  14. Which software is commonly used for visualizing Prometheus metrics?
  15. What does the term “inhibiting” refer to in the context of Alertmanager?
  16. What is the format used for defining alerting rules?
  17. What is the significance of the for attribute in a Prometheus alert rule?
  18. How can you temporarily mute/snooze/suppress an alert during maintenance in Prometheus?
  19. What is the name of Prometheus native dashboarding and visualization feature?
  20. How can you coordinate the simultaneous sending of multiple alerts with similar label sets in Prometheus?
  21. Which feature of Alertmanager is resonsilbe for sending alert to the right receiver?
  22. What is the purpose of the repeat_interval/conitnue/group_wait/group_inteval attribute in an Alertmanager route configuration?
  23. Which 2 attributes of an alerting rule can be used to include extra metadata?
  24. What are required for a high-availability configuration of Alertmanager?
  25. What are the 3 statuses of Alertmanager Silences?

💡 Answer of Questions 💡

Observability Concepts (18%)

  1. pull-based
  2. Observability: understand what’s happening inside a system and predict how it will behave in the future
  3. RED Method consists of: (Request) Rate + (Request) Errors + (Request) Duration
  4. SLO: Service Level Objective (Goal), SLA: Service Level Agreement (Contract), SLI: Service Level Indicator (Metrics)
  5. Span is a single operation/unit of work within a distributed system and captures the start and end times, duration, and associated metadata of a specific operation
  6. for monolith system
  7. Operation Name, Trace ID and Span ID, Start and End Timestamps, Duration, Parent Span ID
  8. bad: a metric with a lot fo variance and poor correlation with user experience, good: metric to set easier threashold for bcs there is no overlap at all.
  9. Metrics (numeric value)
  10. SLI is typically derived from metrics
  11. An error budget policy is a concept used in the context of SLOs and SLAs and is to define the acceptable level of errors or service disruptions that a system or service can experience within a given time period.
  12. timely and proactive data collection (real-time or near real-time) / pushing into the centralized data system
  13. InfluxDB is a pull-based time-series database designed to handle high volumes of time-stamped data (IoT, Sensor, Analytics).
  14. ELK stack is a push-based system, used for collecting, processing, storing, and visualizing log data.
  15. Prometheus is a pull-based time-series database and monitoring system specifically designed for monitoring dynamic cloud-native environments.
  16. numeric time-series data point
  17. An exemplar is a specific trace representative of measurement taken in a given time interval and provides additional information about a specific data point.
  18. To gather and aggregate textual event data from a service for troubleshooting
  19. Logging, Trace and Metrics
  20. Monitoring: continues observation of a system to detect and alert on abnormal behavior.
  21. Telemetry: automates collection and transmission of data from remote source.
  22. Data silos, Volume, velocity, variety, and complexity of data, Lack of pre-production

Prometheus Fundamentals (20%)

  1. promtool
  2. scalability for large-scale deployments with millions of TS, Long-term storage, High cardinality, HA and Replication
  3. SD is a mechanism that allow to automatically discover and monitor targets and services. There are 2 categories: top-down (e.i. ec2) and bottom-up (e.i. consol) mechanisms of static SD
  4. scrape_interval
  5. scrape_configs
  6. scrape_configs -> relabel_configs -> action: drop or action: keep
  7. with the flag --storage.tsdb.retention.time and --storage.tsdb.retention.size
  8. Retrieval, TSDB, HTTP Server
  9. with the flag --web.enable-lifecycle
  10. Sending a SIGHUP signal to the Prometheus process, Using the Prometheus API POST or PUT + /-/reload, Using a service manager (systemctl) or orchestration tool (k8s)
  11. HTTP GET method
  12. ec2_sd_configs
  13. ec2_sd_configs
  14. how frequently Prometheus collects and updates the metrics
  15. time-series database
  16. Prometheus exporter
  17. Pushgateway
  18. Using honor_labels can make your collected metrics more informative and allow you to differentiate between different metrics coming from various sources or probe targets
  19. 9090:prometheus-server, 9093:altermanger, 9100:node-exporter, 9091:pushgateway, 9115:blackbox-exporter
  20. instance and job
  21. ext4, XFS, and NTFS
  22. Internet Control Message Protocol (ICMP) -> prober:icmp
  23. scrape_configs > static_configs -> targets:xxx
  24. agent mode is a light promtheus mode, which is focused for remote-write (remote storage), service-discovery and scraping specially for edge-computing/IoT and reducing for querying, alerting and local storage
  25. ./promtool test rules test.yml
  26. ./promtool check rules test.yml
  27. scrape_configs and *_sd_configs on per-job basis
  28. starting the server with the flag --web.enable-admin-api + curl - X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={xxxx="yyy"}'
  29. starting the server with the flag --web.enable-admin-api + $ curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/clean_tombstones'
  30. YAML and JSON

PromQL & Metrics (28%)

  1. Query Language for Prometheus
  2. Histogramm samples observations (e.g. request durations or response sizes) and counts them in configurable buckets
  3. Scalar, String, Instant Vector, Range Vector
  4. Instant Vector
  5. predict-linear()
  6. boolean
  7. avg_over_time(metrics[x])
  8. avg_over_time(...) has range vector as input and returns range vector as output. avg(...) has instant vector as input and returns aggregated number.
  9. rate(...) needs COUNTER type metrics
  10. offset refers to the past time as duration
  11. rate(...) calc avg rate of change of a time series over the specified time range, irate(...) calc avg rate of change of a time series at the last 2 data points
  12. deriv(...) operates on gauge and rate(...) operates on counter
  13. guage
  14. float64.
  15. <basename>_bucket, <basename>_sum and <basename>_count
  16. metric name, metrics label, timestamp, value
  17. floor(...) = round a number down, ceil(...) = round a number up
  18. absent(...)
  19. logical => OR, AND, UNLESS, arithmetic => + - * / % ^, comparison => ==, !=, >, <, >=, <=
  20. on, ignoring
  21. a part of vector matching. on, ignoring + group_left, group_right
  22. idelta()
  23. max(cert_expiry - time()) / 86400
  24. sum(), min(), max(), avg(), count()
  25. The label is a reserved label

Instrumentation and Exporters (16%)

  1. X-Prometheus-Scrape-Timeout-Seconds
  2. target + module
  3. text-based format for exposing metrics
  4. No
  5. /metrics
  6. build_info
  7. Blackbox Exporter
  8. SNMP exporter
  9. HTTP protocol
  10. Registry serves as a central repository for collecting, storing, and managing metrics
  11. Exporter is responsible for collecting metrics from a specific system, application, or service and exposing them for Prometheus
  12. Network Service Monitoring, Helth Check, Externe Monitoring
  13. scrape_configs > metrics_path: /metrics
  14. Blackbox Exporter allows blackbox probing of endpoints over HTTP, HTTPS, DNS, TCP, ICMP, gRPC
  15. file_sd_configs
  16. HELP, TYPE
  17. JMX Exporter
  18. honor_labels:true
  19. PromQL > job_last_success_unixtime
  20. online-serving, offline-processing, and batch jobs

Recording & Alerting & Dashboarding (18%)

  1. attribute in the route time_intervals ex. time_intervals: [holidays, offhours]. mute_time_interval is DEPRECATED.
  2. symptom-based and NOT causes-based
  3. sympton: The “what’s broken”, cause: “why broken”
  4. symptom is customer visible error
  5. <<level>>:<<metric>>:<<operations>>, e.g.job:node_cpu_seconds:avg_idle
  6. acknowledge-based = notifications for an alert are sent to the recipient only once until the alert is acknowledged or resolved
  7. time-based = timiting the rate of notifications based on a specific time interval (ex. goup_interval, scrape_interval)
  8. firing, pending, inactive
  9. Mene > Alerts > Query > ALERTS
  10. aggregate and filter metrics with PromQL and storing them into Prometheus DB
  11. rules -> record: xxx,expr: xxx
  12. Alert fatigue refers to a situation where individuals or teams become overwhelmed or desensitized by a large volume of alerts
  13. notification templates
  14. group_by
  15. Grafana
  16. Inhibiting refers to a feature that allows certain alerts to be stopped or prevented from generating notifications for a specified duration of time
  17. YAML
  18. for allows for a delay or threshold before an alert is firing, helping to prevent false positives and reduce noise in alerting systems
  19. Slience
  20. Prometheus Console
  21. Grouping
  22. Routing
  23. repeat_interval: is used to determine the wait time before a firing alert that has already been successfully
  24. continue: specifies whether to continue processing subsequent routes after sending a notification for an alert
  25. group_wait: sets how long to initially wait to send a notification
  26. group_interval: dictates how long to wait before sending notifications about new alerts
  27. annotations + labels
  28. This can be configured using the --cluster-* flags
  29. Active, Pending, Expired