To set up a Docker-based monitoring and logging stack with Grafana, Prometheus, and Loki, targeting Windows Server 2019 and 2022, you can follow this guide. This configuration involves setting up each component using Docker and provisioning configuration files for Grafana and Prometheus. The setup includes Loki for log collection, Prometheus for metrics collection, and Grafana as the visualization layer. We’ll use the Grafana Agent (formerly known as Prometheus Agent) for metrics and logs collection on each Windows server.
1. Prerequisites
– Docker installed on the machine where Grafana, Prometheus, and Loki will run.
– Windows Servers 2019/2022 with Docker installed, or the Grafana Agent manually installed if Docker is not available on these servers.
2. Docker Setup for Grafana, Prometheus, and Loki
Create a `docker-compose.yml` file to deploy Grafana, Prometheus, and Loki containers.
version: ‘3.7’
services:
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
– “3000:3000”
volumes:
– ./grafana:/var/lib/grafana
– ./grafana_provisioning:/etc/grafana/provisioning
environment:
– GF_SECURITY_ADMIN_PASSWORD=admin
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
– ./prometheus:/etc/prometheus
command:
– ‘–config.file=/etc/prometheus/prometheus.yml’
ports:
– “9090:9090”
loki:
image: grafana/loki:latest
container_name: loki
ports:
– “3100:3100”
command: -config.file=/etc/loki/local-config.yaml
volumes:
– ./loki:/etc/loki
promtail:
image: grafana/promtail:latest
container_name: promtail
volumes:
– ./promtail:/etc/promtail
command: -config.file=/etc/promtail/config.yml
restart: unless-stopped
3. Prometheus Configuration
Prometheus needs to be configured to scrape metrics from the Grafana Agent running on your Windows Servers.
Create a `prometheus.yml` file in the `./prometheus/` directory:
global:
scrape_interval: 15s
scrape_configs:
– job_name: ‘windows-servers’
static_configs:
– targets: [‘windows-server-1:9182’, ‘windows-server-2:9182′]
Here, `windows-server-1` and `windows-server-2` are placeholders for your actual Windows servers’ IP addresses or hostnames. The Grafana Agent will run on port `9182` by default.
4. Loki Configuration
Loki needs to be configured to receive logs from the Grafana Agent (promtail) running on your Windows Servers.
Create a `local-config.yaml` file in the `./loki/` directory:
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 5m
max_chunk_age: 1h
chunk_target_size: 1048576
chunk_retain_period: 30s
max_transfer_retries: 0
schema_config:
configs:
– from: 2022-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /tmp/loki/boltdb-shipper-active
cache_location: /tmp/loki/boltdb-shipper-cache
shared_store: filesystem
filesystem:
directory: /tmp/loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
5. Grafana Configuration
Grafana requires provisioning to automatically load dashboards, data sources, and notification channels.
Create the following structure inside `./grafana_provisioning/`:
5.1 Data Sources
Create `datasources.yaml` inside `./grafana_provisioning/datasources/`:
apiVersion: 1
datasources:
– name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
– name: Loki
type: loki
access: proxy
url: http://loki:3100
isDefault: false
5.2 Dashboards
Create `dashboards.yaml` inside `./grafana_provisioning/dashboards/`:
apiVersion: 1
providers:
– name: ‘default’
orgId: 1
folder: ”
type: file
disableDeletion: false
updateIntervalSeconds: 10
options:
path: /var/lib/grafana/dashboards
You can place your dashboard JSON files in the `./grafana/dashboards/` directory. Grafana will automatically load these dashboards.
6. Windows Server Configuration
On each Windows Server install the Grafana Agent to collect metrics and logs.
6.1 Installing Grafana Agent
Download and install the Grafana Agent on each server from the [official release page](https://github.com/grafana/agent/releases). After installation, configure the agent.
6.2 Grafana Agent Configuration
Create a configuration file (e.g., `agent-config.yaml`):
server:
log_level: info
metrics:
wal_directory: /tmp/grafana-agent-wal
global:
scrape_interval: 15s
configs:
– name: windows-metrics
scrape_configs:
– job_name: ‘windows-server’
static_configs:
– targets: [‘localhost:9182’]
logs:
configs:
– name: windows-logs
positions:
filename: /tmp/positions.yaml
scrape_configs:
– job_name: “windows-logs”
static_configs:
– targets: [‘localhost’]
labels:
job: windows
host: ‘windows-server-1’
relabel_configs:
– source_labels: [‘__path__’]
target_label: ‘filename’
pipeline_stages:
– json:
expressions:
level: level
msg: message
docker_sd_configs:
– host: tcp://localhost:2375
role: containers
targets:
– job: windows
static_configs:
– targets:
– ‘loki:3100’
labels:
agent: promtail
6.3 Running Grafana Agent
Run the agent with the following command:
./agent-windows-amd64.exe –config.file=agent-config.yaml
This command should be adjusted depending on the installation location and configuration file path.
7. Advanced Configuration
7.1 Alerting with Prometheus
Add alerting rules to your Prometheus configuration (`prometheus.yml`):
rule_files:
– “alert.rules”
alerting:
alertmanagers:
– static_configs:
– targets: [‘localhost:9093’]
Create an `alert.rules` file:
groups:
– name: Windows Alert Group
rules:
– alert: HighCPUUsage
expr: windows_cpu_usage > 80
for: 2m
labels:
severity: critical
annotations:
summary: “High CPU usage detected”
description: “CPU usage on {{ $labels.instance }} has exceeded 80%.”
7.2 Grafana Alerting
Grafana can be used for alerting by creating alert rules within the UI or provisioning them via configuration files.
7.3 Securing the Setup
Consider adding authentication and TLS to secure communication between Prometheus, Loki, and Grafana. This involves setting up reverse proxies or using built-in authentication mechanisms and cert management.
8. Starting the Stack
Run the following command in the directory where your `docker-compose.yml` file is located:
docker-compose up -d
This will start all the services.
9. Accessing the Services
– Grafana: Accessible at `http://localhost:3000`
– Prometheus: Accessible at `http://localhost:9090`
– Loki: Loki does not have a direct UI, but logs can be queried from Grafana.
10. Testing and Validation
Ensure the following:
1. Grafana dashboards are populated with metrics and logs from your Windows servers.
2. Prometheus is scraping metrics and displaying them correctly.
3. Loki is receiving logs and they are accessible through Grafana’s explore feature.
11. Testing and Validation
Once your Docker-based Grafana, Prometheus, and Loki setup is running, it’s crucial to verify that each component is functioning as expected:
11.1 Grafana Dashboards
– Access Grafana: Open your browser and navigate to `http://localhost:3000` (or your server’s IP if running remotely). Log in using the default credentials (`admin/admin` unless changed).
– Add Dashboards: If you’ve provisioned dashboards as per the `dashboards.yaml` configuration, they should already be visible. Otherwise, you can import dashboards manually by going to the `Dashboard` section and importing JSON files or finding dashboards on the Grafana website.
11.2 Prometheus Metrics
– Prometheus UI: Navigate to `http://localhost:9090` to access the Prometheus UI.
– Check Targets: Under the `Status -> Targets` page, ensure that all your Windows servers appear and that Prometheus is successfully scraping metrics.
– Run Queries: Test some basic queries, such as `up`, `windows_cpu_time_total`, or `windows_memory_usage_bytes`, to verify that Prometheus is collecting data as expected.
11.3 Loki Logs
– Grafana Log Querying: In Grafana, go to the `Explore` section. Select `Loki` as the data source and run queries to see if logs are being ingested correctly. An example query might be `{job=”windows”}` to see logs tagged with the `windows` job label.
12. Optimization
After the initial setup, you can further optimize and fine-tune your stack.
12.1 Prometheus Tuning
– Scrape Intervals: Depending on your use case, adjust the `scrape_interval` and `evaluation_interval` in the `prometheus.yml` to balance performance with the level of detail you need.
– Retention Policies: Set appropriate data retention policies in Prometheus to manage disk space usage by configuring `–storage.tsdb.retention.time` in the Prometheus command line arguments.
12.2 Loki Optimization
– Log Retention: Configure log retention periods within Loki to ensure that you do not run out of disk space. This can be set in the `table_manager` configuration as `retention_period`.
– Pipeline Stages: Optimize Loki’s `pipeline_stages` for parsing and filtering logs. This reduces the volume of logs stored and improves query performance.
12.3 Grafana Dashboards
– Custom Dashboards: Create custom dashboards tailored to your specific infrastructure needs, combining metrics from Prometheus and logs from Loki. This allows you to correlate events and monitor performance more effectively.
– Alerting: Set up alerts directly within Grafana (or through Prometheus rules) to get notified about critical issues such as high CPU usage or low disk space.
13. Troubleshooting
Even with careful setup, you may encounter issues. Here are some common troubleshooting steps:
13.1 Grafana Issues
– Data Source Connection: If Grafana cannot connect to Prometheus or Loki, verify that the `datasources.yaml` configuration is correct and that the services are accessible at the specified URLs.
– Dashboard Import: If a dashboard does not load correctly, ensure the JSON is formatted correctly and that all necessary metrics or logs are available in the data sources.
13.2 Prometheus Issues
– Target Down: If a target is shown as `DOWN`, verify network connectivity and firewall settings between the Prometheus server and the Windows machines.
– High Memory Usage: Prometheus can consume significant memory. Consider reducing the number of targets or adjusting the scrape interval to manage memory usage.
13.3 Loki Issues
– No Logs in Grafana: If logs are not appearing in Grafana, check Loki’s configuration files for errors, and ensure that promtail is running correctly on your Windows servers and can reach the Loki instance.
– High Disk Usage: If Loki is consuming too much disk space, review your log retention policies and consider enabling compression or reducing log verbosity.
14. Expansion
As your monitoring needs grow, you might want to expand the setup:
14.1 Adding More Targets
– New Windows Servers: Add new Windows servers to your environment by updating the `prometheus.yml` and deploying the Grafana Agent on the new servers.
– Other Platforms: If you need to monitor Linux servers or Docker containers, you can expand your Prometheus and Loki configurations to include these targets.
14.2 Scaling the Stack
– Horizontal Scaling: Consider setting up a more scalable infrastructure by running Prometheus, Grafana, and Loki in a Kubernetes cluster. This allows for better resource management and high availability.
– Sharding and Federation: For very large environments, implement Prometheus federation or sharding to distribute the load across multiple Prometheus servers.
14.3 Integrating with Other Tools
– Alertmanager: Integrate Prometheus with Alertmanager to handle alerts and send notifications to various channels like Slack, email, or PagerDuty.
– Tempo: Add Tempo to the stack for distributed tracing, allowing you to trace requests across microservices and correlate with logs and metrics.
15. Security Considerations
Securing your monitoring stack is crucial to prevent unauthorized access and data breaches:
15.1 Authentication and Authorization
– Grafana: Enable and configure user authentication in Grafana, using options like OAuth, LDAP, or built-in users and roles.
– Prometheus and Loki: Secure these services using reverse proxies with TLS certificates and HTTP basic auth or OAuth proxies.
15.2 Network Security
– Firewall Rules: Ensure that your firewall rules allow only trusted sources to access Prometheus, Loki, and Grafana.
– VPNs and Private Networks: If possible, keep these services on a private network accessible only via VPN or secure tunnels.
16. Backup and Disaster Recovery
Implementing a backup and disaster recovery strategy ensures your monitoring setup is resilient:
16.1 Backing Up Configuration
– Version Control: Store your configuration files (Docker Compose, Prometheus, Loki, Grafana provisioning) in a version control system like Git.
– Automated Backups: Set up automated backups of the Grafana database and Loki’s data store.
16.2 Restoring Services
– Disaster Recovery Plan: Document the steps required to restore your monitoring setup, including re-deploying Docker containers and restoring data from backups.
This setup provides a comprehensive monitoring and logging solution using Docker-based Grafana, Prometheus, and Loki, with Windows Server 2019 and 2022 as targets. By carefully configuring each component, provisioning dashboards and data sources, and optimizing the stack, you can achieve effective infrastructure monitoring. Advanced configurations like alerting, scaling, and security further enhance the reliability and usability of your monitoring system.
Remember that monitoring is an ongoing process, and your setup will need to evolve as your infrastructure grows or changes. Regularly review and adjust configurations to keep your monitoring effective and efficient.