All Posts

Elasticsearch vs Time-Series DB: Key Differences Explained

Should you store logs in Elasticsearch or InfluxDB? We compare Search Engines vs. Time-Series DBs

Abstract AlgorithmsAbstract Algorithms
··13 min read

AI-assisted content.

TLDR: Elasticsearch is built for search — full-text log queries, fuzzy matching, and relevance ranking via an inverted index. InfluxDB and Prometheus are built for metrics — numeric time series with aggressive compression. Picking the wrong one wastes 10× the storage or makes queries 100× slower.


📖 Logs vs Metrics: Two Different Storage Problems

A log is a sentence: 2024-01-15 ERROR: failed to connect to database host=db1.

A metric is a number at a timestamp: cpu.usage{host=web1} = 87.3 @ 1705312800.

These look similar (both are time-ordered data) but demand fundamentally different storage strategies:

PropertyLog dataMetric data
StructureSemi-structured textStrictly typed numbers
Query patternFull-text search, grep, aggregationRange queries, rate calculations, aggregation
CardinalityUnbounded keysBounded label/tag sets
Update frequencyWrite-once streamsRegular intervals (every 15s)
RetentionDays to months (expensive)Months to years (cheap with downsampling)

If you try to store metrics in Elasticsearch you pay for an inverted index on data that never needs text search. If you push structured logs into Prometheus you lose the ability to query individual events. The mismatch matters from day one.


🔍 The Basics: Search Engines vs Time-Series Databases

Before diving into how each system is built, it helps to understand the fundamental contract each one offers.

Elasticsearch is a search engine that happens to be used for logs. It inherits its design from Apache Lucene: every field in every document is tokenized and added to an inverted index. An inverted index is just a look-up table that maps each unique word to the list of documents that contain it — the same idea as the index at the back of a textbook. Because every term is indexed, Elasticsearch can answer "which documents mention both 'payment' and 'failure'?" in milliseconds regardless of how many billions of documents you have.

Time-series databases (TSDBs) such as InfluxDB, Prometheus, and TimescaleDB take the opposite bet. They assume your data is a stream of numeric measurements arriving at regular intervals: CPU utilization every 15 seconds, request latency on every HTTP response, bytes sent per network interface per minute. TSDBs optimize for this shape: they store values in time-ordered columnar blocks, apply delta encoding and XOR compression, and pre-build aggregations (rollups) so that range queries like "average CPU over the last six hours" return instantly.

ConceptElasticsearchTSDB (Prometheus / InfluxDB)
Core index typeInverted index (term → document IDs)Columnar time blocks (timestamp + value)
Native queryFull-text search, aggregationRate, sum, avg over time ranges
SchemaFlexible (dynamic mapping)Strict (labels must be pre-planned)
Compression~50–100 bytes per log event1–2 bytes per numeric data point
Best data shapeSemi-structured text eventsRegular numeric measurements

The simplest heuristic: if your data is a sentence (with words you want to search), use Elasticsearch. If your data is a number at a clock tick, use a TSDB.


Elasticsearch is built on Apache Lucene. Its core data structure is the inverted index: a map from every term (word) to the list of documents that contain it.

"failed" → [doc_3, doc_7, doc_12]
"database" → [doc_3, doc_9]
"connection" → [doc_7, doc_12, doc_20]

This lets Elasticsearch answer "find all logs containing 'database' AND 'connection'" in milliseconds, even across billions of log lines.

Strengths:

  • Full-text search with stemming, fuzzy matching, synonyms
  • Relevance ranking (BM25)
  • Aggregation pipelines (histograms, top-N, date histograms)
  • Schema flexibility (dynamic mappings)

Weaknesses:

  • High storage overhead — inverted index per field duplicates data
  • Poor at range math on numeric series (no delta encoding)
  • High cardinality is expensive: each unique label value adds index memory

⚙️ Time-Series DBs: Delta Encoding and Columnar Compression

TSDBs (InfluxDB, Prometheus, TimescaleDB, VictoriaMetrics) are optimized for the fact that metric values change slowly.

Delta encoding example:

Raw:     100, 101, 102, 103
Encoded: 100, +1, +1, +1

Storing deltas instead of absolute values reduces the integer size dramatically. A 64-bit value becomes a 1-bit delta. With additional compression (Gorilla encoding, Snappy), modern TSDBs achieve 1–2 bytes per data point versus Elasticsearch's 50–100 bytes per log document.

flowchart LR
    Sensor[Sensor 87.3 87.4 87.5] --> Delta[Delta Encoding 87.3 +0.1 +0.1]
    Delta --> Compress[Gorilla XOR Compression]
    Compress --> TSDB[(TSDB Block 1-2 bytes/point)]

Strengths:

  • Efficient storage (10–50× smaller than Elastic for pure metrics)
  • Fast range queries and time aggregations (SUM, AVG, RATE)
  • Built-in downsampling and retention policies
  • Cardinality-efficient label model (Prometheus label sets)

Weaknesses:

  • Poor at full-text search (no inverted index)
  • Limited schema flexibility (labels must be pre-planned for cardinality control)

📊 Data Flow: From Ingestion to Query

In real production systems these two database types live side by side. The ELK stack (Elasticsearch + Logstash + Kibana) handles unstructured log events. The Prometheus + Grafana stack handles numeric metrics. Each path is optimized for its data shape:

flowchart LR
    App[Application] -->|log events| LS[Logstash / Filebeat]
    LS --> ES[(Elasticsearch)]
    ES --> Kibana[Kibana Dashboard]

    App -->|numeric metrics| Prom[Prometheus Scraper]
    Prom --> InfluxDB[(InfluxDB / TSDB)]
    InfluxDB --> Grafana[Grafana Dashboard]

The log pipeline parses unstructured text, indexes every field, and stores full events so engineers can run ad-hoc text searches during an incident. The metrics pipeline collects numeric samples, compresses them aggressively into time blocks, and pre-computes rollups so dashboards render instantly even for 30-day queries.

Notice that the same application feeds both pipelines simultaneously. This is the normal production pattern — not a choice of one or the other, but deliberate separation of concerns. Logs answer what happened; metrics answer how often and how much.

📊 Choosing Between Elasticsearch and TSDB

flowchart TD
    Start[What type of data?] --> Text{Is the data text or events?}
    Text -->|Yes: log lines, audit trails| ES[Use Elasticsearch Inverted index search]
    Text -->|No: numeric measurements| Num{Regular interval samples?}
    Num -->|Yes: every 15s, counters| TSDB[Use Prometheus / InfluxDB Delta encoding + rollups]
    Num -->|No: irregular events| Mixed{Need full-text search too?}
    Mixed -->|Yes| ES
    Mixed -->|No| TSDB
    ES --> Kibana[Visualize with Kibana KQL queries]
    TSDB --> Grafana[Visualize with Grafana PromQL queries]

This decision flowchart routes any incoming data type to the right storage engine based on two questions: is the data text or events, and does it arrive at regular numeric intervals? Text and event data — log lines, audit trails — belong in Elasticsearch for full-text inverted-index search; regular numeric samples belong in Prometheus or InfluxDB where delta encoding and time-based rollups make range queries orders of magnitude cheaper. The bottom nodes show the matching visualization layer: Kibana with KQL for Elasticsearch, and Grafana with PromQL for TSDB.

📊 Log Query Flow Through Elasticsearch

sequenceDiagram
    participant App as Application
    participant FB as Filebeat Agent
    participant LS as Logstash
    participant ES as Elasticsearch
    participant KB as Kibana

    App->>FB: Emit log line (ERROR + stack trace)
    FB->>LS: Forward log event
    LS->>LS: Parse + enrich + filter
    LS->>ES: Index document (inverted index)
    ES-->>LS: Indexed OK

    KB->>ES: KQL: status:500 AND path:/checkout
    ES->>ES: Inverted index lookup
    ES-->>KB: Matching log documents
    KB-->>KB: Render in dashboard

This sequence diagram traces a single error log from application emission to Kibana visualization. The log passes through Filebeat (the lightweight agent), Logstash (enrichment and field parsing), and Elasticsearch (inverted-index storage) before Kibana issues a KQL query and renders the results. The round-trip from KB->>ES to ES-->>KB represents the actual search: Kibana sends a structured query, Elasticsearch executes a fast inverted-index lookup, and returns only the matching documents — not a full sequential file scan.

PipelineIngest toolStorageVisualizationQuery style
LogsLogstash, Filebeat, FluentdElasticsearchKibanaKQL / Lucene syntax
MetricsPrometheus, TelegrafInfluxDB, VictoriaMetricsGrafanaPromQL / Flux

🌍 Real-World Application: Which One to Use and When

SituationUse
"Find all error logs containing 'timeout'"Elasticsearch
"What was the p99 latency over the last 6 hours?"Prometheus / InfluxDB
"Show me all logs where user_id=12345 performed a payment"Elasticsearch
"Alert when CPU > 90% for 5 minutes"Prometheus
"Audit trail: who changed what and when"Elasticsearch
"How many requests per second to /api/v1/order over 30 days?"TimescaleDB / InfluxDB

In practice: Production observability stacks often use both. The ELK stack (Elasticsearch + Logstash + Kibana) handles logs; Prometheus + Grafana handles metrics.


⚖️ Trade-offs & Failure Modes: Trade-offs, Failure Modes & Decision Guide: Cardinality and TSDB Limits

The biggest operational risk in TSDBs is high-cardinality labels.

Prometheus memory usage scales with the number of unique time series — roughly labels × label combinations. A common trap: using user_id or session_id as a Prometheus label. One million users = one million separate time series = OOM crash.

Rule: TSDBs track populations (per-service, per-host, per-endpoint). Elasticsearch searches individuals (this log, this request, this user).

High-cardinality data can be stored in Elasticsearch because it does not pre-aggregate — it indexes individual events and queries them at read time. The trade-off is storage cost: Elasticsearch will happily index one billion log lines with one billion unique request IDs, but it will charge you disk and memory for every one of them.


🧪 Practical: Setting Up Your Observability Stack

A beginner-friendly starting point is the ELK + Prometheus/Grafana split. Here is the decision process for each type of signal:

Step 1 — Classify your signal. Ask: Is this a sentence or a number? If your application writes structured log lines (ERROR user=42 action=checkout latency_ms=1200), those go to Elasticsearch. If your application exposes a /metrics endpoint with Prometheus counters and gauges, those go to a TSDB.

Step 2 — Design your Prometheus labels carefully. Every label combination creates a separate time series. Safe labels are low-cardinality: environment=prod, service=api, endpoint=/checkout. Dangerous labels are high-cardinality: user_id, request_id, session_token. If you need per-user analytics, store that in Elasticsearch (log the event), not in Prometheus.

Step 3 — Use Elasticsearch index lifecycle management (ILM). Logs are expensive to store long-term because there is no compression equivalent to delta encoding for text. Configure hot → warm → cold → delete tiers: keep the last 7 days on fast SSD (hot), move 8–30 days to cheaper hardware (warm), archive 31–90 days to object storage (cold), then delete. This can cut Elasticsearch storage costs by 60–80%.

Step 4 — Set Prometheus retention and downsampling. Raw Prometheus data at 15-second resolution costs about 2 bytes per sample. After 15 days, downsample to 5-minute resolution; after 90 days, downsample to 1-hour. This retains trend visibility for capacity planning while drastically shrinking storage. Tools like Thanos and Cortex handle long-term TSDB retention with object storage backends.

A common mistake to avoid: routing application logs directly into Prometheus by converting each log line into a Prometheus counter. This is tempting because it is "one fewer system," but it explodes cardinality if the log contains any unique identifiers. Keep logs in Elasticsearch and metrics in a TSDB — the operational cost of running both is lower than the cost of debugging a cardinality explosion in production.


🛠️ Spring Data Elasticsearch & InfluxDB Java Client: Querying Logs and Metrics from Java

Spring Data Elasticsearch brings the familiar Spring Data @Repository abstraction to Elasticsearch — replacing hand-crafted JSON queries with method-name conventions and @Query annotations. InfluxDB's Java client (influxdb-client-java) lets you write timestamped Point objects and run Flux aggregation queries without leaving Java — mirroring the two-pipeline architecture (ES for logs, TSDB for metrics) described in this post.

// ─── Spring Data Elasticsearch: search log events ─────────────────────────────
// build.gradle: org.springframework.boot:spring-boot-starter-data-elasticsearch

@Document(indexName = "nginx-logs")
public record NginxLog(
    @Id String id,
    int status,
    String path,
    Instant timestamp
) {}

// Method name generates the ES query automatically — no JSON needed.
// Under the hood: Elasticsearch executes an inverted-index term-range query
// across billions of log documents in milliseconds.
public interface NginxLogRepository
        extends ElasticsearchRepository<NginxLog, String> {
    List<NginxLog> findByStatusGreaterThanEqualAndTimestampBetween(
            int status, Instant from, Instant to);
    // ↑ translates to: "find all HTTP 5xx errors in the last hour"
    //   Full-text inverted-index query — this is where Elasticsearch shines.
}
// ─── InfluxDB Java client: write a CPU metric and query averages ───────────────
// build.gradle: com.influxdb:influxdb-client-java:7.1.0

import com.influxdb.client.*;
import com.influxdb.client.domain.WritePrecision;
import com.influxdb.client.write.Point;

InfluxDBClient client = InfluxDBClientFactory.create(
    "http://localhost:8086", token, org, bucket);

// Write a data point — delta-encoded numeric stream (maps to ⚙️ section)
client.getWriteApiBlocking().writePoint(
    Point.measurement("cpu_usage")
         .addTag("host", "web-01")
         .addField("percent", 87.3)
         .time(Instant.now(), WritePrecision.MS)
);

// Query: average CPU over the last hour using Flux
String flux = """
    from(bucket: "metrics")
      |> range(start: -1h)
      |> filter(fn: (r) => r._measurement == "cpu_usage")
      |> mean()
    """;

client.getQueryApi()
      .query(flux)
      .forEach(table -> table.getRecords()
               .forEach(r -> System.out.println("Avg CPU: " + r.getValue())));

client.close();
StackJava libraryQuery styleUse for
Elasticsearchspring-data-elasticsearchfindByStatusGreaterThan... / KQLFull-text log search, ad-hoc event queries
InfluxDBinfluxdb-client-javaFlux — range > filter > meanNumeric time-range aggregations, alerting
TimescaleDBspring-data-jpa + PostgreSQL JDBCSQL + time_bucket()Time-range queries with full SQL expressiveness

For a full deep-dive on Spring Data Elasticsearch, InfluxDB, and TimescaleDB Java integration, a dedicated follow-up post is planned.


📚 Production Lessons from Running Both Systems

  • Separate early, not late. Teams that start by dumping everything into Elasticsearch regret it when they try to add alerting: Elasticsearch has no native alerting engine comparable to Prometheus Alertmanager. Separate your log and metric pipelines from the first service you instrument.
  • Delta encoding is not magic — it requires uniform intervals. If your metrics arrive irregularly (event-driven spikes), TSDB compression ratios drop significantly. Schedule regular scrapes (15s or 30s) to get the full benefit.
  • Elasticsearch mapping explosions are silent until they are catastrophic. Dynamic mappings will index every new field automatically. An application that starts logging JSON with variable keys can create thousands of unmapped fields overnight, exhausting the Elasticsearch field limit (1000 by default) and halting indexing.
  • Prometheus is pull-based; InfluxDB is push-based. This matters for firewall rules and service discovery. Prometheus needs network access to scrape each target; InfluxDB and VictoriaMetrics accept pushed writes. In containerized environments Prometheus with service discovery is often simpler; in serverless environments push-based TSDBs win.
  • Do not skip index lifecycle management. Logs without ILM grow unbounded. A busy API service generating 10 GB of logs per day will exhaust a 1 TB Elasticsearch node in 100 days. Set up ILM on day one, not when the disk is at 90%.

📌 TLDR: Summary & Key Takeaways

  • Elasticsearch is for text search; TSDBs are for numeric time series.
  • Elasticsearch uses an inverted index — fast for full-text, expensive for pure numbers.
  • TSDBs use delta encoding + compression — 10–50× smaller for regular numeric streams.
  • Use both in production: ELK for logs, Prometheus/Grafana for metrics.
  • Watch out for high-cardinality labels in Prometheus — they cause OOM crashes.
  • Set up Elasticsearch ILM and Prometheus downsampling from day one to control long-term storage costs.

Share
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms