Postgres can do that? No surprise Postgres is everywhere

Postgres can do that? No surprise Postgres is everywhere

·

10 min read

PostgreSQL is widely regarded as one of the most popular SQL databases and is often the preferred choice for projects involving Online Transaction Processing systems. However, its versatility extends far beyond that, enabling it to effectively manage less common SQL use cases as well as workflows that don’t rely on SQL. In this blog post, we’ll explore various scenarios where PostgreSQL excels and provide insights into how to leverage it in these situations.

Not Only SQL

The past decade has seen dramatic changes in the tech landscape. On one side, we’ve made significant strides in OLAP solutions by leveraging big data, data lakes, parallel processing frameworks like Spark, and low-code tools for business intelligence. On the other side, OLTP transactions have grown increasingly complex, with data becoming less relational and hosting infrastructures evolving significantly.

Nowadays, it’s common to use a single database for both OLTP and OLAP tasks, a strategy known as Hybrid Transactional/Analytical Processing (HTAP). This approach can eliminate the need to transfer data between databases, saving time and allowing for more frequent execution of complex queries (e.g., every 15 minutes). However, running both workflows in the same database can place a heavy load on it, as OLAP transactions might lock tables for extended periods, slowing down OLTP operations.

Another significant shift involves the type of data we handle. Increasingly, we’re dealing with non-relational formats like JSON or XML, machine learning data such as embeddings, spatial data, or time series data. The data landscape is often non-SQL.

Additionally, our tasks have evolved. We’re no longer just calculating aggregates; now we’re identifying similar documents, training large language models, and processing millions of metrics from IoT devices.

Thankfully, PostgreSQL’s extensibility makes it well-suited for these evolving workflows. Whether handling traditional relational data or more complex structures, PostgreSQL offers a wide range of extensions to boost performance. Let’s dive into these workflows and see how PostgreSQL can help.

Full-text search (FTS) is a technique that involves scanning every word within a document to match it against a query. It goes beyond merely finding documents that contain a specific phrase - it also accounts for variations such as similar phrases, typographical errors, patterns, wildcards, synonyms, and more. This complexity makes FTS more challenging to execute, as each query becomes more intricate and may produce more false positives. Additionally, instead of simply scanning each document, the data set needs to be transformed and pre-processed to calculate aggregates, which are then utilized during the search.

Limitations of Postgres in FTS

PostgreSQL supports FTS with tsvector data type and pg_trgm extension. tsvector type lets us extract lexemes from the documents and build indexes (GiST or GIN) on them. pg_trgm extension lets us extract trigrams from the text and then perform a fuzzy-match search on them.

While this works, the built-in Postgres mechanisms may not be enough. As shown by various benchmarks, building an index may take 24+ hours for big (100+ GB) corpora and fail with out-of-memory issues easily.

Another limitation is that these solutions do not support semantic search. We can easily search for specific words but we can’t search for synonyms. Similarly, it’s harder to find texts that contain one word but not the other. Fortunately, all these drawbacks can be overcome.

pg_search For Better FTS

Even though Postgres supports FTS natively, other extensions do the job even better. For example, pg_search, a component of ParadeDB, can be 20 times faster than the tsvector solution. This extension is built on the Okapi BM25 algorithm, widely used by search engines to assess the relevance of documents. It utilizes the Inverse Document Frequency (IDF) formula, which applies probabilistic methods to identify matches.

pg_search is:

  • Built natively with Postgres and with no dependencies on external search solutions

  • Uses a solution similar to Apache Lucene

  • Supports fuzzy search, aggregations, relevance tuning, and highlighting

  • Uses the same algorithm as Elasticsearch

  • Doesn’t require manual reindexing

This makes Postgres very competitive with Elasticsearch.

Analytics

OLAP (Online Analytical Processing) is a category of software technology that enables users to perform complex analytical queries on large amounts of data. OLAP is primarily used for data mining, business reporting, and analysis, allowing users to gain insights from their data in ways that support decision-making.

OLAP systems are designed for queries that involve aggregation and computation over large datasets, typically involving historical data rather than real-time information. These systems are optimized for read-heavy operations and are often used in scenarios where users want to analyze trends, identify patterns, or track key performance indicators (KPIs).

Limitations of Postgres in OLAP

Postgres focuses on OLTP workloads. It supports locks, isolation levels, and concurrent access. However, while these features greatly enhance OLTP solutions, they are detrimental to OLAP queries. This is because:

  • Normalized tables are inefficient for complex joins

  • Isolation levels and locks degrade performance

  • OLTP relies on caching data while OLAP works against caches

  • Postgres must support row-level security and other aspects that decrease the performance of OLAP queries

As we can see in this benchmark, Postgres is very versatile but can be orders of magnitude slower than specialized solutions for OLAP. The example shows video data for which ClickHouse is even 20 times faster. Fortunately, Postgres has many extensions specifically for OLAP.

How DuckDB Improves OLAP in Postgres

One of the solutions that supercharge OLAP in Postgres is ParadeDB which uses columnar storage. It’s an extension to PostgreSQL that uses pg_lakehouse extension to run the queries using DuckDB which is an in-process OLAP database. It’s highly optimized thanks to its columnar storage and can outperform native PostgreSQL by orders of magnitude. They claim they are 94 times faster but it can be even more depending on your use case.

DuckDB is an example of a vectorized query processing engine that processes data in chunks small enough to fit into cache memory, helping to avoid the performance hit from costly I/O operations. Additionally, by utilizing columnar storage, DuckDB can leverage SIMD (Single Instruction, Multiple Data) instructions to execute the same operation across multiple values simultaneously, further boosting performance.

Hydra To The Rescue

In addition to utilizing columnar storage formats for enhanced performance, there are methods available to refine how data is computed in OLAP operations which frequently deal with computing aggregates such as means and medians on datasets. When the original dataset changes, we have alternatives other than recomputing these calculated values entirely. We can seek more efficient strategies for that computation process instead of doing a complete overhaul every time updates are made to our data source.

Hydra enhances aggregate query processing through pg_ivm, leveraging materialized views for storing computed analytical information. This approach minimizes redundant computations as it eliminates the need to recalculate stored view results every time a new query is performed on these precomputed aggregates. You can see how much it improves performance in this benchmark.

Time Series

With an expanding scope due to Internet-of-Things (IoT) technology proliferation and device integration into our lives, new challenges emerge. One of them is efficiently handling colossal volumes of sensor signals generated every day.

The data emanating from these sensors falls under the category known as time series; it represents a chronological sequence of measured values that are real-valued or discrete and recorded at uniform, regular intervals over time - for instance, recording temperature readings within one's home on a minute-to-minute basis. These data points can be instrumental not only when they serve immediate monitoring purposes but also prove pivotal for predictive analytics such as forecasting future states based on past trends or recognizing and taking corrective measures against abnormalities, thereby optimally enhancing our daily living experiences. Time series find utility across diverse data types that assume continuous, real-valued attributes or discrete characteristics when time is a significant component of the dataset's structure.

How Postgres Supports Time Series

Postgres supports arrays that can be used for time series analytics. They support transparent compression and can be easily handled as regular tables. Postgres also supports generate_series method that can be used to generate and handle the data.

While arrays can be queried just like regular functions, they don’t have dedicated support for things that we often need with time series. Fortunately, there are also other extensions in Postgres.

How Timescale Improves Time Series in Postgres

Timescale is an extension for PostgreSQL that transforms it into an efficient time series database. It optimizes time series data processing using features like intelligent chunking, hypertables, and IMMV with aggregates. Hypertables automatically partition the data by time, but to the user, the table still appears as a regular table.

By partitioning data based on time, Timescale significantly enhances query performance by avoiding unnecessary scans across the entire table. It also offers continuous aggregates, allowing real-time updates of aggregate calculations as new data comes in. Additionally, Timescale provides specialized hyperfunctions designed for time series analysis, enabling easy data bucketing, aggregation, windowing, and pipeline creation for streamlined processing. This can bring even a 100 times performance improvement

Timescale has many other features improving performance and reducing the cost usage. For instance, it can automatically store the old data in S3 to save money on storing rarely accessed data.

Document Store

A document database is a type of NoSQL database that stores, retrieves, and manages data as documents, typically in formats like JSON, BSON (binary JSON), or XML. Each document represents a single data entity and contains fields and values, which can be a wide range of types, including strings, numbers, arrays, and nested documents.

Unlike relational databases, where data is structured in tables with rows and columns, document databases allow for a more flexible schema. Each document can have a different structure, making it well-suited for use cases where data models may evolve or have varying attributes. This flexibility makes document databases ideal for managing semi-structured or unstructured data, such as content management systems, user profiles, product catalogs, and more.

How Postgres Handles Documents

Postgres supports multiple non-SQL types natively. It can deal with XML, JSON, BSON, spatial data, vectors, time ranges, and much more natively. It also supports hstore to store any key-value pairs. This makes Postgres capable of processing any documents.

If that’s not enough, Postgres is compatible with Mongo.

FerretDB Turns Postgres Into Mongo

FerretDB is an open-source database designed as a lightweight alternative to MongoDB. It acts as a translation layer between MongoDB queries and a traditional SQL database, such as PostgreSQL. This allows users to interact with a document database API (like MongoDB) while storing the data in PostgreSQL. In essence, FerretDB combines the strengths of MongoDB's developer-friendly API with the reliability and features of SQL databases, offering a flexible solution for users who prefer or need to work within the open-source ecosystem. You can see some benchmarks in this article.

Artificial Intelligence

We aim to store various types of data in an SQL database, but there isn't a direct method for storing complex items like movies, songs, actors, PDFs, images, or videos. This makes it difficult to identify similarities or perform tasks like clustering or finding neighboring objects. To enable such comparisons, we need to convert these objects into numerical representations—essentially a list of numbers (a vector or embedding) that captures key characteristics of the object. For example, a movie could be represented by traits such as its star rating, runtime in minutes, number of actors, or the number of songs used in it. Let’s see how Postgres supports that.

Vectors and Embeddings in Postgres

PostgreSQL enables the use of embeddings through the pgvector extension. This extension introduces a new column type and operators specifically designed for storing and processing embeddings. With pgvector, we can perform element-wise addition and other arithmetic operations, calculate Euclidean or cosine distances between vectors, compute inner products, and determine the Euclidean norm. It also supports a variety of additional operations for handling vectors.

pgai To Build RAG In Postgres

If pgvector is not enough, pgai streamlines the development of search and Retrieval Augmented Generation (RAG) AI applications directly within PostgreSQL. With pgai, embedding and AI generation models are more tightly integrated into the database. This allows you to perform several tasks directly through SQL queries, including:

  • Creating vector embeddings for your data.

  • Retrieving LLM chat completions from models such as Claude Sonnet 3.5, OpenAI GPT-4, Cohere Command, and Llama 3 (via Ollama).

  • Reasoning over your data, enabling use cases like classification, summarization, and data enrichment on your existing relational data in PostgreSQL.

You can learn more in this article comparing various solutions.

Summary

PostgreSQL is one of the most widely used SQL databases, but it's more than just an SQL engine. With a wide range of extensions, it can now manage non-relational data, perform full-text searches, support analytical workflows, handle time series data, and much more. The traditional distinction between OLAP and OLTP is no longer necessary, as PostgreSQL enables HTAP workflows to be executed within a single database. This versatility makes PostgreSQL capable of meeting a broad range of data processing needs.