Postgres substring() function
Extract a substring from a string
The substring()
function in Postgres is used to extract a portion of a string based on specified start and end positions, or a regular expression pattern.
It's useful for data cleaning and transformation where you might need to extract relevant parts of a string. For example, when working with semi-structured data like an address, where you want to extract the zip code. Or, to extract the timestamp of events when working with machine-generated data like logs.
Function signature
The substring()
function has two forms:
string
: The input string to extract the substring from.from
(optional): The starting position for the substring (1-based index). If omitted, it defaults to 1.for
(optional): The length of the substring to extract. If omitted, the substring extends to the end of the string.
string
: The input string to extract the substring from.pattern
: A POSIX regular expression pattern. The substring matching this pattern is returned.
Example usage
Consider a table users
with a user_id
column that contains IDs in the format "user_123". We can use substring()
to extract just the numeric part of the ID.
This query extracts the substring starting from the 6th character of user_id
(1-based index) and returns it as numeric_id
.
You can also use a regular expression pattern to find and extract a substring.
This query extracts the 5-digit zip code from the address
column using the regular expression pattern \d{5}
, which matches exactly 5 consecutive digits.
Advanced examples
Extract a substring of a specific length
You can specify both the starting position and the length of the substring to extract.
This query extracts the timestamp portion from the log_entry
column. It assumes that the timestamp always appears at the beginning of the log entry and has a fixed length of 23 characters
Extract a substring matching a regex pattern with capture groups
The substring()
function extracts the first part of the string that matches the regular expression pattern. However, if the pattern contains capture groups (specified using parentheses), it returns the substring matched by the first parenthesized subexpression.
This query extracts the order number and order amount from the order_info
column using regular expressions with capture groups.
- The pattern
Order #(\d+)
matches the string "Order #" followed by one or more digits. The parentheses around\d+
create a capture group that extracts just the order number. - The pattern
\$(\d+\.\d+)
matches a dollar sign followed by a decimal number. The parentheses around\d+\.\d+
create a capture group that extracts just the order amount.
substring()
in a WHERE
clause
Use You can use substring()
in a WHERE
clause to filter rows based on a substring condition.
This query selects all rows from the users
table where the email address has the domain name example
. The regular expression pattern .*@(.*)\.
extracts the domain part of the email address.
Additional considerations
Performance implications
When working with large datasets, using substring()
in a WHERE
clause may impact query performance since it requires scanning the entire string column to extract substrings and compare them.
If you frequently filter based on substrings, consider creating a functional index on the relevant column using the substring expression, to improve query performance.
Alternative functions
left
- Extracts the specified number of characters from the start of a string.right
- Extracts the specified number of characters from the end of a string.split_part
- Splits a string on the specified delimiter and returns the nth substring.regexp_match
- Extracts the first substring matching a regular expression pattern. Unlikesubstring()
, it returns an array of all the captured substrings when the regex pattern contains multiple parentheses.