Aggregation operations in MongoDB involve processing data records or documents and producing computed results. The operations imply collecting values from multiple documents, grouping them, and performing various computations on the grouped data afterward. This process is comparable to the aggregate function in SQL but has some peculiarities.
Aggregation pipeline
The aggregation pipeline in MongoDB is made up of stages, each one of which modifies the document in some way. Each stage performs a specific operation on the input documents and passes the transformed output to the next stage.
The main components of a pipeline are listed below:
stages – process the input documents.
expressions – manipulate the data within each stage.
accumulators – perform calculations on groups of documents.
Below you can see a schematic representation of an aggregation pipeline:
Input Documents -> [Stage 1 -> Expression 1 -> Accumulator 1] -> [Stage 2 -> Expression 2 -> Accumulator 2] -> ... -> [Stage N -> Expression N -> Accumulator N] -> Output Documents
Pipeline stages
In the MongoDB aggregation pipeline, each stage performs a specific operation on the input documents and passes the transformed output to the next stage. The pipeline includes stages for filtering data-like queries, transforming documents, grouping, and sorting them.
Please see below a table with the most frequent stages of the aggregation pipeline:
| Stage | Description |
|---|---|
$match |
Filters the input data based on specified conditions. |
$group |
Groups the filtered data based on a specified key and performs calculations on the grouped data. |
$project |
Selects a subset of fields from the input documents and outputs them as the result of the pipeline. |
$sort |
Sorts the output of the pipeline based on a specified field or fields. |
$limit |
Limits the number of documents returned by the pipeline. |
$skip |
Skips a specified number of documents in the output of the pipeline. |
$unwind |
Deconstructs an array field from the input documents and outputs a separate document for each element in the array. |
$facet |
Allows to compute multiple aggregations on the same set of documents. |
Pipeline accumulators
Accumulators in the MongoDB aggregation pipeline are used to perform calculations on groups of documents. They can be used in stages such as $group, $project, and $facet. Accumulators provide a way to compute different metrics, such as the sum, average, maximum, and minimum of a particular field across a group of documents.
Please find below a table with common accumulators:
| Accumulator | Description |
|---|---|
$sum |
Adds up numeric values for each group of documents. |
$avg |
Calculates the average value of numeric values for each group of documents. |
$max |
Returns the highest value of a field for each group of documents. |
$min |
Returns the lowest value of a field for each group of documents. |
$push |
Returns an array of values for each group of documents. |
$addToSet |
Returns a set of unique values for each group of documents. |
$first |
Returns the first value in a group of documents. |
$last |
Returns the last value in a group of documents. |
Aggregation pipeline examples
The syntax of the aggregation request is the following:
db.collection.aggregate([pipeline])
Let's look in detail at what could be inside the pipeline section:
Suppose we have a collection called orders that contains documents with the following fields: order_id, customer_id, order_date, order_total. We want to analyze the data in this collection to answer the following questions:
- What is the total revenue generated by each customer?
- What is the average order value for each customer?
- Who are the top 5 customers by revenue?
To answer these questions, we can use the following aggregation pipeline:
db.orders.aggregate([
{
$group: {
_id: "$customer_id",
total_revenue: { $sum: "$order_total" },
order_count: { $sum: 1 },
average_order_value: { $avg: "$order_total" }
}
},
{
$sort: {
total_revenue: -1
}
},
{
$limit: 5
},
{
$project: {
_id: 0,
customer_id: "$_id",
total_revenue: 1,
order_count: 1,
average_order_value: 1
}
}
])
This pipeline consists of the following stages:
- $group: This stage groups the orders collection by "customer_id" and calculates the total revenue, order count, and average order value for each customer.
- $sort: This stage sorts the output by total revenue in descending order.
- $limit: This stage limits the output to the top 5 customers by revenue.
- $project: This stage projects only the fields we want in the output and renames "_id" to "customer_id".
The output of this pipeline will be a list of the top 5 customers by revenue, with their total revenue, order count, and average order value.
Elaborate aggregation pipeline example
This time we are going to see an example with different stages and again we will make a pipeline, based on the collection orders .
The given collection contains documents representing orders made by customers. Each document has fields like customer_name, order_date, products, and total_price.
We want to generate a report showing the top 3 customers by the total amount spent on orders, as well as the top 3 most frequently ordered products in the year 2022.
Here's the pipeline we can use:
db.orders.aggregate([
{
$match: {
order_date: { $gte: ISODate("2022-01-01"), $lte: ISODate("2022-12-31") }
}
},
{
$unwind: "$products"
},
{
$skip: 5
},
{
$facet: {
top_customers: [
{
$group: {
_id: "$customer_name",
total_spent: { $sum: "$total_price" }
}
},
{
$sort: { total_spent: -1 }
},
{
$limit: 3
}
],
top_products: [
{
$group: {
_id: "$products.name",
total_ordered: { $sum: "$products.quantity" }
}
},
{
$sort: { total_ordered: -1 }
},
{
$limit: 3
}
]
}
}
])
Here's what each stage in the pipeline does:
$matchstage: filters the documents to only include orders made in the year 2022.$unwindstage: deconstructs the "products" array field so that each product in an order gets its own document.$skipstage: skips the first 5 documents (i.e. the 5 oldest orders).$facetstage: divides the pipeline into multiple sub-pipelines, each of which generates a separate result set.top_customerssub-pipeline: groups the orders by customer name, calculates the total amount spent by each customer, sorts the results by the total amount spent in descending order, and limits the results to the top 3.top_productssub-pipeline: groups the orders by product name, calculates the total quantity ordered for each product, sorts the results by the total quantity ordered in descending order, and limits the results to the top 3.
Conclusion
In this topic, we covered aggregation operations in MongoDB, which are implied for processing and analyzing data in collections of documents. Pipeline construction, consisting of stages, expressions, and accumulators, allows us to group and manipulate data in a variety of ways. The most common stags include $match, $group, $project, $sort and $limit. Accumulators like $sum, $avg and $count allow for calculations on groups of documents. Aggregation pipeline examples demonstrated how to use the pipeline to answer questions and generate reports. Overall, the aggregation pipeline is a valuable feature of MongoDB that provides us with flexibility and efficiency in working with data. Now, when we learned something new on aggregation – it's time to practice!