Sizing Streaming runners🔗

Agent Type:

Streaming

Agent Platform:

AWS ✅ Azure ✅ GCP ✅

This guide provides best practices for:

Streaming runner sizing: Determining the optimal resources for Streaming runners based on pipeline complexity, table and column count, and data volume.
Streaming runner balancing: Distributing workloads effectively across multiple Streaming runners to enhance performance and avoid bottlenecks.
Scaling for growth: Reviewing Streaming runner distribution periodically, and adjusting assignments to ensure your system scales to meet increasing demands.

Following these recommendations will help ensure efficient, scalable, and reliable streaming pipelines, even as data demands grow.

Streaming runner sizing🔗

Correctly sizing Streaming runners is essential to ensure smooth performance of streaming pipelines. When a Streaming runner has the appropriate resources, it can handle data changes efficiently, minimising latency and avoiding resource bottlenecks.

When determining the optimal Streaming runner size, consider the following factors:

Table count and data volume. Streaming runners managing high numbers of tables or high-volume data streams need greater capacity to prevent performance issues.
Data change frequency. Higher change rates increase Streaming runner workload, so pipelines with frequent updates may need more robust sizing.
Cloud platform and proximity to source. AWS ECS offers high performance when the Streaming runner is co-located with the database in the same region and account. Configurations further from the source database may experience slower throughput and higher latency.

General sizing recommendations🔗

To help size Streaming runners effectively, use these general guidelines.

Configuration type	Number of tables	Snapshot volume	Changes per second	CPU	Memory
Small workloads	100	100,000 – 100 million	< 5,000	1	2 GB
Medium workloads	200	100,000 – 10 billion	5,000 – 10,000	2	4 GB
Large workloads	400	100,000 – 10 billion	10,000 – 20,000	4	8 GB

In Azure ACI and other remote configurations, smaller setups may experience a CPU bottleneck during streaming, leading to lower throughput.

Some configurations require nuanced resourcing adjustments. For example, if there is a high table count with a low record count per table, memory requirements may need to be adjusted to avoid over-allocation:

High table count, low record count per table: Use lower memory if snapshot volume is < 100,000 records.
High volume and top five streaming distribution: For scenarios where only the top five tables in change frequency drive most updates, consider targeted resource allocation.

Outlier configuration	Number of tables	Snapshot volume	Changes per second	CPU	Memory
High table count, low snapshop volume	> 500	< 100,000	5,000 – 10,000	2	4 GB – 8 GB
Targeted streaming (top 5 tables)	> 500	100,000 – 10 billion	5,000 – 10,000	2	4 GB

Streaming runner balancing🔗

For high-volume or complex streaming workflows, distributing workloads across multiple Streaming runners can improve performance, reduce latency, and prevent resource strain. Streaming runner balancing can help to ensure that data moves efficiently, even in environments with large table counts or high data-change frequency.

Best practices for Streaming runner balancing🔗

Separate high-volume tables: Distribute high-transaction or high-volume tables across different Streaming runners to avoid bottlenecks.
Group by data source or table type: Where feasible, group similar tables or data sources on the same Streaming runner. This approach can help streamline data flow and reduce the load on individual Streaming runners.
Monitor for load imbalance: Regularly monitor Streaming runner performance to spot any overloaded Streaming runners. Adjust assignments as needed to keep data moving smoothly across all Streaming runners. Note that remote Streaming runners may show more CPU and RAM headroom, but with slower throughput due to latency.

Scaling for growth🔗

As data demands increase, it may be necessary to scale Streaming runners to maintain performance.

Signs that scaling is needed🔗

High CPU or memory utilisation: Persistent high utilisation on a Streaming runner can indicate it's nearing capacity.
Event lag: Delays in data changes reaching target systems suggest a Streaming runner may be overloaded. If lag persists, evaluate scaling options.
Increased data volume or change frequency: If new tables or higher change frequencies are added to your pipeline, scaling is often necessary to prevent performance impacts.

Approaches to scaling🔗

Horizontal scaling (Streaming runner balancing): For workloads with high table counts or diverse data sources, consider adding additional Streaming runners to distribute the load. Horizontal scaling can be particularly effective for spreading out high-frequency tables and improving throughput.
Vertical scaling (Streaming runner sizing): If adding Streaming runners isn't an option or the workload isn't complex enough to split, you can increase CPU, memory, or storage on existing Streaming runners to handle higher demands.

Additional considerations🔗

Column count: Anecdotally, 10,000 columns across all tables is considered a soft limit before potential performance degradation. If your pipeline approaches this limit, monitor closely for signs of latency or resource strain.
Table distribution and change frequency:
- Consider the distribution of table sizes across your workload, as highly uneven distributions can impact performance differently than even loads.
- Workloads with tables that change frequently (i.e. result in continuous streaming updates) require more resources. It's helpful to monitor both the frequency of changes per table and the overall rate of source changes to determine if adjustments in sizing or balancing are necessary.