Computer scienceProgramming languagesPythonCode qualityCode performance

Introduction to multiprocessing

11 minutes read

If your Python program runs slower than you expected, multiprocessing can be a solution for you. In this topic, we will explore the fundamentals of multiprocessing in Python, focusing on accelerating CPU-intensive tasks. We will introduce the concept of a process and discuss how to manage it. We will also examine the different process-initiating methods, specifically spawn and fork techniques. Lastly, we will look at inter-process communication tools such as Queue and Pipe.

Multiprocessing process

Python's multiprocessing library is a lifesaver when dealing with tasks that consume a lot of CPU time. The cornerstone of this library is the Process class. Think of a Process as an independent unit of work that runs separately from the main program. Each Process has its own Python interpreter and memory space.

Multiprocess

Before diving into the benefits of multiprocessing, let's first look at a code example that performs two different, calculation-heavy tasks without multiprocessing.

import os

def calculate_square(numbers):
    result = 0
    for n in numbers:
        result += n * n
    print(f"Sum of squares: {result}")

def calculate_cube(numbers):
    result = 0
    for n in numbers:
        result += n * n * n
    print(f"Sum of cubes: {result}")

if __name__ == "__main__":
    numbers = [1, 2, 3, 4]
    
    # Show the Process ID (PID) of the main program
    print(f"Main program PID: {os.getpid()}")
    
    # Perform calculations
    calculate_square(numbers)
    calculate_cube(numbers)

This code calculates the sum of squares and the sum of cubes for a list of numbers. It completes the sum of squares before proceeding with the sum of cubes. Remember that the required time would significantly increase for larger or more complex calculations.

The displayed Process ID (PID) is for the main program. In a single-threaded environment like this, all tasks share the same PID, which means they can't run concurrently. This is why multiprocessing can be beneficial. It allows you to perform multiple tasks simultaneously, each with its own PID, making your program more efficient.

The following section will demonstrate how two separate processes can make these calculations faster.

Different methods of starting

Python offers two main strategies when initiating a new process: fork and spawn. These methods differ in creating a new technique and sharing resources, so understanding the nuances can help you pick the right approach for your project.

The fork method creates a new process by duplicating the current process. The parent and child processes continue to execute from the point where the fork was called. This means that the child process inherits the memory state and resources of the parent process.

Since the child process shares the parent's memory, the fork method is generally faster and uses less memory. Moreover, the child process can easily access the parent's variables and state, simplifying inter-process communication. Because the child process inherits the parent's resources, it can sometimes lead to issues like resource locking.

Here's how you can modify the previous code to use multiprocessing with the fork method:

from multiprocessing import Process
import os

def calculate_square(numbers):
    print(f"Square process PID: {os.getpid()}")
    result = 0
    for n in numbers:
        result += n * n
    print(f"Sum of squares: {result}")

def calculate_cube(numbers):
    print(f"Cube process PID: {os.getpid()}")
    result = 0
    for n in numbers:
        result += n * n * n
    print(f"Sum of cubes: {result}")

if __name__ == "__main__":
    numbers = [1, 2, 3, 4]
    
    # Create two processes
    p1 = Process(target=calculate_square, args=(numbers,))
    p2 = Process(target=calculate_cube, args=(numbers,))
    
    # Start the processes
    p1.start()
    p2.start()
    
    # Wait for both processes to finish
    p1.join()
    p2.join()
    
    print(f"Main program PID: {os.getpid()}")

In the code example above, we use Python's multiprocessing library to create two separate processes: one for calculating the sum of squares (calculate_square) and another for calculating the sum of cubes (calculate_cube).

Each process runs independently and has its Process ID (PID), which we print out during the calculation. This helps you see that two different processes are indeed running the calculations.

The Process class from the multiprocessing library makes it easy to create these separate processes. We define the function each process will run (target) and any arguments it needs (args). After defining the processes, we start with the start() and wait for them to be completed with join().

The default method of starting a process in multiprocessing is fork

The spawn method creates an entirely new Python interpreter for each child process. This ensures that each process is fully independent and starts with a clean state. This is useful to ensure your processes do not share data or state.

Here's how you can modify the previous code to use multiprocessing with the spawn method:

from multiprocessing import Process, set_start_method


if __name__ == "__main__":
    # Use spawn method for starting the process
    set_start_method('spawn')
    # Rest of your code remains the same
    # ...

In this modified code, we introduce the set_start_method('spawn') to specify the spawn method for starting new processes.

We can run both calculations simultaneously by multiprocessing, making the program more efficient. This is especially beneficial for larger numbers or more complex calculations, where the time required would be significantly higher if run sequentially.

So, when to fork and when to spawn? The fork is likely better if you need to share state or data between your processes. However, spawn is your go-to option if you aim for independent processes.

Interprocess communication

Inter-process communication is often a necessary part of multiprocessing. You might need to pass data between processes or even send a signal to let another process know that a task is complete. Two popular ways to handle this in Python's multiprocessing module are through Queue and Pipe.

A Queue is a simple way to send and receive messages between processes. It's a First-In-First-Out (FIFO) data structure, meaning the first item you put in is the first to come out. It's thread-safe and process-safe, making it an excellent choice for most use cases.

Suppose you have a list of numbers and want to calculate the sum using multiple processes.

from multiprocessing import Process, Queue

def calc_sum(numbers, q):
    q.put(sum(numbers))

if __name__ == "__main__":
    numbers = [0, 1, 2, 3, 4]
    q = Queue()
    p = Process(target=calc_sum, args=(numbers, q))
    
    p.start()
    p.join()
    
    result = q.get()
    print(f"The sum is: {result}")

In this example, a child process calculates the sum of a list of numbers and puts the result into the queue. The parent process then retrieves the sum from the queue and prints it.

While Queue is based on a pipe and some locking mechanisms, Pipe is a lower-level communication method. It can be faster but less safe. Pipes can be used for bidirectional communication between two processes.

Let's say you want to calculate the factorial of a number in a child process and then send the result back to the parent process for further manipulation or output.

from multiprocessing import Process, Pipe

def calculate_factorial(n, conn):
    factorial = 1
    for i in range(1, n + 1):
        factorial *= i
    conn.send(factorial)
    conn.close()

if __name__ == "__main__":
    parent_conn, child_conn = Pipe()
    
    number = 5  # Calculate the factorial of 5
    p = Process(target=calculate_factorial, args=(number, child_conn))
    
    p.start()
    p.join()
    
    result = parent_conn.recv()
    print(f"The factorial of {number} is {result}.")

In this example, the child process calculates the factorial of a given number. It then sends this result back to the parent process through the Pipe. The parent process receives the result using parent_conn.recv() and prints it out.

When choosing, Queue is generally easier to use and safer, making it suitable for most applications. On the other hand, if you need faster, more direct communication and are willing to handle the safety aspects yourself, Pipe might be the better choice. Selecting between Queue and Pipe should be based on your specific needs for data safety, speed, and the complexity of the data being exchanged. With these examples, you can better assess which tool fits your project requirements.

Conclusion

This topic outlined essential aspects of Python's multiprocessing. The main points to remember are:

Multiprocessing basics: The Process class is foundational for initiating and controlling parallel tasks.
Spawn vs. Fork: Fork enables quick startups by sharing state while Spawn offers process isolation.
Inter-process communication: Queue is user-friendly and safe, whereas Pipe provides faster but less secure data exchange.

14 learners liked this piece of theory. 3 didn't like it. What about you?

Report a typo

Introduction to multiprocessing

Multiprocessing process

Different methods of starting

Interprocess communication

Conclusion

Related topics