Computer scienceFundamentalsEssentialsStandards and formatsData formats

Introduction to Protocol Buffers

7 minutes read

Modern software applications are often built using a service-oriented architecture (SOA), where different services communicate with each other over a network to implement business functionalities. For example, an e-commerce website may implement separate services for order management, payment processing, customer management, and product management. To ensure seamless communication between different services, efficient data serialization is essential.

What are protocol buffers?

Protocol buffers (or Protobuf) are a language-agnostic, platform-agnostic, and extensible mechanism for serializing structured data in a forward-compatible and backward-compatible way. Protocol buffers are an interface definition language (IDL) that allows developers to define the structure of data that will be exchanged between different services. This data can be anything from simple strings to complex objects.

A protocol buffer service interface defines the methods and data types that are exposed by a service. The schema defines the structure of the data that is exchanged between services. Both the service interface and schema are specified in a proto file, which is an ordinary text file with a .proto extension.

Why are protocol buffers useful?

The benefits of using protocol buffers include:

  • Efficiency: Protocol buffers use a compact binary format that is much smaller than equivalent text-based formats like JSON or XML. This makes them suitable for cases where storage and bandwidth are limited.
  • Extensibility: The schema for protocol buffers can be extended over time without breaking existing applications. This makes it easy to add new features or change existing features without having to rewrite all of the code that uses protocol buffers.
  • Flexibility: Protocol buffers support a wide range of data types, including strings, numbers, boolean values, and nested objects. This makes them a versatile tool for representing a wide variety of data.
  • Built-in support for multiple languages: The protocol buffers compiler, protoc, has built-in support for multiple languages. This makes them a great choice for projects that need to be interoperable with different languages. For example, there are protocol buffer libraries available for Kotlin, Java, Python, C++, and Go.

Defining a simple protocol buffer message

If you want to create an application to manage books in your library, you can use protocol buffer messages. The first thing you should do is define the structure of your data with a schema. The schema defines the names, types, and optionality of the fields in the message.

To represent a book using protocol buffers, define a message with the following field names: title, subtitle, author, yearPublished, ISBN, and price. These field names have their corresponding field types. The yearPublished field uses the int32 type, the price field uses the float type, and the remaining fields (title, subtitle, author, and ISBN) use the string type. Each field gets a unique identifier called a field tag, ranging from 1 to 6:

// book.proto

syntax = "proto3";

message Book {
  string title = 1;
  optional string subtitle = 2;
  string author = 3;
  int32 yearPublished = 4;
  string ISBN = 5;
  float price = 6;
}

For example, you can parse the data about a book to the Book message in a text (book.txt) file in the following way:

title: "C Programming"
subtitle: "Absolute Beginner's Guide"
author: "Greg Perry"
yearPublished: 2017
ISBN: "9781098117253"
price: 17.99

Integration with gRPC and HTTP/2

Protocol buffers are used to define both the schema and the service interface. You have defined the schema for your book, go ahead to define the service interface methods (AddBook and GetBook) in the book.proto file:

// book.proto

   ...

service BookService {
  rpc AddBook(Book) returns (Book);
  rpc GetBook(BookID) returns (Book);
}

message BookID {
  string id = 1;
}

To be able to use these methods in your application, you need to generate code for the client and server sides of the service with the protoc compiler:

How protocol buffers work with gRPC services

Before you go any further, let's discuss gRPC. gRPC is a high-performance framework for Remote Procedure Call (RPC), enabling client applications to call methods on remote server applications as if they were local objects, simplifying the development of distributed applications and services.

To utilize gRPC, you need the generated code on both the client and server ends. The client code specifies the message to be sent to the server, while the server code determines how to process the received message. Communication between the client and server occurs through method calls using the service interface, with the server implementing the defined service logic.

Client-server communication takes place using the efficient HTTP/2 transport protocol, enhancing performance through multiplexing, header compression, and server push. Messages are encoded, transmitted via HTTP POST requests, decoded by the server, and the response is encoded and sent back to the client for further processing.

Backward and forward compatibility of protobufs

You have an existing method of documenting book information, but you want to introduce new details, such as indicating if they are new releases. You seek to modify the existing format for recording this information by adding a new field, isNewRelease:

// book.proto

syntax = "proto3";


message Book {
   ...
  // New field
  bool isNewRelease = 7;
}

When you introduce the new field to the data, you get the following text file:

title: "C Programming"
subtitle: "Absolute Beginner's Guide"
author: "Greg Perry"
yearPublished: 2017
ISBN: "9781098117253"
price: 17.99
isNewRelease: true

Backward compatibility in protocol buffers means that older versions of a protocol buffer schema can still read and write data generated by newer versions of the schema.

When serializing data containing the isNewRelease field using the newer schema, the older schema will ignore the isNewRelease field during deserialization because it does not exist in the older schema.

Forward compatibility means that newer versions of a protocol buffer schema can still read and write data generated by older versions of the schema.

When data is serialized using a schema that lacks the isNewRelease field, the serialized data will not contain the isNewRelease field. During deserialization with the newer schema, the absent field is assigned the default value.

Assigning a default value or treating missing fields as optional, proto3 enforces backward and forward compatibility.

Comparison between protobufs and other serialization formats

There is no one-size-fits-all data serialization format. Different applications have different needs, and the best serialization format for one application may not be the best serialization format for another application.

For example, protobufs are a fast and efficient serialization format that is well-suited for applications that need to process large amounts of data quickly. However, they are not human-readable, which can make them difficult to debug and troubleshoot.

If your application is human-facing, you may want to choose a serialization format that is human-readable, such as JSON or XML. JSON and XML are both human-readable formats that are easy to debug and troubleshoot. However, they are not as efficient as protobufs in terms of space usage or speed.

The table compares the three most popular data serialization formats standard libraries:

Criteria Protobufs XML JSON
Ease of use correct correct correct
Performance correct wrong Slowest wrong Slower
Payload size correct wrong Largest wrong Larger
Schema readability correct correct Verbose correct
Data readability wrong Binary correct Text correct Text
Typing Static Dynamic Dynamic
Schema evolution correct wrong

wrong

Conclusion

Protocol buffers are an efficient and flexible data serialization format that can be used in a wide variety of modern software applications. They are smaller and faster than other formats like JSON and XML, and they integrate well with HTTP/2 and gRPC, a high-performance cross-platform infrastructure for connecting distributed applications or microservices.

Protocol buffers are not the only data serialization format out there, and they have some disadvantages. For example, they are not human-readable, which can make them difficult to debug and troubleshoot. If you are looking for a human-readable data serialization format, you may want to consider JSON or XML instead. You can learn more about protocol buffers from its documentation.

9 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo