Computer scienceSoftware development foundationsComputer science fundamentalsStandards and formatsData formats

Syntax of Proto Files

8 minutes read

In this topic, you will learn about the syntax and data types that protocol buffers use. Starting with the syntax and primitive types, such as integers, floats, and strings you will then look at special types, such as enumerations, oneofs, and maps. You will also learn to handle field labels and write nested messages.

Understanding protobufs syntax

Protocol buffers schema and service interface are defined in a .proto file, which is a text file. Although the syntax of protocol buffers is relatively straightforward, it may appear overwhelming when there are many messages and fields defined in a single proto file.

Let's discuss the basic syntax of protocol buffers with the following proto file:

syntax = "proto3";

package project.library;

/*
 This is the library proto file.
 It contains three messages: Book, Subject, and Library.
*/

// An empty message
message Book { }  

message Subject {
  // Location on shelf
  string location = 1;

  // List of books
  repeated Book books = 2;
}

message Library {
  // Map of subject names to books on the Subject
  map<string, Subject> subjects = 1;
}

The syntax declaration specifies the version of the protocol buffers language that the proto file uses — proto3. If the syntax is not specified, the compiler will assume that the proto file uses proto2.

Packages in protocol buffers provide a way to organize types and prevent naming conflicts. By declaring a package at the top of a proto file, types within that file become part of that package.

Using packages helps avoid clashes when multiple proto files define types with the same name. For instance, if two files have a Book message, the compiler can differentiate them by their package name. Each message has a fully qualified name, including the package name, like project.library.Book and school.items.Book. These fully qualified names distinguish types across packages and facilitate referencing types from other packages.

Protocol buffers support single-line comments using two forward slashes (//) and multi-line comments using /* and */ . Both types of comments serve as annotations or explanations within the code and the compiler ignores them during execution.

Primitive types in protocol buffers

Primitive types in protocol buffers represent basic data types which include integers, floating-point numbers, boolean values, strings, and bytes.

In protocol buffers, integer types can be either variable-size or fixed-size. Variable-size integers adapt their representation based on value size, while fixed-size integers always occupy a specific number of bytes. Fixed-size 32-bit integers use 4 bytes and fixed-size 64-bit integers use 8 bytes.

Integer types can be signed or unsigned. The 32-bit integer types include int32, uint32, sint32, fixed32, and sfixed32. Signed types are prefixed with "s" and unsigned types with "u". However, int32 is signed, while fixed32 is unsigned, despite lacking a prefix.

The signed integer types take both positive and negative values, while the unsigned counterparts take only positive values. For example, int64 can take a value between the range of $-2^{63}$ to $+2^{63} - 1$ , while uint64 can take a value between 0 to 2 * $2^{63}$ .

In protocol buffers, integer types are represented in 32-bit and 64-bit. If the values of the integers fall between the range of $-2^{31}$ to $+2^{31} - 1$ , they are best represented by the 32-bit integer types. Values that fall outside this range are best represented by the 64-bit integer types.

In addition to integer types, protocol buffers also have float-point types for numeric representation. The float type is a 32-bit floating-point number, while the double type is a 64-bit floating-point number.

Other primitive types include boolean, string, and bytes. The bool type represents true or false values. The string type is a sequence of Unicode characters, and the bytes type is a sequence of raw bytes.

The reserved keyword

Protocol buffers serialize both data and associated field metadata. Metadata, including field information like data type and unique tags, is stored separately from the serialized data.

Field metadata is essential for proper data deserialization and interpretation. Modifying field types and tags in the schema can cause compatibility issues, as the metadata may struggle to interpret data serialized with the updated schema. This can lead to a break in forward and backward compatibility.

Let's illustrate this with the Book message below:

package book.v1;

message Book {
  string title = 1;
  string subtitle = 2;
  repeated string authors = 3;
  uint32 yearPublished = 4;
  optional uint32 pages = 5;
  float price = 6;
  bool isNewRelease = 7;
}

In the newer version of the Book message, remove the subtitle field, and change the field type of the isNewRelease field from bool to string. The reserved keyword will enforce full compatibility as follows:

package book.v2;

message Book {
  reserved 2, 7;
  reserved subtitle;

  string title = 1;
  repeated string authors = 3;
  uint32 yearPublished = 4;
  optional uint32 pages = 5;
  float price = 6;
  string isNewRelease = 8;
}

Field tags 2 and 7, along with the field name subtitle are added to the reserved list. The protocol buffer compiler will generate warnings if developers attempt to use these reserved field tags or names. Reserved fields allow explicit declaration of fields to avoid conflicts and ensure compatibility during schema changes.

The repeated and optional labels

The repeated field label signifies that a field can have zero or more occurrences within a message. The label helps define lists or arrays of values for a specific field. In the previous example, the repeated field authors indicates that a book can have multiple authors.

The optional field label denotes that a field is not mandatory in a message and may or may not have a value assigned to it. If no value is assigned, the field will have a default value specified in the message schema. An example is the optional field pages in the Book message.

Enum composite type

Enums are a type of composite data type in protocol buffers and are used to represent a mutually exclusive and exhaustive list of values. You can define enums using the enum keyword. Enums use unique tags. The smallest tag in an enum is 0 and it is used to represent the default value. The values in an enum are listed in the upper snake case.

Suppose you have some books about Math and Python in a library. You can create the following enum to represent them:

package book.enum.v1;

enum BookType {
  BOOK_UNSPECIFIED = 0;
  BOOK_MATH = 1;
  BOOK_PYTHON = 2;
}

message Book {
  string title = 1;
  repeated string authors = 2;
  BookType book_type = 3;
}

The enum has an exhaustive list in the BookType enum using UNSPECIFIED. So, if a book does not belong to the Math or Python category, it is placed in the default BOOK_UNSPECIFIED category.

Also, enums can be nested in a message or specified by themselves. Let's look at an example of an enum nested inside a message:

package book.enum.v2;

message Book {
  enum Type {
    BOOK_UNSPECIFIED = 0;
    BOOK_MATH = 1;
    BOOK_PYTHON = 2;
  }

  string title = 1;
  repeated string authors = 2;
  Type type = 3;
}

Notice how you change the name of the enum from BookType to Type for the nested message. This is so that you can get appropriate fully qualified names for the enum. The fully qualified name for the standalone enum is book.enum.v1.BookType, and book.enum.v2.Book.Type for the nested enum.

Oneof composite type

Oneof can group a set of fields together so that only one of the fields can be set at a time. This is useful for situations where you need to represent a value that can be one of a few different types.

Oneof fields do not have a default value. This is because they are mutually exclusive, meaning that only one of them can be set at a time. If a oneof field is not set, then it is considered to be unset. Unlike enums, oneofs can only be used nested in a message.

Let's see how we can represent books in our library using oneof:

package book.oneof;

message Math { }

message Python { }

message Others { }

message Book {
  string title = 1;
  repeated string authors = 2;
  oneof type {
    Math math = 3;
    Python python = 4;
    Others others = 5;
  }
}

Here, oneof represents a book that can be either Math, Python, or Others. Unlike enums, you can use oneofs to specify more complex data types.

Map composite type

In protocol buffers, the map field is a feature that allows you to represent key-value pairs within a message. It provides a convenient way to store and access data in a structured manner. Maps are written as:

map<key_type, value_type> field_name = field_tag;

Keep in mind you can only use strings and integers as keys, while the values can be any scalar type, enum, or message type.

The library example demonstrates the use of maps in protocol buffers to associate books with subjects. The subject name serves as the key to retrieve the corresponding Subject message, which includes the subject's location and a list of books.

     ...

message Library {
  // Map of subject names to books on the Subject
  map<string, Subject> subjects = 1;
}

The repeated label cannot be used with maps, enums, and oneofs in protocol buffers. This restriction exists because repeated fields are only permitted for simple types, whereas maps, enums, and oneofs are considered as complex types.

Conclusion

In summary:

Syntax declaration and package names ensure proper compilation and prevent naming conflicts
Protocol buffers support primitive and composite data types
Primitive data types include signed and unsigned integers, floating-point numbers, boolean values, strings, and bytes
Enums, oneofs, and maps composite data types can be useful to represent complex objects
Composite data types help you handle complex data beyond the capabilities of the primitive types
You can use nested messages to further enhance the flexibility and organization of protocol buffers

By mastering protocol buffers syntax and data types, you gain knowledge to achieve efficient data serialization and streamline communication within your distributed applications. The possibilities are vast. Consult the protocol buffers documentation and discover more features and functionalities.

4 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo