Multi-Modal Voice Applications in the Food & Beverage DC

GS1 and traceability demands are driving adoption.

Dry7 11123299

The traditional distinction between voice and RF applications is disappearing as old school voice-directed warehouse applications are giving way to multi-modal voice applications that combine voice direction, speech recognition, barcode scanning, and device displays. This transition started almost 10 years ago, but has accelerated in recent years as a byproduct of GS1 data standardization initiatives and the drive for better product tracking and traceability within and beyond the four walls of the DC. Multi-modal voice applications provide greater flexibility and efficiency in capturing product-level data, helping DCs to meet traceability objectives without adding to costs.

Just as no two DC’s are alike, not all “multi-modal” voice solutions are created equal. For example, some solutions offer voice-direction and scanning, but do not include speech recognition. Others may offer complete voice capabilities with barcode scanning, but require the use of a special purpose hardware device that does not include a screen and keyboard (for log-in or to display product images on the screen). True multi-modal solutions don’t require DCs to purchase dedicated hardware or to limit their data capture and display options. Rather, they permit the interchangeable use of speech recognition and barcode scanning, and the selective use of device displays, in addition to key and touch screen entry using a broad choice of standard multi-purpose hardware platforms.

As described in this article, the genesis of multi-modal voice applications was tied to hardware technology developments in the early 2000s, but the compelling business driver for multi-modal processes is related to product tracking and traceability initiatives.


From voice-only or scan-only to multi-modal

Foodservice and grocery distributors were among the early adopters of voice-directed warehouse applications as a highly accurate, efficient, and ergonomic means for order selection and other warehouse processes. In the typical voice process, selectors are directed by voice and confirm their activity by speaking a location- or item-based check string (typically a two- or three-digit number) as they grab items from a location. The voice system recognizes the user’s speech using advanced recognition technology running on a mobile, belt-worn computing device.

Early voice applications used single-purpose, voice-only “appliances” that were purpose-built for speech recognition applications and did not have a screen or scanner. Those voice-only solutions were often adopted as a replacement or alternative to RF systems running on multi-purpose wearable or handheld devices.

The clean separation between voice and RF solutions started to break down in the middle of the last decade when the major manufacturers of rugged mobile computers used for RF upgraded and optimized their hardware to support speech recognition. Since then, standard, multi-modal mobile computers have steadily taken a larger share of the total hardware market for voice applications.

More and more voice solutions have been delivered on multi-modal hardware, but not all of these applications utilize scanning, screens or keypads. Similarly, some RF/scanning systems have incorporated voice direction in their solutions, but they do not utilize speech recognition; instead they rely on barcode scanning (and key entry) to confirm activity and capture information.

Although voice-only or scan-only approaches work for capturing item-level information, neither technology is best for every situation. As a result, any DC that relies on one technology to the exclusion of the other may be settling for a sub-optimal solution. The operational benefits of a multi-modal approach using both voice and scanning are accentuated as data capture requirements increase, due in large part to new product traceability and GS1 data standards initiatives.


GS1 standards and multi-modal software

GS1 is a non-profit association that develops and promotes the implementation of data, barcode, electronic product code (RFID), and data synchronization standards to improve supply chain efficiency. In addition to cross-industry standards efforts, GS1 has a number of industry-specific groups, including the GS1 Foodservice Initiative that addresses the unique challenges of the foodservice supply chain. (For more information about the GS1 standards and the Foodservice Initiative visit

The purpose of GS1 standards is to provide a common language and method for sharing and tracking product information throughout the supply chain. For example, the GS1 Foodservice Initiative has defined standard data formats for identifying producers, distributors, products (Global Trade Item Number – GTINs), lot, date and other product attributes (in addition to means for sharing and synchronizing data). Common formats make it easier and less costly for distributors to manage data from multiple producers, as well as improving item-level tracking as products flow through the DC. In the long run, adoption of the standards will reduce DC costs—including the costs for managing recalls—but the savings may require changes both in the back-end systems that manage and share data, and in material handling processes within the DC.

To get maximum benefit from the standards efforts, DCs need to track product data as pallets received from manufacturers are broken down into cases and individual items that are shipped to end customers. To achieve this level of tracking often requires additional data entry steps that have the potential to increase operating costs. This is encouraging many DCs to rethink their material handling processes and technology choices. Rather than framing their choices as voice OR scanning, they are thinking anew about voice AND scanning.


Understanding the benefits of voice and scanning

The relative merits of voice or scanning for any task depend both on how products or locations are labeled (whether information is printed or barcoded) and the specific requirements for the business process. In instances where voice and scanning are both practical, the question becomes which technology will result in a more efficient, effective process?

In comparison to a scan- and screen-based RF process, the combination of voice direction with voice confirmation of a location check digit creates a highly fluid, efficient workflow for selectors. With voice processes, the selector will move as he or she listens and speaks, without the extraneous motion of stopping to scan or stopping to read a screen for the next instruction. Likewise, voice systems allow users to speak quantities and units of measure—“three cases”—as they work, rather than stopping to key enter or confirm quantities. Positively entering the quantity (by voice or key entry) is a check on accuracy; entering by voice is more efficient and accurate than key entry.

Capturing product level detail at the point of activity—which is at the heart of GS1 and related traceability initiatives—is a more complex matter. Most DCs today have products with mixed labeling. While some products may include GS1-128 barcodes that encode GTIN, lot, date, weight and other product level data, other products may have item data encoded in barcodes that do not conform to the GS1 standards. In addition, other products may only include printed lot or date information.

The prevalence of mixed labeling illustrates the importance of giving users the flexibility to enter product data by a variety of means. In an ideal world in which all products have product level data printed and barcoded in consistent formats, there would still be the question of which technology is best for the task from a process efficiency view. The important point is that neither speech recognition nor scanning is optimal for all tasks.

For example, when entering data by voice, there are some situations where you would want the voice system to repeat the spoken value back to the user for confirmation. This is common when entering the weights of variable weight products, to account for the fact that users will sometimes transpose digits or misread numbers. In those instances, scanning may save seconds per weight entered and improve data accuracy by eliminating human error.

In other situations, the system can verify the value entered without asking for a secondary user confirmation—for example, if lot numbers are unique to each location. In that case, it may be faster for a user to voice-enter the last 3 or 4 digits of the lot than to scan the number. A good rule of thumb is that scanning will generally be faster than voice if the string of numbers being entered exceeds five digits. And as noted in the catch weight example, scan-entry eliminates human error (users misspeaking a number), which improves overall process efficiency.


Multi-modal beyond voice and scanning

While much of the current interest in multi-modal applications is focused on the need for more efficient data capture using a combination of speech recognition and barcode scanning, multi-modal is about more than voice and scanning. Notwithstanding the advantages of voice direction, there are places where a device display can be used to supplement voice—for example, in presenting lists of items and/or item images.

The important point is that true multi-modal applications give DCs new flexibility to design their processes to suit evolving business needs, rather than requiring them to adapt to the limitations of a single technology. In the food and beverage industry this means more efficient and effective warehouse operations that simultaneously enhance product traceability and reduce costs. ?