Auto-PDT

This document is password protected.

Incorrect password
Architecture Exploration

Auto-PDT
Voice-driven hands-free
automation

Leveraging Android accessibility services to give Kmart team members natural voice control over their existing portable data terminals.

Date 24 March 2026 Author Fabio Oliveira Status Concept Exploration
01

Executive Summary

A lightweight Android app that sits on top of existing PDT apps and automates them via natural voice commands, without requiring any changes to the current applications or backend systems. Team members speak naturally, and the system interprets their intent, reads the current screen, and performs the taps and navigation they would otherwise do manually.

The concept is inspired by Blue (heyblue.ai), a Y Combinator startup that uses a hardware dongle to enable voice control of iPhone apps. The key insight: Android, which our Zebra PDTs run, is significantly more open than iOS. What Blue needs custom hardware to achieve on iPhone, Android can do in pure software.

02

The Problem

Kmart team members use handheld Zebra PDTs throughout their shifts to perform operational tasks: checking stock, verifying prices, managing task lists, processing markdowns, receiving deliveries, and flagging replenishment needs. These tasks require physical interaction with the device, creating friction across several scenarios.

Hands are full

Team members are frequently carrying product, working with trolleys, stocking shelves, or handling deliveries. Stopping to use the PDT interrupts physical work.

Repetitive navigation

Many tasks require the same multi-tap flow dozens of times per shift. Stock check alone: open app, tap search, enter code, tap go, read result, go back. 50+ times per day.

Context switching

Moving between apps for different functions breaks flow and slows down the work.

Speed of service

When a customer asks "do you have more of these?", the team member currently has to stop, pull out the device, navigate, enter the product, and wait. A voice command could answer in seconds.

03

Inspiration: Blue (heyblue.ai)

Blue is a Y Combinator S25 startup building "the first voice assistant that can control every app on your phone." Their approach reveals both the opportunity and why Android is the right platform for us.

Blue on iOS

Needs custom hardware

  • USB-C hardware dongle ("Bud") required
  • Bud acts as USB HID device to inject touches
  • Screen capture provides visual context to LLM
  • LLM determines actions, Bud executes them
  • Priced at ~$299/year
  • iOS blocks third-party apps from reading UI or injecting events
Auto-PDT on Android

Pure software solution

  • No hardware add-ons needed
  • AccessibilityService reads any app's UI natively
  • dispatchGesture() performs taps and swipes in software
  • LLM interprets natural language commands
  • Software licensing + low API costs only
  • Android provides all required APIs out of the box

Blue validates the concept. Android makes it dramatically simpler and cheaper to implement. What they solve with custom hardware, we solve with software on existing devices.

04

Target Device Fleet

ModelAndroid VersionStatus
Zebra TC51 Android 8.x (Oreo) Legacy
Zebra TC52 Android 10-11 Current
Zebra TC53 Android 13 Newest

All models support the AccessibilityService APIs required. Zebra devices also offer programmable hardware buttons (ideal for voice trigger), Zebra EMDK for barcode scanner integration, MDM management for controlled deployment, and built-in barcode scanners that can complement voice input.

05

System Architecture

Auto-PDT is a single Android application with five core components that translate natural voice commands into automated actions within existing PDT apps.

Auto-PDT Application
01
Voice Listener

SpeechRecognizer API. Captures natural speech via hardware button trigger.

02
Command Interpreter

LLM-powered NLU. Speech + screen context to structured intent.

03
Screen Reader

AccessibilityService. Reads UI tree of any foreground app.

04
Action Engine

LLM-driven step-by-step navigation using the accessibility tree.

05 · Overlay UI

Floating mic icon, listening state, confirmations, result display

↓          ↓
Existing PDT App A

Stock / Price

Existing PDT App B

Task Management

Component Detail

1 Voice Listener
SpeechRecognizer API

Captures the team member's spoken command and converts it to text.

  • Recommended trigger: Programmable hardware button on the Zebra device (most reliable in noisy environments)
  • Alternative triggers: Tap on floating overlay icon, wake word (future, less reliable in store noise)
  • Processing: On-device speech-to-text where supported (Android 13+ on TC53), cloud fallback for older devices
  • Output: Raw text transcript of the spoken command

Store environments are noisy. A physical button press is unambiguous, fast, and avoids false activations. Zebra devices already have configurable side buttons that are easy to reach.

2 Command Interpreter (NLU-powered)
Cloud LLM (Haiku / Gemini Flash)

The brain of the system. Team members speak naturally. "How many of these have we got," "check stock on this," "what's our count," and "do we have more out back" all resolve to the same action. No training, no command sheets, no memorisation. The same cloud LLM handles both initial intent resolution and step-by-step screen navigation during action execution.

  • Input: Speech transcript + current screen context (app, screen, visible data)
  • Processing: LLM resolves intent from a finite, well-defined set of actions
  • Output: Structured intent, e.g. {action: "stock_lookup", product: "12345678"}

LLM Options

OptionLatencyCostConnectivityCapability
Cloud LLM (recommended) 200-500ms Fractions of a cent Requires store WiFi High accuracy, easy to update
On-device model 50-150ms Zero marginal Works offline Less capable, harder to update
Hybrid 50-500ms Low Graceful degradation Best of both, more complex

Available Action Set

ActionCategoryType
stock_lookupStockRead
price_checkPriceRead
apply_markdownPriceWrite
view_tasksTasksRead
complete_taskTasksWrite
next_taskTasksRead
start_receivingReceivingWrite
confirm_deliveryReceivingWrite
flag_empty_shelfReplenishmentWrite
request_fillReplenishmentWrite
navigate_backNavigationRead
go_homeNavigationRead
open_appNavigationRead

The LLM handles both classification/parameter extraction (intent resolution) and step-by-step screen navigation during execution. Multiple calls per flow, roughly 300ms each.

3 Screen Reader
AccessibilityService.getRootInActiveWindow()

Reads the current state of whatever app is on screen. This is what makes "check stock for this item" work without the user saying the product code, because the Screen Reader already knows what product is on screen.

  • Which app is in the foreground (package name)
  • The screen structure (buttons, text fields, labels, lists)
  • Visible text content (product codes, descriptions, quantities, prices)
  • Which elements are interactive (tappable, editable)
4 Action Engine
dispatchGesture() / performAction()

Uses LLM-driven navigation to complete each action. At every step, the engine sends the current screen's accessibility tree to the LLM. The LLM reads the structured UI, decides the next action (tap, type, scroll), and the engine executes it. The new screen state is read and sent back. This loop repeats until the goal is achieved. No pre-coded flows needed. The LLM adapts to whatever is on screen, just like a human would.

  • Receives structured action from the Command Interpreter
  • Reads the current screen via the accessibility tree
  • Sends screen state + goal to the LLM, which decides the next UI action
  • Executes that action (tap, type, scroll), then reads the updated screen
  • Repeats until the goal is reached, then captures the result for the Overlay UI

Graceful failure: If the LLM cannot determine a valid next step, or a timeout is reached, the engine stops immediately. Announces "Sorry, couldn't complete that" and leaves the app as-is. No forced actions, no random tapping. Critical for trust.

5 Overlay UI
TYPE_ACCESSIBILITY_OVERLAY

A minimal, always-visible floating interface. Small footprint, stays out of the way, expands briefly to show results, auto-collapses.

  • Floating microphone icon (tap or hardware button to activate)
  • Listening indicator (pulsing animation when capturing speech)
  • Command confirmation ("Checking stock for item 12345678...")
  • Result display ("Stock on hand: 24. Backroom: 12.")
  • Confirmation prompt for write actions
06

Data Flow: End-to-End

A team member is stocking shelves, holding product. They glance at the PDT screen showing a product detail page and want to know stock availability.

Stock Lookup Flow
1
Wake + Capture

Team member presses side button. Says: "How many of these have we got?"

SpeechRecognizer returns: "how many of these have we got"
2
Screen Context Capture

Screen Reader snapshots the current accessibility tree.

App: StockSmart · Screen: Product Detail
Product: 40987654 · "Anko 1.7L Kettle"
3
Command Resolution

LLM receives transcript + screen context. Resolves to structured intent.

{action: "stock_lookup", product: "40987654"}
4
Action Execution

Action Engine sends screen state to LLM. LLM: "I see a Stock tab, tap it." Screen updates. LLM: "I see stock values, read them." Done.

5
Result Capture

Screen Reader captures stock values from the result screen.

On hand: 24 · Backroom: 12 · On order: 48 (arriving Thursday)
6
Spoken Confirmation

Overlay speaks and displays the result. Auto-collapses after 5 seconds.

"24 on the floor, 12 out back. 48 on order, arriving Thursday."
Estimated total: 2-4 seconds from button press to spoken result
07

Write Action Safety

Any action that changes data requires an explicit voice confirmation before execution. Read-only actions (stock lookups, price checks, viewing tasks) execute immediately.

Team "Mark this down to five dollars"
Auto-PDT Resolves: {apply_markdown, product: "40987654", price: $5.00}
Auto-PDT "Mark down Anko 1.7L Kettle to $5.00. Say 'confirm' or 'cancel'."
Team "Confirm"
Auto-PDT "Done. Marked down to $5.00."
08

Key Risks and Mitigations

RiskLikelihoodMitigation
LLM takes wrong navigation path on unfamiliar screensAction fails or reaches an unexpected state Low LLM reads structured accessibility data, making navigation reliable. Graceful failure stops execution if the LLM is uncertain. Prompt tuning and testing across screen variants further reduces risk.
Voice recognition in noisy storesCommands misheard or not captured Medium Hardware button trigger avoids false activations. LLM interprets intent, not exact words. Confirmation on write actions prevents costly errors.
Store WiFi reliabilityCloud LLM calls fail Medium Design for on-device fallback. Read-only actions could use simpler on-device NLU.
AccessibilityService permissionsCan't deploy the app Low Zebra MDM can whitelist specific accessibility services. IT controls the permitted list.
Android version fragmentationAPI differences across TC51/52/53 Low AccessibilityService API is stable since Android 8. Core APIs are consistent across all models.
Team member trust and adoptionLow usage, resistance to new interaction Medium Start with read-only commands (zero risk). Let team members see value before enabling writes. Optional, not mandatory.
LLM misinterprets commandWrong action executed Low Bounded action set limits error space. Confirmation on write actions. "Did you mean...?" for ambiguous commands.
Privacy and data concernsVoice data transmitted to cloud Medium On-device STT where possible. Only transcript (not audio) sent to cloud. No personal data in the voice flow.
Android tightening restrictionsFuture OS limits on AccessibilityService Low Enterprise-managed devices are exempt from consumer restrictions. MDM whitelist overrides limitations.
09

Rollout Strategy

1 Read-Only

Prove the concept

Scope: Stock lookups, price checks, view task list

Goal: Prove voice-to-action works reliably in a real store. Measure adoption and time savings. Build trust.

Deployment: 2-3 pilot stores, volunteer team members

Zero risk No data modification possible

2 Task Mgmt

Low-risk writes

Scope: Task completion, next task navigation

Goal: Validate the confirmation model for write actions. Measure productivity impact on task throughput.

Deployment: Expand to pilot store fleet

Low risk Task completion is reversible

3 Transactional

Higher-stakes writes

Scope: Markdowns, receiving, replenishment flags

Goal: Full hands-free operational workflow. Measure end-to-end impact.

Deployment: Broader rollout based on Phase 1-2 learnings

Medium risk Financial impact on markdowns

10

What This Doesn't Do

  • × Does not replace existing apps. Auto-PDT automates them. If removed, everything works exactly as before.
  • × Does not require changes to existing apps or backends. Interacts through the same UI that team members use.
  • × Does not require new hardware. Pure software on existing Zebra devices.
  • × Does not handle communications. Not a walkie-talkie replacement or messaging tool.
  • × Does not use cameras or computer vision. Reads the screen via accessibility APIs, not by taking photos.
  • × Does not require internet for basic operation. On-device STT is possible on TC53. Cloud LLM is preferred but not the only path.
11

Technology Summary

ComponentTechnologyMaturity
Voice captureAndroid SpeechRecognizer APIStable
Natural language understandingCloud LLM (Haiku / Gemini Flash)Production
Screen readingAndroid AccessibilityServiceMature
UI automationdispatchGesture / performActionStable
Overlay UITYPE_ACCESSIBILITY_OVERLAYStable
Device managementZebra EMDK + MDMEnterprise
Trigger mechanismZebra programmable buttonsBuilt-in

All required APIs are available on Android 8+ (TC51 minimum). No experimental or beta APIs needed.

12

Auto-PDT vs Blue

DimensionBlue (iOS)Auto-PDT (Android)
Hardware requiredUSB-C dongle ($299/yr)None (software only)
Screen readingScreen capture + computer visionAccessibilityService (structured UI tree)
Action executionUSB HID touch injectiondispatchGesture() / performAction()
Target userConsumer (personal phone)Enterprise (managed store devices)
DeploymentApp store + hardwareMDM-managed, sideloaded
Voice understandingLLM-powered (free-form)LLM-powered (free-form)
Cost per device~$299/yearSoftware + low API costs
RobustnessVisual screen parsingLLM-driven navigation over structured accessibility tree. Adapts to UI changes without code updates.
13

Open Questions for Discussion

Which PDT apps are highest priority?

We need to identify the specific apps used for stock, price, and task management so we can test LLM navigation against their screens.

Store WiFi reliability?

The cloud LLM dependency needs reliable connectivity. What's the current state of WiFi on the shop floor and in backrooms?

MDM policy for accessibility services?

Can our current MDM setup whitelist a custom accessibility service? Who manages the Zebra fleet and MDM policies?

Appetite for a prototype?

If the team agrees it's viable, the next step would be a small prototype targeting stock lookups only, on a single device, in a controlled environment.

IT and Security engagement?

Deploying a custom accessibility service on managed devices will require IT and InfoSec review. When and how should we engage them?

Barcode scanner integration?

Could a voice command like "scan and check stock" trigger the scanner, capture the barcode, and perform the stock lookup? Combining physical scan with voice-driven follow-up.

Accessibility (the other kind)?

Auto-PDT could improve accessibility for team members with mobility limitations. Worth considering as part of the value proposition.

14

Next Steps

If the team agrees to proceed:

Identify the target app(s) for stock lookup and test LLM-driven navigation against their screens

Build a minimal proof-of-concept on a single TC53: voice trigger, screen read, one stock lookup action

Test in a controlled environment (not store floor) to validate the end-to-end flow

Engage IT/InfoSec for accessibility service deployment review

Assess WiFi readiness across target pilot stores