Architecture Exploration

Auto-PDT
Voice-driven hands-free
automation

Leveraging Android accessibility services to give Kmart team members natural voice control over their existing portable data terminals.

Date 24 March 2026 Author Fabio Oliveira Status Concept Exploration

Executive Summary

A lightweight Android app that sits on top of existing PDT apps and automates them via natural voice commands, without requiring any changes to the current applications or backend systems. Team members speak naturally, and the system interprets their intent, reads the current screen, and performs the taps and navigation they would otherwise do manually.

The concept is inspired by Blue (heyblue.ai), a Y Combinator startup that uses a hardware dongle to enable voice control of iPhone apps. The key insight: Android, which our Zebra PDTs run, is significantly more open than iOS. What Blue needs custom hardware to achieve on iPhone, Android can do in pure software.

The Problem

Kmart team members use handheld Zebra PDTs throughout their shifts to perform operational tasks: checking stock, verifying prices, managing task lists, processing markdowns, receiving deliveries, and flagging replenishment needs. These tasks require physical interaction with the device, creating friction across several scenarios.

✋

Hands are full

Team members are frequently carrying product, working with trolleys, stocking shelves, or handling deliveries. Stopping to use the PDT interrupts physical work.

↻

Repetitive navigation

Many tasks require the same multi-tap flow dozens of times per shift. Stock check alone: open app, tap search, enter code, tap go, read result, go back. 50+ times per day.

⇄

Context switching

Moving between apps for different functions breaks flow and slows down the work.

⏱

Speed of service

When a customer asks "do you have more of these?", the team member currently has to stop, pull out the device, navigate, enter the product, and wait. A voice command could answer in seconds.

Inspiration: Blue (heyblue.ai)

Blue is a Y Combinator S25 startup building "the first voice assistant that can control every app on your phone." Their approach reveals both the opportunity and why Android is the right platform for us.

Blue on iOS

Needs custom hardware

USB-C hardware dongle ("Bud") required
Bud acts as USB HID device to inject touches
Screen capture provides visual context to LLM
LLM determines actions, Bud executes them
Priced at ~$299/year
iOS blocks third-party apps from reading UI or injecting events

Auto-PDT on Android

Pure software solution

No hardware add-ons needed
AccessibilityService reads any app's UI natively
dispatchGesture() performs taps and swipes in software
LLM interprets natural language commands
Software licensing + low API costs only
Android provides all required APIs out of the box

Blue validates the concept. Android makes it dramatically simpler and cheaper to implement. What they solve with custom hardware, we solve with software on existing devices.

Target Device Fleet

Model	Android Version	Status
Zebra TC51	Android 8.x (Oreo)	Legacy
Zebra TC52	Android 10-11	Current
Zebra TC53	Android 13	Newest

All models support the AccessibilityService APIs required. Zebra devices also offer programmable hardware buttons (ideal for voice trigger), Zebra EMDK for barcode scanner integration, MDM management for controlled deployment, and built-in barcode scanners that can complement voice input.

System Architecture

Auto-PDT is a single Android application with five core components that translate natural voice commands into automated actions within existing PDT apps.

Auto-PDT Application

Voice Listener

SpeechRecognizer API. Captures natural speech via hardware button trigger.

Command Interpreter

LLM-powered NLU. Speech + screen context to structured intent.

Screen Reader

AccessibilityService. Reads UI tree of any foreground app.

Action Engine

LLM-driven step-by-step navigation using the accessibility tree.

05 · Overlay UI

Floating mic icon, listening state, confirmations, result display

↓ ↓

Existing PDT App A

Stock / Price

Existing PDT App B

Task Management

Component Detail

1 Voice Listener

SpeechRecognizer API

Captures the team member's spoken command and converts it to text.

Recommended trigger: Programmable hardware button on the Zebra device (most reliable in noisy environments)
Alternative triggers: Tap on floating overlay icon, wake word (future, less reliable in store noise)
Processing: On-device speech-to-text where supported (Android 13+ on TC53), cloud fallback for older devices
Output: Raw text transcript of the spoken command

Store environments are noisy. A physical button press is unambiguous, fast, and avoids false activations. Zebra devices already have configurable side buttons that are easy to reach.

2 Command Interpreter (NLU-powered)

Cloud LLM (Haiku / Gemini Flash)

The brain of the system. Team members speak naturally. "How many of these have we got," "check stock on this," "what's our count," and "do we have more out back" all resolve to the same action. No training, no command sheets, no memorisation. The same cloud LLM handles both initial intent resolution and step-by-step screen navigation during action execution.

Input: Speech transcript + current screen context (app, screen, visible data)
Processing: LLM resolves intent from a finite, well-defined set of actions
Output: Structured intent, e.g. {action: "stock_lookup", product: "12345678"}

LLM Options

Option	Latency	Cost	Connectivity	Capability
Cloud LLM (recommended)	200-500ms	Fractions of a cent	Requires store WiFi	High accuracy, easy to update
On-device model	50-150ms	Zero marginal	Works offline	Less capable, harder to update
Hybrid	50-500ms	Low	Graceful degradation	Best of both, more complex

Available Action Set

Action	Category	Type
stock_lookup	Stock	Read
price_check	Price	Read
apply_markdown	Price	Write
view_tasks	Tasks	Read
complete_task	Tasks	Write
next_task	Tasks	Read
start_receiving	Receiving	Write
confirm_delivery	Receiving	Write
flag_empty_shelf	Replenishment	Write
request_fill	Replenishment	Write
navigate_back	Navigation	Read
go_home	Navigation	Read
open_app	Navigation	Read

The LLM handles both classification/parameter extraction (intent resolution) and step-by-step screen navigation during execution. Multiple calls per flow, roughly 300ms each.

3 Screen Reader

AccessibilityService.getRootInActiveWindow()

Reads the current state of whatever app is on screen. This is what makes "check stock for this item" work without the user saying the product code, because the Screen Reader already knows what product is on screen.

Which app is in the foreground (package name)
The screen structure (buttons, text fields, labels, lists)
Visible text content (product codes, descriptions, quantities, prices)
Which elements are interactive (tappable, editable)

4 Action Engine

dispatchGesture() / performAction()

Uses LLM-driven navigation to complete each action. At every step, the engine sends the current screen's accessibility tree to the LLM. The LLM reads the structured UI, decides the next action (tap, type, scroll), and the engine executes it. The new screen state is read and sent back. This loop repeats until the goal is achieved. No pre-coded flows needed. The LLM adapts to whatever is on screen, just like a human would.

Receives structured action from the Command Interpreter
Reads the current screen via the accessibility tree
Sends screen state + goal to the LLM, which decides the next UI action
Executes that action (tap, type, scroll), then reads the updated screen
Repeats until the goal is reached, then captures the result for the Overlay UI

Graceful failure: If the LLM cannot determine a valid next step, or a timeout is reached, the engine stops immediately. Announces "Sorry, couldn't complete that" and leaves the app as-is. No forced actions, no random tapping. Critical for trust.

5 Overlay UI

TYPE_ACCESSIBILITY_OVERLAY

A minimal, always-visible floating interface. Small footprint, stays out of the way, expands briefly to show results, auto-collapses.

Floating microphone icon (tap or hardware button to activate)
Listening indicator (pulsing animation when capturing speech)
Command confirmation ("Checking stock for item 12345678...")
Result display ("Stock on hand: 24. Backroom: 12.")
Confirmation prompt for write actions

Data Flow: End-to-End

A team member is stocking shelves, holding product. They glance at the PDT screen showing a product detail page and want to know stock availability.

Stock Lookup Flow

Wake + Capture

Team member presses side button. Says: "How many of these have we got?"

SpeechRecognizer returns: "how many of these have we got"

Screen Context Capture

Screen Reader snapshots the current accessibility tree.

App: StockSmart · Screen: Product Detail
Product: 40987654 · "Anko 1.7L Kettle"

Command Resolution

LLM receives transcript + screen context. Resolves to structured intent.

{action: "stock_lookup", product: "40987654"}

Action Execution

Action Engine sends screen state to LLM. LLM: "I see a Stock tab, tap it." Screen updates. LLM: "I see stock values, read them." Done.

Result Capture

Screen Reader captures stock values from the result screen.

On hand: 24 · Backroom: 12 · On order: 48 (arriving Thursday)

Spoken Confirmation

Overlay speaks and displays the result. Auto-collapses after 5 seconds.

"24 on the floor, 12 out back. 48 on order, arriving Thursday."

Estimated total: 2-4 seconds from button press to spoken result

Write Action Safety

Any action that changes data requires an explicit voice confirmation before execution. Read-only actions (stock lookups, price checks, viewing tasks) execute immediately.

Team "Mark this down to five dollars"

Auto-PDT Resolves: {apply_markdown, product: "40987654", price: $5.00}

Auto-PDT "Mark down Anko 1.7L Kettle to $5.00. Say 'confirm' or 'cancel'."

Team "Confirm"

Auto-PDT "Done. Marked down to $5.00."

Key Risks and Mitigations

Risk	Likelihood	Mitigation
LLM takes wrong navigation path on unfamiliar screensAction fails or reaches an unexpected state	Low	LLM reads structured accessibility data, making navigation reliable. Graceful failure stops execution if the LLM is uncertain. Prompt tuning and testing across screen variants further reduces risk.
Voice recognition in noisy storesCommands misheard or not captured	Medium	Hardware button trigger avoids false activations. LLM interprets intent, not exact words. Confirmation on write actions prevents costly errors.
Store WiFi reliabilityCloud LLM calls fail	Medium	Design for on-device fallback. Read-only actions could use simpler on-device NLU.
AccessibilityService permissionsCan't deploy the app	Low	Zebra MDM can whitelist specific accessibility services. IT controls the permitted list.
Android version fragmentationAPI differences across TC51/52/53	Low	AccessibilityService API is stable since Android 8. Core APIs are consistent across all models.
Team member trust and adoptionLow usage, resistance to new interaction	Medium	Start with read-only commands (zero risk). Let team members see value before enabling writes. Optional, not mandatory.
LLM misinterprets commandWrong action executed	Low	Bounded action set limits error space. Confirmation on write actions. "Did you mean...?" for ambiguous commands.
Privacy and data concernsVoice data transmitted to cloud	Medium	On-device STT where possible. Only transcript (not audio) sent to cloud. No personal data in the voice flow.
Android tightening restrictionsFuture OS limits on AccessibilityService	Low	Enterprise-managed devices are exempt from consumer restrictions. MDM whitelist overrides limitations.

Rollout Strategy

1 Read-Only

Prove the concept

Scope: Stock lookups, price checks, view task list

Goal: Prove voice-to-action works reliably in a real store. Measure adoption and time savings. Build trust.

Deployment: 2-3 pilot stores, volunteer team members

Zero risk No data modification possible

2 Task Mgmt

Low-risk writes

Scope: Task completion, next task navigation

Goal: Validate the confirmation model for write actions. Measure productivity impact on task throughput.

Deployment: Expand to pilot store fleet

Low risk Task completion is reversible

3 Transactional

Higher-stakes writes

Scope: Markdowns, receiving, replenishment flags

Goal: Full hands-free operational workflow. Measure end-to-end impact.

Deployment: Broader rollout based on Phase 1-2 learnings

Medium risk Financial impact on markdowns

What This Doesn't Do

× Does not replace existing apps. Auto-PDT automates them. If removed, everything works exactly as before.
× Does not require changes to existing apps or backends. Interacts through the same UI that team members use.
× Does not require new hardware. Pure software on existing Zebra devices.
× Does not handle communications. Not a walkie-talkie replacement or messaging tool.
× Does not use cameras or computer vision. Reads the screen via accessibility APIs, not by taking photos.
× Does not require internet for basic operation. On-device STT is possible on TC53. Cloud LLM is preferred but not the only path.

Technology Summary

Component	Technology	Maturity
Voice capture	Android SpeechRecognizer API	Stable
Natural language understanding	Cloud LLM (Haiku / Gemini Flash)	Production
Screen reading	Android AccessibilityService	Mature
UI automation	dispatchGesture / performAction	Stable
Overlay UI	TYPE_ACCESSIBILITY_OVERLAY	Stable
Device management	Zebra EMDK + MDM	Enterprise
Trigger mechanism	Zebra programmable buttons	Built-in

All required APIs are available on Android 8+ (TC51 minimum). No experimental or beta APIs needed.

Auto-PDT vs Blue

Dimension	Blue (iOS)	Auto-PDT (Android)
Hardware required	USB-C dongle ($299/yr)	None (software only)
Screen reading	Screen capture + computer vision	AccessibilityService (structured UI tree)
Action execution	USB HID touch injection	dispatchGesture() / performAction()
Target user	Consumer (personal phone)	Enterprise (managed store devices)
Deployment	App store + hardware	MDM-managed, sideloaded
Voice understanding	LLM-powered (free-form)	LLM-powered (free-form)
Cost per device	~$299/year	Software + low API costs
Robustness	Visual screen parsing	LLM-driven navigation over structured accessibility tree. Adapts to UI changes without code updates.

Open Questions for Discussion

Which PDT apps are highest priority?

We need to identify the specific apps used for stock, price, and task management so we can test LLM navigation against their screens.

Store WiFi reliability?

The cloud LLM dependency needs reliable connectivity. What's the current state of WiFi on the shop floor and in backrooms?

MDM policy for accessibility services?

Can our current MDM setup whitelist a custom accessibility service? Who manages the Zebra fleet and MDM policies?

Appetite for a prototype?

If the team agrees it's viable, the next step would be a small prototype targeting stock lookups only, on a single device, in a controlled environment.

IT and Security engagement?

Deploying a custom accessibility service on managed devices will require IT and InfoSec review. When and how should we engage them?

Barcode scanner integration?

Could a voice command like "scan and check stock" trigger the scanner, capture the barcode, and perform the stock lookup? Combining physical scan with voice-driven follow-up.

Accessibility (the other kind)?

Auto-PDT could improve accessibility for team members with mobility limitations. Worth considering as part of the value proposition.

Next Steps

If the team agrees to proceed:

Identify the target app(s) for stock lookup and test LLM-driven navigation against their screens

Build a minimal proof-of-concept on a single TC53: voice trigger, screen read, one stock lookup action

Test in a controlled environment (not store floor) to validate the end-to-end flow

Engage IT/InfoSec for accessibility service deployment review

Assess WiFi readiness across target pilot stores