Auto-PDT
Voice-driven hands-free
automation
Leveraging Android accessibility services to give Kmart team members natural voice control over their existing portable data terminals.
Executive Summary
A lightweight Android app that sits on top of existing PDT apps and automates them via natural voice commands, without requiring any changes to the current applications or backend systems. Team members speak naturally, and the system interprets their intent, reads the current screen, and performs the taps and navigation they would otherwise do manually.
The concept is inspired by Blue (heyblue.ai), a Y Combinator startup that uses a hardware dongle to enable voice control of iPhone apps. The key insight: Android, which our Zebra PDTs run, is significantly more open than iOS. What Blue needs custom hardware to achieve on iPhone, Android can do in pure software.
The Problem
Kmart team members use handheld Zebra PDTs throughout their shifts to perform operational tasks: checking stock, verifying prices, managing task lists, processing markdowns, receiving deliveries, and flagging replenishment needs. These tasks require physical interaction with the device, creating friction across several scenarios.
Hands are full
Team members are frequently carrying product, working with trolleys, stocking shelves, or handling deliveries. Stopping to use the PDT interrupts physical work.
Repetitive navigation
Many tasks require the same multi-tap flow dozens of times per shift. Stock check alone: open app, tap search, enter code, tap go, read result, go back. 50+ times per day.
Context switching
Moving between apps for different functions breaks flow and slows down the work.
Speed of service
When a customer asks "do you have more of these?", the team member currently has to stop, pull out the device, navigate, enter the product, and wait. A voice command could answer in seconds.
Inspiration: Blue (heyblue.ai)
Blue is a Y Combinator S25 startup building "the first voice assistant that can control every app on your phone." Their approach reveals both the opportunity and why Android is the right platform for us.
Needs custom hardware
- USB-C hardware dongle ("Bud") required
- Bud acts as USB HID device to inject touches
- Screen capture provides visual context to LLM
- LLM determines actions, Bud executes them
- Priced at ~$299/year
- iOS blocks third-party apps from reading UI or injecting events
Pure software solution
- No hardware add-ons needed
- AccessibilityService reads any app's UI natively
- dispatchGesture() performs taps and swipes in software
- LLM interprets natural language commands
- Software licensing + low API costs only
- Android provides all required APIs out of the box
Blue validates the concept. Android makes it dramatically simpler and cheaper to implement. What they solve with custom hardware, we solve with software on existing devices.
Target Device Fleet
| Model | Android Version | Status |
|---|---|---|
| Zebra TC51 | Android 8.x (Oreo) | Legacy |
| Zebra TC52 | Android 10-11 | Current |
| Zebra TC53 | Android 13 | Newest |
All models support the AccessibilityService APIs required. Zebra devices also offer programmable hardware buttons (ideal for voice trigger), Zebra EMDK for barcode scanner integration, MDM management for controlled deployment, and built-in barcode scanners that can complement voice input.
System Architecture
Auto-PDT is a single Android application with five core components that translate natural voice commands into automated actions within existing PDT apps.
Voice Listener
SpeechRecognizer API. Captures natural speech via hardware button trigger.
Command Interpreter
LLM-powered NLU. Speech + screen context to structured intent.
Screen Reader
AccessibilityService. Reads UI tree of any foreground app.
Action Engine
LLM-driven step-by-step navigation using the accessibility tree.
Existing PDT App A
Stock / Price
Existing PDT App B
Task Management
Component Detail
Captures the team member's spoken command and converts it to text.
- Recommended trigger: Programmable hardware button on the Zebra device (most reliable in noisy environments)
- Alternative triggers: Tap on floating overlay icon, wake word (future, less reliable in store noise)
- Processing: On-device speech-to-text where supported (Android 13+ on TC53), cloud fallback for older devices
- Output: Raw text transcript of the spoken command
Store environments are noisy. A physical button press is unambiguous, fast, and avoids false activations. Zebra devices already have configurable side buttons that are easy to reach.
The brain of the system. Team members speak naturally. "How many of these have we got," "check stock on this," "what's our count," and "do we have more out back" all resolve to the same action. No training, no command sheets, no memorisation. The same cloud LLM handles both initial intent resolution and step-by-step screen navigation during action execution.
- Input: Speech transcript + current screen context (app, screen, visible data)
- Processing: LLM resolves intent from a finite, well-defined set of actions
- Output: Structured intent, e.g.
{action: "stock_lookup", product: "12345678"}
LLM Options
| Option | Latency | Cost | Connectivity | Capability |
|---|---|---|---|---|
| Cloud LLM (recommended) | 200-500ms | Fractions of a cent | Requires store WiFi | High accuracy, easy to update |
| On-device model | 50-150ms | Zero marginal | Works offline | Less capable, harder to update |
| Hybrid | 50-500ms | Low | Graceful degradation | Best of both, more complex |
Available Action Set
| Action | Category | Type |
|---|---|---|
| stock_lookup | Stock | Read |
| price_check | Price | Read |
| apply_markdown | Price | Write |
| view_tasks | Tasks | Read |
| complete_task | Tasks | Write |
| next_task | Tasks | Read |
| start_receiving | Receiving | Write |
| confirm_delivery | Receiving | Write |
| flag_empty_shelf | Replenishment | Write |
| request_fill | Replenishment | Write |
| navigate_back | Navigation | Read |
| go_home | Navigation | Read |
| open_app | Navigation | Read |
The LLM handles both classification/parameter extraction (intent resolution) and step-by-step screen navigation during execution. Multiple calls per flow, roughly 300ms each.
Reads the current state of whatever app is on screen. This is what makes "check stock for this item" work without the user saying the product code, because the Screen Reader already knows what product is on screen.
- Which app is in the foreground (package name)
- The screen structure (buttons, text fields, labels, lists)
- Visible text content (product codes, descriptions, quantities, prices)
- Which elements are interactive (tappable, editable)
Uses LLM-driven navigation to complete each action. At every step, the engine sends the current screen's accessibility tree to the LLM. The LLM reads the structured UI, decides the next action (tap, type, scroll), and the engine executes it. The new screen state is read and sent back. This loop repeats until the goal is achieved. No pre-coded flows needed. The LLM adapts to whatever is on screen, just like a human would.
- Receives structured action from the Command Interpreter
- Reads the current screen via the accessibility tree
- Sends screen state + goal to the LLM, which decides the next UI action
- Executes that action (tap, type, scroll), then reads the updated screen
- Repeats until the goal is reached, then captures the result for the Overlay UI
Graceful failure: If the LLM cannot determine a valid next step, or a timeout is reached, the engine stops immediately. Announces "Sorry, couldn't complete that" and leaves the app as-is. No forced actions, no random tapping. Critical for trust.
A minimal, always-visible floating interface. Small footprint, stays out of the way, expands briefly to show results, auto-collapses.
- Floating microphone icon (tap or hardware button to activate)
- Listening indicator (pulsing animation when capturing speech)
- Command confirmation ("Checking stock for item 12345678...")
- Result display ("Stock on hand: 24. Backroom: 12.")
- Confirmation prompt for write actions
Data Flow: End-to-End
A team member is stocking shelves, holding product. They glance at the PDT screen showing a product detail page and want to know stock availability.
Wake + Capture
Team member presses side button. Says: "How many of these have we got?"
Screen Context Capture
Screen Reader snapshots the current accessibility tree.
Product: 40987654 · "Anko 1.7L Kettle"
Command Resolution
LLM receives transcript + screen context. Resolves to structured intent.
Action Execution
Action Engine sends screen state to LLM. LLM: "I see a Stock tab, tap it." Screen updates. LLM: "I see stock values, read them." Done.
Result Capture
Screen Reader captures stock values from the result screen.
Spoken Confirmation
Overlay speaks and displays the result. Auto-collapses after 5 seconds.
Write Action Safety
Any action that changes data requires an explicit voice confirmation before execution. Read-only actions (stock lookups, price checks, viewing tasks) execute immediately.
{apply_markdown, product: "40987654", price: $5.00}
Key Risks and Mitigations
| Risk | Likelihood | Mitigation |
|---|---|---|
| LLM takes wrong navigation path on unfamiliar screensAction fails or reaches an unexpected state | Low | LLM reads structured accessibility data, making navigation reliable. Graceful failure stops execution if the LLM is uncertain. Prompt tuning and testing across screen variants further reduces risk. |
| Voice recognition in noisy storesCommands misheard or not captured | Medium | Hardware button trigger avoids false activations. LLM interprets intent, not exact words. Confirmation on write actions prevents costly errors. |
| Store WiFi reliabilityCloud LLM calls fail | Medium | Design for on-device fallback. Read-only actions could use simpler on-device NLU. |
| AccessibilityService permissionsCan't deploy the app | Low | Zebra MDM can whitelist specific accessibility services. IT controls the permitted list. |
| Android version fragmentationAPI differences across TC51/52/53 | Low | AccessibilityService API is stable since Android 8. Core APIs are consistent across all models. |
| Team member trust and adoptionLow usage, resistance to new interaction | Medium | Start with read-only commands (zero risk). Let team members see value before enabling writes. Optional, not mandatory. |
| LLM misinterprets commandWrong action executed | Low | Bounded action set limits error space. Confirmation on write actions. "Did you mean...?" for ambiguous commands. |
| Privacy and data concernsVoice data transmitted to cloud | Medium | On-device STT where possible. Only transcript (not audio) sent to cloud. No personal data in the voice flow. |
| Android tightening restrictionsFuture OS limits on AccessibilityService | Low | Enterprise-managed devices are exempt from consumer restrictions. MDM whitelist overrides limitations. |
Rollout Strategy
Prove the concept
Scope: Stock lookups, price checks, view task list
Goal: Prove voice-to-action works reliably in a real store. Measure adoption and time savings. Build trust.
Deployment: 2-3 pilot stores, volunteer team members
Zero risk No data modification possible
Low-risk writes
Scope: Task completion, next task navigation
Goal: Validate the confirmation model for write actions. Measure productivity impact on task throughput.
Deployment: Expand to pilot store fleet
Low risk Task completion is reversible
Higher-stakes writes
Scope: Markdowns, receiving, replenishment flags
Goal: Full hands-free operational workflow. Measure end-to-end impact.
Deployment: Broader rollout based on Phase 1-2 learnings
Medium risk Financial impact on markdowns
What This Doesn't Do
- × Does not replace existing apps. Auto-PDT automates them. If removed, everything works exactly as before.
- × Does not require changes to existing apps or backends. Interacts through the same UI that team members use.
- × Does not require new hardware. Pure software on existing Zebra devices.
- × Does not handle communications. Not a walkie-talkie replacement or messaging tool.
- × Does not use cameras or computer vision. Reads the screen via accessibility APIs, not by taking photos.
- × Does not require internet for basic operation. On-device STT is possible on TC53. Cloud LLM is preferred but not the only path.
Technology Summary
| Component | Technology | Maturity |
|---|---|---|
| Voice capture | Android SpeechRecognizer API | Stable |
| Natural language understanding | Cloud LLM (Haiku / Gemini Flash) | Production |
| Screen reading | Android AccessibilityService | Mature |
| UI automation | dispatchGesture / performAction | Stable |
| Overlay UI | TYPE_ACCESSIBILITY_OVERLAY | Stable |
| Device management | Zebra EMDK + MDM | Enterprise |
| Trigger mechanism | Zebra programmable buttons | Built-in |
All required APIs are available on Android 8+ (TC51 minimum). No experimental or beta APIs needed.
Auto-PDT vs Blue
| Dimension | Blue (iOS) | Auto-PDT (Android) |
|---|---|---|
| Hardware required | USB-C dongle ($299/yr) | None (software only) |
| Screen reading | Screen capture + computer vision | AccessibilityService (structured UI tree) |
| Action execution | USB HID touch injection | dispatchGesture() / performAction() |
| Target user | Consumer (personal phone) | Enterprise (managed store devices) |
| Deployment | App store + hardware | MDM-managed, sideloaded |
| Voice understanding | LLM-powered (free-form) | LLM-powered (free-form) |
| Cost per device | ~$299/year | Software + low API costs |
| Robustness | Visual screen parsing | LLM-driven navigation over structured accessibility tree. Adapts to UI changes without code updates. |
Open Questions for Discussion
We need to identify the specific apps used for stock, price, and task management so we can test LLM navigation against their screens.
The cloud LLM dependency needs reliable connectivity. What's the current state of WiFi on the shop floor and in backrooms?
Can our current MDM setup whitelist a custom accessibility service? Who manages the Zebra fleet and MDM policies?
If the team agrees it's viable, the next step would be a small prototype targeting stock lookups only, on a single device, in a controlled environment.
Deploying a custom accessibility service on managed devices will require IT and InfoSec review. When and how should we engage them?
Could a voice command like "scan and check stock" trigger the scanner, capture the barcode, and perform the stock lookup? Combining physical scan with voice-driven follow-up.
Auto-PDT could improve accessibility for team members with mobility limitations. Worth considering as part of the value proposition.
Next Steps
If the team agrees to proceed: