top of page

From Pilot to P&L: Making AI Pay Off in Retail

  • Writer: Aria Irizarry
    Aria Irizarry
  • 4 days ago
  • 7 min read

Updated: 1 hour ago



Panelists: Shirley Gao, Chief Digital & Information Officer, PacSun  •  Divyangna Singh, Director of Product Management, BJ’s Wholesale Club  •  Roberto Croce, Former SVP International, American Eagle 

Moderated by: Christian Floerkemeier, VP of Product, CTO & Co-Founder, Scandit 

Sponsored by Scandit. 



Nearly every retailer has run an AI pilot. Far fewer have seen one show up on a P&L. That gap is where the real work happens, and it is exactly what this conversation was about.


Shirley Gao, Divyangna Singh, and Roberto Croce have all been doing this long enough to have failed at it, learned from it, and built things that actually worked. Christian Floerkemeier, whose company Scandit has deployed vision AI across hundreds of retailers, kept the conversation grounded and added his own hard-won perspective throughout. If you are trying to move AI off the slide deck and onto the balance sheet, this one is worth your time.​​​​​​​​​​​​​​​​


How to Measure AI Returns: The Framework

Gao opened with the structure she uses to evaluate any AI investment. She splits returns into three buckets, and both Singh and Croce recognized it immediately and started filling in their own examples. 


Hard returns are the numbers you can put in a business case: revenue lift tracked through conversion rate, cart value, and purchase propensity, and cost reduction across customer service labor, headcount, and data entry hours. These are the ones your CFO will ask about first. 


Soft returns are real but harder to pin down. Brand discoverability across social and AI platforms. Strategic readiness, being present in a channel before it is proven. Fraud and account takeover prevention. Gao’s take on fraud was pragmatic: the value is impossible to quantify until something goes wrong, at which point it is worth hundreds of millions. You are betting on a number that never appears in a spreadsheet, and hoping it stays that way. 


Long-term value is the third bucket and the one people tend to skip. It is about how deeply a solution integrates into your existing tech stack and how broadly it reaches across the business. A tool that improves one paid channel is a different investment than one that changes how the whole brand makes decisions. 



The AI stylist on PacSun’s website illustrates where those categories get complicated. A customer planning a Hawaii trip opens a chat, describes what they need, and gets outfit recommendations. PacSun compared buyers who used it against those who did not and saw what annualised to around $10 million in revenue lift. But Gao was clear about what that number actually proves: you cannot run a true counterfactual. There is no way to know whether those customers would have bought anyway, and the real point is simply having the door open when customers are ready. 



The Conversation that Kills Pilots Before They Start 

Gao raised something that most people on the technical side of retail AI have bumped into but rarely say out loud. All three panelists sit on the technical side of their businesses, and she was direct about the problem that creates. 


When a technology leader frames AI to the business as a headcount reduction play, the conversation tends to die. Her actual words: "If we are telling the business: adopt this technology and you are going to look forward to eliminating 10% of your team, that is going to be a very hard conversation." Her approach is to not start there. Get the technology accepted first. Let teams experience the shift from manual data entry to actual decision-making. Let the efficiency story surface on its own. 


Croce came at the same point from a different direction. He is a merchant by background, not a technologist, and the way he frames any AI conversation starts with the business problem, not the tool. "You have got to have a customer problem or a business problem you are trying to solve. That is where the partnership between technology and the business actually comes to fruition." For him it was never about eliminating jobs. It was about giving people better tools so they could do what they were already supposed to be doing. 



The Four-Legged Stool

When Floerkemeier asked Singh how she resources AI rollouts, her answer was a model she has refined across years of building ML systems. 


She called it a four-legged stool: Product, Data Science, Engineering, and the Business. All four need to be genuinely involved from the start, not just kept informed. Pull any one away and the whole thing tips over. 


The Business defines what success looks like and owns the outcome. Product decides how the solution rolls out and what it looks like in practice. Engineering makes it real and keeps it from breaking six months after launch. Data Science certifies the results and, critically, translates them into something the business can actually use. 


That translation piece is where rollouts lose people.  A model accuracy percentage means nothing to a merchant. A concrete change in inventory turn or margin is something they can act on. Most teams do not have anyone doing that translation consistently, and it shows up in how fast trust in the AI erodes. 


Your Data Has Two Spellings of Large in it

Singh told a story about an early demand forecasting model her team tried to build. The plan was to use product attributes as inputs. It did not work. When they dug into why, they found the same black T-shirt across multiple SKUs with Large entered two different ways. One record had it as LG. Another had it spelled out as L-A-R-G-E. The reinforcement learning algorithm had no way of knowing these referred to the same product. 



Everyone in the room had lived some version of this. The instinct is to clean the data, which helps, but does not fix what created the problem. Build the UIs your teams use every day so that entering inconsistent values is structurally difficult. If the size field is a dropdown, you cannot end up with two variants of Large downstream.


Croce added the organizational layer that makes all of this hard in practice.  Product descriptions, customer records, loyalty history, and inventory data are all managed by different people with different workflows. Getting it right requires those teams to work on it together, not just hand the mess to data engineering periodically. "Data science is the quarterback," Croce said. "But the other teams are key stakeholders. If they are not involved, they are just going to be unhappy with the results." 



The Feedback Loop: the Piece Nobody Designs in

Singh was more emphatic about this than anything else she said. The feedback loop is the most important component of any ML or AI deployment, and it is consistently the piece that gets underdesigned or skipped entirely. 


Early ML systems asked organizations to trust the output. The model knows. Take the result. What agentic AI changes is that feedback can now happen in plain language. An inventory planner can tell the system: you got this wrong, the weather shifted, there was a competitor opening nearby, none of that was in your data. That correction feeds back into the model.  



Without the loop, models drift. Something that worked in the POC starts producing worse outputs as conditions change. By the time the business notices, you have already burned through goodwill. Singh’s point was blunt: build the feedback mechanism in from the start. Not as a phase two. From day one. 


She also shared a cautionary note about simulations. Early in her career, her team built one that gave results they were confident in. When they ran an A/B test in a smaller production group, the numbers were completely different. The technology has matured and simulations can now incorporate far more variables. But it still requires deliberately building in that complexity rather than assuming the simulation will reflect what actually happens in the field. 



The Honest Number on How Often AI Pilots Succeed 

When Floerkemeier asked the panel what the typical success rate on AI pilots actually is, Gao put a number on it. Roughly 50/50. And after a few cycles at that rate, the business gets tired. Stakeholders who started out curious become skeptical. The expectation underneath a lot of these conversations, that AI will be fully automated, always accurate, and essentially do the work for you, does not match where the technology is today. 


In Gao’s words: "Their expectation is: bring me something fully automated and accurate. But we are not there yet." 


Croce offered a more graduated view that is worth sitting alongside Gao’s. For fast test-and-learn work, things like outfitting recommendations or figuring out the right marketing assets to put in front of customers, a 10 to 15% hit rate is acceptable as long as you are learning quickly and moving on. For initiatives with real P&L exposure, you need to be closer to 50% before you scale. For foundational strategy work, treat it as fail-forward from the start: not expecting perfection on launch day, but committing to keep going until you get there. 


Singh connected it back to change management. The organizations that handle pilot failure well are the ones that set expectations before the first result came in. Getting alignment upfront on the possibility of failure does not make it easier when it happens. But it stops one bad result from derailing everything that follows. 


Before Your Next AI Pilot, Three Things are Worth Getting Right 


Lock in your KPIs before the pilot starts, not after.

Not directional signals. Actual metrics with a baseline, an owner, and a measurement timeline. If you cannot name the metric and who is accountable for it before the work begins, you are running a research project, not a business initiative. 


Fix the data before you blame the algorithm.

Every organization has a version of the T-shirt with two spellings of Large. Cleaning it is a start. Redesigning the tool that created the problem so it cannot happen again is the real fix. Data design at the input level saves enormous effort downstream and is still treated as an afterthought in most deployments. 


Design the feedback loop in from day one.

Agentic tools have made this more achievable than ever: plain-language corrections, real-time drift detection, humans actively teaching the model what it missed. But it will not happen by accident. The teams that build it in deliberately will keep improving every quarter. The ones that skip it will keep asking why it worked in the POC and not in production. 


Conversations like this one are exactly why the Council Collab exists. Real leaders, real problems, no slides. 



Want to be Part of the Next Conversation? 


Council Collabs are free for Retail AI Council members. Join to get notified of upcoming roundtables, access session recaps, and connect with the retail leaders shaping how AI actually gets built and deployed. 



Comments


© 2025 Retail AI Council, LLC. • Privacy • Terms • Sitemap

  • Logo linkedin 4
  • Logo youtube 4
RAIC_Black (1).png
bottom of page