Mining Google Maps Reviews for Unmet Customer Demand
A JupyterLab workflow for extracting Google Maps reviews, filtering for causally useful customer experience signals, and turning local competitor reviews into unmet-demand analysis.
Mining Google Maps Reviews for Unmet Customer Demand
The most useful reviews are not always the angriest ones.
Five-star reviews show what already works. One- and two-star reviews show serious failure or customer-business mismatch. But three- and four-star reviews are an interesting case. Fundamentally, they are positive reviews, yet they are still not five stars. Something was missing.
They are the reviews where the customer's demands were met, where they got what they paid for, and received what the business promised, but in spite of this, for some reason, probably won't go back. The customer got what they paid for, but not enough to care.
Three and four star reviews are the unmet demand, that when properly identified, can lend serious advantage in a competitive business landscape.
Most local competitor analysis stops too early.
It can show where competitors are located. It can show how many businesses exist in a suburb, street, or search category. It can scrape names, addresses, websites, ratings, and sometimes review counts. That is useful, but it mostly describes the supply side of the market.
It tells you where businesses are.
It does not automatically tell you where unmet need still exists. This information is invaluable for business positioning, designing new products or services, or simply realising that your own business already meets this demand, but you haven't yet highlighted it properly.
That is the gap this notebook is designed to address.
The workflow uses Google Maps as a local-market discovery layer, then extracts reviews from a small set of businesses, filters those reviews for causally useful information, and uses an LLM to convert customer language into operational signals: journey friction, expectation mismatches, staff interactions, constraints, recovery failures, satisfaction patterns, persona cues, and unmet demand.
The goal is not to produce a generic sentiment summary. The goal is to move from competitor mapping to demand mapping.
competitor locations
→ reviews
→ causal customer-experience signals
→ journey failures
→ unmet demand
→ local positioning opportunities
The business demand this notebook meets
A local business does not only need to know who its competitors are.
It needs to know:
- what customers expected before choosing a business
- what happened during the service experience
- which parts of the journey created friction
- what customers praise when the experience succeeds
- what customers complain about when the experience fails
- whether poor reviews reveal isolated mistakes or repeated unmet demand
- how the local market can still be entered, differentiated, or repositioned
This matters because competitor density alone is ambiguous.
A suburb can be dense with cafés, dentists, tattoo studios, gyms, mechanics, or clinics and still contain unmet demand. The opportunity is not always “there are not enough competitors”. Sometimes the opportunity is “there are enough competitors, but they are failing customers in the same repeated ways”.
That is the strategic difference.
Competitor density shows where supply exists.
Review-derived journey mapping shows where the market is still failing.
This notebook is a small working prototype of that idea.
What the workflow produces
The notebook produces one CSV file:
reviews.csv
The file contains all causally useful reviews, including the rating, review text, extracted causal atoms, confidence, and persona cues.
The ratings are preserved in one file so the analysis can compare patterns across one-star, two-star, three-star, four-star, and five-star reviews inside a single dataset. The useful comparison is:
what exists across all review ratings
what exists only in successful experiences
what exists only in failed or compromised experiences
what journey map is implied by both
what demand remains unmet
Important note before publishing or running this
This is a research notebook. Google Maps pages, selectors, scroll behaviour, and review markup can change. Scraping can also be constrained by site terms, permissions, rate limits, and local rules. Treat the notebook as a workflow demonstration and adapt it responsibly for your own environment.
Full notebook code and workflow
1. Setup: imports, timing, search target, and limits
The notebook starts with imports, a start timestamp, the local search term, and small limits for testing. In the version inspected here, the target query is coffee shop enmore, with three businesses and five reviews per business.
# --- Notebook code cell 0 ---
from playwright.async_api import async_playwright, Page, TimeoutError as PWTimeoutError
import re
import time
from datetime import date, datetime, timedelta
import pandas as pd
from IPython.display import clear_output
from collections import deque
import openai
from openai import OpenAI
import json
from IPython.display import clear_output
from collections import deque
import matplotlib.pyplot as plt
from pathlib import Path
import os
# --- Notebook code cell 1 ---
start_time = datetime.now()
print(datetime.now().strftime('%Y-%m-%d %H:%M'))
# --- Notebook code cell 2 ---
SEARCH_TERM = "coffee shop enmore"
# --- Notebook code cell 3 ---
MAX_BUSINESSES = 3
# --- Notebook code cell 4 ---
MAX_REVIEWS = 5
# --- Notebook code cell 5 ---
SEARCH_URL = f"https://www.google.com/maps/search/{SEARCH_TERM.replace(' ', '+')}/"
2. Progress tracking and the Windows/Jupyter Playwright wrapper
The progress tracker keeps long scraping and model calls visible inside Jupyter. The run_playwright_async wrapper runs Playwright inside a separate event loop/thread, which is important when Jupyter, Windows, and Playwright are fighting over async event-loop behaviour.
# --- Notebook code cell 6 ---
def make_progress_tracker(maxlen=5):
last_time = time.time()
updates = deque(maxlen=maxlen)
def tick(i):
nonlocal last_time
now = time.time()
elapsed = now - last_time
last_time = now
updates.append((i, elapsed))
clear_output(wait=True)
print(f"Recent progress (last {maxlen}):\n")
for idx, t in updates:
print(f"Processed: {idx} | Δt: {t:.1f}s")
return tick
tick = make_progress_tracker(maxlen=5)
# --- Notebook code cell 7 ---
import asyncio, sys, threading
def run_playwright_async(fn, *args, **kwargs):
box = {}
def runner():
loop = asyncio.ProactorEventLoop() if sys.platform.startswith("win") else asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
box["result"] = loop.run_until_complete(fn(*args, **kwargs))
except BaseException as e:
box["error"] = e
finally:
loop.close()
t = threading.Thread(target=runner)
t.start()
t.join()
if "error" in box:
raise box["error"]
return box["result"]
3. Business discovery from Google Maps
This section opens the Google Maps search URL, scrolls the result feed, opens business cards, extracts the place URL, address, and website URL, then deduplicates results by address and website.
# --- Notebook code cell 8 ---
n_businesses = 0
async def collect_business_list(search_url, max_businesses=10, headless=False):
global n_businesses
from urllib.parse import urlsplit, urlunsplit, parse_qsl, urlencode
def normalise_website_url(href: str | None) -> str | None:
if not href:
return None
parts = urlsplit(href)
qs = parse_qsl(parts.query, keep_blank_values=True)
drop_keys = {
"utm_source", "utm_medium", "utm_campaign", "utm_content", "utm_term",
"gclid", "fbclid", "dclid", "gbraid", "wbraid"
}
qs2 = [(k, v) for (k, v) in qs if k.lower() not in drop_keys]
query2 = urlencode(qs2, doseq=True)
return urlunsplit((parts.scheme, parts.netloc, parts.path, query2, parts.fragment))
async with async_playwright() as p:
browser = await p.chromium.launch(headless=headless)
context = await browser.new_context(viewport={"width": 1280, "height": 900})
page = await context.new_page()
await page.goto(search_url, timeout=90000, wait_until="domcontentloaded")
await page.wait_for_selector('div[role="feed"] div[role="article"]', timeout=90000)
feed = page.locator('div[role="feed"]').first
prev, stagnation = 0, 0
while True:
cards = page.locator('div[role="feed"] div[role="article"]')
count = await cards.count()
if count >= max_businesses:
break
if count == prev:
stagnation += 1
if stagnation >= 8:
break
else:
prev, stagnation = count, 0
await feed.evaluate("(el) => el.scrollBy(0, Math.max(300, el.clientHeight * 0.85))")
await page.wait_for_timeout(900)
async def open_place_from_card(index, max_tries=5):
for _ in range(max_tries):
cards_local = page.locator('div[role="feed"] div[role="article"]')
if await cards_local.count() <= index:
return False
card = cards_local.nth(index)
await card.scroll_into_view_if_needed()
await page.wait_for_timeout(150)
await card.click(timeout=8000)
deadline = time.monotonic() + 25
while time.monotonic() < deadline:
if await page.locator('button[data-item-id="address"], button[aria-label^="Address:"]').count() > 0:
return True
if await page.locator('a[data-item-id="authority"], a[aria-label^="Website:"]').count() > 0:
return True
if await page.locator("h1").first.count() > 0:
if await page.locator(
'button[data-item-id="address"], button[aria-label^="Address:"], a[data-item-id="authority"], a[aria-label^="Website:"]'
).count() > 0:
return True
await page.wait_for_timeout(200)
await page.keyboard.press("Escape")
await page.wait_for_timeout(250)
return False
async def get_place_url_from_panel():
cur = page.url or ""
if "/maps/place/" in cur:
return cur
a = page.locator('a[href*="/maps/place/"]').first
if await a.count() > 0:
href = await a.get_attribute("href")
if href and "/maps/place/" in href:
return href
m = page.locator('meta[itemprop="url"]').first
if await m.count() > 0:
content = await m.get_attribute("content")
if content and "/maps/place/" in content:
return content
canon = page.locator('link[rel="canonical"]').first
if await canon.count() > 0:
href = await canon.get_attribute("href")
if href and "/maps/place/" in href:
return href
og = page.locator('meta[property="og:url"]').first
if await og.count() > 0:
content = await og.get_attribute("content")
if content and "/maps/place/" in content:
return content
raise RuntimeError("FAILED to find canonical /maps/place/ url")
async def get_address_text():
loc = page.locator('button[data-item-id="address"] .Io6YTe').first
if await loc.count() > 0:
txt = (await loc.inner_text()).strip()
if txt:
return txt
b = page.locator('button[data-item-id="address"]').first
if await b.count() > 0:
lab = await b.get_attribute("aria-label")
if lab:
lab = lab.strip()
if lab.lower().startswith("address:"):
lab = lab.split(":", 1)[1].strip()
if lab:
return lab
b1 = page.locator('button[aria-label^="Address"]').first
if await b1.count() > 0:
lab = await b1.get_attribute("aria-label")
if lab:
lab = lab.strip()
if lab.lower().startswith("address:"):
lab = lab.split(":", 1)[1].strip()
return lab or None
b2 = page.locator('button[aria-label^="Copy address"]').first
if await b2.count() > 0:
lab = await b2.get_attribute("aria-label")
if lab:
lab = lab.strip()
if lab.lower().startswith("address:"):
lab = lab.split(":", 1)[1].strip()
return lab or None
return None
business_rows = []
cards = page.locator('div[role="feed"] div[role="article"]')
total = min(await cards.count(), max_businesses)
for i in range(total):
ok = await open_place_from_card(i, max_tries=5)
if not ok:
continue
n_businesses += 1
tick(n_businesses)
place_url = await get_place_url_from_panel()
address = await get_address_text()
website_url = None
w = page.locator('a[data-item-id="authority"]').first
if await w.count() > 0:
website_url = await w.get_attribute("href")
if not website_url:
w2 = page.locator('a[aria-label^="Website:"]').first
if await w2.count() > 0:
website_url = await w2.get_attribute("href")
website_url = normalise_website_url(website_url)
business_rows.append(
{
"place_url": place_url,
"address": address,
"website_url": website_url,
}
)
await page.keyboard.press("Escape")
await page.wait_for_timeout(180)
await page.keyboard.press("Escape")
await page.wait_for_timeout(280)
await context.close()
await browser.close()
df = pd.DataFrame(business_rows)
if not df.empty:
df = df.drop_duplicates(subset=["place_url", "address", "website_url"]).reset_index(drop=True)
return df
# --- Notebook code cell 9 ---
business_df = run_playwright_async(
collect_business_list,
search_url=SEARCH_URL,
max_businesses=MAX_BUSINESSES,
headless=False
)
# --- Notebook code cell 10 ---
before = len(business_df)
business_df = business_df.drop_duplicates(
subset=["address", "website_url"]
).reset_index(drop=True)
after = len(business_df)
print(before - after, "duplicates removed")
# --- Notebook code cell 11 ---
with pd.option_context('display.max_rows', None, 'display.max_colwidth', None, 'display.html.use_mathjax', False):
display(business_df)
4. Review extraction from selected businesses
This is the longest extraction component. It reopens Google Maps, opens each business, navigates to the reviews panel, expands visible truncated reviews where possible, extracts review text and rating, checks truncation, removes duplicate reviews, and displays the working review table.
# --- Notebook code cell 12 ---
n_reviews = 0
async def extract_reviews_from_search(search_url, max_businesses=10, max_reviews_per_business=50, headless=False):
global n_reviews
from urllib.parse import urlsplit, urlunsplit, parse_qsl, urlencode
def normalise_website_url(href: str | None) -> str | None:
if not href:
return None
parts = urlsplit(href)
qs = parse_qsl(parts.query, keep_blank_values=True)
drop_keys = {
"utm_source", "utm_medium", "utm_campaign", "utm_content", "utm_term",
"gclid", "fbclid", "dclid", "gbraid", "wbraid"
}
qs2 = [(k, v) for (k, v) in qs if k.lower() not in drop_keys]
query2 = urlencode(qs2, doseq=True)
return urlunsplit((parts.scheme, parts.netloc, parts.path, query2, parts.fragment))
async with async_playwright() as p:
rows = []
for i in range(max_businesses):
browser = await p.chromium.launch(headless=headless)
context = await browser.new_context(viewport={"width": 1280, "height": 900})
page = await context.new_page()
await page.goto(search_url, timeout=90000, wait_until="domcontentloaded")
await page.wait_for_selector('div[role="feed"] div[role="article"]', timeout=90000)
feed = page.locator('div[role="feed"]').first
prev, stagnation = 0, 0
while True:
cards = page.locator('div[role="feed"] div[role="article"]')
count = await cards.count()
if count >= max_businesses:
break
if count == prev:
stagnation += 1
if stagnation >= 8:
break
else:
prev, stagnation = count, 0
await feed.evaluate("(el) => el.scrollBy(0, Math.max(300, el.clientHeight * 0.85))")
await page.wait_for_timeout(900)
async def open_place_from_card(index, max_tries=5):
for _ in range(max_tries):
cards_local = page.locator('div[role="feed"] div[role="article"]')
if await cards_local.count() <= index:
return False
card = cards_local.nth(index)
await card.scroll_into_view_if_needed()
await page.wait_for_timeout(150)
await card.click(timeout=8000)
deadline = time.monotonic() + 25
while time.monotonic() < deadline:
if await page.locator('button[data-item-id="address"], button[aria-label^="Address:"]').count() > 0:
return True
if await page.locator('a[data-item-id="authority"], a[aria-label^="Website:"]').count() > 0:
return True
if await page.locator("h1").first.count() > 0:
if await page.locator(
'button[data-item-id="address"], button[aria-label^="Address:"], a[data-item-id="authority"], a[aria-label^="Website:"]'
).count() > 0:
return True
await page.wait_for_timeout(200)
await page.keyboard.press("Escape")
await page.wait_for_timeout(250)
return False
async def get_place_url_from_panel():
cur = page.url or ""
if "/maps/place/" in cur:
return cur
a = page.locator('a[href*="/maps/place/"]').first
if await a.count() > 0:
href = await a.get_attribute("href")
if href and "/maps/place/" in href:
return href
m = page.locator('meta[itemprop="url"]').first
if await m.count() > 0:
content = await m.get_attribute("content")
if content and "/maps/place/" in content:
return content
canon = page.locator('link[rel="canonical"]').first
if await canon.count() > 0:
href = await canon.get_attribute("href")
if href and "/maps/place/" in href:
return href
og = page.locator('meta[property="og:url"]').first
if await og.count() > 0:
content = await og.get_attribute("content")
if content and "/maps/place/" in content:
return content
raise RuntimeError("FAILED to find canonical /maps/place/ url")
async def get_address_text():
loc = page.locator('button[data-item-id="address"] .Io6YTe').first
if await loc.count() > 0:
txt = (await loc.inner_text()).strip()
if txt:
return txt
b = page.locator('button[data-item-id="address"]').first
if await b.count() > 0:
lab = await b.get_attribute("aria-label")
if lab:
lab = lab.strip()
if lab.lower().startswith("address:"):
lab = lab.split(":", 1)[1].strip()
if lab:
return lab
b1 = page.locator('button[aria-label^="Address"]').first
if await b1.count() > 0:
lab = await b1.get_attribute("aria-label")
if lab:
lab = lab.strip()
if lab.lower().startswith("address:"):
lab = lab.split(":", 1)[1].strip()
return lab or None
b2 = page.locator('button[aria-label^="Copy address"]').first
if await b2.count() > 0:
lab = await b2.get_attribute("aria-label")
if lab:
lab = lab.strip()
if lab.lower().startswith("address:"):
lab = lab.split(":", 1)[1].strip()
return lab or None
return None
async def goto_reviews_tab():
tab = page.locator(
'button[role="tab"][aria-label*="Reviews"], button[role="tab"]'
).filter(has_text="Reviews").first
await tab.click(timeout=60000)
await page.wait_for_timeout(3000)
async def pick_reviews_scroller():
containers = page.locator("div.m6QErb.XiKgde")
n = min(await containers.count(), 50)
for k in range(n):
c = containers.nth(k)
if await c.locator("span.wiI7pd").count() > 0:
return c
return containers.first
async def click_visible_review_expanders(scroller, max_clicks=200):
return await scroller.evaluate(
"""(root, maxClicks) => {
function pickViewport(root) {
const candidates = [root, ...root.querySelectorAll('div')];
let best = root;
for (const el of candidates) {
const sh = el.scrollHeight || 0;
const ch = el.clientHeight || 0;
if (sh <= ch + 5) continue;
if (!el.querySelector('span.wiI7pd')) continue;
if ((best.scrollHeight || 0) < sh) best = el;
}
return best;
}
const viewport = pickViewport(root);
const vRect = viewport.getBoundingClientRect();
const nodes = Array.from(root.querySelectorAll('button, div[role="button"], span[role="button"], a[role="button"]'));
const isExpander = (el) => {
const al = (el.getAttribute('aria-label') || '').toLowerCase();
const txt = (el.innerText || '').trim().toLowerCase();
const js = (el.getAttribute('jsaction') || '').toLowerCase();
if (al === 'see more' || al === 'more' || al === 'read more') return true;
if (al.includes('see more') || al.includes('read more')) return true;
if (txt === 'more' || txt === 'see more' || txt === 'read more') return true;
if (js.includes('review') && (js.includes('expand') || js.includes('more'))) return true;
if (el.classList && (el.classList.contains('w8nwRe') || el.classList.contains('kyuRq'))) return true;
return false;
};
const clickable = nodes.filter(isExpander);
let clicked = 0;
for (const el of clickable) {
if (clicked >= maxClicks) break;
const ariaExpanded = el.getAttribute('aria-expanded');
if (ariaExpanded === 'true') continue;
const r = el.getBoundingClientRect();
const inView =
r.width > 0 && r.height > 0 &&
r.bottom > vRect.top + 4 && r.top < vRect.bottom - 4;
if (!inView) continue;
el.scrollIntoView({ block: 'center' });
el.click();
clicked++;
}
return { clicked, candidates: clickable.length };
}""",
max_clicks
)
async def settle_expansions(scroller, max_rounds=12):
for _ in range(max_rounds):
res = await click_visible_review_expanders(scroller, max_clicks=220)
await page.wait_for_timeout(260)
if not res or res.get("clicked", 0) == 0:
break
async def scroll_reviews_panel(scroller, frac=0.35):
return await scroller.evaluate(
"""(root, frac) => {
const candidates = [root, ...root.querySelectorAll('div')];
let best = null;
for (const el of candidates) {
const sh = el.scrollHeight || 0;
const ch = el.clientHeight || 0;
if (sh <= ch + 5) continue;
if (!el.querySelector('span.wiI7pd')) continue;
if (!best || el.scrollHeight > best.scrollHeight) best = el;
}
const target = best || root;
const before = target.scrollTop;
target.scrollTop = before + target.clientHeight * frac;
return target.scrollTop !== before;
}""",
frac
)
async def extract_visible_reviews_js(scroller):
return await scroller.evaluate(
"""(root) => {
const blocks = Array.from(root.querySelectorAll('div.MyEned, div[data-review-id], div[jscontroller]'));
const out = [];
for (const b of blocks) {
const textEl = b.querySelector('span.wiI7pd');
const text = textEl ? textEl.innerText.trim() : '';
if (!text) continue;
let key = null;
const ridBtn = b.querySelector('button[data-review-id]');
if (ridBtn) key = ridBtn.getAttribute('data-review-id');
if (!key) {
const direct = b.getAttribute && b.getAttribute('data-review-id');
if (direct) key = direct;
}
if (!key) {
const ctrlBtn = b.querySelector('button[aria-controls]');
if (ctrlBtn) key = ctrlBtn.getAttribute('aria-controls');
}
if (!key && b.id) key = b.id;
if (!key) key = text.slice(0, 160);
let rating = null;
let n = b;
for (let i = 0; i < 18 && n; i++) {
const star = n.querySelector('span.kvMYJc[role="img"][aria-label]');
if (star) {
const lab = star.getAttribute('aria-label') || '';
const m = lab.match(/(\\d+)/);
rating = m ? parseInt(m[1], 10) : null;
break;
}
n = n.parentElement;
}
const hasMore =
!!b.querySelector('button[aria-label="See more"], button[jsaction*="review.expandReview"], button.w8nwRe.kyuRq') ||
Array.from(b.querySelectorAll('button')).some(x => ((x.innerText || '').trim().toLowerCase() === 'more'));
out.push({ key, text, rating, has_more: hasMore });
}
return out;
}"""
)
async def force_expand_if_truncated(review_container):
text_loc = review_container.locator("span.wiI7pd").first
if await text_loc.count() == 0:
return False
txt = (await text_loc.inner_text()).strip()
if not txt or not txt.endswith("…"):
return False
btn = review_container.locator(
'button[aria-label="See more"], button[jsaction*="review.expandReview"], button.w8nwRe.kyuRq'
).first
if await btn.count() == 0:
btn = review_container.locator("button").filter(has_text="More").first
if await btn.count() == 0:
return False
for _ in range(4):
await btn.scroll_into_view_if_needed(timeout=2500)
await page.wait_for_timeout(120)
await btn.click(timeout=2500)
await page.wait_for_timeout(380)
txt2 = (await text_loc.inner_text()).strip()
if txt2 and not txt2.endswith("…"):
return True
return False
ok = await open_place_from_card(i, max_tries=5)
if ok:
place_url = await get_place_url_from_panel()
address = await get_address_text()
website_url = None
w = page.locator('a[data-item-id="authority"]').first
if await w.count() > 0:
website_url = await w.get_attribute("href")
if not website_url:
w2 = page.locator('a[aria-label^="Website:"]').first
if await w2.count() > 0:
website_url = await w2.get_attribute("href")
website_url = normalise_website_url(website_url)
try:
await goto_reviews_tab()
scroller = await pick_reviews_scroller()
seen = set()
deadline = time.monotonic() + 200
stagnation_r = 0
while len(seen) < max_reviews_per_business and time.monotonic() < deadline:
await settle_expansions(scroller, max_rounds=12)
batch = await extract_visible_reviews_js(scroller)
added = 0
for item in batch:
key = item.get("key")
if not key or key in seen:
continue
text = (item.get("text") or "").strip()
if text.endswith("…") and item.get("has_more"):
conts = scroller.locator("div.MyEned")
n_conts = await conts.count()
fixed = False
for idx in range(min(n_conts, 60)):
cont = conts.nth(idx)
tloc = cont.locator("span.wiI7pd").first
if await tloc.count() == 0:
continue
t = (await tloc.inner_text()).strip()
if t == text:
fixed = await force_expand_if_truncated(cont)
break
if fixed:
batch2 = await extract_visible_reviews_js(scroller)
match = next((x for x in batch2 if x.get("key") == key), None)
if match and (match.get("text") or "").strip():
text = (match.get("text") or "").strip()
if text.endswith("…"):
continue
seen.add(key)
rows.append(
{
"url": place_url,
"address": address,
"website_url": website_url,
"review": text,
"rating": item.get("rating"),
}
)
n_reviews += 1
tick(n_reviews)
added += 1
if len(seen) >= max_reviews_per_business:
break
if added == 0:
stagnation_r += 1
if stagnation_r >= 14:
break
else:
stagnation_r = 0
moved = await scroll_reviews_panel(scroller, frac=0.35)
await page.wait_for_timeout(1500)
if not moved:
await settle_expansions(scroller, max_rounds=8)
break
except Exception:
pass
await context.close()
await browser.close()
df = pd.DataFrame(rows)
if not df.empty:
df = df.drop_duplicates(subset=["url", "address", "website_url", "review", "rating"]).reset_index(drop=True)
return df
# --- Notebook code cell 13 ---
reviews_df = run_playwright_async(
extract_reviews_from_search,
search_url=SEARCH_URL,
max_businesses=MAX_BUSINESSES,
max_reviews_per_business=MAX_REVIEWS,
headless=False
)
# --- Notebook code cell 14 ---
with pd.option_context('display.max_rows', None, 'display.max_colwidth', None, 'display.html.use_mathjax', False):
display(reviews_df[['address', 'website_url', 'review', 'rating']])
# --- Notebook code cell 15 ---
reviews_df.duplicated().sum()
# --- Notebook code cell 16 ---
percentage = (
reviews_df["review"]
.astype(str)
.str.strip()
.str.endswith("…")
.mean() * 100
if not reviews_df.empty else 0
)
percentage
# --- Notebook code cell 17 ---
reviews_df = reviews_df.drop_duplicates(subset=["review"])
# --- Notebook code cell 18 ---
with pd.option_context('display.max_rows', None, 'display.max_colwidth', None, 'display.html.use_mathjax', False):
display(reviews_df)
# --- Notebook code cell 19 ---
len(reviews_df)
5. OpenAI client setup
The notebook then loads an OpenAI API key from a local project folder and creates a client. On a public version, this should be replaced with your own environment-variable or secrets-management method.
# --- Notebook markdown cell 20 ---
# # API key
# --- Notebook code cell 21 ---
# API_KEY = Path("openai_key.txt").read_text().strip()
def load_openai_key():
"""
Locate and load API key from shared file:
D:\Sync\Python Projects\openai\openai_key
Works from any notebook inside 'Python Projects'
"""
# Start from current working directory
current = Path.cwd()
# Traverse upwards until we find "Python Projects"
for parent in [current] + list(current.parents):
if parent.name == "Python Projects":
key_path = parent / "openai" / "openai_key.txt"
return key_path.read_text().strip()
raise FileNotFoundError("Could not locate 'Python Projects' root")
OPENAI_API_KEY = load_openai_key()
client = openai.OpenAI(api_key=OPENAI_API_KEY)
6. Filtering reviews for causal structure
Not every review is useful for demand analysis. This classifier keeps reviews that describe a process, interaction, constraint, expectation mismatch, decision point, or recovery/failure sequence. Generic praise and outcome-only complaints are filtered out.
# --- Notebook code cell 22 ---
n_review = 0
def classify_review_has_causal_structure(review_text):
global n_review
n_review += 1
tick(n_review)
"""
Classify whether a review contains causal / structural information.
Returns:
- True: review contains usable signal (process, interaction, constraint, etc.)
- False: review is generic / outcome-only
- "API Error": OpenAI call failed
- "Empty": input text invalid
"""
if not isinstance(review_text, str) or not review_text.strip():
return "Empty"
messages = [
{
"role": "system",
"content": """You are classifying a customer review.
Return ONLY one word: true or false
Return "true" if the review contains ANY of the following:
- a sequence of events or process (e.g. arrival, waiting, during, after)
- an interaction between customer and staff (e.g. told, asked, explained, ignored)
- a decision point or consent moment (e.g. size, placement, timing, pricing, approval)
- a constraint (e.g. time delay, environment, policy, availability)
- an expectation mismatch (e.g. “I expected”, “wasn’t told”, “I assumed”)
- a response or lack of response to an issue (e.g. apology, explanation, no update)
Return "false" if the review is only:
- generic praise or recommendation
- generic complaint with no detail
- outcome-only statement (e.g. “great tattoo”, “bad service”) without describing what happened
If unsure, return "true".
Do not explain. Output only "true" or "false".
"""
},
{
"role": "user",
"content": f"""Review:
"{review_text}"
"""
},
]
try:
response = client.chat.completions.create(
model="gpt-4.1-mini",
messages=messages,
max_tokens=3,
temperature=0.0,
)
except Exception:
return "API Error"
raw = (response.choices[0].message.content or "").strip().lower()
# Strict normalisation
if raw == "true":
return True
elif raw == "false":
return False
else:
return "Non Binary"
# --- Notebook code cell 23 ---
reviews_df = reviews_df.copy()
reviews_df.loc[:, 'causal_structure'] = reviews_df['review'].apply(classify_review_has_causal_structure)
# --- Notebook code cell 24 ---
reviews_df
# --- Notebook code cell 25 ---
reviews_df.causal_structure.value_counts()
7. Rating distribution and causal subset
This section plots the distribution of all ratings, then narrows the dataset to reviews that passed the causal-structure classifier. That matters because the analysis is trying to find repeated mechanisms, not merely count sentiment.
# --- Notebook code cell 26 ---
def plot_rating_distribution(df, rating_col="rating"):
"""
Create a frequency bar plot of ratings (1–5).
Args:
- df: pandas DataFrame
- rating_col: column containing rating values (expected 1–5)
Behaviour:
- Ensures all ratings 1–5 appear (even if count = 0)
- Displays counts as bar heights
"""
# Clean + coerce to numeric
ratings = pd.to_numeric(df[rating_col], errors="coerce").dropna()
# Count frequencies, ensure 1–5 present
counts = ratings.value_counts().reindex([1, 2, 3, 4, 5], fill_value=0).sort_index()
# Plot
plt.figure()
plt.bar(counts.index.astype(int), counts.values)
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.title("Review Rating Distribution")
plt.xticks([1, 2, 3, 4, 5])
plt.show()
return counts
# --- Notebook code cell 27 ---
plot_rating_distribution(reviews_df, rating_col="rating")
# --- Notebook code cell 28 ---
reviews_df.duplicated().sum()
# --- Notebook code cell 29 ---
subset_df = reviews_df.query('causal_structure == True')
# --- Notebook code cell 30 ---
len(subset_df)
# --- Notebook code cell 31 ---
plot_rating_distribution(subset_df, rating_col="rating")
# --- Notebook code cell 32 ---
with pd.option_context('display.max_rows', None, 'display.max_colwidth', None, 'display.html.use_mathjax', False):
display(subset_df.review)
# --- Notebook code cell 33 ---
subset_df
8. Extracting causal atoms from each useful review
This is the core LLM analysis stage. Each causally useful review is converted into short operational atoms: mechanism statements, evidence snippets, tags, persona cues, severity, and confidence.
# --- Notebook code cell 34 ---
n_atoms = 0
def extract_causal_atoms(
review_text: str,
max_atoms: int = 8,
model: str = "gpt-5.2",
reasoning_effort: str = "low",
max_output_tokens: int = 600,
):
global n_atoms
n_atoms += 1
tick(n_atoms)
if not isinstance(review_text, str) or not review_text.strip():
return "Empty"
max_atoms = int(max(1, min(max_atoms, 30)))
max_output_tokens = int(max(16, max_output_tokens))
developer = f"""Extract operationally usable causal structure from ONE review.
Definition of usable:
- process step, constraint + response, decision/confirmation moment, policy/guarantee/time window,
failure + recovery, or explicit expectation management.
Evidence gate:
- Provide "evidence" as a short verbatim snippet from the review where possible.
- If you cannot quote exact words, still include the atom but keep evidence as the closest short phrase.
Return ONLY JSON with EXACT keys (no extras):
{{
"is_high_fidelity": true/false,
"fidelity_reason": "short (<= 16 words)",
"atoms": [
{{
"atom": "replicable mechanism statement (<= 12 words)",
"evidence": "short snippet (<= 12 words)",
"tags": ["optional free-text tags"]
}}
],
"persona_cues": ["short cues grounded in text"],
"severity": 0,
"confidence": 0.0
}}
Limit atoms to {max_atoms}.
No backticks. No commentary.
"""
t0 = time.time()
try:
kwargs = {
"model": model,
"input": [
{"role": "developer", "content": developer},
{"role": "user", "content": f'Review:\n"""{review_text}"""'},
],
"max_output_tokens": max_output_tokens,
}
if "gpt-5" in model:
kwargs["reasoning"] = {"effort": reasoning_effort}
resp = client.responses.create(**kwargs)
except Exception as e:
print(f"API Error after {time.time()-t0:.1f}s:", repr(e))
return "API Error"
raw = (getattr(resp, "output_text", None) or "").strip()
cleaned = re.sub(r"^```(?:json)?\s*|\s*```$", "", raw, flags=re.IGNORECASE).strip()
try:
result = json.loads(cleaned)
except json.JSONDecodeError:
print("Non-JSON output (head):", raw[:600])
return "Non JSON"
required_keys = {"is_high_fidelity", "fidelity_reason", "atoms", "persona_cues", "severity", "confidence"}
if not isinstance(result, dict) or set(result.keys()) != required_keys:
return "Non JSON"
txt = review_text
def norm(s: str) -> str:
s = s.lower()
s = re.sub(r"[^\w\s]", " ", s)
s = re.sub(r"\s+", " ", s).strip()
return s
t_norm = norm(txt)
def ev_ok(evidence: str) -> bool:
if not isinstance(evidence, str) or not evidence.strip():
return False
return norm(evidence) in t_norm
def clean_tags(tags):
if not isinstance(tags, list):
return []
out = []
for t in tags:
if isinstance(t, str) and t.strip():
out.append(t.strip()[:40])
return out[:6]
# Keep BOTH:
# - atoms that pass evidence (for high-fidelity use)
# - atoms that fail evidence (for coverage / later inspection)
atoms_out = []
passed = 0
for it in result["atoms"][:max_atoms]:
if not isinstance(it, dict) or set(it.keys()) != {"atom", "evidence", "tags"}:
continue
atom = it.get("atom")
evidence = it.get("evidence")
if not isinstance(atom, str) or not atom.strip():
continue
if not isinstance(evidence, str):
evidence = ""
ok = ev_ok(evidence)
if ok:
passed += 1
atoms_out.append(
{
"atom": atom.strip()[:120],
"evidence": evidence.strip()[:120],
"tags": clean_tags(it.get("tags")),
# NOTE: if you must keep EXACT keys everywhere, remove this field.
# Better: put it in tags like "evidence_ok:true".
}
)
persona = []
for p in result["persona_cues"]:
if isinstance(p, str) and p.strip():
persona.append(p.strip()[:80])
persona = persona[:10]
# Recompute fidelity from evidence pass-rate, but DO NOT delete atoms.
is_high = (passed > 0)
fidelity_reason = result["fidelity_reason"].strip()[:200]
if not is_high:
fidelity_reason = "Atoms returned; evidence match failed (normalisation mismatch likely)"
try:
severity = max(0, min(3, int(result.get("severity", 0))))
except Exception:
severity = 0
try:
confidence = max(0.0, min(1.0, float(result.get("confidence", 0.7))))
except Exception:
confidence = 0.7
return {
"is_high_fidelity": bool(is_high),
"fidelity_reason": fidelity_reason,
"atoms": atoms_out,
"persona_cues": persona,
"severity": severity,
"confidence": confidence,
}
# --- Notebook code cell 35 ---
subset_df = subset_df.copy()
subset_df["causal_analysis"] = subset_df["review"].apply(
lambda x: extract_causal_atoms(x, model="gpt-4.1-mini")
)
# --- Notebook code cell 36 ---
for i, row in subset_df.iterrows():
print(f"\nROW {i}")
print(json.dumps(row["causal_analysis"], indent=2))
9. Flattening the causal analysis and exporting the review CSV
The nested JSON is flattened into ordinary table columns. The notebook then exports the full causally useful review dataset as a single CSV file. The rating column is preserved, so five-star, three-to-four-star, and one-to-two-star reviews can be compared later without splitting the data during export.
# --- Notebook code cell 37 ---
def flatten_causal_analysis(df, col):
def extract(row):
data = row if isinstance(row, dict) else {}
atoms = data.get("atoms", [])
persona = data.get("persona_cues", [])
return {
f"{col}_is_high_fidelity": data.get("is_high_fidelity"),
f"{col}_confidence": data.get("confidence"),
f"{col}_n_atoms": len(atoms),
f"{col}_atoms_text": " | ".join(
[f"{a.get('atom')} → {a.get('evidence')}" for a in atoms]
),
f"{col}_persona": ", ".join(persona),
}
expanded = df[col].apply(extract).apply(pd.Series)
return pd.concat([df, expanded], axis=1)
# --- Notebook code cell 38 ---
subset_df = flatten_causal_analysis(subset_df, "causal_analysis")
# --- Notebook code cell 39 ---
subset_df.columns
# --- Notebook code cell 40 ---
subset_df.isna().sum()
# --- Notebook code cell 41 ---
with pd.option_context('display.max_rows', None, 'display.max_colwidth', None, 'display.html.use_mathjax', False):
display(subset_df[['website_url', 'address', 'rating', 'review', 'causal_analysis_confidence', 'causal_analysis_n_atoms', 'causal_analysis_atoms_text', 'causal_analysis_persona']])
# --- Notebook code cell 42 ---
with pd.option_context('display.max_rows', None, 'display.max_colwidth', None, 'display.html.use_mathjax', False):
display(subset_df.query('rating < 5')[['website_url', 'rating', 'review', 'causal_analysis_atoms_text', 'causal_analysis_persona']])
# --- Notebook code cell 43 ---
with pd.option_context('display.max_rows', None, 'display.max_colwidth', None, 'display.html.use_mathjax', False):
display(subset_df.query('rating == 5')[['website_url', 'rating', 'review', 'causal_analysis_atoms_text', 'causal_analysis_persona']])
# --- Notebook code cell 44 ---
subset_df.to_csv('reviews.csv', index=False)
# --- Notebook code cell 45 ---
end_time = datetime.now()
print(datetime.now().strftime('%Y-%m-%d %H:%M'))
# --- Notebook code cell 46 ---
# compute delta
time_delta = end_time - start_time
print(f"Start: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"End: {end_time.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Delta: {time_delta}")
Analysing the exported CSV with ChatGPT or another LLM
Once the notebook has produced reviews.csv, the next step is interpretive analysis.
Upload the file:
reviews.csv
Then ask the model to compare the structures across rating groups inside the same dataset.
Use these prompts.
1. What exists across all review ratings?
Using `reviews.csv`, identify what exists across all review ratings.
Focus on repeated service expectations, journey stages, customer needs, emotional cues, operational constraints, and satisfaction/dissatisfaction triggers.
Do not merely summarise the reviews. Infer the stable customer-experience structure that appears across all rating groups.
2. What exists only in five-star ratings?
Using `reviews.csv`, filter to rating == 5 and identify what exists only in five-star ratings.
Focus on what successful experiences consistently contain:
- staff behaviours
- service moments
- product or environment qualities
- trust signals
- expectation management
- journey moments that appear to convert customers into advocates
Compare against the other rating groups and exclude patterns that also appear outside five-star reviews.
3. What exists only in one-star or two-star ratings?
Using `reviews.csv`, filter to rating in [1, 2] and isolate the one-star and two-star reviews.
Identify what exists only in these low-rating reviews.
Focus on severe journey failures:
- broken expectations
- staff conflict
- waiting or timing problems
- poor communication
- refusal, neglect, confusion, or failed recovery
- customer emotions that indicate abandonment or churn risk
Separate isolated incidents from repeated structural failures.
4. What exists only in three-star or four-star ratings?
Using `reviews.csv`, filter to rating in [3, 4] and isolate the three-star and four-star reviews.
Identify what exists only in these middle-rating reviews.
Focus on partial success:
- customers who liked something but still felt friction
- experience trade-offs
- conditional recommendations
- weak points that did not fully destroy the experience
- improvement areas that could convert acceptable experiences into excellent ones
Explain how these differ from both five-star reviews and one-star/two-star reviews.
5. What customer journey map is inferred?
Using `reviews.csv`, infer the customer journey map.
Break the journey into stages, such as:
- search/discovery
- selection
- arrival
- ordering or booking
- service interaction
- waiting
- consumption or delivery
- issue handling
- payment
- post-experience judgement
- review trigger
For each stage, identify:
- customer expectation
- positive signals
- negative signals
- failure modes
- recovery opportunities
- evidence from the review data
6. What unmet customer demand is inferred?
Using `reviews.csv`, infer unmet customer demand.
Do not simply list complaints.
Identify demand that appears structurally underserved by the local market:
- needs customers repeatedly express indirectly
- expectations that are not reliably met
- service moments where competitors fail
- emotional or practical outcomes customers seem to want
- positioning opportunities for a business entering or improving in this local category
Separate:
1. explicit unmet demand
2. implied unmet demand
3. weak signals worth testing
4. likely content/SEO angles
5. operational changes a business could make
What this changes strategically
The notebook turns local SEO and Google Maps competitor analysis into something more useful than a directory scrape.
The usual workflow is:
find competitors
→ map locations
→ compare ratings
→ export data
This workflow adds the missing demand layer:
find competitors
→ extract reviews
→ isolate causally useful reviews
→ extract journey atoms
→ compare satisfaction, dissatisfaction, and near-miss reviews by rating
→ infer unmet demand
→ identify positioning opportunities
That changes the object of analysis.
The market is no longer just a set of businesses on a map. It becomes a set of customer journeys, some successful and some broken, distributed across a local competitive landscape.
That is the useful claim:
Competitor mapping tells you where the market is crowded.
Review journey mapping tells you where the market is still weak.