What's happening
A user clicks a button. The spinner runs for two seconds, then five, then the dashboard flips to a generic error page. You open the network tab. The request to a base44 function or SDK endpoint returns HTTP 500 with a body that is empty, a single-line "Internal Server Error," or a JSON envelope with no useful detail. The base44 function logs confirm a 500 happened. They do not tell you why.
A base44 500 error is almost always one of four root causes, in order of frequency: a function timeout where the SDK request layer trips its per-request budget, an RLS misconfiguration after an AI agent edit, an integration credential that has expired or hit quota, and a scheduled-task collision with peak request load. The dashboard rarely surfaces which one you have — confirm the cause from the network tab and the function logs before patching anything.
In 14 of our last 30 base44 audits, the team's first move was a redeploy and a hopeful retry — which masks the cause when it works and burns a day when it does not. The four causes below cover ~95 percent of the 500s we see; the remaining five percent are platform-side regressions covered at the end.
The four root causes (ranked by frequency in our client work)
1. Function timeout — the SDK trips its per-request budget
Trigger. A function that used to finish in 800ms now takes 6 seconds because of a new third-party call, an unbounded loop, or a query that lost its index after a schema change.
Signature. The 500 is consistent on a specific function, the logs show work starting but never finishing, and the response time sits within a small window of the per-request limit. No exception is logged — the runtime cut the function off before it could throw.
Why it returns 500. When the SDK request layer hits its per-request budget, it terminates the function and returns 500 with a generic envelope. There is no 408 and no 504 — the SDK normalizes the timeout because the function never produced a real response. Most common cause in our caseload: 14 of the last 30 audits. Cold-start variant: /fix/functions-stop-working-after-hours.
2. RLS misconfiguration after an AI agent edit
Trigger. The AI agent rewrites a function and the new query reads a row an RLS policy now denies. Or the agent rewrites a policy and the existing function cannot read its own writes. Either way, policy and function are out of sync.
Signature. The 500 fires only for some users — the ones whose row state matches the denied condition. Logs show a permission-style error from the driver, wrapped in a generic SDK exception. Reproducing against a different user either succeeds or fails depending on which side of the policy boundary that user lands on.
Why it returns 500. The denied query throws. The SDK normalizes the throw to a 500 and discards the permission code. Full RLS-regression playbook: /fix/base44-rls-out-of-sync-after-ai-edit. We saw this in 8 of the last 30 audits, every one tied to a recent AI edit log entry.
3. Integration credential expiry
Trigger. A Stripe webhook signing secret is rotated and the new value is not in the function's environment. An OAuth token expires and the refresh path is broken. A third-party API key hits its monthly quota. A vendor deprecates an endpoint and the call now returns 410.
Signature. The 500 affects only the calls that touch a specific integration; other functions keep working. The log shows a 401, 403, 410, or 429 from the third-party, then a generic 500 from the catch-all handler. We catch this in 5 of every 30 audits.
Why it returns 500. The base44 error envelope swallows downstream HTTP details. A 401 from Stripe becomes a generic 500. A 429 becomes a 500. Stripe-specific case: /fix/stripe-integration-breaks-update. Rate-limit variant: /fix/rate-limit-429-production-throttle.
4. Scheduled-task collision with peak request load
Trigger. A cron runs a heavy database operation while real user traffic peaks. Both fight for the same connection pool, the pool runs dry, and the user-facing function fails to acquire a connection inside its timeout window.
Signature. The 500s cluster sharply at predictable times — top of the hour, midnight UTC, the daily digest tick. Outside those windows the same functions work. The log shows a connection-acquisition error or a generic timeout. We saw this in 3 of the last 30 audits, all on apps with high cron density.
Why it returns 500. The function cannot get a connection within its budget. The SDK times out. The user gets a 500. The cron, meanwhile, completes and shows green in the dashboard — which is why this cause hides easily.
The diagnostic checklist
Run these in order. Each step rules out a specific cause. Do not skip — we once spent a full day chasing a credential expiry that turned out to be a timeout because we skipped step 2.
- Capture the failing request from the network tab. Note function name, timestamp, response time, and response body. Preserve-log on.
- Pull the dashboard log at that exact timestamp. Note whether the function started and never finished (timeout), started and threw (RLS or integration), or never started (router or pool exhaustion).
- Compare the response time to the platform's per-request budget. Within a few hundred ms of the limit means cause 1 — skip to the timeout fix.
- Reproduce against two different user records — one known-good, one suspected-bad. If only one fails, you have cause 2 (RLS). If both fail identically, RLS is not it.
- Test every third-party credential the function uses with a direct curl. Stripe key, OAuth tokens, API keys, webhook signing secrets. A 401, 403, 410, or 429 here means cause 3.
- Plot the 500 timestamps against your cron schedule. If they align with cron ticks (top of hour, daily, hourly), you have cause 4.
- Check your AI agent edit log for the last 24 hours. A 500 that started within an hour of an AI edit is almost always cause 2 or a code-shape regression — we see this in ~60 percent of post-AI-edit incidents.
- Check your deploy log for the last 24 hours. A 500 that started right after a deploy with no AI edit is a regression in the deployed code — usually a query change or a new dependency.
- Sample at least 20 failing requests across 30 minutes. Constant rate = code-level cause (1, 2, 3). Spikes and recovers = load-level (4) or upstream rate-limit.
- Confirm one root cause before patching. Do not ship until steps 1-9 converge on a single cause. Fixing the wrong one wastes a deploy and obscures the real signal.
If you remain uncertain between two causes, that is itself a signal — usually cause 1 (timeout) is masking cause 3 (integration), because the slow integration call is what pushed the function past its budget.
The fix — by root cause
Each fix below assumes the diagnostic has converged on a single cause. Skipping it is the top reason a 500 fix takes three deploys instead of one.
Fix 1: Function timeout
Move slow work off the request path. If the function makes a third-party call that takes more than a couple of seconds, the call should not be on the user's request — it should be queued and processed out-of-band.
// BEFORE: synchronous third-party call inside the request handler.
// This is the shape that trips the per-request budget under load.
export default async function handler(req: Request) {
const body = await req.json();
const order = await db.orders.create({ data: body });
// Slow path — Stripe call + email send + analytics ping, all serial.
await stripe.charges.create({ amount: order.total, source: body.token });
await emailProvider.send({ to: body.email, template: "order-confirm" });
await analytics.track("order.created", { id: order.id });
return Response.json({ ok: true, orderId: order.id });
}
// AFTER: write the order, queue the side effects, return immediately.
// The queued job runs out-of-band and has its own timeout budget.
export default async function handler(req: Request) {
const body = await req.json();
const order = await db.orders.create({ data: body });
await db.jobs.create({
data: {
kind: "order.fulfill",
payload: { orderId: order.id, token: body.token, email: body.email },
runAt: new Date(),
},
});
return Response.json({ ok: true, orderId: order.id });
}
The two-step pattern keeps the user-facing request fast and gives slow work its own retry surface. Add an index on any query that scans more than a few thousand rows, and audit for any unbounded forEach over a list that grows with usage.
Fix 2: RLS misconfiguration after an AI agent edit
Patch the policy or function so they agree, then put the agreement under a regression test the agent must keep green.
// tests/rls/orders-policy.test.ts
// One allowed-row case and one denied-row case per policy. Run on every
// deploy and after every AI agent edit.
import { test, expect } from "vitest";
import { runAsUser, runAsOtherUser } from "./helpers";
test("orders: a user can read their own order", async () => {
const order = await runAsUser("alice", async (db) => {
return db.orders.findFirst({ where: { ownerId: "alice" } });
});
expect(order).not.toBeNull();
});
test("orders: a user cannot read another user's order", async () => {
const order = await runAsOtherUser("alice", "bob", async (db) => {
return db.orders.findFirst({ where: { ownerId: "bob" } });
});
// The denied case must return null or throw a typed permission error,
// never a generic 500. If you see a generic 500 here, your function's
// error handler is masking the real signal — fix that first.
expect(order).toBeNull();
});
The full RLS-regression playbook, including the agent-side guard rails that stop the regression from coming back the next time the agent edits, is at /fix/base44-rls-out-of-sync-after-ai-edit.
Fix 3: Integration credential expiry
Add a structured error path that surfaces the underlying integration failure, then add a daily health check that pages you when a credential stops working.
// Wrap every third-party call with a structured error that survives the
// SDK envelope. The function still returns 500 to the caller, but your
// log sink now sees the real cause inline.
async function callStripe<T>(name: string, fn: () => Promise<T>): Promise<T> {
try {
return await fn();
} catch (err: unknown) {
const error = err as { statusCode?: number; code?: string; message?: string };
console.error(
JSON.stringify({
kind: "integration.failure",
integration: "stripe",
operation: name,
statusCode: error.statusCode ?? null,
code: error.code ?? null,
message: error.message ?? String(err),
at: new Date().toISOString(),
})
);
throw err;
}
}
// In the function:
const charge = await callStripe("charges.create", () =>
stripe.charges.create({ amount: order.total, source: body.token })
);
Mirror the same payload to a third-party log sink (Logtail, Datadog, Axiom). The base44 console truncates and rotates fast; a sink with longer retention is the difference between a one-hour fix and a one-week guess. The Stripe-specific recurrence pattern is at /fix/stripe-integration-breaks-update.
Fix 4: Scheduled-task collision
Move heavy cron work off the peak window, cap the connection draw per task, and add a circuit breaker that backs the cron off when the user-facing pool falls below a threshold.
// In your scheduled task, pull a small batch and bail if the pool is hot.
export default async function nightlyDigest() {
const poolStats = await db.$queryRaw<{ active: number; max: number }[]>`
SELECT count(*) FILTER (WHERE state = 'active') AS active,
current_setting('max_connections')::int AS max
FROM pg_stat_activity
`;
const utilization = poolStats[0].active / poolStats[0].max;
if (utilization > 0.6) {
console.warn(
`digest: pool utilization ${utilization.toFixed(2)} too hot, deferring`
);
return; // Cron will run again on the next tick.
}
// Process at most 200 records per tick. Keeps the cron's connection
// draw bounded so user requests can still acquire connections.
const batch = await db.users.findMany({
where: { digestPending: true },
take: 200,
});
for (const user of batch) {
await sendDigest(user);
}
}
Stagger your cron schedule away from the top of the hour. Most 500-cluster patterns we see come from three or four crons all firing at 0 * * * * and fighting for the same pool. Spread them across the hour and the collision usually disappears with no other change. The cold-start variant is at /fix/functions-stop-working-after-hours.
When to call us
If you have run the diagnostic and confirmed the cause but the fix touches code or schema you do not want to edit under live-traffic pressure, that is the engagement we run. Our audit is a $497 one-day diagnostic that delivers a written cause analysis and the exact fix path. Our fix-sprint is a fixed-price 48-72 hour engagement that ships the fix, the regression tests, and the instrumentation. The full base44 error reference, including platform-side regressions we exclude from DIY fixes, is at /blog/base44-error-reference.
If a base44 500 keeps recurring after a redeploy, do not ship more code — run the four-cause diagnostic first. The dashboard cannot tell you which root cause you have, but the network tab, function logs, AI edit log, and cron schedule together always can. Confirm the cause, ship the matching fix, then add the regression test that prevents the next one.