Google's Gemini 3 Is Out, Should You Care?

Short answer: yes.

Google released Gemini 3 last week. It was being hyped as a major upgrade to Gemini 2.5 and was rumored to outperform OpenAI’s GPT-5.1.

On paper, it is now the leading large language model. It outperforms GPT-5.1, Claude, and Gemini 2.5 Pro on nearly all industry benchmarks. The only benchmark it trails GPT-5.1 and Claude on is agentic coding (see SWE-Bench benchmark) - but only just.

It really outperforms on reasoning tasks. By reasoning we mean complex, multi-step tasks. On Humanity’s Last Exam (great name), it beats GPT-5.1 by 11 percentage points without tools and 19 percentage points with tools (e.g. ability to use search and code). On MathArena Apex, which measures a model’s ability to solve highly complex math problems to test ‘frontier reasoning’, it scored 23.4% versus 1.6% for Claude and 1.0% for GPT-5.1.

For reference, here is an example of a question from Humanity’s Last Exam. If you can solve this without AI, please reach out to me.

These benchmarks are all well and good but how well does the model do with real estate concepts? I’ve always had my own real estate benchmark for LLM reasoning and that’s the ability for a LLM to correctly calculate the distributions and IRRs in a carried interest model.

Here is my prompt:

Below are cash flows for a real estate development and the IRR hurdles and cash flow splits for the LP and GP.

Calculate the distributions and XIRR for the LP, GP, and Deal.

Waterfall Structure:

Return of capital pari-passu
Remaining cash distributed pari-passu until LP achieves the preferred return
Remaining cash distributed according to the Tier 2 splits until LP receives Tier 2 Hurdle IRR
Remaining cash distributed according to the Tier 3 splits

The waterfall will use daily compounding accruals using the annual IRRs given.

I ran this prompt three times each through Gemini 3, GPT-5.1, and Claude Opus 4.1 - the latest and greatest reasoning models from Google, OpenAI, and Anthropic, respectively.

Gemini 3 produced the correct result every time.

GPT-5.1 also produced the same correct result each time but only after manually turning on ‘Thinking’ mode. Its Auto-mode, which is supposed to intelligently decide when to use more or less ‘thinking’, was a disaster. It kept asking numerous follow up questions and after getting frustrated I gave up on Auto-mode altogether.

Claude Opus 4.1 gave a different wrong answer each time. Disappointing.

I knew the right answer ahead of time but what if I didn’t? Two models agreed on an answer but the third did not. How do I know what’s correct?

To verify the result I had Gemini 3 produce Excel VBA code to create a dynamic Excel-based carried interest model using the same structure and cash flows along with various validation checks. It produced excellent code doing just that.

You can view my own and Gemini 3’s Excel model here: Carried Interest Excel Model

A key benefit of Gemini is its exceedingly large context window - the amount of text you can feed the model. Gemini can handle up to 1 million tokens or approximately 1,300 single-spaced written pages. GPT-5.1 is limited to about 40% of that while Claude is limited to 20%.

The massive context window combined with powerful multi-modal reasoning lends itself well to various real estate use cases:
-Reconciling leases against rent rolls
-Reconciling historical rent rolls against T12 financials
-Abstracting leases with less prompting
-Evaluating Property Condition Assessments and the property images in them
-Analyzing Environmental Reports
-Auditing carried interest calculations using language from JVAs
-Creating a due diligence checklist and identifying missing documents

I am extremely impressed with Gemini 3 so far and will be using it far more often. If you’re a student at NYU, you get Gemini 3 for free.

Back to the drawing board to make harder real estate benchmarks…

-Professor Scott

Google's Gemini 3 Is Out, Should You Care?

Keep Reading