radioactivist 4 days ago

The data set quality seems a really spotty based on looking a few random problems (I looked at about a dozen in the "Physics" subcategory). Several problems had no clear question (or answer) and seemed to be clipped from some longer resource and thus had back references to Sections and Chapters that the models clearly couldn't follow. Worse is that the verification of the answer seems to be via an LLM and not all that reliable; I saw several where the answer was marked correct when it clearly wasn't and several that were correct but not in the precise form given as "the" answer and thus were labelled as incorrect.

  • rosstaylor90 4 days ago

    Thanks for feedback! Yes, we’re looking to improve quality in the coming months. Couple of notes:

    - The initial use of data is distillation so we’re less bound by question quality (anything that evinces output diversity is good).

    - But moving onto RL, we’ll need stronger quality. We have much better things planned both on data filtering and verification!

    - Surprisingly, a lot of ML datasets actually look like this when you look under hood. We’re hoping having more eyeballs on it will help improve quality in long run over less transparent status quo!

    • eternityforest 2 days ago

      I still don't understand why all the datasets have so many general knowledge questions and so much math, when so few people can do any of that stuff.

      It makes sense for ASI research I suppose, but why are we trying to teach small models to do stuff almost no humans even try to do?

      What happens if you train them with RAG context in the prompts and calculator calls in the CoT?

      • rosstaylor90 2 days ago

        Many math questions are easy to verify, and it's a classic benchmark for reasoning -> so it's a good hill to climb.

        I agree with your meta-point that better benchmarks testing more types of task would be good!

westurner 4 days ago

"Can Large Language Models Emulate Judicial Decision-Making? [Paper]" (2025) https://news.ycombinator.com/item?id=42927611 ; awesome-legal-nlp, LexGLUE, FairLex, LegalBench, "Who hath done it?" exercise : {Thing done}, ({Gdo, You, Others, Unknown/Nobody} x {Ignorance, Malice, Motive, Intent}) ... Did nobody do this?

Can LLMs apply a consistent procedure for logic puzzles with logically disjunctive possibilities?

Enter: Philosoraptor the LLM

akomtu 4 days ago

The output of a reasoning model must be an algorithm, formulas or something similar in a formal language, that leaves no room for ambiguity. I can "reason" all day about the P=NP problem, but I won't be able to come up with something verifiable. A language model may translate the formal language of a reasoning model into English or Chinese for example.

Once this stage is reached, once we can throw piles of data onto a reasoning model and get formal algorithms that explain or predict that data, the new era will begin.

emorning3 4 days ago

LLMs cannot reason, they can only say things that sound reasonable, there's a difference. Duh.

  • rosstaylor90 4 days ago

    What's your AIME 2025 score? https://gr.inc/RJT1990/AIME2025/

    • nyrikki 4 days ago

      The is the point of the AIME, it is a 3 hour closed book examination in which each answer is an integer number from 0 to 999 and should only depend on pre-calc...for a human with no calculator, notes, or internet access.

      The concepts are heavily covered in the training corpus, and if people were allowed to take it more than once, with even a book let alone access to the internet it wouldn't be very hard.

      Examples:

      1) Find the sum of all integer bases $b>9$ for which $17_b$ is a divisor of $97_b.$

      In the corpus: https://www.quora.com/In-what-bases-b-does-b-7-divide-into-9...

      And one more:

      3) https://artofproblemsolving.com/wiki/index.php/2025_AIME_I_P...

      Is just the the number of ways to distribute k indistinguishable balls (players) into n distinguishable boxes (flavors, without exclusion, in such a way that no box is empty.

      Thus in the corpus for any courses that need to cover combinatorial problems including physics, discreet math, logistics etc...

      IMHO these concept classes from a typical AIME are so common, the scores you gave demonstrate that those models are doing no "general reasoning" at all and are actually failing at approximate retrieval.

      • rosstaylor90 4 days ago

        I disagree, 10 years ago AIs nailing these types of competition would have been seen as very impressive. The fact goal posts can move on this now shows how much AI has progressed.

        (Also the term “approximate retrieval” is a bad one - reasoning is inherently a process of chaining together associations. What matters is whether the reasoning reaches the right conclusions. Still some way to go, but already very impressive in tasks traditionally considered harbours of human reasoning!)

        • CamperBob2 3 days ago

          I disagree, 10 years ago AIs nailing these types of competition would have been seen as very impressive.

          It would have been seen as witchcraft.

        • bossyTeacher 3 days ago

          > What matters is whether the reasoning reaches the right conclusions

          no, it doesn't. a broken clock is right twice a day, reasoning is about the journey more than the destination

          • rosstaylor90 3 days ago

            RL has more than two steps...

            • bossyTeacher 2 days ago

              Point is that reasoning is more about the conclusions. if your steps are wrong, your reasoning is wrong regardless of the conclusion. Poor reasoning is what could make an LLM conclude that 1 + 2 = 3 but what 2 + 1 = [some number other than 3]

  • CamperBob2 4 days ago

    My next-token predictor said you would say that next.

  • nh23423fefe 4 days ago

    jokes on you, they can't even speak. so obviously your sentence is meaningless. arguing about definitions is very fruitful!

  • perching_aix 4 days ago

    emorning3 cannot reason, he can only say things that sound reasonable, there's a difference. Duh.

    good luck. as a reminder, there are people who, with varying degrees of certainty, think their loved ones have been replaced by actors, as well as people who think they're actually the god of the world around them, for it is just their imagination.

    • leptons 4 days ago

      > as a reminder, there are people who, with varying degrees of certainty, think their loved ones have been replaced by actors, as well as people who think they're actually the god of the world around them, for it is just their imagination.

      None of that is any proof at all that LLMs or computers in general can reason.

      "some humans are dumb, so LLMs are smart" is not a valid argument here.

      • fragmede 4 days ago

        how do we test for reasoning? if A -> B and B -> C, then something that can reason could conclude A -> C. If I give A -> B and B -> C to an LLM, and ask it about the relationship between A and C, it'll tell me about the transitive property of implication, graph theory, transitivity. That there's no qualia behind that, that doesn't really reason or think or breath or love, we have to go back and ask what is reasoning. there are some definitions for reasoning that LLMs can pass, there are some they can't. If they're able to outperform dumb humans whom we assume do reason, why does that not mean that LLMs have some capacity to reason?

        • vhantz 4 days ago

          > how do we test for reasoning? if A -> B and B -> C, then something that can reason could conclude A -> C. If I give A -> B and B -> C to an LLM, and ask it about the relationship between A and C, it'll tell me about the transitive property of implication, graph theory, transitivity.

          Not true.

          A LLM might give you that answer x% of the time, x being a number less than 100. However, any thinking person answering your question, will give you the same answer, no matter how many times you ask it. That's the fundamental difference between thinking and statistically mapping and reproducing the structure of human language.

          • sharemywin 4 days ago

            I'm pretty sure if you set the temp to 0 it will product the exact same output every time. its the sampling that produces the output variation.

          • perching_aix 4 days ago

            > any thinking person answering your question, will give you the same answer, no matter how many times you ask it

            Oh, will they? Will they really?!

            • vhantz 3 days ago

              Yes 2 + 2 is always 4 if you're not a language model and know basic arithmetics

              • CamperBob2 3 days ago

                Or if the language model answers the question by writing and running a Python script. Which is exactly what it can do.

                Never mind that the tendency to give the exact same answer to the same question over time is not the exhibition of reasoning power you seem to think it is. Have you actually asked some people to multiply 10-digit numbers in their heads? Did they always get the same result? No? Well, there goes that argument.

                We don't do anything that the LLMs don't do at this point, except adjust our weights (poorly) to move short-term context into long-term memory. Once that capability is added to the models -- which will happen soon enough, because why wouldn't it? -- where will the goalposts go next?

                • vhantz 3 days ago

                  It's not about giving the same answer to the same question. It's about getting the right answer 100% of the time, in some very specific domains. If you know, understand and are able to use the basic rules of arithmetic, 2 + 2 only has one answer. If you know, understand and are able to use the basic rules of formal logic, the same premises will lead you to the same conclusion. Two trivial cases for any reasoning person . Two cases that illustrate how fundamentally different LLMs text generation is to reasoning. Two cases that illustrate some of the challenges that need to be solved to bring AI models closer the fiction so many on this site are desperately taking them to be.

                  Of course those who don't care about improving those systems also don't care about understanding their limits, which is unsurprisingly the case for a lot of people on this website.

                  • CamperBob2 3 days ago

                    You've failed to explain -- or to understand -- how the models get the right answer at all. The fact is, when you ask what 2+2 is, or what 2342+33222 is, the current ChatGPT model will give you the correct answer, even if you don't tell it to write code to get it. The first answer can simply be regurgitated. The second one, not so much.

                    Heck, let's throw in a square root for the fun of it: https://i.imgur.com/Q9eHAaI.png

                    How'd it do that, if it can't reason? That problem wasn't in its training corpus. Similar ones were, with different numbers, and that was enough.

                    Ask it 100 times, and it will probably get it wrong a certain percentage of the time... just like you would if I asked you to perform the calculation in your head.

                    Notice that the model actually got the LSD slightly wrong in this example. 188.58 would be a better estimate. It even screws up the way we do. That, to me, is almost as interesting as the fact that it can deal with the problem at all.

                    Of course those who don't care about improving those systems also don't care about understanding their limits

                    The people who do care about improving these systems seem to be doing a pretty awesome job.

                    As for the limits, they frankly don't seem to exist. They certainly aren't where you and your predecessors over the past few years have assured us they are.

              • perching_aix 3 days ago

                So you've never had an encounter at a bar, moderately intoxicated, where you regrettably put the server on the spot by not understanding why you are supposed to pay the amount they're telling you you're supposed to pay?

                Cause I have, and I have doubts it was significantly more complicated math than basic integer addition and subtraction. I also do use a calculator even for basic, low value, integer math, because what do you know, in my perfectness I often had an issue with numbers not ending up as what they were supposed to.

                There's also the quite accessible and extensive history of human calculators and the extensive error correction strategies they had to employ because they'd keep cocking up calculations.

                Come on...

      • perching_aix 4 days ago

        > None of that is any proof at all that LLMs or computers in general can reason.

        It was never meant to be a proof that LLMs or computers in general can reason [or rather, that they can reason generalistically]. Instead, it was a demonstration of how their argument looks when mapped to other situations, illustrating that it isn't really putting anything to the table in the ways of, proofs, evidences, definitions, or logic / arguments, nor does it enable others to do so.

        > some humans are dumb, so LLMs are smart

        That kind of reading loses the point just about completely, so it shouldn't be surprising you can't find an argument in there.

        The point was and is that simply pointing at an AI model and saying "nah it's cappin" is awfully lacking, and is suspiciously similar to how certain people with certain mental conditions view their world. It is not insightful, nor reasonable. It's just an assertion of disbelief, following a - as you can hopefully agree - dubious logic that cannot be disproven or substantially argued against, as it was never designed to enable that in the first place.

carimura 4 days ago

nginx not happy.

  • rosstaylor90 4 days ago

    Happier now, upgraded the backend :) (co-creator here)