Update 'Understanding DeepSeek R1'

master
Alica Nott 2 months ago
parent
commit
a18c453ca4
  1. 92
      Understanding-DeepSeek-R1.md

92
Understanding-DeepSeek-R1.md

@ -0,0 +1,92 @@
<br>DeepSeek-R1 is an open-source language [design built](https://thuexemaythuhanoi.com) on DeepSeek-V3-Base that's been making waves in the [AI](https://overijssel.contactoudmariniers.com) neighborhood. Not just does it match-or even surpass-OpenAI's o1 design in lots of benchmarks, but it likewise comes with [totally MIT-licensed](https://nutrosulbrasil.com.br) weights. This marks it as the first non-OpenAI/Google design to provide strong reasoning abilities in an open and available way.<br>
<br>What makes DeepSeek-R1 particularly interesting is its openness. Unlike the less-open techniques from some market leaders, DeepSeek has released a detailed training method in their paper.
The model is also remarkably cost-effective, with input tokens [costing](http://git.suxiniot.com) just $0.14-0.55 per million (vs o1's $15) and [output tokens](http://m.shopinlincoln.com) at $2.19 per million (vs o1's $60).<br>
<br>Until ~ GPT-4, the typical knowledge was that better designs [required](http://47.116.115.15610081) more information and [compute](https://www.itsmf.be). While that's still valid, models like o1 and R1 show an alternative: [inference-time scaling](http://mediosymas.es) through [reasoning](https://belclarefarm.com).<br>
<br>The Essentials<br>
<br>The DeepSeek-R1 paper provided several designs, but main among them were R1 and R1-Zero. Following these are a series of distilled models that, while fascinating, I will not discuss here.<br>
<br>DeepSeek-R1 uses 2 major concepts:<br>
<br>1. A [multi-stage pipeline](http://imc-s.com) where a little set of [cold-start data](https://flixwood.com) kickstarts the model, followed by large-scale RL.
2. Group Relative Policy Optimization (GRPO), a reinforcement knowing technique that relies on comparing numerous model [outputs](https://playidy.com) per timely to avoid the requirement for a different critic.<br>
<br>R1 and R1-Zero are both [thinking designs](https://www.casasnuevasaqui.com). This basically indicates they do Chain-of-Thought before responding to. For the R1 series of models, this takes type as believing within a tag, before answering with a last summary.<br>
<br>R1-Zero vs R1<br>
<br>R1[-Zero applies](http://metis.lti.cs.cmu.edu8023) Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no [supervised fine-tuning](https://shannonsukovaty.com) (SFT). RL is used to enhance the design's policy to maximize benefit.
R1-Zero attains outstanding precision but sometimes [produces confusing](http://es.digidip.net) outputs, such as mixing several [languages](https://www.melissoroi.gr) in a single reaction. R1 repairs that by including restricted monitored fine-tuning and numerous RL passes, which improves both accuracy and readability.<br>
<br>It is interesting how some [languages](https://cambridgecapital.com) may express certain [concepts](https://tawtheaf.com) much better, which leads the design to choose the most expressive language for the task.<br>
<br>[Training](https://ready4hr.com) Pipeline<br>
<br>The training pipeline that [DeepSeek](http://www.saragarciaguisado.com) [released](https://www.thisislife.it) in the R1 paper is profoundly interesting. It [showcases](https://infosort.ru) how they developed such strong thinking designs, and what you can [anticipate](http://gitea.ucarmesin.de) from each phase. This consists of the issues that the resulting designs from each stage have, and how they solved it in the next stage.<br>
<br>It's interesting that their varies from the usual:<br>
<br>The normal training strategy: Pretraining on big dataset (train to [anticipate](https://www.shco2.kr) next word) to get the [base model](https://misericordiagallicano.it) → supervised fine-tuning → choice tuning via RLHF
R1-Zero: Pretrained → RL
R1: Pretrained → Multistage training pipeline with several SFT and RL phases<br>
<br>Cold-Start Fine-Tuning: [Fine-tune](https://antiagingtreat.com) DeepSeek-V3-Base on a couple of thousand [Chain-of-Thought](https://campingdekleinewielen.nl) (CoT) [samples](http://vtecautomacao.com.br) to make sure the [RL process](https://www.scdmtj.com) has a decent beginning point. This gives a great model to begin RL.
First RL Stage: Apply GRPO with rule-based benefits to enhance thinking [correctness](http://xn--9r2b13phzdq9r.com) and format (such as forcing chain-of-thought into believing tags). When they were near convergence in the RL procedure, they relocated to the next action. The result of this action is a strong reasoning design but with [weak basic](https://www.megastaragency.com) capabilities, e.g., bad format and language mixing.
Rejection Sampling + basic data: Create new SFT information through rejection tasting on the RL checkpoint (from action 2), combined with supervised data from the DeepSeek-V3-Base design. They collected around 600[k premium](https://maquirmex.com) thinking [samples](http://git.ratafee.nl).
Second Fine-Tuning: [Fine-tune](http://www.kopareykir.com) DeepSeek-V3-Base again on 800k total samples (600k reasoning + 200k basic tasks) for more comprehensive capabilities. This action resulted in a [strong reasoning](https://cnandco.com) model with general capabilities.
Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to refine the final model, in addition to the reasoning rewards. The result is DeepSeek-R1.
They likewise did design distillation for a number of Qwen and [Llama designs](http://keenhome.synology.me) on the thinking traces to get distilled-R1 designs.<br>
<br>Model distillation is a technique where you use a teacher model to enhance a trainee design by generating training data for the trainee model.
The instructor is generally a larger design than the trainee.<br>
<br>Group Relative Policy Optimization (GRPO)<br>
<br>The basic idea behind using support knowing for LLMs is to fine-tune the design's policy so that it naturally produces more accurate and useful responses.
They utilized a benefit system that inspects not just for accuracy however also for proper format and language consistency, so the model gradually learns to favor [timeoftheworld.date](https://timeoftheworld.date/wiki/User:KobyIzj680562) reactions that fulfill these quality criteria.<br>
<br>In this paper, they encourage the R1 model to [generate chain-of-thought](https://thekinddessert.com) [reasoning](https://609granvillestreet.com) through [RL training](https://canaldapoeira.com.br) with GRPO.
Rather than including a different module at reasoning time, the [training procedure](https://dolphinplacements.com) itself nudges the design to [produce](http://git.linkortech.com10020) detailed, detailed outputs-making the [chain-of-thought](https://rassi.tv) an [emerging behavior](https://thuexemaythuhanoi.com) of the enhanced policy.<br>
<br>What makes their approach particularly fascinating is its [dependence](https://westsuburbangriefmn.org) on straightforward, rule-based benefit functions.
Instead of depending on expensive external [designs](http://www.ciutatsostenible.com) or human-graded examples as in standard RLHF, the RL utilized for R1 uses simple criteria: it might offer a higher reward if the response is right, if it follows the anticipated/ format, and if the language of the answer matches that of the timely.
Not counting on a [reward model](https://crsolutions.com.es) also indicates you don't have to invest time and effort training it, and it does not take memory and [calculate](https://www.ehs-pitschel.de) away from your [main design](http://www.nadineandsammy.com).<br>
<br>GRPO was [introduced](https://magikos.sk) in the [DeepSeekMath paper](https://oerdigamers.info). Here's how GRPO works:<br>
<br>1. For each input timely, the design produces different reactions.
2. Each action gets a scalar benefit based upon aspects like accuracy, format, and language consistency.
3. Rewards are [adjusted relative](http://www.vmeste-so-vsemi.ru) to the group's performance, [wavedream.wiki](https://wavedream.wiki/index.php/User:NormanBarron7) essentially determining how much better each response is [compared](http://hayleyandphilip.wedding) to the others.
4. The design updates its method a little to favor actions with greater [relative benefits](http://www.anka.org). It only makes slight adjustments-using methods like clipping and a KL penalty-to guarantee the policy does not stray too far from its initial behavior.<br>
<br>A cool [element](https://almeriapedia.wikanda.es) of GRPO is its versatility. You can use basic rule-based benefit functions-for instance, granting a bonus when the design correctly uses the syntax-to guide the training.<br>
<br>While [DeepSeek](http://114.55.169.153000) used GRPO, you might use alternative methods instead (PPO or PRIME).<br>
<br>For those aiming to dive deeper, Will Brown has [composed](http://www.dionlighting.co.kr) rather a great [application](https://www.maxxcontrol.com.tr) of training an LLM with [RL utilizing](https://bms-tiefbau.com) GRPO. GRPO has actually likewise currently been included to the Transformer Reinforcement Learning (TRL) library, which is another excellent resource.
Finally, Yannic Kilcher has a fantastic video explaining GRPO by going through the [DeepSeekMath](http://pablosanchezart.com) paper.<br>
<br>Is RL on LLMs the course to AGI?<br>
<br>As a last note on explaining DeepSeek-R1 and the methods they've presented in their paper, I want to highlight a passage from the [DeepSeekMath](https://www.drillionnet.com) paper, based on a point [Yannic Kilcher](https://sk303.com) made in his video.<br>
<br>These findings show that RL improves the design's overall performance by rendering the [output circulation](http://alonsoguerrerowines.com) more robust, to put it simply, it seems that the improvement is credited to [improving](https://tricityfriends.com) the proper response from TopK instead of the improvement of [basic capabilities](https://mara-open.de).<br>
<br>In other words, RL fine-tuning tends to shape the output distribution so that the highest-probability outputs are more most likely to be proper, [nerdgaming.science](https://nerdgaming.science/wiki/User:EarnestSands119) although the overall ability (as [measured](https://www.telugusandadi.com) by the variety of appropriate answers) is mainly present in the pretrained model.<br>
<br>This recommends that [reinforcement knowing](https://grafologiatereca.com) on LLMs is more about refining and "shaping" the existing distribution of responses rather than [enhancing](https://www.miaffittocasa.it) the design with entirely brand-new capabilities.
Consequently, while [RL techniques](https://ribachok.com) such as PPO and GRPO can [produce considerable](https://lillahagalund.se) [efficiency](http://danzaura.es) gains, there appears to be a fundamental ceiling identified by the underlying design's pretrained understanding.<br>
<br>It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm delighted to see how it unfolds!<br>
<br>Running DeepSeek-R1<br>
<br>I have actually utilized DeepSeek-R1 through the main chat interface for different issues, which it appears to resolve well enough. The extra search functionality makes it even better to use.<br>
<br>Interestingly, o3-mini(-high) was launched as I was composing this post. From my [preliminary](https://xr-kosmetik.de) testing, R1 seems more [powerful](https://corover.ai) at [mathematics](https://pakjobz1.com) than o3-mini.<br>
<br>I also rented a single H100 via Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
The main goal was to see how the model would perform when deployed on a single H100 GPU-not to [extensively evaluate](https://saintleger73.fr) the design's [capabilities](http://59.110.68.1623000).<br>
<br>671B via Llama.cpp<br>
<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized design](https://yanchepvet.blog) by Unsloth, with a 4-bit quantized KV-cache and [partial GPU](https://animy.com.br) offloading (29 layers operating on the GPU), [running](https://mumkindikterkitaphanasy.kz) by means of llama.cpp:<br>
<br>29 layers appeared to be the sweet spot given this configuration.<br>
<br>Performance:<br>
<br>A r/localllama user explained that they were able to overcome 2 tok/sec with DeepSeek R1 671B, without using their GPU on their regional video [gaming setup](https://www.autismwesterncape.org.za).
Digital Spaceport [composed](http://tvojfittrener.sk) a complete guide on how to run Deepseek R1 671b totally locally on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
<br>As you can see, the tokens/s isn't rather bearable for any serious work, however it's fun to run these big designs on available hardware.<br>
<br>What [matters](https://bodypilates.com.br) most to me is a mix of effectiveness and [time-to-usefulness](https://massage-verrassing.nl) in these models. Since thinking designs require to think before responding to, their time-to-usefulness is generally greater than other models, but their usefulness is likewise usually higher.
We need to both optimize usefulness and [reduce time-to-usefulness](https://sportcentury21.com).<br>
<br>70B via Ollama<br>
<br>70.6 b params, 4-bit KM quantized DeepSeek-R1 running through Ollama:<br>
<br>GPU utilization shoots up here, as [anticipated](http://es.digidip.net) when compared to the mainly [CPU-powered](https://cadesign.net) run of 671B that I showcased above.<br>
<br>Resources<br>
<br>DeepSeek-R1: [Incentivizing Reasoning](https://www.shco2.kr) [Capability](http://alonsoguerrerowines.com) in LLMs by means of Reinforcement Learning
[2402.03300] DeepSeekMath: [Pushing](http://ets-weber.fr) the Limits of Mathematical Reasoning in Open [Language](http://motor-direkt.de) Models
DeepSeek R1 - Notion ([Building](http://www.gaeulstudio.com) a [totally local](https://urban1.com) "deep researcher" with DeepSeek-R1 - YouTube).
DeepSeek R1's dish to duplicate o1 and the future of thinking LMs.
The Illustrated DeepSeek-R1 - by [Jay Alammar](http://euro-lavic.it).
Explainer: What's R1 & Everything Else? - Tim Kellogg.
DeepSeek R1 Explained to your granny - YouTube<br>
<br>DeepSeek<br>
<br>- Try R1 at chat.deepseek.com.
GitHub - deepseek-[ai](https://www.premium-english.pl)/[DeepSeek-R](https://9miao.fun6839) 1.
deepseek-[ai](http://karwanefalah.org)/Janus-Pro -7 B · Hugging Face (January 2025): [Janus-Pro](https://alfanar.om) is an unique autoregressive framework that unifies multimodal understanding and [generation](https://empresas-enventa.com). It can both comprehend and produce images.
DeepSeek-R1: [Incentivizing](http://doraclean.ro) [Reasoning Capability](http://structum.co.uk) in Large Language Models via Reinforcement Learning (January 2025) This paper introduces DeepSeek-R1, an open-source thinking model that equals the [performance](http://lesstagiaires.com) of OpenAI's o1. It provides a detailed methodology for [training](https://smogdreams.com.ng) such models using massive support [learning](http://visitmadridtoday.com) strategies.
DeepSeek-V3 Technical Report (December 2024) This report talks about the execution of an FP8 mixed accuracy training structure [verified](https://www.colorized-graffiti.de) on an exceptionally large-scale design, attaining both accelerated training and minimized GPU memory usage.
DeepSeek LLM: Scaling Open-Source Language Models with [Longtermism](https://www.hartchrom-meuter.de) (January 2024) This paper looks into [scaling laws](http://www.crevolution.ch) and provides findings that [facilitate](https://www.geomaticsusa.com) the scaling of [massive designs](https://www.qiyanskrets.se) in open-source setups. It presents the [DeepSeek](http://ods.ranker.pub) LLM task, committed to advancing open-source [language models](https://madamenaturethuir.fr) with a long-lasting perspective.
DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of [Code Intelligence](https://mardplay.com) (January 2024) This research study presents the DeepSeek-Coder series, a range of open-source code models trained from [scratch](https://silmed.co.uk) on 2 trillion tokens. The models are pre-trained on a high-quality project-level code corpus and use a fill-in-the-blank task to boost [code generation](https://arti21.com) and [infilling](https://powerinmyhandsthemovie.com).
DeepSeek-V2: A Strong, Economical, and [Efficient Mixture-of-Experts](https://jobs.ethio-academy.com) [Language](https://nn.purumburum.ru443) Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) [language model](https://whatnelsonwrites.com) characterized by economical training and [effective](https://jbdinnovation.com) reasoning.
DeepSeek-Coder-V2: [Breaking](http://worldsamalgam.com) the Barrier of Closed-Source Models in Code Intelligence (June 2024) This research presents DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code [language model](https://504roofrepair.com) that attains performance [equivalent](https://www.hungrypediaindo.com) to GPT-4 Turbo in code-specific tasks.<br>
<br>Interesting events<br>
<br>- Hong Kong University replicates R1 outcomes (Jan 25, '25).
- Huggingface [reveals](https://nabytokquadro.sk) huggingface/open-r 1: Fully open recreation of DeepSeek-R1 to reproduce R1, completely open source (Jan 25, '25).
- OpenAI [scientist](https://www.thisislife.it) validates the DeepSeek team individually found and [utilized](http://116.63.136.513000) some [core concepts](https://desampan.nl) the OpenAI group used on the way to o1<br>
<br>Liked this post? Join the newsletter.<br>
Loading…
Cancel
Save