Gemini: An Overview of Its Innovative Features and Models
Gemini is a family of chatbots based on artificial intelligence developed by Google. Right now, Gemini is in third place among all chatbots in terms of market share, behind only ChatGPT and Microsoft Copilot. At the same time, Gemini continues to grow faster than its competitors and is steadily gaining popularity: it ranks 4th in terms of new user inflow, with only Claude growing faster among well-known chatbots. In this article we will look at the history of Gemini, current models, their features and limitations.
A brief history of Google Gemini
Google has been a pioneer in large language models architecture and draws upon its robust research to develop its own artificial intelligence models.
- 2017: Google researchers present the transformer architecture, which underpins many of today’s large language models.
- 2020: The company introduces Meena, a neural network-based chatbot with 2.6 billion parameters, which Google claimed to be superior to all other existing chatbots at the time.
- 2021: Meena renamed to LaMDA (short for Language Model for Dialogue Applications) as its data and computing power increased.
- 2022: A new language model called PaLM (Pathways Language Model) is released, with more advanced capabilities compared to LaMDA.
- 2023: A chatbot called Google Bard is released during the first quarter of the year, backed by a lightweight and optimized version of LaMDA. Then, in the second quarter, they introduced PaLM 2, featuring improved coding, multilingual capabilities, and enhanced reasoning skills, which Bard then adopted. Finally, in the last quarter, Google announced Gemini 1.0.
- 2024: Google renames Bard as Gemini and upgrades its multimodal AI models to version 1.5. Gemini 2.0 models are introduced in December.
In April 2024 Google DeepMind CEO Demis Hassabis said that over time the company will spend more than $100 billion developing artificial intelligence technology.

Demis Hassabis
Gemini’s distinctive features
Every chatbot has limited knowledge of recent events because its training data encompasses only a finite period of time. A cutoff date in the context of chatbots refers to the point in time up to which the model has been trained on data and can provide information. For instance, if a chatbot has a cutoff date of October 2023, it means that all the knowledge and data it has access to is current only until that date. Any events, developments, or changes that have occurred after that date will not be reflected in the chatbot’s responses. This limitation is important for users to understand, as it affects the accuracy and relevance of the information provided, especially in fast-changing fields such as technology, politics, or current events. However, Gemini, can work around this limitation by accessing and processing information from online searches via Google Search, providing more up-to-date answers.
Consequently, users may need to verify information from more recent sources if they are seeking the latest updates or insights. Sometimes, Gemini shows you sources and related content within and below its response. These include web sources with similar information and links for you to dig deeper. Gemini is designed to generate original content, but if it does directly quote at length from a web page, you'll see a quotation mark with the cited source and a link to that page. Sources and related content may include websites that Gemini quoted or that relate to parts of its response. If Gemini's response includes a thumbnail of an image from the web, it will show the source and provide a link directly to it.

Gemini was designed multimodal from the get-go, meaning it was trained on multiple data types, and now it can seamlessly work with different types of content. As you can see on the picture above, the bot can include images in its responses. Gemini can understand text, audio, video fragments, handwritten notes, graphs, diagrams, can identify objects on photos, and on top of that can generate images using Imagen 3, Google’s most advanced text-to-image model.
The chatbot also has broad multilingual capabilities as it is available in 46 different languages.
Current models, their strengths and capabilities
Gemini offers different models that are optimized for specific use cases. Here's a brief overview of the variants that are available:
Model | Input | Output | Description |
Gemini 2.0 Flash | Audio, images, videos, and text | Text, images (coming soon), and audio (coming soon) | Next generation features, speed, and multimodal generation for a diverse variety of tasks |
Gemini 2.0 Flash Thinking | Text, images | Text | Enhanced reasoning model that excels in science and math |
Gemini 1.5 Flash | Audio, images, videos, and text | Text | Fast and versatile performance across a diverse variety of tasks |
Gemini 1.5 Flash-8B | Audio, images, videos, and text | Text | High volume and lower intelligence tasks |
Gemini 1.5 Pro | Audio, images, videos, and text | Text | Complex reasoning tasks requiring more intelligence |
Gemini 1.5 Flash comes with a 1-million-token context window, and Gemini 1.5 Pro comes with a 2-million-token context window, which is the longest of any large language model.
One token is equivalent to about 4 characters for Gemini models. 100 tokens are about 60-80 English words.
In practice, 1 million tokens would look like:
- 50,000 lines of code (with the standard 80 characters per line).
- Transcripts of over 200 average length podcast episodes.
- 8 average length English novels.
- All the text messages you have sent in the last 5 years.
Gemini 1.5 Flash and Flash-8B | |
| Input token limit | 1,048,576 |
| Output token limit | 8,192 |
| Maximum number of images | 3,600 |
| Maximum video length | 1 hour |
| Maximum audio length | Approximately 9.5 hours |
Gemini 1.5 Pro achieves near-perfect recall on long-context retrieval tasks across modalities, unlocking the ability to accurately process long documents, thousands of lines of code, hours of audio, video, and more.
Gemini 1.5 Pro | |
| Input token limit | 2,097,152 |
| Output token limit | 8,192 |
| Maximum number of images | 7,200 |
| Maximum video length | 2 hours |
| Maximum audio length | Approximately 19 hours |
Each image is equivalent to 258 tokens. Supported image types:
- PNG
- WEBP
- JPEG
- HEIC
- HEIF
While there are no specific limits to the number of pixels in an image besides the model's context window, larger images are scaled down to a maximum resolution of 3072x3072 while preserving their original aspect ratio, while smaller images are scaled up to 768x768 pixels.
Vision capabilities:
- Caption and answer questions about images.
- Transcribe and reason over PDFs, including long documents up to 2 million token context window.
- Describe, segment, and extract information from videos, including both visual frames and audio, up to 90 minutes long.

Gemini is able to correctly recognize all of the handwritten content and verify the reasoning.
Gemini’s audio capabilities:
- Describe, summarize, or answer questions about audio content.
- Provide a transcription of the audio.
- Provide answers or a transcription about a specific segment of the audio.
Supported audio formats:
- WAV
- MP3
- FLAC
- OGG Vorbis
- AIFF
- AAC
Each second of audio is equivalent to 25 tokens; for example, one minute of audio is represented as 1,500 tokens.
Gemini 2.0 Flash | |
| Input token limit | 1,048,576 |
| Output token limit | 8,192 |
Gemini 2.0 Flash is the most powerful and versatile model of the Gemini family. It can natively create images and generate speech, and when it comes to performance, it surpasses other models across in almost all key benchmarks. See for yourself.
| Capability | Benchmark | Description | Gemini 1.5 Flash | Gemini 1.5 Pro | Gemini 2.0 Flash |
| General | MMLU-Pro | Evaluates how well machine learning models understand natural language | 67.3% | 75.8% | 76.4% |
| Code | Natural2Code | Code generation across Python, Java, C++, JS, Go | 79.8% | 85.4% | 92.9% |
| Code | Bird-SQL (Dev) | Evaluates converting natural language questions into executable SQL | 45.6% | 54.4% | 56.9% |
| Factuality | FACTS Grounding | Ability to provide factuality correct responses given documents and diverse user requests | 82.9% | 80.0% | 83.6% |
| Math | MATH | Challenging math problems (incl. algebra, geometry, pre-calculus, and others) | 77.9% | 86.5% | 89.7% |
| Math | HiddenMath | Competition-level math problems | 47.2% | 52.0% | 63.0% |
| Reasoning | GPQA (diamond) | Challenging dataset of questions written by domain experts in biology, physics, and chemistry | 51.0% | 59.1% | 62.1% |
| Image | MMMU | Multi-discipline college-level multimodal understanding and reasoning problems | 62.3% | 65.9% | 70.7% |
| Audio | CoVoST2 (21 lang) | Automatic speech translation | 37.4 | 40.1 | 39.2 |
| Video | EgoSchema (test) | Video analysis | 66.8% | 71.2% | 71.5% |
Gemini 2.0 Flash Thinking combines speed and performance, demonstrating remarkable expertise in tackling complex problems in both math and science. A one-million token context window enables deeper analysis of long-form text. Improved thinking provides more consistency between thoughts and answers.
Gemini 2.0 Flash Thinking | |
| Input token limit | 1,048,576 |
| Output token limit | 65,536 |
Please note the ginormous output token window. It allows the model to not only process lengthy requests but also to give back extensive responses, which might come in handy for generating large chunks of code, for instance.
See how Gemini 2.0 Flash Thinking surpasses Gemini 1.5 Pro and Gemini 2.0 in Math, Science, and Multimodal reasoning. It might not be as versatile as those two models in general, but in these specific domains, Gemini 2.0 Flash Thinking is unmatched.

Math, science, and reasoning

Math, and science
Criticism
Gemini chatbot had a rough start when it was released back in 2023. The developers were in too much of a hurry to release a rival to ChatGPT. And that's why the release version of the chatbot was riddled with bugs. Users complained about a large number of factual errors and inaccuracies in the bot's answers.
One of the most high-profile was the image generation controversy. Gemini tried to present maximum racial diversity even where it was inappropriate. According to the chatbot, this is what German soldiers looked like in 1943:

And this is what U.S. senators from the 1800s looked like:

Due to user discontent, the company's shares fell by 4.5%, which roughly corresponds to a loss of $90 million. The developers also had to temporarily block the ability to generate images of people.
Following the controversy surrounding image generation, some users began accusing Gemini's text responses of being biased toward the left. In one such example Gemini stated that it was "difficult to say definitively" whether Elon Musk or the Nazi dictator Adolf Hitler had a greater negative impact on society. Additionally, other users noted that Gemini appeared to favor left-leaning politicians and issues like affirmative action and abortion rights, while being reluctant to support right-wing figures, meat consumption, and fossil fuels.
But it must be said that all these difficulties are mostly behind now. Now Gemini has no problems and is one of the most successful and popular chatbots in the world.