{"id":16020,"date":"2024-02-08T08:44:21","date_gmt":"2024-02-08T08:44:21","guid":{"rendered":"https:\/\/www.stratio.com\/blog\/?p=16020"},"modified":"2025-02-24T12:17:40","modified_gmt":"2025-02-24T12:17:40","slug":"vit-applying-the-genai-core-to-image-recognition","status":"publish","type":"post","link":"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/","title":{"rendered":"Vision Transformers &#8211; ViT, applying the GenAI Core to image recognition"},"content":{"rendered":"\n<p>Have you ever wondered how generative AI tools seem to work so well with images? The clever ways maintenance crews can detect pot holes on the road just by flying a drone over them or doctors being able to detect skin cancer from photos? This is possible because of the <strong>Transformers<\/strong> -not \u2018robots in disguise\u2019, but a neural network architecture that implements attention mechanisms.&nbsp; It has been a revolution in Natural Language Processing (NLP) and image\/text generation tasks and allowed a great advance in other tasks related with image recognition: <strong>image classification<\/strong>, <strong>object detection<\/strong>, and <strong>semantic image segmentation<\/strong>. All these advances are possible with the use of Vision Transformers or ViT.<\/p>\n\n\n\n<p>In this article we will explore what are the Vision Transformers? How do they work? And how they improve the image recognition tasks versus the Convolutional Neural Networks (CNNs).<\/p>\n\n\n\n<p class=\"has-text-align-center has-larger-font-size\"><img decoding=\"async\" src=\"https:\/\/lh7-us.googleusercontent.com\/1kR9C6gJfTWWS66uD0fNTpyP5tLoF2Dq2q28Qa6hQcNd9aIh-3zmhXnhxeNQ6PcJF2tMNg1wrUhBV-tEJm_oNy-wK188b1U53lyM3yEhw0wU79XMwpupCO00uL-RaGcBCkk-c5L9d1OAApL0vz6T334\" style=\"width: 800px;\"><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-what-are-the-vision-transformers\">What are the Vision Transformers?<\/h2>\n\n\n\n<p>The Vision Transformer (ViT) model architecture was introduced in a research paper published titled <a href=\"https:\/\/arxiv.org\/pdf\/2010.11929.pdf\" target=\"_blank\" rel=\"noreferrer noopener\"><em><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-cyan-blue-color\">An Image is Worth 16*16 Words: Transformers for Image Recognition at Scale<\/mark><\/em><\/a>. It was developed and published by <em>Neil Houlsby, Alexey Dosovitskiy, and a team of authors from the Google Research Brain Team<\/em>. The fine-tuning code and pre-trained models are available on this <a href=\"https:\/\/github.com\/google-research\/vision_transformer\" target=\"_blank\" rel=\"noreferrer noopener\"><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-cyan-blue-color\">GitHub<\/mark><\/a>.<\/p>\n\n\n\n<p>A <strong>transformer<\/strong> is a deep learning model that uses attention mechanisms to detect the weight (importance) of each part of an input data sequence (usually text). The <strong>visual transformer<\/strong> is a special type of transformer, which can do the same thing but with images.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-how-do-the-vision-transformers-work\">How do the Vision Transformers work?<\/h2>\n\n\n\n<p>Unlike CCNs, which process images by detecting features through convolutional filters, ViTs transform the images in a series of <strong>small segments or &#8220;patches&#8221;<\/strong> of a certain size. For example, if you have an image of 224&#215;224 pixels, it would be divided into 16&#215;16 patches, making 224\/16 x 224\/16 = 14 x 14 = 196 patches.<\/p>\n\n\n\n<p><br>Each of these patches is then flattened and transformed into a <strong>one-dimensional sequence of tokens<\/strong>. each token encodes a part of the image.\u00a0 However, because tokens are obtained after flattening the original image, information about the visual hierarchy of the image is lost, that is, which portion of the image comes before or after another part. Therefore, information about the order in which they were presented in the image is added to these vectors, it is called the <strong>positional embedding<\/strong>. This final sequence is the\u00a0 input of a <strong>transformer encoder<\/strong>.<\/p>\n\n\n\n<figure data-wp-context=\"{&quot;uploadedSrc&quot;:&quot;https:\\\/\\\/www.stratio.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/02\\\/pasted-image-0.png&quot;,&quot;figureClassNames&quot;:&quot;wp-block-image size-full&quot;,&quot;figureStyles&quot;:null,&quot;imgClassNames&quot;:&quot;wp-image-16021&quot;,&quot;imgStyles&quot;:null,&quot;targetWidth&quot;:930,&quot;targetHeight&quot;:485,&quot;scaleAttr&quot;:false,&quot;ariaLabel&quot;:&quot;Enlarge image: Vision Transformer VIT Architecture - Source\\n&quot;,&quot;alt&quot;:&quot;Vision Transformer VIT Architecture - Source\\n&quot;}\" data-wp-interactive=\"core\/image\" class=\"wp-block-image size-full wp-lightbox-container\"><img loading=\"lazy\" decoding=\"async\" width=\"930\" height=\"485\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on--click=\"actions.showLightbox\" data-wp-on--load=\"callbacks.setButtonStyles\" data-wp-on-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/pasted-image-0.png\" alt=\"Vision Transformer VIT Architecture - Source\n\" class=\"wp-image-16021\" srcset=\"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/pasted-image-0.png 930w, https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/pasted-image-0-300x156.png 300w\" sizes=\"(max-width: 930px) 100vw, 930px\" \/><button\n\t\t\tclass=\"lightbox-trigger\"\n\t\t\ttype=\"button\"\n\t\t\taria-haspopup=\"dialog\"\n\t\t\taria-label=\"Enlarge image: Vision Transformer VIT Architecture - Source\n\"\n\t\t\tdata-wp-init=\"callbacks.initTriggerButton\"\n\t\t\tdata-wp-on--click=\"actions.showLightbox\"\n\t\t\tdata-wp-style--right=\"context.imageButtonRight\"\n\t\t\tdata-wp-style--top=\"context.imageButtonTop\"\n\t\t>\n\t\t\t<svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"12\" height=\"12\" fill=\"none\" viewBox=\"0 0 12 12\">\n\t\t\t\t<path fill=\"#fff\" d=\"M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z\" \/>\n\t\t\t<\/svg>\n\t\t<\/button><figcaption class=\"wp-element-caption\">Figure 1: <a href=\"https:\/\/github.com\/google-research\/vision_transformer\"><em>Vision Transformer VIT Architecture &#8211; Source<\/em><\/a><\/figcaption><\/figure>\n\n\n\n<p>In this encoder, the self-attention mechanisms analyze each patch of the image, taking into account its relationship with the rest of the parts of the image. These compute a weighted sum of the input data, where the weights are computed based on the similarity between the input features. So, the model has the capability to evaluate the relative importance of each patch in relation to the others, allowing <strong>ViTs to capture long-range relationships between different parts of the image<\/strong> and to attend to different regions of the input data, based on their relevance to the task at hand.&nbsp;<\/p>\n\n\n\n<p>The final output of the ViT architecture is a <strong>class prediction<\/strong>, obtained by passing the output of the last transformer block through a classification head, which typically consists of a single fully connected layer.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-visual-transformers-vs-convolutional-neural-networks\">Visual Transformers vs Convolutional Neural Networks<\/h2>\n\n\n\n<p>According to results obtained in this article <a href=\"https:\/\/arxiv.org\/pdf\/2012.12556.pdf\" target=\"_blank\" rel=\"noreferrer noopener\"><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-cyan-blue-color\">A Survey on Visual Transformer<\/mark><\/a><strong>, the Visual Transformers or the combination of ViTs with CNN allows for better accuracy compared to CNN networks<\/strong>. Also It has been shown that this new architecture, even with hyperparameters fine-tuning (a high intensity process of search for the best parameters for model training), can be lighter than the CNN, consuming fewer computational resources and taking less training time.\u00a0<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-summary\">Summary<\/h2>\n\n\n\n<p>1.&nbsp; ViT architecture is more robust than CNN networks for images that have noise or are augmented.&nbsp;<\/p>\n\n\n\n<p>2. ViT architecture&nbsp; performs better than CNN due to the self-attention mechanisms, which make the overall image information to be accessible from the highest to the lowest layers.&nbsp;<\/p>\n\n\n\n<p>3. ViTs have the advantage of learning information better with fewer images. This is because the images are divided into small patches, so there is a greater diversity of relationships between them.&nbsp;<\/p>\n\n\n\n<p>4. CNNs can generalize better with smaller datasets and get better accuracy than ViTs.\u00a0<\/p>\n\n\n\n<figure data-wp-context=\"{&quot;uploadedSrc&quot;:&quot;https:\\\/\\\/www.stratio.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/02\\\/pasted-image-0-1.png&quot;,&quot;figureClassNames&quot;:&quot;wp-block-image size-full&quot;,&quot;figureStyles&quot;:null,&quot;imgClassNames&quot;:&quot;wp-image-16022&quot;,&quot;imgStyles&quot;:null,&quot;targetWidth&quot;:886,&quot;targetHeight&quot;:671,&quot;scaleAttr&quot;:false,&quot;ariaLabel&quot;:&quot;Enlarge image: Comparison between CNN and Vision Tansformer Models - Source\\n&quot;,&quot;alt&quot;:&quot;Comparison between CNN and Vision Tansformer Models - Source\\n&quot;}\" data-wp-interactive=\"core\/image\" class=\"wp-block-image size-full wp-lightbox-container\"><img loading=\"lazy\" decoding=\"async\" width=\"886\" height=\"671\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on--click=\"actions.showLightbox\" data-wp-on--load=\"callbacks.setButtonStyles\" data-wp-on-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/pasted-image-0-1.png\" alt=\"Comparison between CNN and Vision Tansformer Models - Source\n\" class=\"wp-image-16022\" srcset=\"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/pasted-image-0-1.png 886w, https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/pasted-image-0-1-300x227.png 300w, https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/pasted-image-0-1-87x67.png 87w\" sizes=\"(max-width: 886px) 100vw, 886px\" \/><button\n\t\t\tclass=\"lightbox-trigger\"\n\t\t\ttype=\"button\"\n\t\t\taria-haspopup=\"dialog\"\n\t\t\taria-label=\"Enlarge image: Comparison between CNN and Vision Tansformer Models - Source\n\"\n\t\t\tdata-wp-init=\"callbacks.initTriggerButton\"\n\t\t\tdata-wp-on--click=\"actions.showLightbox\"\n\t\t\tdata-wp-style--right=\"context.imageButtonRight\"\n\t\t\tdata-wp-style--top=\"context.imageButtonTop\"\n\t\t>\n\t\t\t<svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"12\" height=\"12\" fill=\"none\" viewBox=\"0 0 12 12\">\n\t\t\t\t<path fill=\"#fff\" d=\"M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z\" \/>\n\t\t\t<\/svg>\n\t\t<\/button><figcaption class=\"wp-element-caption\">Figure 2: <a href=\"https:\/\/arxiv.org\/pdf\/2012.12556.pdf\"><em>Comparison between CNN and Vision Tansformer Models &#8211; Source<\/em><\/a><\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-visual-transformers-in-stratio-platform\">Visual Transformers in Stratio Platform<\/h2>\n\n\n\n<p>All these results and advances obtained make visual transformers the most appropriate elements to do tasks related to image recognition in generative AI tools, such as: <em>Image Detection and Classification,&nbsp; Video Deepfake Detection and Anomaly Detection, Autonomous Driving and Image segmentation and cluster analysis.<\/em><\/p>\n\n\n\n<p>Stratio Generative AI Data Fabric<strong> <\/strong>product not only serves to govern, process and analyze data but it is also used to <strong>develop and implement the deeplearning models<\/strong> on which Vision Transformers are based.\u00a0 Besides, it has its own <strong>module for the development and productivity of GenAI models <\/strong>with efficiency and agility.<\/p>\n\n\n\n<p>Request a <a href=\"https:\/\/www.stratio.com\/request-a-demo\" target=\"_blank\" rel=\"noreferrer noopener\"><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-cyan-blue-color\">demo<\/mark><\/a> now!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-about-the-author\">About the author:<\/h2>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignleft size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"422\" height=\"560\" src=\"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/javiermoralo.png\" alt=\"\" class=\"wp-image-16023\" style=\"width:375px;height:auto\" srcset=\"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/javiermoralo.png 422w, https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/javiermoralo-226x300.png 226w\" sizes=\"(max-width: 422px) 100vw, 422px\" \/><\/figure><\/div>\n\n\n<p><em>Data Science Lead in Stratio BD. Currently working on design and development of Artificial Intelligence projects, in Big Data environments and Agile teams.<\/em><\/p>\n\n\n\n<p><em>Bioinformatician and Master in Big Data &amp; Analytics with more 21 years in data sector and more than 30 AI models developed, including classic machine learning and deep learning models, reinforcement learning models and generative AI models<\/em><\/p>\n\n\n\n<p><em>Amateur in languages, in addition to Spanish he speaks English and Russian, the history, photography and the oriental disciplines such as Taoism and Tai Chi.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Have you ever wondered how generative AI tools seem to work so well with images? The clever ways maintenance crews can detect pot holes on the road just by flying a drone over them or doctors being able to detect skin cancer from photos? This is possible because of the Transformers -not \u2018robots in disguise\u2019,<\/p>\n","protected":false},"author":257,"featured_media":16027,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[558,686],"tags":[16,649,722],"ppma_author":[797],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v22.9 (Yoast SEO v22.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Image Recognition: How Vision Transformers Work - Stratio<\/title>\n<meta name=\"description\" content=\"Explore image recognition advancements with Vision Transformers. Learn how they outperform Convolutional Neural Networks.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Image Recognition and Vision Transformers Explained\" \/>\n<meta property=\"og:description\" content=\"Explore the world of image recognition and learn how Vision Transformers improve accuracy in tasks like object detection.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/\" \/>\n<meta property=\"og:site_name\" content=\"Stratio\" \/>\n<meta property=\"article:published_time\" content=\"2024-02-08T08:44:21+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-02-24T12:17:40+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/Hero-Blog-12.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"670\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Javier Moralo\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"Vision Transformers - ViT, applying the GenAI Core to image recognition\" \/>\n<meta name=\"twitter:creator\" content=\"@stratiobd\" \/>\n<meta name=\"twitter:site\" content=\"@stratiobd\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Javier Moralo\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/\"},\"author\":{\"name\":\"Javier Moralo\",\"@id\":\"https:\/\/www.stratio.com\/blog\/#\/schema\/person\/607849d6401902ffabc474ba2e1aab4e\"},\"headline\":\"Vision Transformers &#8211; ViT, applying the GenAI Core to image recognition\",\"datePublished\":\"2024-02-08T08:44:21+00:00\",\"dateModified\":\"2025-02-24T12:17:40+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/\"},\"wordCount\":914,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.stratio.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/Hero-Blog-12.png\",\"keywords\":[\"Artifical Intelligence\",\"Data Fabric\",\"Vision Transformers\"],\"articleSection\":[\"News\",\"Product\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/\",\"url\":\"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/\",\"name\":\"Image Recognition: How Vision Transformers Work - Stratio\",\"isPartOf\":{\"@id\":\"https:\/\/www.stratio.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/Hero-Blog-12.png\",\"datePublished\":\"2024-02-08T08:44:21+00:00\",\"dateModified\":\"2025-02-24T12:17:40+00:00\",\"description\":\"Explore image recognition advancements with Vision Transformers. Learn how they outperform Convolutional Neural Networks.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/#primaryimage\",\"url\":\"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/Hero-Blog-12.png\",\"contentUrl\":\"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/Hero-Blog-12.png\",\"width\":1000,\"height\":670},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.stratio.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Vision Transformers &#8211; ViT, applying the GenAI Core to image recognition\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.stratio.com\/blog\/#website\",\"url\":\"https:\/\/www.stratio.com\/blog\/\",\"name\":\"Stratio Blog\",\"description\":\"Corporate blog\",\"publisher\":{\"@id\":\"https:\/\/www.stratio.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.stratio.com\/blog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.stratio.com\/blog\/#organization\",\"name\":\"Stratio\",\"url\":\"https:\/\/www.stratio.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.stratio.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/stratio.com\/blog\/wp-content\/uploads\/2020\/06\/stratio-web-logo-1.png\",\"contentUrl\":\"https:\/\/stratio.com\/blog\/wp-content\/uploads\/2020\/06\/stratio-web-logo-1.png\",\"width\":260,\"height\":55,\"caption\":\"Stratio\"},\"image\":{\"@id\":\"https:\/\/www.stratio.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/stratiobd\",\"https:\/\/es.linkedin.com\/company\/stratiobd\",\"https:\/\/www.youtube.com\/c\/StratioBD\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.stratio.com\/blog\/#\/schema\/person\/607849d6401902ffabc474ba2e1aab4e\",\"name\":\"Javier Moralo\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.stratio.com\/blog\/#\/schema\/person\/image\/fdc2cb1a849ccb5357c29b153482c5a0\",\"url\":\"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/unnamed-150x150.png\",\"contentUrl\":\"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/unnamed-150x150.png\",\"caption\":\"Javier Moralo\"}}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Image Recognition: How Vision Transformers Work - Stratio","description":"Explore image recognition advancements with Vision Transformers. Learn how they outperform Convolutional Neural Networks.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/","og_locale":"en_US","og_type":"article","og_title":"Image Recognition and Vision Transformers Explained","og_description":"Explore the world of image recognition and learn how Vision Transformers improve accuracy in tasks like object detection.","og_url":"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/","og_site_name":"Stratio","article_published_time":"2024-02-08T08:44:21+00:00","article_modified_time":"2025-02-24T12:17:40+00:00","og_image":[{"width":1000,"height":670,"url":"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/Hero-Blog-12.png","type":"image\/png"}],"author":"Javier Moralo","twitter_card":"summary_large_image","twitter_title":"Vision Transformers - ViT, applying the GenAI Core to image recognition","twitter_creator":"@stratiobd","twitter_site":"@stratiobd","twitter_misc":{"Written by":"Javier Moralo","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/#article","isPartOf":{"@id":"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/"},"author":{"name":"Javier Moralo","@id":"https:\/\/www.stratio.com\/blog\/#\/schema\/person\/607849d6401902ffabc474ba2e1aab4e"},"headline":"Vision Transformers &#8211; ViT, applying the GenAI Core to image recognition","datePublished":"2024-02-08T08:44:21+00:00","dateModified":"2025-02-24T12:17:40+00:00","mainEntityOfPage":{"@id":"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/"},"wordCount":914,"commentCount":0,"publisher":{"@id":"https:\/\/www.stratio.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/#primaryimage"},"thumbnailUrl":"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/Hero-Blog-12.png","keywords":["Artifical Intelligence","Data Fabric","Vision Transformers"],"articleSection":["News","Product"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/","url":"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/","name":"Image Recognition: How Vision Transformers Work - Stratio","isPartOf":{"@id":"https:\/\/www.stratio.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/#primaryimage"},"image":{"@id":"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/#primaryimage"},"thumbnailUrl":"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/Hero-Blog-12.png","datePublished":"2024-02-08T08:44:21+00:00","dateModified":"2025-02-24T12:17:40+00:00","description":"Explore image recognition advancements with Vision Transformers. Learn how they outperform Convolutional Neural Networks.","breadcrumb":{"@id":"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/#primaryimage","url":"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/Hero-Blog-12.png","contentUrl":"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/Hero-Blog-12.png","width":1000,"height":670},{"@type":"BreadcrumbList","@id":"https:\/\/www.stratio.com\/blog\/vit-applying-the-genai-core-to-image-recognition\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.stratio.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Vision Transformers &#8211; ViT, applying the GenAI Core to image recognition"}]},{"@type":"WebSite","@id":"https:\/\/www.stratio.com\/blog\/#website","url":"https:\/\/www.stratio.com\/blog\/","name":"Stratio Blog","description":"Corporate blog","publisher":{"@id":"https:\/\/www.stratio.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.stratio.com\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.stratio.com\/blog\/#organization","name":"Stratio","url":"https:\/\/www.stratio.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.stratio.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/stratio.com\/blog\/wp-content\/uploads\/2020\/06\/stratio-web-logo-1.png","contentUrl":"https:\/\/stratio.com\/blog\/wp-content\/uploads\/2020\/06\/stratio-web-logo-1.png","width":260,"height":55,"caption":"Stratio"},"image":{"@id":"https:\/\/www.stratio.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/stratiobd","https:\/\/es.linkedin.com\/company\/stratiobd","https:\/\/www.youtube.com\/c\/StratioBD"]},{"@type":"Person","@id":"https:\/\/www.stratio.com\/blog\/#\/schema\/person\/607849d6401902ffabc474ba2e1aab4e","name":"Javier Moralo","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.stratio.com\/blog\/#\/schema\/person\/image\/fdc2cb1a849ccb5357c29b153482c5a0","url":"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/unnamed-150x150.png","contentUrl":"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/unnamed-150x150.png","caption":"Javier Moralo"}}]}},"authors":[{"term_id":797,"user_id":257,"is_guest":0,"slug":"jmoralo","display_name":"Javier Moralo","avatar_url":"https:\/\/www.stratio.com\/blog\/wp-content\/uploads\/2024\/02\/unnamed-150x150.png","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.stratio.com\/blog\/wp-json\/wp\/v2\/posts\/16020"}],"collection":[{"href":"https:\/\/www.stratio.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.stratio.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.stratio.com\/blog\/wp-json\/wp\/v2\/users\/257"}],"replies":[{"embeddable":true,"href":"https:\/\/www.stratio.com\/blog\/wp-json\/wp\/v2\/comments?post=16020"}],"version-history":[{"count":1,"href":"https:\/\/www.stratio.com\/blog\/wp-json\/wp\/v2\/posts\/16020\/revisions"}],"predecessor-version":[{"id":16024,"href":"https:\/\/www.stratio.com\/blog\/wp-json\/wp\/v2\/posts\/16020\/revisions\/16024"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.stratio.com\/blog\/wp-json\/wp\/v2\/media\/16027"}],"wp:attachment":[{"href":"https:\/\/www.stratio.com\/blog\/wp-json\/wp\/v2\/media?parent=16020"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.stratio.com\/blog\/wp-json\/wp\/v2\/categories?post=16020"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.stratio.com\/blog\/wp-json\/wp\/v2\/tags?post=16020"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.stratio.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=16020"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}