Technical SEO Guide: From Crawling to Indexing – Making Search Engines Understand Website (Part 1)

Post Time: Mar 29, 2026
Update Time: Apr 12, 2026
Article.Summary

Master technical SEO with this complete guide covering crawlability, indexability, Core Web Vitals, structured data, XML sitemaps, and mobile-first indexing. Learn how to build a search-engine-friendly website infrastructure.

What is Google technical SEO?

Simply put, it is the practice of enabling search engines to find your pages, understand your content, and correctly index your website.

Google technical SEO

Many people think technical SEO is mysterious. It involves code, configurations, and a host of confusing terms.

In reality, it is not. The core logic of technical SEO is simple: Google is a robot. It needs to crawl your website, understand your content, and store your pages in its database. Your job is to make this process as smooth as possible.

If Google cannot crawl your pages, even the best content is useless. If Google cannot understand your page structure, it will not know where to rank you. If your website is slow or offers a poor mobile experience, Google will directly lower your rankings.

According to research by SEMrush, over 80% of websites have technical SEO issues. Many of these are fundamental – incorrect robots.txt configurations, missing canonical tags, and excessively slow page speeds.

This article will start from the most basic crawling mechanisms and go all the way to the cutting edge of AI search optimization. Whether you are a beginner or an experienced SEO professional, you will find useful information here.

Let us begin.


The Underlying Logic of Google Technical SEO: Crawling, Rendering, Indexing

Before diving into specific operations, it is important to understand how Google works.

Step 1: Crawling

Google has a crawler program called Googlebot. Its job is to constantly visit web pages and fetch their content.

How does Googlebot discover new pages?

  • Through links on known pages
  • Through XML Sitemaps
  • Through manual submission in Google Search Console
  • Through links from third-party websites pointing to your site

Once Googlebot discovers a URL, it adds it to the crawl queue. However, not all URLs are crawled immediately. Google decides the crawl priority based on the page's importance, update frequency, and the website's crawl budget.

Step 2: Rendering

After crawling the HTML, Google needs to render the page – execute JavaScript, load CSS, and generate the final Document Object Model (DOM).

This step is crucial. If your website heavily relies on JavaScript to generate content (for example, React, Vue, or Angular single-page applications), Google needs additional time and resources to render. According to official Google documentation, rendering can be delayed from a few seconds to several days.

Step 3: Indexing

After rendering, Google analyzes the page content, extracts key information (titles, body text, links, structured data, etc.), and then decides whether to include the page in its index.

The index is Google's database. Only pages that are indexed can appear in search results.

The entire process:

txt Copy
1URL Discovery → Added to Crawl Queue → HTML Crawled → Page Rendered → Content Analyzed → Added to Index → Included in Rankings
2
3

The goal of technical SEO is to ensure every step in this process runs smoothly.


Robots.txt: The First Door for Controlling Crawlers

Robots.txt is a text file placed in your website's root directory ([example.com/robots.txt]). It tells search engine crawlers which pages can be crawled and which cannot.

Basic Syntax

Copy
1User-agent: *
2Disallow: /admin/
3Disallow: /cart/
4Disallow: /checkout/
5Allow: /
6
7Sitemap: https://example.com/sitemap.xml
8
9

Explanation:

  • User-agent: * — applies to all crawlers
  • Disallow: /admin/ — blocks crawling of all pages under the /admin/ directory
  • Allow: / — allows crawling of all other pages
  • Sitemap: — tells crawlers the location of the Sitemap

Common Robots.txt Errors

ErrorConsequenceCorrect Approach
Disallow: / (blocking all crawling)Entire website disappears from search resultsOnly block directories that do not need indexing
Blocking CSS/JS filesGoogle cannot render the page, affecting rankingsAllow crawling of CSS and JS
Blocking image directoriesImages do not appear in Google ImagesAllow crawling of images
Forgetting to modify development environmentEntire site blocked after going liveCheck robots.txt before launch
Using robots.txt to prevent indexingPage may still be indexed (just not crawled)Use noindex tag to prevent indexing

The last point is particularly important: robots.txt can only prevent crawling, not indexing. If another website links to one of your pages, Google might index that URL without crawling it (showing only the URL without a content summary). To truly prevent indexing, you must use the noindex tag.

Robots.txt for Different Platforms

  • WordPress: robots.txt is generated automatically by default and can be customized using plugins like Rank Math or Yoast SEO.
  • Shopify: robots.txt is generated automatically and cannot be edited directly. However, starting in 2021, limited customization is possible through the robots.txt.liquid template.
  • Custom Websites: Create the robots.txt file manually and place it in the website's root directory.

Robots.txt for AI Crawlers

Between 2024 and 2025, AI crawlers have become a new issue. OpenAI's GPTBot, Anthropic's ClaudeBot, and Google's Google-Extended – these AI crawlers scrape your content to train their models.

If you do not want AI crawlers to scrape your content, you can add content to robots.txt as follows:

Copy
1User-agent: GPTBot
2Disallow: /
3
4User-agent: ClaudeBot
5Disallow: /
6
7User-agent: Google-Extended
8Disallow: /
9
10

However, note that blocking AI crawlers may affect your visibility in AI search results. This is a trade-off.


XML Sitemap: Giving Google a Map

An XML Sitemap is a file that lists all the important pages on your website. It helps Google discover and understand your website's structure.

A Sitemap is not a ranking factor. Having a Sitemap will not make you rank higher. However, it helps ensure Google is aware of all your important pages.

1. Basic Sitemap Format

xml Copy
1<?xml version="1.0" encoding="UTF-8"?>
2<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
3  <url>
4    <loc>https://example.com/page-1/</loc>
5    <lastmod>2025-01-15</lastmod>
6    <changefreq>monthly</changefreq>
7    <priority>0.8</priority>
8  </url>
9</urlset>
10

Where:

  • loc: Page URL (required)

  • lastmod: Last modification date (recommended; Google references this)

  • changefreq: Update frequency (Google largely ignores this field)

  • Priority: Priority (Google also largely ignores this)

In practice, you only need to focus on loc and lastmod.

2. Sitemap Best Practices

RuleExplanation
Include only pages that need indexingDo not put [noindex] pages, redirected pages (301/302), or 404 error pages in the Sitemap. These waste crawl budget and send mixed signals to search engines.
URLs in the Sitemap must be canonical URLsIf a page has a canonical tag pointing to another URL, the Sitemap should include the canonical URL—not the duplicate or variant URL.
Maximum 50,000 URLs per SitemapEach Sitemap file cannot exceed 50,000 URLs. If your site exceeds this limit, split URLs across multiple Sitemap files and use a Sitemap Index file ([sitemap_index.xml]) to aggregate them.
Sitemap file size not exceeding 50MBUncompressed file size must stay under 50MB. For large Sitemaps, submit compressed files ([.xml.gz]) to reduce bandwidth and improve processing speed.
[lastmod] should be accurateUpdate the [] tag only when the page content actually changes. Do not automatically update all pages daily—this creates unnecessary crawl demand and reduces trust signals with search engines.
Declare Sitemap location in [robots.txt]Add a [Sitemap] directive to your [robots.txt] file to help search engines discover your Sitemap location. Format: [Sitemap: https://example.com/sitemap.xml]
Submit in Google Search ConsoleAfter publishing your Sitemap, submit it via Google Search Console (or Bing Webmaster Tools). Monitor the Index status report to verify that pages are being discovered and indexed correctly.

3. Sitemap Strategy for Large Websites

If your website has tens of thousands or even hundreds of thousands of pages, you need to organize them using a Sitemap Index file:

xml Copy
1<?xml version="1.0" encoding="UTF-8"?>
2<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
3  <sitemap>
4    <loc>https://example.com/sitemap-products.xml</loc>
5    <lastmod>2025-01-15</lastmod>
6  </sitemap>
7  <sitemap>
8    <loc>https://example.com/sitemap-categories.xml</loc>
9    <lastmod>2025-01-10</lastmod>
10  </sitemap>
11  <sitemap>
12    <loc>https://example.com/sitemap-blog.xml</loc>
13    <lastmod>2025-01-14</lastmod>
14  </sitemap>
15</sitemapindex>
16
17

Split the Sitemap by page type (products, categories, blog, static pages) to facilitate management and monitoring.

4. Sitemap for Different Platforms

  • WordPress + Rank Math: Sitemap is generated automatically. You can control which content types are included in the Sitemap within Rank Math settings. The path is usually /sitemap_index.xml.

  • Shopify: Sitemap is generated automatically at the path /sitemap.xml. It cannot be customized, but Shopify's default Sitemap is sufficient.

  • Custom Websites: Generate using tools like Screaming Frog or Sitebulb, or generate dynamically with code.


Canonical Tags: The Tool for Resolving Duplicate Content

Duplicate content is one of the most common technical SEO issues. When the same content appears across multiple URLs, Google does not know which one to index.

The canonical tag tells Google which version among these duplicate pages is the "master" copy.

1. When Are Canonical Tags Needed?

Common duplicate content scenarios:

  • URL parameters: example.com/product and example.com/product?ref=email are the same page

  • HTTP/HTTPS: http://example.com and https://example.com

  • www/non-www: www.example.com and example.com

  • Trailing slashes: example.com/page and example.com/page/

  • Uppercase/lowercase: example.com/Page and example.com/page

Pagination: example.com/blog and example.com/blog?page=1

  • Sorting/filtering: example.com/products?sort=price and example.com/products?sort=name

  • Cross-domain content: Your article republished on other websites

2. Using Canonical Tags

Add the following in the page's section:

html Copy
1<link rel="canonical" href="https://example.com/preferred-url/" />
2
3

Every page should have a canonical tag, including self-referencing canonicals (pointing to itself).

3. Common Canonical Tag Errors

ErrorConsequenceCorrect Approach
All pages canonical pointing to homepageAll pages except the homepage disappear from the indexEach page points to its own canonical URL (self-referential canonical)
Canonical URL is a 404 pageGoogle ignores the canonical tagEnsure the canonical URL returns a 200 (OK) status code
Canonical URL blocked by robots.txtGoogle cannot verify the canonical relationshipEnsure the canonical URL can be crawled (not blocked by robots.txt)
Canonical chain (A → B → C)Google may ignore the chain or follow inconsistentlyPoint directly to the final (master) URL—avoid chains
Canonical and noindex used togetherConflicting signals confuse Google's indexing decisionChoose one strategy: either canonical to the master version OR noindex—never both
HTTP canonical on an HTTPS pageProtocol mismatch creates confusion and may be ignoredAlways use HTTPS for canonical URLs when the site uses HTTPS

Important Note: The canonical tag is a "suggestion," not a "directive." Google may ignore your canonical tag and choose a URL it considers more appropriate as the canonical. If you find Google choosing the wrong canonical, you need to check whether internal links, the Sitemap, and external links all point to the correct URL.

Website Architecture and URL Structure

Website architecture determines how Google understands your website. A good architecture allows Google to crawl all pages easily; a poor one leaves Google lost.

1. Flat Architecture

The ideal website architecture is flat – any page should be reachable from the homepage within three clicks.

text Copy
1Homepage
2├── Category A
3│   ├── Product A1
4│   ├── Product A2
5│   └── Product A3
6├── Category B
7│   ├── Product B1
8│   └── Product B2
9└── Blog
10    ├── Article 1
11    └── Article 2
12
13

Problems with overly deep issues:

  • Google crawlers may not reach deep pages

  • Deep pages receive less internal link equity (PageRank)

  • Users have difficulty finding deep content

2. URL Structure Best Practices

URLs are the foundation of technical SEO. Good URL structure:

PrincipleGood URLPoor URL
Short/ball-valves//products/category/industrial/ball-valves/stainless-steel/
Descriptive/stainless-steel-ball-valve//product-12345/
Hyphen-separated/ball-valve//ball_valve/ or /ballvalve/
Lowercase/ball-valve//Ball-Valve/
No parameters/ball-valves//products?cat=5&sort=price
Contains keywords/link-building-guide//post-2025-01-15/

3. URL Changes and Redirects

Once a URL is established, try not to change it. Each time you change a URL, you need to set up a 301 redirect and may experience short-term ranking fluctuations.

If you must change a URL:

  • Set up a 301 redirect (permanent redirect) from the old URL to the new URL

  • Update all internal links to point to the new URL

  • Update the Sitemap

  • Monitor in Google Search Console

  • Keep the 301 redirect in place for at least one year

301 vs. 302 Redirects:

  • 301: Permanent redirect. Tells Google the old URL has been permanently moved to the new URL; link equity is transferred.

  • 302: Temporary redirect. Tells Google the old URL is only temporarily redirected; link equity is not transferred (or very little is transferred).

In most cases, you should use 301. Only use 302 when the page is genuinely temporary (such as for A/B testing or temporary maintenance).

Page Speed and Core Web Vitals

In 2021, Google officially incorporated Core Web Vitals into its ranking factors. Page speed is no longer just "nice to have"; it is "must-have."

1. The Three Core Metrics

MetricWhat It MeasuresGoodNeeds ImprovementPoor
LCP (Largest Contentful Paint)Loading time of the largest content element (e.g., hero image, main heading)≤ 2.5 seconds2.5–4 seconds> 4 seconds
INP (Interaction to Next Paint)Delay from user interaction (click, tap, keypress) to visual page response≤ 200 ms200–500 ms> 500 ms
CLS (Cumulative Layout Shift)Visual stability—unexpected layout shifts during page load≤ 0.10.1–0.25> 0.25

Note: In March 2024, Google replaced FID (First Input Delay) with INP (Interaction to Next Paint). INP measures the responsiveness to all interactions throughout the page's lifecycle, making it more comprehensive than FID.

2. Optimizing LCP

LCP is typically the largest image or text block on the page. Methods to optimize LCP:

  • Optimize server response time (TTFB): Use good hosting, enable caching, use a CDN

  • Optimize the largest content element: If the LCP element is an image, compress it, use WebP format, and set appropriate dimensions

  • Preload LCP resources:

  • Reduce render-blocking resources: Inline critical CSS, defer non-critical JavaScript

  • Avoid client-side rendering: If LCP content requires JavaScript to display, consider server-side rendering

3. Optimizing INP

INP measures how quickly a page responds to user interactions. Clicking buttons, typing into fields, selecting dropdown menus – how fast does the page provide visual feedback after these interactions?

Optimizing INP:

  • Reduce main thread blocking: Split long tasks using requestIdleCallback or scheduler.yield()

  • Reduce JavaScript execution time: Remove unnecessary JavaScript, defer loading third-party scripts

  • Optimize event handlers: Avoid complex calculations in event handlers

  • Reduce DOM size: The more DOM nodes, the slower the interaction response

4. Optimizing CLS

CLS measures unexpected movement of elements during page load. You are reading a paragraph when suddenly an ad loads and pushes the text down – that is a layout shift.

Common CLS issues and solutions:

IssueCauseSolution
Image loading causes shiftImage has no width/height attributes setSet [width] and [height] attributes on [img] tags
Ad loading causes shiftAd space has no reserved areaSet fixed dimensions for ad containers
Font loading causes shiftWeb font replaces system font with different sizeUse [font-display: swap] with matching fallback fonts
Dynamic content insertionContent inserted after JavaScript loadsReserve space or use CSS [contain] property
Iframe loadingIframe has no dimensions setSet fixed [width] and [height] attributes for iframes

Measurement Tools

Tools for measuring Core Web Vitals:

  • Google PageSpeed Insights: Most commonly used; displays both lab data and field data

  • Google Search Console: Core Web Vitals report showing the site-wide CWV status

  • Chrome DevTools: Performance panel for detailed analysis

  • Web Vitals Chrome Extension: Real-time display of CWV data for the current page

  • Lighthouse: Chrome's built-in auditing tool

Important Distinction: Lab Data vs. Field Data.

  • Lab Data: Measured in a simulated environment; results may vary each time; used for debugging

  • Field Data (CrUX): Data from real Chrome users; this is what Google uses for ranking

If your lab data is good but field data is poor, it means your real users have devices or network conditions worse than your simulated environment. Optimization must target low-end devices and slow networks.

Mobile Optimization and Mobile-First Indexing

Starting in 2023, Google fully transitioned to Mobile-First Indexing. This means Google primarily uses the mobile version of your website to determine rankings.

If your mobile version lacks content, loads slowly, or offers a poor experience, your rankings will suffer – even if the desktop version is perfect.

1. Requirements for Mobile-First Indexing

  • Content consistency: Content on mobile and desktop must be identical. Do not hide content on mobile

  • Structured data consistency: Schema markup on mobile and desktop must be identical

  • Meta tag consistency: Titles, descriptions, and robots tags must be identical on both versions

  • Image consistency: Mobile images must have alt text; format and quality should not be inferior to desktop images

2. Responsive Design vs. Separate Mobile Site

Google recommends responsive design – a single URL that automatically adjusts its layout based on screen size.

Separate mobile sites (m.example.com) are not recommended, because:

  • Two sets of content need maintenance

  • Correct canonical and alternate tags need configuration

  • Content inconsistencies are common

  • Technical complexity increases

If you are still using a separate mobile site, it is strongly recommended to migrate to a responsive design.

Mobile Technical Checklist

  • Viewport meta tag:

  • Font size: At least 16px (to avoid requiring users to zoom)

  • Touch targets: At least 48x48 pixels, with at least 8px spacing

  • No Flash: (largely obsolete, but some older sites still use it)

  • No horizontal scrolling layout

  • Form inputs: Use appropriate input types (email, tel, number, etc.) to trigger the correct keyboard


For more, please read this article:

Click The Part 2 of Google Technical SEO for tailored content reading.

Related articles

Consent Preferences