Technical SEO Guide: From Crawling to Indexing – Making Search Engines Understand Website (Part 1)
Master technical SEO with this complete guide covering crawlability, indexability, Core Web Vitals, structured data, XML sitemaps, and mobile-first indexing. Learn how to build a search-engine-friendly website infrastructure.
Simply put, it is the practice of enabling search engines to find your pages, understand your content, and correctly index your website.
![]()
Many people think technical SEO is mysterious. It involves code, configurations, and a host of confusing terms.
In reality, it is not. The core logic of technical SEO is simple: Google is a robot. It needs to crawl your website, understand your content, and store your pages in its database. Your job is to make this process as smooth as possible.
If Google cannot crawl your pages, even the best content is useless. If Google cannot understand your page structure, it will not know where to rank you. If your website is slow or offers a poor mobile experience, Google will directly lower your rankings.
According to research by SEMrush, over 80% of websites have technical SEO issues. Many of these are fundamental – incorrect robots.txt configurations, missing canonical tags, and excessively slow page speeds.
This article will start from the most basic crawling mechanisms and go all the way to the cutting edge of AI search optimization. Whether you are a beginner or an experienced SEO professional, you will find useful information here.
Let us begin.
Before diving into specific operations, it is important to understand how Google works.
Google has a crawler program called Googlebot. Its job is to constantly visit web pages and fetch their content.
How does Googlebot discover new pages?
- Through links on known pages
- Through XML Sitemaps
- Through manual submission in Google Search Console
- Through links from third-party websites pointing to your site
Once Googlebot discovers a URL, it adds it to the crawl queue. However, not all URLs are crawled immediately. Google decides the crawl priority based on the page's importance, update frequency, and the website's crawl budget.
After crawling the HTML, Google needs to render the page – execute JavaScript, load CSS, and generate the final Document Object Model (DOM).
This step is crucial. If your website heavily relies on JavaScript to generate content (for example, React, Vue, or Angular single-page applications), Google needs additional time and resources to render. According to official Google documentation, rendering can be delayed from a few seconds to several days.
After rendering, Google analyzes the page content, extracts key information (titles, body text, links, structured data, etc.), and then decides whether to include the page in its index.
The index is Google's database. Only pages that are indexed can appear in search results.
The entire process:
1URL Discovery → Added to Crawl Queue → HTML Crawled → Page Rendered → Content Analyzed → Added to Index → Included in Rankings
2
3The goal of technical SEO is to ensure every step in this process runs smoothly.
Robots.txt is a text file placed in your website's root directory ([example.com/robots.txt]). It tells search engine crawlers which pages can be crawled and which cannot.
1User-agent: *
2Disallow: /admin/
3Disallow: /cart/
4Disallow: /checkout/
5Allow: /
6
7Sitemap: https://example.com/sitemap.xml
8
9Explanation:
- User-agent: * — applies to all crawlers
- Disallow: /admin/ — blocks crawling of all pages under the /admin/ directory
- Allow: / — allows crawling of all other pages
- Sitemap: — tells crawlers the location of the Sitemap
| Error | Consequence | Correct Approach |
|---|---|---|
| Disallow: / (blocking all crawling) | Entire website disappears from search results | Only block directories that do not need indexing |
| Blocking CSS/JS files | Google cannot render the page, affecting rankings | Allow crawling of CSS and JS |
| Blocking image directories | Images do not appear in Google Images | Allow crawling of images |
| Forgetting to modify development environment | Entire site blocked after going live | Check robots.txt before launch |
| Using robots.txt to prevent indexing | Page may still be indexed (just not crawled) | Use noindex tag to prevent indexing |
The last point is particularly important: robots.txt can only prevent crawling, not indexing. If another website links to one of your pages, Google might index that URL without crawling it (showing only the URL without a content summary). To truly prevent indexing, you must use the noindex tag.
- WordPress: robots.txt is generated automatically by default and can be customized using plugins like Rank Math or Yoast SEO.
- Shopify: robots.txt is generated automatically and cannot be edited directly. However, starting in 2021, limited customization is possible through the robots.txt.liquid template.
- Custom Websites: Create the robots.txt file manually and place it in the website's root directory.
Between 2024 and 2025, AI crawlers have become a new issue. OpenAI's GPTBot, Anthropic's ClaudeBot, and Google's Google-Extended – these AI crawlers scrape your content to train their models.
If you do not want AI crawlers to scrape your content, you can add content to robots.txt as follows:
1User-agent: GPTBot
2Disallow: /
3
4User-agent: ClaudeBot
5Disallow: /
6
7User-agent: Google-Extended
8Disallow: /
9
10However, note that blocking AI crawlers may affect your visibility in AI search results. This is a trade-off.
An XML Sitemap is a file that lists all the important pages on your website. It helps Google discover and understand your website's structure.
A Sitemap is not a ranking factor. Having a Sitemap will not make you rank higher. However, it helps ensure Google is aware of all your important pages.
1<?xml version="1.0" encoding="UTF-8"?>
2<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
3 <url>
4 <loc>https://example.com/page-1/</loc>
5 <lastmod>2025-01-15</lastmod>
6 <changefreq>monthly</changefreq>
7 <priority>0.8</priority>
8 </url>
9</urlset>
10Where:
-
loc: Page URL (required)
-
lastmod: Last modification date (recommended; Google references this)
-
changefreq: Update frequency (Google largely ignores this field)
-
Priority: Priority (Google also largely ignores this)
In practice, you only need to focus on loc and lastmod.
| Rule | Explanation |
|---|---|
| Include only pages that need indexing | Do not put [noindex] pages, redirected pages (301/302), or 404 error pages in the Sitemap. These waste crawl budget and send mixed signals to search engines. |
| URLs in the Sitemap must be canonical URLs | If a page has a canonical tag pointing to another URL, the Sitemap should include the canonical URL—not the duplicate or variant URL. |
| Maximum 50,000 URLs per Sitemap | Each Sitemap file cannot exceed 50,000 URLs. If your site exceeds this limit, split URLs across multiple Sitemap files and use a Sitemap Index file ([sitemap_index.xml]) to aggregate them. |
| Sitemap file size not exceeding 50MB | Uncompressed file size must stay under 50MB. For large Sitemaps, submit compressed files ([.xml.gz]) to reduce bandwidth and improve processing speed. |
| [lastmod] should be accurate | Update the [] tag only when the page content actually changes. Do not automatically update all pages daily—this creates unnecessary crawl demand and reduces trust signals with search engines. |
| Declare Sitemap location in [robots.txt] | Add a [Sitemap] directive to your [robots.txt] file to help search engines discover your Sitemap location. Format: [Sitemap: https://example.com/sitemap.xml] |
| Submit in Google Search Console | After publishing your Sitemap, submit it via Google Search Console (or Bing Webmaster Tools). Monitor the Index status report to verify that pages are being discovered and indexed correctly. |
If your website has tens of thousands or even hundreds of thousands of pages, you need to organize them using a Sitemap Index file:
1<?xml version="1.0" encoding="UTF-8"?>
2<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
3 <sitemap>
4 <loc>https://example.com/sitemap-products.xml</loc>
5 <lastmod>2025-01-15</lastmod>
6 </sitemap>
7 <sitemap>
8 <loc>https://example.com/sitemap-categories.xml</loc>
9 <lastmod>2025-01-10</lastmod>
10 </sitemap>
11 <sitemap>
12 <loc>https://example.com/sitemap-blog.xml</loc>
13 <lastmod>2025-01-14</lastmod>
14 </sitemap>
15</sitemapindex>
16
17Split the Sitemap by page type (products, categories, blog, static pages) to facilitate management and monitoring.
-
WordPress + Rank Math: Sitemap is generated automatically. You can control which content types are included in the Sitemap within Rank Math settings. The path is usually /sitemap_index.xml.
-
Shopify: Sitemap is generated automatically at the path /sitemap.xml. It cannot be customized, but Shopify's default Sitemap is sufficient.
-
Custom Websites: Generate using tools like Screaming Frog or Sitebulb, or generate dynamically with code.
Duplicate content is one of the most common technical SEO issues. When the same content appears across multiple URLs, Google does not know which one to index.
The canonical tag tells Google which version among these duplicate pages is the "master" copy.
Common duplicate content scenarios:
-
URL parameters: example.com/product and example.com/product?ref=email are the same page
-
HTTP/HTTPS: http://example.com and https://example.com
-
www/non-www: www.example.com and example.com
-
Trailing slashes: example.com/page and example.com/page/
-
Uppercase/lowercase: example.com/Page and example.com/page
Pagination: example.com/blog and example.com/blog?page=1
-
Sorting/filtering: example.com/products?sort=price and example.com/products?sort=name
-
Cross-domain content: Your article republished on other websites
Add the following in the page's section:
1<link rel="canonical" href="https://example.com/preferred-url/" />
2
3Every page should have a canonical tag, including self-referencing canonicals (pointing to itself).
| Error | Consequence | Correct Approach |
|---|---|---|
| All pages canonical pointing to homepage | All pages except the homepage disappear from the index | Each page points to its own canonical URL (self-referential canonical) |
| Canonical URL is a 404 page | Google ignores the canonical tag | Ensure the canonical URL returns a 200 (OK) status code |
| Canonical URL blocked by robots.txt | Google cannot verify the canonical relationship | Ensure the canonical URL can be crawled (not blocked by robots.txt) |
| Canonical chain (A → B → C) | Google may ignore the chain or follow inconsistently | Point directly to the final (master) URL—avoid chains |
| Canonical and noindex used together | Conflicting signals confuse Google's indexing decision | Choose one strategy: either canonical to the master version OR noindex—never both |
| HTTP canonical on an HTTPS page | Protocol mismatch creates confusion and may be ignored | Always use HTTPS for canonical URLs when the site uses HTTPS |
Important Note: The canonical tag is a "suggestion," not a "directive." Google may ignore your canonical tag and choose a URL it considers more appropriate as the canonical. If you find Google choosing the wrong canonical, you need to check whether internal links, the Sitemap, and external links all point to the correct URL.
Website architecture determines how Google understands your website. A good architecture allows Google to crawl all pages easily; a poor one leaves Google lost.
The ideal website architecture is flat – any page should be reachable from the homepage within three clicks.
1Homepage
2├── Category A
3│ ├── Product A1
4│ ├── Product A2
5│ └── Product A3
6├── Category B
7│ ├── Product B1
8│ └── Product B2
9└── Blog
10 ├── Article 1
11 └── Article 2
12
13Problems with overly deep issues:
-
Google crawlers may not reach deep pages
-
Deep pages receive less internal link equity (PageRank)
-
Users have difficulty finding deep content
URLs are the foundation of technical SEO. Good URL structure:
| Principle | Good URL | Poor URL |
|---|---|---|
| Short | /ball-valves/ | /products/category/industrial/ball-valves/stainless-steel/ |
| Descriptive | /stainless-steel-ball-valve/ | /product-12345/ |
| Hyphen-separated | /ball-valve/ | /ball_valve/ or /ballvalve/ |
| Lowercase | /ball-valve/ | /Ball-Valve/ |
| No parameters | /ball-valves/ | /products?cat=5&sort=price |
| Contains keywords | /link-building-guide/ | /post-2025-01-15/ |
Once a URL is established, try not to change it. Each time you change a URL, you need to set up a 301 redirect and may experience short-term ranking fluctuations.
If you must change a URL:
-
Set up a 301 redirect (permanent redirect) from the old URL to the new URL
-
Update all internal links to point to the new URL
-
Update the Sitemap
-
Monitor in Google Search Console
-
Keep the 301 redirect in place for at least one year
301 vs. 302 Redirects:
-
301: Permanent redirect. Tells Google the old URL has been permanently moved to the new URL; link equity is transferred.
-
302: Temporary redirect. Tells Google the old URL is only temporarily redirected; link equity is not transferred (or very little is transferred).
In most cases, you should use 301. Only use 302 when the page is genuinely temporary (such as for A/B testing or temporary maintenance).
In 2021, Google officially incorporated Core Web Vitals into its ranking factors. Page speed is no longer just "nice to have"; it is "must-have."
| Metric | What It Measures | Good | Needs Improvement | Poor |
|---|---|---|---|---|
| LCP (Largest Contentful Paint) | Loading time of the largest content element (e.g., hero image, main heading) | ≤ 2.5 seconds | 2.5–4 seconds | > 4 seconds |
| INP (Interaction to Next Paint) | Delay from user interaction (click, tap, keypress) to visual page response | ≤ 200 ms | 200–500 ms | > 500 ms |
| CLS (Cumulative Layout Shift) | Visual stability—unexpected layout shifts during page load | ≤ 0.1 | 0.1–0.25 | > 0.25 |
Note: In March 2024, Google replaced FID (First Input Delay) with INP (Interaction to Next Paint). INP measures the responsiveness to all interactions throughout the page's lifecycle, making it more comprehensive than FID.
LCP is typically the largest image or text block on the page. Methods to optimize LCP:
-
Optimize server response time (TTFB): Use good hosting, enable caching, use a CDN
-
Optimize the largest content element: If the LCP element is an image, compress it, use WebP format, and set appropriate dimensions
-
Preload LCP resources:
-
Reduce render-blocking resources: Inline critical CSS, defer non-critical JavaScript
-
Avoid client-side rendering: If LCP content requires JavaScript to display, consider server-side rendering
INP measures how quickly a page responds to user interactions. Clicking buttons, typing into fields, selecting dropdown menus – how fast does the page provide visual feedback after these interactions?
Optimizing INP:
-
Reduce main thread blocking: Split long tasks using requestIdleCallback or scheduler.yield()
-
Reduce JavaScript execution time: Remove unnecessary JavaScript, defer loading third-party scripts
-
Optimize event handlers: Avoid complex calculations in event handlers
-
Reduce DOM size: The more DOM nodes, the slower the interaction response
CLS measures unexpected movement of elements during page load. You are reading a paragraph when suddenly an ad loads and pushes the text down – that is a layout shift.
| Issue | Cause | Solution |
|---|---|---|
| Image loading causes shift | Image has no width/height attributes set | Set [width] and [height] attributes on [img] tags |
| Ad loading causes shift | Ad space has no reserved area | Set fixed dimensions for ad containers |
| Font loading causes shift | Web font replaces system font with different size | Use [font-display: swap] with matching fallback fonts |
| Dynamic content insertion | Content inserted after JavaScript loads | Reserve space or use CSS [contain] property |
| Iframe loading | Iframe has no dimensions set | Set fixed [width] and [height] attributes for iframes |
Tools for measuring Core Web Vitals:
-
Google PageSpeed Insights: Most commonly used; displays both lab data and field data
-
Google Search Console: Core Web Vitals report showing the site-wide CWV status
-
Chrome DevTools: Performance panel for detailed analysis
-
Web Vitals Chrome Extension: Real-time display of CWV data for the current page
-
Lighthouse: Chrome's built-in auditing tool
Important Distinction: Lab Data vs. Field Data.
-
Lab Data: Measured in a simulated environment; results may vary each time; used for debugging
-
Field Data (CrUX): Data from real Chrome users; this is what Google uses for ranking
If your lab data is good but field data is poor, it means your real users have devices or network conditions worse than your simulated environment. Optimization must target low-end devices and slow networks.
Starting in 2023, Google fully transitioned to Mobile-First Indexing. This means Google primarily uses the mobile version of your website to determine rankings.
If your mobile version lacks content, loads slowly, or offers a poor experience, your rankings will suffer – even if the desktop version is perfect.
-
Content consistency: Content on mobile and desktop must be identical. Do not hide content on mobile
-
Structured data consistency: Schema markup on mobile and desktop must be identical
-
Meta tag consistency: Titles, descriptions, and robots tags must be identical on both versions
-
Image consistency: Mobile images must have alt text; format and quality should not be inferior to desktop images
Google recommends responsive design – a single URL that automatically adjusts its layout based on screen size.
Separate mobile sites (m.example.com) are not recommended, because:
-
Two sets of content need maintenance
-
Correct canonical and alternate tags need configuration
-
Content inconsistencies are common
-
Technical complexity increases
If you are still using a separate mobile site, it is strongly recommended to migrate to a responsive design.
-
Viewport meta tag:
-
Font size: At least 16px (to avoid requiring users to zoom)
-
Touch targets: At least 48x48 pixels, with at least 8px spacing
-
No Flash: (largely obsolete, but some older sites still use it)
-
No horizontal scrolling layout
-
Form inputs: Use appropriate input types (email, tel, number, etc.) to trigger the correct keyboard
For more, please read this article:
Click The Part 2 of Google Technical SEO for tailored content reading.








