Technical SEO Guide: From Crawling to Indexing – Making Search Engines Understand Website (Part 2)
Master technical SEO: crawlability, indexing, Core Web Vitals, structured data, XML sitemaps, mobile-first. Build a search-friendly site infrastructure.
Click The Part 1 of Google SEO For Frontier Content Reading.
![]()
HTTPS is a confirmed ranking factor. Google has used HTTPS as a ranking signal since 2014.
By 2025, if your website does not have HTTPS, you will be behind. Chrome displays a "Not Secure" warning in the address bar, and user trust drops to zero.
HTTPS requires an SSL/TLS certificate. Ways to obtain a certificate:
-
Let's Encrypt: Free, automatic renewal; supported with one-click installation by most hosts
-
Cloudflare: Free SSL, also provides CDN and DDoS protection
-
Paid certificates: DigiCert, Comodo, etc.; suitable for enterprises needing EV certificates
For most websites, Let's Encrypt's free certificate is sufficient.
If your website is still using HTTP, pay attention to the following when migrating to HTTPS:
A. Install the SSL certificate
B. Set up 301 redirects from HTTP to HTTPS
C. Update all internal links to HTTPS
D. Update canonical tags to HTTPS
E. Update the Sitemap with HTTPS URLs
F. Check for mixed content (HTTP resources loading on HTTPS pages)
G. Update Google Search Console (add the HTTPS property)
H. Update the default URL in Google Analytics
I. Update the Sitemap URL in robots.txt
J. Notify external linking sites to update links (if possible)
A. Mixed Content: Loading HTTP resources (images, CSS, JS) on an HTTPS page. Browsers show warnings or block loading.
To check mixed content:
-
Chrome DevTools → Console, look for Mixed Content warnings
-
Use the "Why No Padlock" tool
-
Crawl the site with Screaming Frog and filter for HTTP resources
B. Expired Certificate: SSL certificates have a validity period. After expiration, browsers show security warnings, and users cannot access the site. Set up automatic renewal (Let's Encrypt defaults to 90 days and supports automatic renewal).
C. Redirect Chain: HTTP → HTTPS → www → non-www, with multiple redirects, slows down the process. The redirect should ideally be direct: HTTP non-www → HTTPS non-www (or your chosen final version).
Structured data uses a standardized format to tell Google what your page content is about. It does not directly affect rankings but can make your search results richer – displaying star ratings, prices, FAQs, breadcrumb navigation, and more.
According to official Google documentation, structured data should use the Schema.org vocabulary, with JSON-LD as the recommended format.
JSON-LD is the structured data format recommended by Google. It is a block of JSON code placed within a tag:
1<script type="application/ld+json">
2{
3 "@context": "https://schema.org",
4 "@type": "Product",
5 "name": "2-Inch Stainless Steel Ball Valve",
6 "description": "Full port ball valve, SS316, 1000 WOG",
7 "image": "https://example.com/images/ball-valve.jpg",
8 "brand": {
9 "@type": "Brand",
10 "name": "YourBrand"
11 },
12 "offers": {
13 "@type": "Offer",
14 "price": "45.99",
15 "priceCurrency": "USD",
16 "availability": "https://schema.org/InStock"
17 },
18 "aggregateRating": {
19 "@type": "AggregateRating",
20 "ratingValue": "4.8",
21 "reviewCount": "127"
22 }
23}
24</script>
25
26| Schema Type | Use Case | Search Result Effect |
|---|---|---|
| Product | Product pages | Displays price, stock status, ratings |
| Article | Blog posts | Displays publication date, author |
| FAQ | Frequently Asked Question pages | Displays expandable Q&A sections |
| HowTo | Tutorial pages | Displays step-by-step instructions |
| BreadcrumbList | All pages | Displays breadcrumb navigation |
| LocalBusiness | Local businesses | Displays address, phone, hours |
| Organization | About Us pages | Displays company info, logo |
| Review | Review pages | Displays star ratings |
| VideoObject | Video pages | Displays video thumbnails |
| Event | Event pages | Displays date, location |
After writing structured data, it must be validated:
-
Google Rich Results Test: Checks if it meets Google's requirements for rich results
-
Schema.org Validator: Checks the syntax of the Schema markup
Common Errors:
-
Missing required fields (e.g., Product Schema missing price)
-
Data inconsistency with page content (the price in the Schema differs from the price shown on the page)
-
Using Schema types not supported by Google
-
JSON syntax errors (missing commas, mismatched quotes)
Google explicitly warns: structured data must reflect the real content of the page. Do not add Review Schema if your page has no reviews. Do not fabricate AggregateRating if your product has no ratings.
Violating this rule can result in a manual action from Google, and your rich results will be removed.
Crawl budget is the crawling resources Google allocates to your website. Google does not crawl your site indefinitely – it needs to crawl the entire internet within its limited resources.
According to the official Google blog, crawl budget is determined by two factors:
-
Crawl Rate Limit: Google does not want to overwhelm your server by crawling too quickly. If your server responds slowly, Google automatically reduces the crawl speed.
-
Crawl Demand: Google's level of interest in your website's content. Popular pages and frequently updated pages are crawled more often.
Small websites (with a few hundred pages) typically do not need to worry about crawl budget. Google has sufficient resources to crawl all your pages.
Websites that need to pay attention to crawl budget:
-
Large e-commerce sites (tens of thousands to millions of product pages)
-
News sites (publishing large amounts of content daily)
-
UGC sites (user-generated content, uncontrollable page count)
-
Sites with many parameterized URLs
| Issue | Explanation | Solution |
|---|---|---|
| Parameterized URLs | Filtering, sorting, and pagination generate many duplicate URLs that waste crawl budget | Block via [robots.txt] + canonical tags |
| Duplicate content | Same content accessible across multiple URLs, causing redundant crawling | Canonical tags + 301 redirects |
| Soft 404s | Page returns 200 status code, but content says "not found," wasting crawl budget | Return a true 404 status code |
| Redirect chains | Multiple redirects (A → B → C → D) before reaching final destination | Redirect directly from A to D |
| Low-quality pages | Empty pages, thin content with little to no value | [noindex] or remove entirely |
| Infinite crawl traps | Calendars, search results, or filters that generate infinite URL combinations | Block via [robots.txt] or use [nofollow] |
| Slow server | Long response times cause Google to reduce crawl speed | Optimize server performance |
A. Clean up useless URLs: Use robots.txt to block URL patterns that do not need crawling
B. Fix redirect chains: All redirects should go directly to the final destination
C. Fix soft 404s: Return true 404s for empty pages
D. Optimize Sitemap: Only include pages that need indexing
E. Improve server speed: Faster responses = more Google crawling
F. Optimize internal links: Ensure important pages have sufficient internal links
G. Use lastmod: Accurately mark the last modification date in the Sitemap
In Google Search Console, under "Settings → Crawl Stats," you can see:
-
Daily crawl requests
-
Average response time
-
Crawled file types
-
Crawl status code distribution
If you find Google heavily crawling useless pages (parameterized URLs, old redirects), your crawl budget is being wasted.
Google Search Console tells you what Google has indexed. Log analysis tells you what Google has actually crawled.
These two can be very different.
Your server records every access request, including visits from Googlebot. By analyzing these logs, you can see:
-
Which pages did Googlebot crawl?
-
The crawl frequency
-
Which pages were never crawled
-
What errors Googlebot encountered
-
Response times for crawling
| Tool | Price | Key Features |
|---|---|---|
| Screaming Frog Log Analyzer | $149/year | Professional SEO log analysis tool; integrates with Screaming Frog SEO Spider |
| Botify | Enterprise pricing | Crawler analysis platform for large websites; advanced visualization and recommendations |
| JetOctopus | From $60/month | Cloud-based log analysis; supports large data volumes; no installation required |
| GoAccess | Free, open-source | General-purpose log analysis tool; real-time analytics; requires manual bot filtering |
| ELK Stack (Elasticsearch + Logstash + Kibana) | Free, open-source | Full-stack log management; highly flexible; requires significant technical expertise |
Through log analysis, you may discover:
-
Important pages not being crawled: Indicates insufficient internal links or overly deep page hierarchy
-
Junk URLs being heavily crawled: Indicates crawl budget is being wasted
-
Decrease in crawl frequency: Could be due to a slower server or decreased Google interest in your site
-
High number of 5xx errors: Unstable server, affecting crawling
-
Ratio of Googlebot to other bots: If spam bots account for most traffic, they need to be blocked
Not all pages should be indexed. Indexing pages that should not be indexed dilutes the overall quality of your website.
Add the following in the page's section:
1<meta name="robots" content="noindex">
2
3Or via HTTP response header:
1X-Robots-Tag: noindex
2
3Pages That Should Be Noindexed:
-
Internal search results pages
-
Tag pages (if content duplicates category pages)
-
Filtering/sorting results pages
-
Thank you pages, confirmation pages
-
Privacy policies, terms of service (unless you want them in search results)
-
Login/registration pages
-
Shopping cart, checkout pages
-
Test pages, draft pages
In Google Search Console's "Pages" report, you can see:
-
Number of indexed pages
-
Number of non-indexed pages and reasons
-
Pages excluded by noindex
-
Pages excluded by canonical
-
Pages with crawl anomalies
Check this report regularly. If you find important pages not being indexed, investigate the cause.
| Issue | Cause | Solution |
|---|---|---|
| "Discovered - currently not indexed" | Google discovered the URL but has not yet crawled it | Increase internal links, submit Sitemap, improve page quality |
| "Crawled - currently not indexed" | Google crawled it but deemed it not worth indexing | Improve content quality, increase external links, enhance user experience |
| "Excluded by noindex tag" | Page has [noindex] meta tag | If mistakenly added, remove [noindex] |
| "Blocked by robots.txt" | [robots.txt] disallows crawling | Modify [robots.txt] to allow crawling |
| "Duplicate, Google chose a different canonical than the user" | Google does not recognize your canonical choice | Check consistency of internal links, external links, and Sitemap |
| "Server error (5xx)" | Server returns 500 error response | Fix server issues (500, 502, 503, 504) |
| "Redirect error" | Redirect loop or chain too long | Fix redirect configuration |
Google provides an Indexing API that allows you to proactively notify Google about page updates. However, this API currently only supports pages of type JobPosting and BroadcastEvent.
For other types of pages, you can use Google Search Console's URL inspection tool to manually request indexing. However, there are daily quotas, making it unsuitable for large volumes of pages.
A more practical approach is to keep your Sitemap updated, ensure a sound internal link structure, and let Google naturally discover and crawl your new pages.
If your website targets multiple countries or languages, hreflang tags are essential.
Hreflang tells Google: this page has different versions for different languages/regions; please display the correct version based on the user's language and location.
What happens without hreflang?
-
Google might show your English page to Chinese users
-
Different language versions compete against each other in rankings (keyword cannibalization)
-
Poor user experience, high bounce rates
Choose one of three methods:
1<link rel="alternate" hreflang="en" href="https://example.com/product/" />
2<link rel="alternate" hreflang="de" href="https://example.com/de/product/" />
3<link rel="alternate" hreflang="ja" href="https://example.com/ja/product/" />
4<link rel="alternate" hreflang="x-default" href="https://example.com/product/" />
5
61Link: <https://example.com/product/>; rel="alternate"; hreflang="en",
2 <https://example.com/de/product/>; rel="alternate"; hreflang="de"
3
41<url>
2 <loc>https://example.com/product/</loc>
3 <xhtml:link rel="alternate" hreflang="en" href="https://example.com/product/" />
4 <xhtml:link rel="alternate" hreflang="de" href="https://example.com/de/product/" />
5</url>
6
7-
Missing x-default: x-default specifies which page to display when no matching language version is found
-
Asymmetry: Page A's hreflang points to B, but B does not point back to A. Mutual referencing is required
-
Incorrect language codes: Using "en-uk" instead of "en-gb", "zh-cn" instead of "zh-Hans"
-
Canonical and hreflang conflicts: The canonical for each language version should point to itself, not to other language versions
-
Returning 4xx/5xx: URLs pointed to by hreflang must return 200
| Language/Region | Hreflang Code | Description |
|---|---|---|
| English (Global) | [en] | No region specified; targets all English-speaking users |
| English (US) | [en-us] | Targets users in the United States |
| English (UK) | [en-gb] | Targets users in the United Kingdom |
| German | [de] | All German-speaking users (no region specified) |
| German (Austria) | [de-at] | Targets users in Austria |
| Simplified Chinese | [zh-Hans] | Simplified Chinese script users |
| Traditional Chinese | [zh-Hant] | Traditional Chinese script users |
| Japanese | [ja] | Japanese language users |
| Spanish | [es] | All Spanish-speaking users (no region specified) |
| Default | [x-default] | Default version when no language/region match is found |
Modern websites increasingly rely on JavaScript. React, Vue, Angular, Next.js – these frameworks make front-end development more efficient but also introduce SEO challenges.
Google's crawling process is split into two steps:
A. Crawling HTML: Googlebot fetches the page's raw HTML
B. Rendering: Google's Web Rendering Service (WRS) executes JavaScript to generate the final DOM
The issue is that there can be a delay between these two steps. Google requires additional resources to render JavaScript, so the rendering queue can become backlogged. Your page might be crawled, but take days or even weeks to be rendered and indexed.
| Rendering Method | Description | SEO Friendliness | Best Use Case |
|---|---|---|---|
| Server-Side Rendering (SSR) | Server generates complete HTML before sending to browser | Best | Content sites, e-commerce, news platforms |
| Static Site Generation (SSG) | HTML files generated at build time and served as static files | Best | Blogs, documentation, marketing pages, portfolios |
| Client-Side Rendering (CSR) | Browser executes JavaScript to generate content | Worst | Admin panels, dashboards, apps that do not require SEO |
| Hybrid Rendering | Critical content rendered on server, non-critical on client | Good | Large applications, complex SPAs, dynamic content sites |
If your website relies on SEO traffic, SSR or SSG is strongly recommended.
-
Content not present in HTML source: If you view the page source (Ctrl+U) and cannot see the main content, it means the content is rendered by JavaScript. Google may not be able to index it correctly.
-
Internal links are JavaScript events instead of tags: Using onclick="window.location='...'" may not be recognized by Googlebot as a link. Standard tags must be used.
-
Lazy-loaded content: If content requires scrolling or clicking to load, Googlebot may not see it. Critical content should not be lazy-loaded.
-
JavaScript errors preventing rendering: If JavaScript has errors, the page may not render correctly. Regularly check for JavaScript errors.
-
Meta tags dynamically set via JavaScript: Titles and descriptions should ideally be generated server-side, not rely on JavaScript for dynamic setting.
-
Google Search Console URL Inspection Tool: View the rendered page screenshot to confirm content completeness
-
View Page Source: Use Ctrl+U to see the raw HTML and confirm if critical content is present in the source
-
Test with JavaScript Disabled: Disable JavaScript in Chrome DevTools and see if the page still contains content
-
site: Search: Perform a site:example.com search on Google to see if indexed pages have correct titles and descriptions
If you use React or Vue, Next.js (for React) or Nuxt.js (for Vue) are recommended. These frameworks natively support SSR and SSG, greatly simplifying JavaScript SEO.
Next.js SEO Advantages:
-
Supports SSR, SSG, and ISR (Incremental Static Regeneration)
-
Automatic code splitting, reducing JavaScript size
-
Built-in image optimization (next/image)
-
Built-in Head component for easy meta tag configuration
-
Supports Sitemap generation
Website migration is one of the highest-risk operations in technical SEO. Changing domains, platforms, or URL structures – any mistake in any step can lead to significant traffic drops.
-
Domain migration: Moving from http://old-domain.com to http://new-domain.com
-
Platform migration: Moving from WordPress to Shopify, or vice versa
-
URL structure change: Changing from /products/category/product-name to /product-name
-
Protocol migration: Moving from HTTP to HTTPS
-
Subdomain migration: Moving from blog.example.com to example.com/blog
Before Migration:
-
Crawl the old website with Screaming Frog and record all URLs
-
Export ranking and traffic data for all pages (Google Search Console + Analytics)
-
Record all external links pointing to URLs (Ahrefs or SEMrush)
-
Create a complete URL mapping table (old URL → new URL)
-
Verify all redirects in a test environment
During Migration:
-
Set up all 301 redirects
-
Update all internal links
-
Update canonical tags
-
Update Sitemap
-
Update robots.txt
-
Update URLs in structured data
After Migration:
-
Add the new domain/URL to Google Search Console
-
Submit the new Sitemap
-
For domain migrations, use Google Search Console's "Change of Address" tool
-
Monitor traffic changes (a short-term decline is expected)
-
Check for 404 errors
-
Verify all redirects are working correctly
-
Monitor ranking changes
According to Search Engine Journal, a successful website migration typically takes 3-6 months to fully recover traffic. If the migration is done well, traffic may recover most of its volume within a few weeks.
If traffic continues to decline for more than 3 months after migration, investigate:
-
Whether important pages have redirects set up
-
Whether redirects are 301 (not 302)
-
Whether content quality on new pages has declined
-
Whether there are technical issues (noindex, robots.txt blocking, etc.)
Technical SEO is not a one-time task. Websites are constantly changing – new pages, new features, plugin updates, server changes – each change can introduce new technical issues.
A full technical SEO audit is recommended quarterly.
| Tool | Price | Key Features |
|---|---|---|
| Screaming Frog | Free (500 URLs) / $259/year | Most comprehensive crawler tool; desktop-based; extensive customization |
| Ahrefs Site Audit | From $99/month (includes other features) | Cloud-based audit; automatic periodic scanning; integrated with Ahrefs SEO platform |
| SEMrush Site Audit | From $129/month (includes other features) | Cloud-based audit; clear issue categorization; priority-based recommendations |
| Sitebulb | $152/year | Strong visualization; excellent for reporting; user-friendly interface |
| Google Search Console | Free | Official Google data; essential for indexing and performance monitoring |
| Lighthouse | Free | Built into Chrome; performance + accessibility + SEO audits |
-
Is robots.txt correctly configured?
-
Does the Sitemap include all important pages?
-
Are any important pages noindexed?
-
Are there a large number of 404 errors?
-
Are there redirect chains or loops?
-
Does the number of indexed pages match expectations?
-
Does each page have a unique title and description?
-
Are title lengths between 50 and 60 characters?
-
Are there duplicate titles or descriptions?
-
Does an H1 tag exist, and is it unique?
-
Do images have alt text?
-
Are canonical tags correct?
-
Are Core Web Vitals passing?
-
Is TTFB within 200ms?
-
Is the page size reasonable (recommended within 3MB)?
-
Are images compressed and using modern formats?
-
Are CSS/JS compressed and combined?
-
Is the entire site HTTPS?
-
Is there mixed content?
-
Is the SSL certificate valid?
-
Is the HTTP-to-HTTPS redirect correct?
-
Is Schema markup correct?
-
Are there validation errors?
-
Does it cover all applicable page types?
-
Is the design responsive?
-
Does mobile content match desktop content?
-
Are touch targets large enough?
-
Are there any mobile-specific issues?
WordPress is the most widely used CMS globally, powering over 40% of websites. Its SEO flexibility is strong, but it requires correct configuration.
-
Rank Math: More feature-rich, powerful, even in the free version. Recommended.
-
Yoast SEO: Veteran plugin, stable and reliable, but the free version has limited features.
-
WP Rocket: Paid, but the simplest configuration with the best results
-
LiteSpeed Cache: Free, ideal if your server uses LiteSpeed
-
W3 Total Cache: Free, feature-rich but complex configuration
-
ShortPixel: Automatically compresses uploaded images, supports WebP conversion
-
Imagify: From the same company as WP Rocket, good integration
Settings → Permalinks → Select "Post name" (/%postname%/). This is the cleanest URL structure.
-
Enable Sitemap module
-
Set global title and description templates
-
Enable breadcrumb navigation
-
Configure default Schema types (Article for blog posts, Product for products)
-
In Sitemap settings, exclude content types that do not need indexing (tags, author archives, etc.)
-
Disable WordPress's built-in Emoji script (reduces one HTTP request)
-
Disable XML-RPC (security consideration)
-
Disable oEmbed (if not needed)
-
Limit the number of post revisions (reduces database bloat)
Add to wp-config.php:
1define('WP_POST_REVISIONS', 5);
2define('DISALLOW_FILE_EDIT', true);
3
4WordPress databases expand over time – post revisions, spam comments, expired transients. Clean regularly:
-
Use the WP-Optimize plugin for automatic cleaning
-
Or manually execute SQL cleaning (requires backup)
A CDN (Content Delivery Network) distributes your website content across servers worldwide. When users visit, they fetch content from the nearest server, resulting in faster speeds.
For international websites, a CDN is almost essential. Your server might be in the US, but customers could be in Europe, Southeast Asia, or the Middle East – without a CDN, access speeds for users in these regions will be slow.
Cloudflare's free version offers:
-
Global CDN
-
Free SSL certificate
-
Basic DDoS protection
-
DNS management
-
Page rules (3 free)
| Setting | Recommended Value | Reason |
|---|---|---|
| SSL/TLS mode | Full (Strict) | Ensures end-to-end encryption with certificate validation |
| Always Use HTTPS | Enabled | Automatically redirects HTTP to HTTPS for security |
| Auto Minify | Enabled (HTML/CSS/JS) | Reduces file sizes and improves load times |
| Brotli compression | Enabled | Higher compression rate than Gzip |
| Browser Cache TTL | 1 month (or longer) | Reduces repeat requests for returning visitors |
| Rocket Loader | Decide after testing | May speed up JavaScript loading, but can cause issues with some scripts |
| Early Hints | Enabled | Tells browser about needed resources in advance |
Note: Cloudflare's Rocket Loader may conflict with some JavaScript. After enabling, always test whether the website functionality remains normal.
By 2025, AI search is no longer the future – it is the present. Google's AI Overviews, Bing's Copilot, Perplexity, ChatGPT's search functionality – these AI search engines are changing how users access information.
Technical SEO is equally important for AI search. AI search engines still need to crawl and understand your web content.
-
Structured data becomes more important: AI needs structured information to generate answers. Schema markup helps AI understand your content
-
Content extractability: AI needs to be able to extract key information from your pages. A clear HTML structure (H1-H6, lists, tables) is easier for AI to understand than large blocks of plain text
-
E-E-A-T signals: AI search engines tend to cite authoritative sources. Author information, cited sources, professional credentials – these signals help AI determine whether your content is trustworthy
-
Page speed still matters: AI crawlers also have crawl budgets. Slow websites will be crawled less frequently
LLMs.txt is an emerging standard, similar to robots.txt, but specifically for AI crawlers. It tells AI models how to use your content.
This standard is still in its early stages, and there is no evidence yet that it improves visibility in AI search. However, if you want to experiment, you can create an llms.txt file in your website's root directory.
Technical SEO is the foundation of SEO. Without a solid foundation, content and backlinks built on top are like castles in the air.
Core Takeaways:
-
Ensure crawlability: robots.txt correctly configured, Sitemap complete, internal link structure sound
-
Ensure indexability: Canonical tags correct, noindex used only where appropriate, regularly check index status
-
Ensure speed: Core Web Vitals passing, images optimized, CDN acceleration
-
Ensure mobile-friendliness: Responsive design, content consistency, touch-friendly
-
Ensure security: Full site HTTPS, no mixed content
-
Ensure clarity of structure: Schema markup, clear URLs, flat architecture
-
Regular audits: Full audit quarterly, timely detection and resolution of issues
Technical SEO does not require you to become a programmer. But you need to understand these concepts, know how to check them, and know who to contact to fix issues when they arise.
If you are using WordPress or Shopify, most technical SEO issues have ready-made plugins or apps to solve them. The key is knowing what to check, checking regularly, and fixing issues promptly when found.
When technical SEO is done right, your content and backlinks can deliver their maximum value.








