You learn how to create and submit an XML sitemap to Google Search Console. You see “Success.” You assume Google’s crawling everything now. Then you check back in three weeks and half your pages still aren’t indexed. Or worse, Google’s crawling 10,000 pages you never wanted indexed in the first place, burning through crawl budget on pagination filters and archive URLs while your best content sits undiscovered.

This happens because most people treat XML sitemaps like a checkbox. Generate it, submit it, forget it. But sitemaps aren’t magic. They’re a crawl efficiency mechanism. And when you don’t understand how search engines actually use them, you end up creating indexation problems instead of solving them.

This guide explains how to create and submit an XML sitemap by breaking down the system underneath it—what sitemaps actually do, how crawlers prioritize them, what to include versus exclude, and how to maintain them as your site evolves. Not just the steps. The structure.


Key Takeaways

  • XML sitemaps are advisory, not mandatory—Google can ignore your sitemap entirely if other signals conflict with it
  • Bad sitemaps waste crawl budget by directing crawlers to low-value pages instead of strategic content
  • Sitemap submission doesn’t guarantee indexation—you need ongoing monitoring to confirm what actually gets crawled
  • Strategic exclusion matters more than inclusion—removing junk URLs from your sitemap improves crawl efficiency
  • Sitemap maintenance is a system, not a one-time task—as your site evolves, your sitemap needs regular audits

What an XML Sitemap Actually Does (And Doesn’t Do)

An XML sitemap is a structured file that lists URLs on your website along with metadata about when they were last updated, how frequently they change, and (theoretically) how important they are relative to other pages.

You’re essentially giving search engines a declared list of URLs you want crawled.

But here’s what most content gets wrong: A sitemap is not a discovery mechanism. Google doesn’t need your sitemap to find your pages. Crawlers follow internal links. They parse your site structure. They monitor your RSS feeds. If a page is properly linked within your site, Google will find it eventually.

So why do sitemaps exist?

They exist to influence crawl prioritization. When you submit a sitemap, you’re telling search engines “these are the URLs I care about, and here’s when they were last updated.” The crawler reads that, compares it to what it already knows about your site, and decides whether those URLs are worth crawling based on your site’s crawl budget allocation.

Crawl budget is the number of pages Google is willing to crawl on your site within a given timeframe. It’s determined by your site’s authority, server performance, and perceived content quality. Low-authority sites get less crawl budget. High-authority sites with fast servers get more.

If you have 10,000 pages but Google only crawls 500 per day, your sitemap helps determine which 500 get prioritized. If your sitemap is full of junk—duplicate pages, parameter URLs, archived content nobody searches for—you’re directing crawl budget away from pages that actually matter.

That’s the tradeoff most people miss. Inclusion feels safer than exclusion. But when you include everything, you dilute crawl focus.

XML Sitemaps vs. HTML Sitemaps

XML sitemaps are machine-readable files designed for search engines. HTML sitemaps are human-readable pages designed for site navigation. They serve different purposes.

An HTML sitemap helps users navigate your site structure. It’s a page on your website with links organized by category. Some SEOs argue it helps with internal linking, and it can—but mostly for sites with poor navigation architecture. If your site structure is clean, HTML sitemaps are optional.

XML sitemaps, on the other hand, are required for any site that wants efficient crawling. They’re not displayed to users. They’re read by bots. And they follow a specific format that search engines parse automatically.

You need an XML sitemap. You probably don’t need an HTML sitemap unless your site has 10,000+ pages with weak internal linking.


How Search Engines Actually Use Your Sitemap

When you submit a sitemap to Google Search Console or Bing Webmaster Tools, here’s what happens:

  1. The crawler fetches your sitemap file from the URL you provided (usually yoursite.com/sitemap.xml)
  2. It parses the XML structure and extracts the list of URLs along with their metadata (lastmod, changefreq, priority)
  3. It compares those URLs to what it already knows about your site from previous crawls and internal link discovery
  4. It decides whether to crawl those URLs based on your site’s crawl budget, the perceived importance of each page, and how recently the content changed
  5. It crawls the URLs it deems valuable, indexes the content if it meets quality thresholds, and ignores or deprioritizes everything else

Notice what’s missing: There’s no guarantee.

Google can ignore your sitemap entirely. If your sitemap lists URLs that are blocked by robots.txt, return 404 errors, redirect to other pages, or contain thin content, Google will stop trusting your sitemap. Over time, if your sitemap consistently includes low-quality URLs, crawlers will rely more on internal link signals and less on your declared list.

This is why sitemap accuracy matters. It’s not just about having one. It’s about maintaining one that reflects your actual site priorities.

What Influences Whether Google Crawls Sitemap URLs

Several variables determine whether a search engine actually crawls the URLs in your sitemap:

Site authority – Low-authority sites get smaller crawl budgets, so sitemaps become more important as a crawl efficiency tool. High-authority sites can afford sloppier sitemaps because Google crawls them aggressively anyway.

Internal linking strength – If a URL is deeply linked within your site, Google prioritizes it over sitemap declarations. Strong internal links override sitemaps. Weak internal links make sitemaps more influential.

Content update frequency – If your sitemap’s lastmod dates are accurate and show recent updates, crawlers check those pages more often. If your lastmod dates never change or are clearly wrong, crawlers ignore them.

Server performance – If your sitemap file is slow to load or your server throttles requests, crawlers will reduce how often they fetch it. Fast sitemap delivery improves crawl frequency.

Historical trust – If you’ve submitted sitemaps in the past that were full of 404s, redirects, or blocked URLs, Google reduces its reliance on your sitemaps. Trust degrades over time with repeated inaccuracies.

The pattern here: Sitemaps work best when they reflect reality. They stop working when they become wishful thinking.


Strategic Sitemap Construction: What to Include and Exclude

Most sitemap guides tell you to include every indexable URL. That’s wrong.

A better principle: Only include URLs you actively want crawled and indexed.

This means excluding:

  • Pagination URLs (unless they contain unique content)
  • Filter and sorting URLs (e.g., ?sort=price-low-to-high)
  • Search result pages (e.g., /search?q=keyword)
  • Duplicate content (if you have multiple URLs for the same content, only include the canonical)
  • Thin or low-quality pages (outdated blog posts, placeholder pages, category archives with no unique value)
  • Admin or user account pages (these should already be blocked by robots.txt)
  • URLs that redirect (sitemaps should only list final destination URLs)
  • URLs with noindex tags (Google explicitly tells you not to include these)

The goal is crawl focus. Every URL in your sitemap competes for crawl budget. If you include 5,000 URLs but only 500 actually matter for your business, you’re diluting the signal.

For large sites—ecommerce stores with 50,000 products, content hubs with 10 years of blog archives—this gets more complex. You need to prioritize strategically.

Sitemap Segmentation for Large Sites

If your site has more than 10,000 URLs, you should break your sitemap into multiple files organized by category or priority level.

Instead of one massive sitemap.xml file, create:

  • sitemap-index.xml (the master file that lists all other sitemaps)
  • sitemap-products.xml (active product pages)
  • sitemap-blog.xml (recent blog content)
  • sitemap-categories.xml (high-value category pages)
  • sitemap-archive.xml (older content that still has value but doesn’t need frequent crawling)

This structure gives you granular control. You can update sitemap-products.xml daily while only updating sitemap-archive.xml monthly. You can submit different sitemaps to different search engines if needed. And you can monitor crawl stats per sitemap to see which content types Google prioritizes.

For ecommerce specifically, exclude:

  • Out-of-stock products (unless they’ll restock soon)
  • Product variants that don’t have unique content (e.g., the same shirt in 10 colors shouldn’t be 10 sitemap entries unless each has unique descriptions)
  • Dynamically generated comparison pages
  • Wishlist or cart URLs

For content sites, exclude:

  • Author archive pages (unless they’re heavily optimized)
  • Tag archives (these are almost always thin)
  • Date-based archives (e.g., /2019/03/)
  • Comment pages or paginated comments

The cleaner your sitemap, the more efficiently crawlers navigate your site.


How to Create Your XML Sitemap

There are three main approaches: manual creation, CMS plugins, and programmatic generation. Which one you use depends on your site’s size and technical complexity.

Method 1: Manual Creation (Small Static Sites)

If you have fewer than 50 pages and a static site, you can create an XML sitemap manually.

The basic structure looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/page-1</loc>
    <lastmod>2026-02-15</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://example.com/page-2</loc>
    <lastmod>2026-02-10</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.0</priority>
  </url>
</urlset>

Save this as sitemap.xml and upload it to your site’s root directory.

Important notes on XML tags:

  • <loc> – The full URL (required)
  • <lastmod> – Last modified date in YYYY-MM-DD format (optional but recommended)
  • <changefreq> – How often the page changes: always, hourly, daily, weekly, monthly, yearly, never (optional and mostly ignored by Google)
  • <priority> – A value from 0.0 to 1.0 indicating relative importance (optional and deprecated—Google ignores this)

For small sites, this works. For anything larger, manual management becomes unsustainable.

Method 2: CMS Plugins (WordPress, Shopify, Webflow)

Most modern content management systems have sitemap plugins or built-in sitemap generators.

WordPress: Yoast SEO, Rank Math, and All in One SEO all generate sitemaps automatically. They update dynamically as you publish new content. Configuration options let you exclude post types, taxonomies, or individual pages.

Shopify: Has a built-in sitemap at yourstore.com/sitemap.xml. It includes products, collections, pages, and blog posts automatically. You can’t customize it directly, but you can use apps like Sitemap NoIndex to exclude specific pages.

Webflow: Generates sitemaps automatically for published pages. Limited customization options.

Wix, Squarespace, Weebly: All generate sitemaps automatically. You generally can’t control what’s included, which is fine for small sites but problematic for large ones.

The advantage of plugins: Automation. You don’t have to manually update the sitemap every time you publish content.

The disadvantage: Lack of control. Most plugins include everything by default. If you have thin tag pages or parameter URLs, the plugin will add them to your sitemap unless you explicitly configure exclusions.

If you’re using a plugin, audit it. Don’t assume it’s doing what you want.

Method 3: Programmatic Generation (Custom Sites, Headless CMS, Large-Scale)

For large sites, dynamic sites, or headless setups, you need programmatic sitemap generation.

This typically involves:

  1. Querying your database for all indexable URLs
  2. Filtering out pages that shouldn’t be indexed (duplicates, thin content, noindexed pages)
  3. Generating the XML structure dynamically
  4. Serving the sitemap at a static URL (e.g., /sitemap.xml)
  5. Updating it automatically whenever content changes

Most modern frameworks have sitemap libraries:

  • Next.js: Use next-sitemap package
  • Gatsby: Built-in sitemap plugin
  • Django: django-sitemap app
  • Ruby on Rails: sitemap_generator gem
  • Laravel: spatie/laravel-sitemap package

For JavaScript-heavy sites (React, Vue, Angular), make sure your sitemap generation happens server-side or during build time. Client-side rendering doesn’t work for sitemaps—crawlers need the XML file immediately accessible.

If your site has more than 50,000 URLs, you’ll need a sitemap index file that links to multiple sitemaps.

Example sitemap index structure:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2026-02-15</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
    <lastmod>2026-02-14</lastmod>
  </sitemap>
</sitemapindex>

This scales to 50,000 sitemaps × 50,000 URLs = 2.5 billion URLs. If you need more than that, you have bigger problems than sitemap structure.


Sitemap Validation and Error Checking

Before submitting your sitemap, validate it.

Common errors that break sitemaps:

  • Invalid XML formatting (missing closing tags, incorrect encoding)
  • Non-UTF-8 characters (special characters that aren’t properly encoded)
  • URLs that return 404 or 500 errors
  • URLs that redirect (sitemaps should only list final destination URLs)
  • URLs blocked by robots.txt (Google will ignore these and may reduce trust in your sitemap)
  • URLs with noindex tags (Google explicitly warns against including these)
  • File size exceeding 50MB uncompressed (split into multiple sitemaps)
  • More than 50,000 URLs in a single sitemap (use a sitemap index)

Use a sitemap validator before submitting:

  • XML Sitemap Validator (xml-sitemaps.com/validate-xml-sitemap.html)
  • Google Search Console’s sitemap report (shows errors after submission)
  • Screaming Frog SEO Spider (can crawl your sitemap and identify broken URLs)

If your sitemap has errors, fix them before submitting. Repeated submission of broken sitemaps trains crawlers to ignore you.


How to Submit Your Sitemap to Search Engines

Once your sitemap is generated and validated, submit it to search engines.

Google Search Console

  1. Log in to Google Search Console
  2. Navigate to Sitemaps in the left sidebar
  3. Enter your sitemap URL (e.g., sitemap.xml or https://example.com/sitemap.xml)
  4. Click Submit

Google will fetch your sitemap and start processing the URLs. This doesn’t mean instant indexation—it means Google has queued your URLs for crawling.

Check back in a few days to see the Discovered vs. Indexed count. If you submitted 1,000 URLs but only 200 are indexed, that’s a signal. Either:

  • The other 800 pages are low quality
  • They’re duplicates
  • They’re blocked somewhere else (robots.txt, noindex tags)
  • Google doesn’t see them as valuable

This is where ongoing monitoring matters.

Bing Webmaster Tools

  1. Log in to Bing Webmaster Tools
  2. Go to Sitemaps under the site dashboard
  3. Submit your sitemap URL
  4. Click Submit

Bing typically crawls sitemaps faster than Google for new sites. If you’re targeting international markets where Bing has stronger presence (Russia uses Yandex, China uses Baidu, but Bing powers DuckDuckGo and other privacy-focused engines), this matters.

IndexNow (Bing, Yandex, and Others)

IndexNow is a protocol that lets you instantly notify search engines when content changes instead of waiting for them to crawl your sitemap.

Instead of submitting a static sitemap and hoping crawlers check it, you ping an API endpoint whenever you publish or update a page.

Example API call:

POST https://api.indexnow.org/indexnow
{
  "host": "example.com",
  "key": "your-api-key",
  "keyLocation": "https://example.com/your-api-key.txt",
  "urlList": [
    "https://example.com/new-page"
  ]
}

Bing and Yandex support this. Google doesn’t (yet). For high-frequency publishing—news sites, ecommerce stores launching products daily—IndexNow reduces time-to-index significantly.

If you’re on WordPress, plugins like Rank Math support IndexNow integration automatically.

Robots.txt Sitemap Declaration

You can also declare your sitemap in your robots.txt file:

User-agent: *
Sitemap: https://example.com/sitemap.xml

This helps crawlers discover your sitemap automatically, even if you haven’t manually submitted it to search consoles. It’s good hygiene, but not a replacement for GSC submission—manual submission gives you monitoring tools.


Ongoing Sitemap Maintenance: The System You Actually Need

Most people submit their sitemap once and forget about it. That’s where the system breaks down.

Your site evolves. You publish new content. You delete old pages. You restructure categories. You update product listings. If your sitemap doesn’t reflect these changes, it becomes increasingly inaccurate—and crawlers start ignoring it.

Here’s the maintenance system:

Monthly: Audit sitemap accuracy

  • Check Google Search Console’s sitemap report for errors
  • Confirm that submitted URLs match indexed URLs
  • Remove any 404s, redirects, or blocked URLs from the sitemap

Quarterly: Review crawl efficiency

  • Analyze which pages Google is crawling most frequently
  • Compare sitemap priorities to actual crawl behavior
  • Identify low-value pages consuming crawl budget and exclude them

After major site changes: Regenerate and resubmit

  • URL structure changes (e.g., moving from /blog/post-name to /content/post-name)
  • Platform migrations (e.g., WordPress to headless CMS)
  • Large-scale content deletions or consolidations

After content launches: Update immediately

  • For time-sensitive content (news, product launches), make sure your sitemap updates automatically
  • If you’re manually managing your sitemap, add new high-priority URLs within 24 hours

The goal isn’t perfection. It’s sustained accuracy. A sitemap that’s 95% accurate and updated regularly is more valuable than a perfect sitemap that never changes.


How to Measure Sitemap Effectiveness

Submitting a sitemap is easy. Knowing whether it’s actually helping is harder.

Here’s what to track:

Indexation rate – Compare the number of URLs in your sitemap to the number of indexed pages in Google Search Console. If you submitted 1,000 URLs but only 400 are indexed, investigate why.

Crawl frequency – GSC’s Crawl Stats report shows how often Google crawls your site. If you submit an updated sitemap but crawl frequency doesn’t increase, your sitemap might not be influencing crawler behavior.

Time to indexation – For new content, measure how long it takes from publication to Google indexation. If it’s consistently 7+ days, your sitemap might not be getting crawled frequently enough—or your site’s crawl budget is too low.

Sitemap errors – GSC flags errors like 404s, blocked URLs, and redirect chains. If errors persist across multiple crawls, fix them. Repeated errors degrade sitemap trust.

Coverage issues – GSC’s coverage report shows which URLs Google discovered but didn’t index, and why. Common reasons: duplicate content, low quality, crawled but not indexed (usually a quality signal). If sitemap URLs are showing up as “Discovered – currently not indexed,” that’s a content quality problem, not a sitemap problem.

If your sitemap isn’t improving indexation velocity or crawl efficiency, the issue is usually one of three things:

  1. Your site has low authority (not enough crawl budget allocated)
  2. Your sitemap includes too many low-value URLs (diluting crawl focus)
  3. Your internal linking is weak (crawlers rely more on links than sitemaps)

Sitemaps amplify efficiency. They don’t create it from nothing.


Common Sitemap Mistakes and Edge Cases

Mistake 1: Including URLs That Are Blocked or Noindexed

If a URL has a noindex tag or is blocked by robots.txt, don’t include it in your sitemap. Google explicitly warns against this. It wastes crawl budget and reduces sitemap trust.

Mistake 2: Using Incorrect lastmod Dates

If your lastmod dates never change or are clearly wrong (e.g., all pages show the same date), Google will ignore them. Only include lastmod if you’re tracking actual content updates.

Mistake 3: Not Using Canonical URLs

Your sitemap should only include canonical URLs. If you have duplicate content with canonical tags pointing elsewhere, don’t include the non-canonical versions in your sitemap. This confuses crawlers about which version to prioritize.

Mistake 4: Submitting a Sitemap Full of Paginated URLs

Pagination (e.g., /blog/page/2, /products?page=3) rarely needs to be in a sitemap unless each page has unique, valuable content. Most paginated pages are thin and waste crawl budget. Use rel="next" and rel="prev" tags instead, or implement infinite scroll with proper anchor links.

Edge Case 1: JavaScript-Rendered Sites

If your site relies heavily on JavaScript (React, Vue, Angular), make sure your sitemap URLs are server-side rendered or pre-rendered at build time. Google can render JavaScript, but it’s slower and less reliable. If a URL requires JS execution to display content, it might not get indexed even if it’s in your sitemap.

Edge Case 2: Multilingual and Multi-Regional Sites

If you have multiple language or regional versions of the same content (e.g., example.com/en/page and example.com/es/page), use hreflang annotations inside your sitemap to tell search engines which version to show to which audience.

Example:

<url>
  <loc>https://example.com/en/page</loc>
  <xhtml:link rel="alternate" hreflang="es" href="https://example.com/es/page"/>
  <xhtml:link rel="alternate" hreflang="en" href="https://example.com/en/page"/>
</url>

This prevents duplicate content issues and ensures the right version ranks in the right region.

Edge Case 3: Dynamic Ecommerce Filters

If your ecommerce site has faceted navigation (e.g., /products?color=red&size=large), these URLs should not be in your sitemap unless each filter combination has unique, valuable content. Most filter URLs are thin and create indexation bloat. Use canonical tags to consolidate them, and keep them out of your sitemap.


Structured Summary: The Sitemap System

XML sitemaps are a crawl efficiency tool, not a magic indexation button. They work best when they reflect your site’s actual priority structure and are maintained as your site evolves.

Core principles:

  1. Only include URLs you actively want crawled and indexed—strategic exclusion improves crawl focus
  2. Keep your sitemap accurate—remove 404s, redirects, and blocked URLs immediately
  3. Update it regularly—as your site changes, your sitemap should change with it
  4. Monitor indexation, not just submission—GSC’s coverage reports tell you whether your sitemap is actually working
  5. Segment large sitemaps—break them into categories to maintain granular control and faster crawl processing

Checklist:

  • [ ] Audit existing site structure and determine which URLs should be indexed
  • [ ] Generate sitemap using appropriate method (manual, plugin, or programmatic)
  • [ ] Validate sitemap for XML formatting errors and broken URLs
  • [ ] Submit to Google Search Console and Bing Webmaster Tools
  • [ ] Declare sitemap location in robots.txt
  • [ ] Set up monthly sitemap accuracy audits
  • [ ] Monitor indexation rates and crawl frequency in GSC
  • [ ] Update sitemap after major site changes or new content launches

If your sitemap is 95% accurate, updated monthly, and strategically excludes low-value pages, you’re in the top 10% of sites. Most sitemaps are bloated, outdated, and ignored by crawlers. Yours doesn’t have to be.


Looking to optimize your site’s crawl efficiency and indexation strategy? At MarginsEye, we audit technical SEO infrastructure to identify structural weaknesses that hold sites back. Get a technical SEO audit and see where your site’s crawl budget is actually going.


Frequently Asked Questions

1. Do I need an XML sitemap if my site is small?

If your site has fewer than 10 pages and strong internal linking, technically no. Google will find everything through links. But there’s no downside to having one, and it speeds up discovery for new pages. For any site that publishes content regularly or has more than 50 pages, yes—you need a sitemap.

2. How often should I update my sitemap?

It depends on how often your content changes. News sites and ecommerce stores should update daily or use dynamic sitemap generation. Blogs that publish weekly can update weekly. Static sites can update monthly. The key is consistency—update it whenever your site structure changes significantly.

3. Can I have multiple sitemaps?

Yes. For large sites, breaking your sitemap into multiple files (products, blog, categories) improves organization and crawl efficiency. Use a sitemap index file to link them all together.

4. What’s the difference between sitemap priority and actual crawl priority?

Sitemap priority tags are deprecated—Google ignores them. Actual crawl priority is determined by internal linking, content quality, update frequency, and site authority. Strong internal links override sitemap declarations.

5. Why are some URLs in my sitemap not getting indexed?

Common reasons: low content quality, duplicate content, thin pages, blocked by robots.txt or noindex tags, or your site doesn’t have enough crawl budget. Check Google Search Console’s coverage report for specific reasons.

6. Should I include images in my sitemap?

If images are important for your SEO strategy (e.g., ecommerce product photos, visual content sites), yes. Create a separate image sitemap or add image tags to your main sitemap. For most sites, this is optional.

7. Do I need to submit my sitemap to multiple search engines?

Yes. Submit to Google Search Console and Bing Webmaster Tools at minimum. If you target specific regions, also submit to Yandex (Russia), Baidu (China), or Naver (South Korea).

8. What happens if I submit a sitemap with errors?

Google will flag the errors in Search Console and may ignore affected URLs. Repeated submission of broken sitemaps degrades trust, meaning crawlers rely less on your sitemap over time. Always validate before submitting.

9. Can I block certain pages from my sitemap but still have them indexed?

Yes. Your sitemap is advisory, not mandatory. Pages with strong internal links can still get indexed even if they’re not in your sitemap. Excluding them just deprioritizes them in crawl order.

10. How do I know if my sitemap is actually being used by Google?

Check Google Search Console’s sitemap report. It shows how many URLs were submitted, how many were discovered, and how many were indexed. If discovered and indexed numbers are close to submitted numbers, your sitemap is working. If there’s a large gap, investigate why.

11. Should I use changefreq tags in my sitemap?

They’re optional and mostly ignored by Google. Only include them if you have accurate data on how often pages actually change. Incorrect changefreq tags don’t harm anything, but they don’t help either.

12. What file format should my sitemap use?

XML is standard. You can gzip compress it to .xml.gz to reduce file size and bandwidth. Both formats are accepted by all major search engines.


Next Read: How to Diagnose and Fix Crawl Budget Issues That Are Killing Your Indexation

Understanding crawl budget allocation is the next layer after sitemap optimization—because even a perfect sitemap won’t help if Google isn’t allocating enough crawl budget to your site in the first place.