HOW TO FIND ALL EXISTING AND ARCHIVED URLS ON A WEB SITE

How to Find All Existing and Archived URLs on a web site

How to Find All Existing and Archived URLs on a web site

Blog Article

There are lots of good reasons you may will need to discover all the URLs on an internet site, but your specific goal will figure out That which you’re searching for. As an example, you may want to:

Establish each individual indexed URL to research issues like cannibalization or index bloat
Accumulate latest and historic URLs Google has witnessed, specifically for website migrations
Uncover all 404 URLs to Get better from post-migration mistakes
In Each individual situation, one tool received’t Present you with every thing you'll need. Regrettably, Google Lookup Console isn’t exhaustive, plus a “web-site:illustration.com” lookup is proscribed and tricky to extract facts from.

During this publish, I’ll walk you thru some resources to construct your URL checklist and in advance of deduplicating the information utilizing a spreadsheet or Jupyter Notebook, depending on your internet site’s dimension.

Old sitemaps and crawl exports
In the event you’re trying to find URLs that disappeared from your Are living internet site not long ago, there’s a chance anyone on the team can have saved a sitemap file or maybe a crawl export prior to the alterations were built. When you haven’t already, look for these documents; they can often provide what you need. But, in case you’re reading this, you probably did not get so lucky.

Archive.org
Archive.org
Archive.org is a useful Software for Website positioning responsibilities, funded by donations. For those who look for a site and choose the “URLs” choice, you'll be able to access as many as ten,000 shown URLs.

Even so, there are a few restrictions:

URL limit: You'll be able to only retrieve as many as web designer kuala lumpur 10,000 URLs, that's insufficient for greater web sites.
Top quality: A lot of URLs could possibly be malformed or reference resource documents (e.g., photos or scripts).
No export possibility: There isn’t a built-in solution to export the list.
To bypass The dearth of the export button, make use of a browser scraping plugin like Dataminer.io. Having said that, these limits signify Archive.org might not offer an entire Option for greater internet sites. Also, Archive.org doesn’t reveal irrespective of whether Google indexed a URL—but if Archive.org uncovered it, there’s a superb chance Google did, way too.

Moz Professional
Though you could possibly normally make use of a connection index to discover external web pages linking to you, these equipment also uncover URLs on your website in the process.


Ways to use it:
Export your inbound links in Moz Pro to acquire a fast and simple listing of focus on URLs out of your site. For those who’re managing a large Web site, consider using the Moz API to export knowledge further than what’s workable in Excel or Google Sheets.

It’s crucial that you Observe that Moz Professional doesn’t ensure if URLs are indexed or identified by Google. Even so, due to the fact most web pages use the exact same robots.txt policies to Moz’s bots as they do to Google’s, this method usually performs perfectly as a proxy for Googlebot’s discoverability.

Google Look for Console
Google Research Console delivers numerous useful resources for making your list of URLs.

Links experiences:


Much like Moz Professional, the Hyperlinks part gives exportable lists of focus on URLs. Regretably, these exports are capped at 1,000 URLs Just about every. You'll be able to use filters for distinct web pages, but considering the fact that filters don’t implement on the export, you could possibly ought to depend on browser scraping instruments—restricted to 500 filtered URLs at a time. Not suitable.

Effectiveness → Search engine results:


This export will give you a list of pages getting lookup impressions. When the export is limited, You can utilize Google Lookup Console API for more substantial datasets. You will also find totally free Google Sheets plugins that simplify pulling extra comprehensive information.

Indexing → Web pages report:


This portion offers exports filtered by situation style, although these are typically also confined in scope.

Google Analytics
Google Analytics
The Engagement → Web pages and Screens default report in GA4 is a superb source for gathering URLs, using a generous limit of a hundred,000 URLs.


Better yet, it is possible to apply filters to develop different URL lists, proficiently surpassing the 100k Restrict. One example is, in order to export only blog site URLs, follow these measures:

Phase 1: Include a phase towards the report

Step two: Click “Create a new section.”


Stage three: Define the phase by using a narrower URL sample, for instance URLs containing /blog site/


Be aware: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they supply beneficial insights.

Server log information
Server or CDN log information are Most likely the final word Software at your disposal. These logs seize an exhaustive listing of every URL path queried by users, Googlebot, or other bots in the course of the recorded time period.

Factors:

Information size: Log documents might be huge, a lot of sites only retain the final two months of knowledge.
Complexity: Examining log information is usually difficult, but a variety of applications are available to simplify the process.
Combine, and great luck
When you finally’ve collected URLs from these sources, it’s time to combine them. If your web site is small enough, use Excel or, for greater datasets, resources like Google Sheets or Jupyter Notebook. Ensure all URLs are continuously formatted, then deduplicate the checklist.

And voilà—you now have an extensive listing of current, previous, and archived URLs. Fantastic luck!

Report this page