by Christoph Schiessl on JavaScript
I previously explained how to use JavaScript to extract all URLs from your website's sitemap. Today, I want to build on top of this and re-use the same browser functions to extract the <title>
and <meta name="description">
texts from the pages behind all of these URLs.
My motivation for doing this is to programmatically check the length of my pages' titles and descriptions. I have been told that titles should be no more than 70 characters and descriptions should be no more than 160 characters for optimal SEO results. Therefore, this project aims to generate a list of page URLs from my /sitemap.xml
that violate these two rules. Furthermore, pages with missing titles or descriptions should also be included in the output.
As a starting point, we will use the async
function that we developed before to fetch, parse, and query your website's /sitemap.xml
file:
async function extractPageURLsFromSitemapXML() {
let rawXMLString = await fetch("/sitemap.xml").then((response) => response.text());
let parsedDocument = (new DOMParser()).parseFromString(rawXMLString, "application/xml");
let locElements = parsedDocument.querySelectorAll("loc");
return locElements.values().map((e) => e.textContent.trim()).toArray();
}
This function returns the list of page URLs that we will be working with, for each one, performing the following steps:
String
into a data structure to work with.Finally, once we have a function to retrieve the metadata from a page, we will iterate over all pages and select the ones that do not follow the SEO rules above to obtain the final result.
As before, to fetch the /sitemap.xml
, we use the asynchronous fetch()
function to download the page from the server. We must await
the Promise
object returned by fetch()
to get the Response
object to which the Promise
resolves.
let response = await fetch(pageURL);
let rawHTMLString = await response.text();
Next, we use the text()
function of the Response
object to get the HTTP response body as a simple String
. We have to await
this function call, too, because it also returns a Promise
, which resolves to the String
we are looking for.
String
Also, the parsing of the raw String
is similar but not the same as that of sitemap.xml
. We can still use the built-in DOMParser
, but we have to use the mime type text/html
. The mime type controls the type of the return value of the parser. With application/xml,
you get an instance XMLDocument
(i.e., parsedDocument instance of XMLDocument === true
), but it would not be correct in this context. Also, generally speaking, HTML documents are not valid XML documents (with the now-abandoned XHTML standard being the exception).
let parsedDocument = (new DOMParser()).parseFromString(rawHTMLString, "text/html");
Instead, we have to use the correct mime type, text/html,
which makes the parsed return an instance of HTMLDocument
(i.e., parsed document instance of HTMLDocument === true
).
Now, things get more interesting. Previously, we used querySelectorAll()
, which takes a CSS selector as a parameter and returns a NodeList
of elements matching the given selector. This is more than we need to extract the title and description because valid pages only ever have a single title and description. Therefore, we can use querySelector()
, the smaller sibling of querySelectorAll()
, which also takes a CSS selector as its sole parameter. However, it returns only the first matching element. If there is no such element, then it returns null
.
let title = null;
let titleElement = parsedDocument.querySelector("head > title");
if (titleElement !== null) { title = titleElement.textContent; }
Once we have an Element
representing the page's <title>
element, we can use the textContent
property to access its text. Lastly, I'm using trim()
to remove any leading or trailing whitespace.
For the description, we can use mostly the same code. We need a different CSS selector, though, and we have to extract the text in a different way because <meta>
tags have no text content but keep their payload in a content
attribute.
let description = null;
let descriptionElement = parsedDocument.querySelector("head > meta[name=description][content]");
if (descriptionElement !== null) { description = descriptionElement.getAttribute("content"); }
To accomplish this, we can use the getAttribute()
method of the Element
object.
At this point, we have everything we need to write a function to extract the title and description from a given page URL.
async function extractPageTitleAndDescription(pageURL) {
let title = null, description = null;
let response = await fetch(pageURL);
let rawHTMLString = await response.text();
let parsedDocument = (new DOMParser()).parseFromString(rawHTMLString, "text/html");
let titleElement = parsedDocument.querySelector("head > title");
if (titleElement !== null) { title = titleElement.textContent; }
let descriptionElement = parsedDocument.querySelector("head > meta[name=description][content]");
if (descriptionElement !== null) { description = descriptionElement.getAttribute("content"); }
return { title, description };
}
To call this function for all page URLs extracted from /sitemap.xml
, we can use a simple loop:
const pagesWithMeta = {};
for (const pageURL of await extractPageURLsFromSitemapXML()) {
pagesWithMeta[pageURL] = await extractPageTitleAndDescription(pageURL);
}
console.table(pagesWithMeta);
Note that we have to await
our own functions because they are both async
— they have to be because they use the browser's asynchronous functionality. In any case, here is the output of all pages with the respective titles and descriptions:
Finally, we can use the same loop again to iterate over all pages and check whether each satisfies our SEO rules. If it doesn't, we append the page to the pagesWithErrors
object. Otherwise, we ignore it and do nothing.
const pagesWithErrors = {};
for (const pageURL of await extractPageURLsFromSitemapXML()) {
const { title, description } = await extractPageTitleAndDescription(pageURL);
const errors = [];
if (title === null) { errors.push('title missing'); }
else if (title.length > 70) { errors.push(`title too long ${title.length}`); }
if (description === null) { errors.push('description missing'); }
else if (description.length > 160) { errors.push(`description too long ${description.length}`); }
if (errors.length > 0) { pagesWithErrors[pageURL] = errors.join(", "); }
}
console.table(pagesWithErrors);
We are also assembling a quick description that explains what's wrong with the page. For instance, we add error messages like title too long
or description missing
. If you run this whole code now, you get the following output:
Now I have a few pages to SEO optimize ;)
Here is the full code again for your convenience:
async function extractPageURLsFromSitemapXML() {
let rawXMLString = await fetch("/sitemap.xml").then((response) => response.text());
let parsedDocument = (new DOMParser()).parseFromString(rawXMLString, "application/xml");
let locElements = parsedDocument.querySelectorAll("loc");
return locElements.values().map((e) => e.textContent.trim()).toArray();
}
async function extractPageTitleAndDescription(pageURL) {
let title = null, description = null;
let response = await fetch(pageURL);
let rawHTMLString = await response.text();
let parsedDocument = (new DOMParser()).parseFromString(rawHTMLString, "text/html");
let titleElement = parsedDocument.querySelector("head > title");
if (titleElement !== null) { title = titleElement.textContent.trim(); }
let descriptionElement = parsedDocument.querySelector("head > meta[name=description][content]");
if (descriptionElement !== null) { description = descriptionElement.getAttribute("content").trim(); }
return { title, description };
}
const pagesWithMeta = {};
for (const pageURL of await extractPageURLsFromSitemapXML()) {
const { title, description } = await extractPageTitleAndDescription(pageURL);
pagesWithMeta[pageURL] = { title, description };
}
console.table(pagesWithMeta);
const pagesWithErrors = {};
for (const pageURL of await extractPageURLsFromSitemapXML()) {
const { title, description } = await extractPageTitleAndDescription(pageURL);
const errors = [];
if (title === null) { errors.push('title missing'); }
else if (title.length > 70) { errors.push(`title too long ${title.length}`); }
if (description === null) { errors.push('description missing'); }
else if (description.length > 160) { errors.push(`description too long ${description.length}`); }
if (errors.length > 0) { pagesWithErrors[pageURL] = errors.join(", "); }
}
console.table(pagesWithErrors);
Anyway, that's everything for today. Thank you very much for reading, and see you again next time!
I send two weekly emails on building performant and resilient Web Applications with Python, JavaScript and PostgreSQL. No spam. Unscubscribe at any time.
The article discusses how to validate the word count of web pages for SEO purposes. It explains how to use JavaScript to count the number of words in HTML documents.
By Christoph Schiessl on JavaScript
sitemap.xml
with JavaScript
Learn JavaScript techniques needed to parse your sitemap.xml
in order to obtain a list of all pages making up your website.
By Christoph Schiessl on JavaScript
ETag
and If-None-Match
Headers
Learn how to use ETag
and If-None-Match
headers to limit your web application's resource consumption by preventing data retransfers.
By Christoph Schiessl on Python and FastAPI