I’ve been working on a script that goes to a URL and scraps some parts of data, which is pretty much a crawler or spider.
If all pages that the crawler landed were valid, my job would have been so easy. However, in reality many many pages are not valid and the script has to use regular expression.
This can be a good or bad thing for those web owners.
However, exposure is very necessary in terms of marketing for the site and valid html page means it has greater chance to get exposed by search engines such as google.com because valid html page will provide what search engine crawler wants more efficiently.
I believe engineers who work on those crawler have overcome many difficulties due to the invalid markup on a page. However, if HTML in a page is not valid (treating it as a xml), those smart engineers would have to come up with a logic to overcome that by using regular expression perhaps. That could be prone to mistakes so lead to scrapping only few from invalid HTML in a page. After all engineers are human and human make mistakes.
Also just for the same reason, if well formed semantic HTML is used, it will have higher chance to get exposed to a certain keyword typed by users.
That’s just my idea of how html page has to be constructed considering SEO and future use.
So my recommendation is this:
1. Treat markup in a page as data. Forget about presentation and such. Just make sure the data is valid.
2. Use CSS to visualize the data (= HTML markup) to appeal users
It’s quite simple after all.