Some lessons Elastic has learned from indexing the web

Indexing the web is hard. There’s a nearly infinite supply of misbehaving sites, misapplied (or ignored) standards, duplicate content, and corner cases to contend with. It’s a big task to create an easy-to-use web crawler that’s thorough and flexible enough to account for all the different content it encounters.

Read the whole article here:

Building a scalable, easy-to-use web crawler for Elastic Enterprise Search

