A DIY search engine that would be somewhat
usable should work like this. I'm modelling this on a model I built of Google a
few years ago.
Physical
Firstly, you need some servers. The amount
will depend on how much content you plan to index
You'll need an internet connection
You should have a firewall
You should have a Web application firewall
to protect your data Introducing KEMP’s LoadMaster-Integrated Web Application
Firewall Services
You'll need a load balancer for your
application servers - both the web publisher and possibly your database servers
Load Balancers – bringing zen to your holistic SEO strategy
Applications - the search web applicaiton
Note: most people think that search and
ranking happens when you do a search. Its actually already done. To make search
so fast, the search layer just returns
the next 10 results from a list starting at position 1, and the only logic
applied is (a) geo specific pages are removed, (b) QDF checks pull in items
based on a date and time range and (c) real time data is pulled from a queue
which has it ready to go
Essentially you want to break the HTML user
interface from the business logic and the databases (the index)
I would imagine a 3 tier where you have a
very basic Apache (or IIS) web application that lets a user enter a search
query and then it fires it the search application which sends the results to
another 2nd User interface. That way should there be peak demand you can
basically disconnect the user operation servers from the functioning search
server
The business logic layer - the
"search" should simply have 2 functions
Function A - strip the search string in a
query of non-value words (a, the, an, and, it, is) where each word is
searched and a second query where its
earched in different %'s of completeness. E.g. a search for "how much does
an iphone cost" would [much or does or iphone or cost] and also [much or
does or iphone] and [much or does or cost]
The query is then handed to the next
available search parser. The search parser connects to the next available
database server. this is where load balancing comes in - if any single server
becomes non-responsive, the load balancer just removes it from queue and
connects the next available healthy server
The parser requests the top 10 results with
those words and is given a list of key IDs which each match a url
The magic of rank order isn't established
during search - search is the final leg
Applications - Crawl List
You need a URL list, which you would need
to mark with last crawled dates, time to crawl/page speed, robots and meta
index allowed status
You would also want to store a rank and
geo-location and a language identifier
Your crawl servers do the clever work
The crawl list is made up from a multitude
of sources including: crawling and processing files and pages found on a
server, inside a file (i.e. links from other pages), from URL's accessed by a
browser, toolbar app, or a submit page on your search app
Lastly, you would store a list matrix of
the pages that link to this page and a vale of those pages which is in turn
based on the number of links they have. you might also build in some kind of
penalty/spam feature at this point
You would then have a realtime source list
- e.g. news sites that you either allow a direct URL submit (e.g. look at pingomatic
for WordPress: Ping-o-Matic!)
Remember submitted urls save time - so if
you have a trust indicator or protocol established, then trusted urls would be
auto-crawled super faster - becasuse there is less time spent on discovery.
Once you crawl it, you mark it as crawled, thus if the url is found during
another process,, well, unless its refresh-by date hasn't passed it doesn't
need to be recrawled.
You would then have a scheduler which would
be a view or table of that data in terms of which pages and domains are
prioritised first
You would need multiple crawlers if you
were trying to index the web or a large piece of it
some servers will work on a dedicated.


0 comments:
Post a Comment
Note: only a member of this blog may post a comment.