Quantcast
Channel: phpBB.com
Viewing all articles
Browse latest Browse all 2234

[3.3.x] Support Forum • Re: Dealing with bot traffic

$
0
0
I am also experiencing this problem.

The requests have these properties:
  • They do not have a bot-identification
  • They do not respect robots.txt
  • They have a sid in the url
Each request comes from a "new" IP (mainly in Brazil).

My theory is that the problem is caused by a combination of "stupid" web crawlers and unpractical forum software.

If you make a request with an old (long gone) sid in the url from a new IP, then you will receive a page in a new session and with the new sid in all the links on the page, so basically you get a page with new links (we know that they are not actually new because the sid should be disregarded).

Now if you then take the page back to the central register it will obtain some text and a number of "new" links to crawl. These links are put into an already long (very long) queue of links to crawl. Links in this queue is crawled much later than they are entered and by a completely different client/crawler. So they again appear "unknown" and from a "first timer" and therefore they result in pages with new sid's in the links.

So this is a never ending story: New links are being invented and later crawled and then new links are being invented ...

To deal with this I have made two changes to the phpbb-software:
  • Requests with a sid which cannot be found in the database are redirected to a static html file; stating that their session has expired (and with a link back to the main forum page)
  • Only registered users will get links with sids (otherwise you cannot change to administrator mode).
The first change is in session_begin in sessions.php (the four last lines is the actual change):

Code:

// if session id is setif (!empty($this->session_id)){$sql = 'SELECT u.*, s.*FROM ' . SESSIONS_TABLE . ' s, ' . USERS_TABLE . " uWHERE s.session_id = '" . $db->sql_escape($this->session_id) . "'AND u.user_id = s.session_user_id";$result = $db->sql_query($sql);$this->data = $db->sql_fetchrow($result);$db->sql_freeresult($result);            // silly bot counter-fit            if (!isset($this->data['user_id'])){                redirect("/expired.htm");            }
This change means that with very little resource load the "bots" are redirected to a very neutral page, which only contains static links.
As a result the web server now actually have resources to service regular users.

The second change is in append_sid (the root cause of it all) in functions.php.

Code:

// Append session id and parameters (even if they are empty)// If parameters are empty, the developer can still append his/her parameters without caring about the delimiter    global $user;    if ($session_id  && $user->data['is_registered'])    {    return $url . (($append_url) ? $url_delim . $append_url . $amp_delim : $url_delim) . $params . ((!$session_id) ? '' : $amp_delim . 'sid=' . $session_id) . $anchor;    }    else    {        return $url . (($append_url) ? $url_delim . $append_url . $amp_delim : $url_delim) . $params . $anchor;    }
This change means that people without sid in a cookie cannot have a session (in my mind there should never have been an append_sid function at all).

Now I will wait and see how long time it takes before the "sid" requests wear off. If they don't wear off then my theory is wrong. Question is how long time I will have to wait for this to happen because I believe "they" have a "tricillion" links waiting in their register(s) already. Each request they have made in the past gave 3-100+ new links (i.e. an exponential growth with a rather high exponent).

Statistics: Posted by Thomas Linder Puls — Wed Mar 12, 2025 11:53 am



Viewing all articles
Browse latest Browse all 2234

Trending Articles