如何改进这个 PHP/MySQL 新闻提要?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4162020/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 12:08:59  来源:igfitidea点击:

How can I improve this PHP/MySQL news feed?

phpmysqlweb-applicationsfeed

提问by Josh Smith

Let me start right off the bat by saying that I know this is not the best solution. I know it's kludgy and a hack of a feature. But that's why I'm here!

让我立即开始说我知道这不是最好的解决方案。我知道它很笨拙,而且是一个功能的黑客。但这就是我在这里的原因!

This question/work builds off some discussion on Quora with Andrew Bosworth, creator of Facebook's news feed.

这个问题/工作建立在与Facebook 新闻提要的创建者安德鲁博斯沃思在 Quora 上的一些讨论的基础上

I'm building a news feedof sorts. It's built solely in PHPand MySQL.

我正在构建各种新闻提要。它完全内置于PHP和 中MySQL

alt text

替代文字



The MySQL

MySQL

The relational model for the feed is composed of two tables. One table functions as an activity log; in fact, it's named activity_log. The other table is newsfeed. These tables are nearly identical.

提要的关系模型由两个表组成。一张表用作活动日志;事实上,它的名字是activity_log。另一个表是newsfeed这些表几乎相同。

The schema for the logis activity_log(uid INT(11), activity ENUM, activity_id INT(11), title TEXT, date TIMESTAMP)

日志架构activity_log(uid INT(11), activity ENUM, activity_id INT(11), title TEXT, date TIMESTAMP)

...and the schema for the feedis newsfeed(uid INT(11), poster_uid INT(11), activity ENUM, activity_id INT(11), title TEXT, date TIMESTAMP).

...而提要架构newsfeed(uid INT(11), poster_uid INT(11), activity ENUM, activity_id INT(11), title TEXT, date TIMESTAMP).

Any time a user does somethingrelevant to the news feed, for example asking a question, it will get logged to the activity logimmediately.

任何时候用户做一些与新闻提要相关的事情,例如提出问题,它都会立即被记录到活动日志中



Generating the news feeds

生成新闻提要

Then every X minutes(5 minutes at the moment, will change to 15-30 minutes later), I run a cron jobthat executes the script below. This script loops through all of the users in the database, finds all the activities for all of that user's friends, and then writes those activities to the news feed.

然后每 X 分钟(目前 5 分钟,将更改为 15-30 分钟后),我运行一个执行以下脚本的 cron 作业。该脚本遍历数据库中的所有用户,查找该用户所有朋友的所有活动,然后将这些活动写入新闻提要。

At the moment, the SQLthat culls the activity (called in ActivityLog::getUsersActivity()) has a LIMIT 100imposed for performance* reasons. *Not that I know what I'm talking about.

目前,出于性能*原因SQL,剔除活动(调用ActivityLog::getUsersActivity())的 已LIMIT 100强加。*不是说我知道我在说什么。

<?php

$user = new User();
$activityLog = new ActivityLog();
$friend = new Friend();
$newsFeed = new NewsFeed();

// Get all the users
$usersArray = $user->getAllUsers();
foreach($usersArray as $userArray) {

  $uid = $userArray['uid'];

  // Get the user's friends
  $friendsJSON = $friend->getFriends($uid);
  $friendsArray = json_decode($friendsJSON, true);

  // Get the activity of each friend
  foreach($friendsArray as $friendArray) {
    $array = $activityLog->getUsersActivity($friendArray['fid2']);

    // Only write if the user has activity
    if(!empty($array)) {

      // Add each piece of activity to the news feed
      foreach($array as $news) {
        $newsFeed->addNews($uid, $friendArray['fid2'], $news['activity'], $news['activity_id'], $news['title'], $news['time']);
      }
    }
  }
}


Displaying the news feeds

显示新闻提要

In the client code, when fetching the user's news feed, I do something like:

在客户端代码中,在获取用户的新闻提要时,我会执行以下操作:

$feedArray = $newsFeed->getUsersFeedWithLimitAndOffset($uid, 25, 0);

foreach($feedArray as $feedItem) {

// Use a switch to determine the activity type here, and display based on type
// e.g. User Name asked A Question
// where "A Question" == $feedItem['title'];

}


Improving the news feed

改进新闻提要

Now forgive my limited understanding of the best practices for developing a news feed, but I understand the approach I'm using to be a limited version of what's called fan-out on write, limited in the sense that I'm running a cron job as an intermediate step instead of writing to the users' news feeds directly. But this is very different from a pull model, in the sense that the user's news feed is not compiled on load, but rather on a regular basis.

现在请原谅我对开发新闻提要的最佳实践的有限理解,但我理解我使用的方法是所谓的写时扇出的有限版本,在我运行 cron 工作的意义上是有限的作为中间步骤,而不是直接写入用户的新闻提要。但这与拉模型非常不同,因为用户的新闻提要不是在加载时编译,而是定期编译。

This is a large question that probably deserves a large amount of back and forth, but I think it can serve as a touchstone for many important conversations that new developers like myself need to have. I'm just trying to figure out what I'm doing wrong, how I can improve, or how I should maybe even start from scratch and try a different approach.

这是一个很大的问题,可能值得反复讨论,但我认为它可以作为像我这样的新开发人员需要进行的许多重要对话的试金石。我只是想弄清楚我做错了什么,如何改进,或者我应该如何从头开始并尝试不同的方法。

One other thing that bugs me about this model is that it works based on recency rather than relevancy. If anyone can suggest how this can be improved to work relevancy in, I would be all ears. I'm using Directed Edge's API for generating recommendations, but it seems that for something like a news feed, recommenders won't work (since nothing's been favorited previously!).

关于这个模型的另一件事是它的工作基于新近度而不是相关性。如果有人可以建议如何改进这点以与工作相关性,我会全神贯注。我正在使用 Directed Edge 的 API 来生成推荐,但似乎对于像新闻提要这样的东西,推荐器不起作用(因为以前没有人喜欢!)。

采纳答案by Dan Spiteri

Really cool question. I'm actually in the middle of implementing something like this myself. So, I'm going to think out loud a bit.

真的很酷的问题。我实际上正在自己实施这样的事情。所以,我要大声思考一下。

Here's the flaws I see in my mind with your current implementation:

以下是我在您当前的实施中看到的缺陷:

  1. You are processing all of the friends for all users, but you will end up processing the same users many times due to the fact that the same groups of people have similar friends.

  2. If one of my friends posts something, it won't show up on my news feed for at most 5 minutes. Whereas it should show up immediately, right?

  3. We are reading the entire news feed for a user. Don't we just need to grab the new activities since the last time we crunched the logs?

  4. This doesn't scale that well.

  1. 您正在处理所有用户的所有朋友,但由于同一组人有相似的朋友,您最终会多次处理相同的用户。

  2. 如果我的一个朋友发布了一些东西,它最多不会出现在我的新闻提要上 5 分钟。而它应该立即出现,对吗?

  3. 我们正在为用户阅读整个新闻提要。自上次处理日志以来,我们不是只需要获取新的活动吗?

  4. 这不能很好地扩展。

The newsfeed looks like the exact same data as the activity log, I would stick with that one activity log table.

新闻源看起来与活动日志完全相同,我会坚持使用那个活动日志表。

If you shard your activity logs across databases, it will allow you to scale easier. You can shard your users if you wish as well, but even if you have 10 million user records in one table, mysql should be fine doing reads. So whenever you lookup a user, you know which shard to access the user's logs from. If you archive your older logs every so often and only maintain a fresh set of logs, you won't have to shard as much. Or maybe even at all. You can manage many millions of records in MySQL if you are tuned even moderately well.

如果您跨数据库分片您的活动日志,它将允许您更轻松地扩展。如果你愿意,你也可以对你的用户进行分片,但即使你在一张表中有 1000 万条用户记录,mysql 应该可以很好地进行读取。因此,无论何时查找用户,您都知道从哪个分片访问用户的日志。如果您经常归档旧日志并且只维护一组新日志,则不必进行那么多分片。或者甚至可能根本没有。如果您调得适中,您可以在 MySQL 中管理数百万条记录。

I would leverage memcached for your users table and possibly even the logs themselves. Memcached allows cache entries up to 1mb in size, and if you were smart in organizing your keys you could potentially retrieve all of the most recent logs from the cache.

我会为您的用户表甚至日志本身利用 memcached。Memcached 允许最大 1mb 的缓存条目,如果您在组织密钥方面很聪明,您可能会从缓存中检索所有最新的日志。

This would be more work as far as architecture is concerned, but it will allow you to work in real-time and scale out in the future...especially when you want users to start commentingon each posting. ;)

就架构而言,这将是更多的工作,但它将允许您实时工作并在未来扩展……尤其是当您希望用户开始对每个帖子发表评论时。;)

Did you see this article?

你看到这篇文章了吗?

http://bret.appspot.com/entry/how-friendfeed-uses-mysql

http://bret.appspot.com/entry/how-friendfeed-uses-mysql

回答by Freeman L

I'm trying to build a Facebook-style news feed on my own. Instead of creating another table to log users' activities, I calculated the 'edge' from the UNION of posts, comments etc.

我正在尝试自己构建一个 Facebook 风格的新闻提要。我没有创建另一个表来记录用户的活动,而是从帖子、评论等的 UNION 中计算了“边缘”。

With a bit of mathematics, I calculate the 'edge' using an exponential decay model, with time-elapsed being the independent variable, taking account the number of comments, likes, etc each post has to formulate the lambda constant. The edge will decrease fast at first but gradually flattens to almost 0 after a few days (but will never reach 0)

通过一些数学知识,我使用指数衰减模型计算“边缘”,时间流逝是自变量,考虑到评论、喜欢等的数量,每个帖子必须制定 lambda 常数。边缘一开始会快速下降,但几天后逐渐变平至几乎为 0(但永远不会达到 0)

When showing the feed, each edge is multiplied using RAND(). Posts with higher edge will appear more often

显示提要时,每条边都使用 RAND() 相乘。具有更高边缘的帖子将更频繁地出现

This way, more popular posts have higher probability to appear in the news feed, for a longer time.

这样,更受欢迎的帖子更有可能在新闻提要中出现更长的时间。

回答by jsh

Instead of running a cron job, a post-commit script of some sort. I don't know specifically what the capabilities of PHP and MySQL are in this regard - if I recall correctly MySQL InnoDB allows more advanced features than other varieties but I don't remember if there are things like triggers in the latest version.

不是运行 cron 作业,而是某种提交后的脚本。我不知道 PHP 和 MySQL 在这方面的具体功能 - 如果我没记错的话 MySQL InnoDB 允许比其他品种更高级的功能,但我不记得最新版本中是否有触发器之类的东西。

anyway, a simple variety that doesn't rely on a lot of database magic:

无论如何,一个不依赖于大量数据库魔法的简单变体:

when user X adds content:

当用户 X 添加内容时:

1) do an asynchronous call from your PHP page after the database commit (async of course so that the user viewing the page doesn't have to wait for it!)

1)在数据库提交后从您的 PHP 页面执行异步调用(当然是异步的,这样查看页面的用户就不必等待它!)

The call starts an instance of your logical script.

该调用将启动逻辑脚本的一个实例。

2) the logic script goes onlythrough the list of friends [A,B,C] of the user who committed the new content (as opposed to list of everyone in the DB!) and appends the action of user X to feeds for each of these users.

2) 逻辑脚本遍历提交新内容的用户的朋友列表 [A,B,C](而不是数据库中的每个人的列表!),并将用户 X 的操作附加到每个人的提要这些用户中。

You could just store these feeds as straight-up JSON files and append new data to the end of each. Better of course to keep the feeds in cache with a backup to filesystem or BerkeleyDB or Mongo or whatever you like.

您可以将这些提要存储为直接的 JSON 文件,并将新数据附加到每个提要的末尾。当然最好将提要保存在缓存中,并备份到文件系统或 BerkeleyDB 或 Mongo 或任何你喜欢的东西。

This is just a basic idea for feeds based on recency, not relevance. You COULD store the data sequentially in this manner and then do additional parsing on a per-user basis to filter by relevance, but this is a hard problem in any application and probably not one that can be easily addressed by an anonymous web user without detailed knowledge of your requirements ;)

这只是基于新近度而非相关性的提要的基本思想。您可以以这种方式顺序存储数据,然后在每个用户的基础上进行额外的解析以按相关性进行过滤,但这在任何应用程序中都是一个难题,并且可能不是一个匿名网络用户可以轻松解决的问题,而无需详细说明了解您的要求;)

jsh

jsh

回答by Blender

Would you add statistical keywording? I made a (crude) implementation via exploding the body of my document, stripping HTML, removing common words, and counting the most common words. I made that a few years ago just for fun (as with any such project, the source is gone), but it worked for my temporary test-blog/forum setup. Maybe it will work for your news feed...

你会添加统计关键字吗?我通过分解我的文档主体、剥离 HTML、删除常用词和计算最常用的词来进行(粗略的)实现。几年前我只是为了好玩而做这个(就像任何这样的项目一样,源代码已经消失了),但它适用于我的临时测试博客/论坛设置。也许它适用于您的新闻提要...

回答by Akash Sharma

between you can use user flags and caching. Lets say, have a new field for user as last_activity. Update this field whenever user enters any activity. Keep a flag, till what time you have fetched the feeds lets say it feed_updated_on.

您可以使用用户标志和缓存。让我们说,为用户创建一个新字段作为 last_activity。每当用户输入任何活动时更新此字段。保留一个标志,直到您获取提要的时间可以说是 feed_updated_on。

Now update function $user->getAllUsers(); to return only users that have last_activity time later than feed_updated_on. This will exclude all the users that doesnt have any activity log :). Similar process for the users friends.

现在更新函数 $user->getAllUsers(); 仅返回 last_activity 时间晚于 feed_updated_on 的用户。这将排除所有没有任何活动日志的用户:)。用户朋友的类似过程。

You can also use caching like memcache or file level caching.

您还可以使用缓存,如 memcache 或文件级缓存。

Or use some nosql DB for storing all the feeds as one document.

或者使用一些 nosql DB 将所有提要存储为一个文档。