当前位置：首页 > news >正文

网上设计网站推软件

news 2025/12/31 2:58:37

网上设计网站,推软件,建站宝盒创业经历,网站建设服务器租用共识算法(Consensus Algorithm) 共识算法即在分布式系统中节点达成共识的算法#xff0c;提高系统在分布式环境下的容错性。依据系统对故障组件的容错能力可分为#xff1a; 崩溃容错协议(Crash Fault Tolerant, CFT) : 无恶意行为#xff0c;如进程崩溃#xff0c;只要…共识算法(Consensus Algorithm) 共识算法即在分布式系统中节点达成共识的算法提高系统在分布式环境下的容错性。依据系统对故障组件的容错能力可分为崩溃容错协议(Crash Fault Tolerant, CFT) : 无恶意行为如进程崩溃只要失败的quorum不过半即可正常提供服务拜占庭容错协议(Byzantine Fault Tolerant, BFT): 有恶意行为只要恶意的quorum不过1/3即可正常提供服务分布式环境下节点之间是没有一个全局时钟和同频时钟故在分布式系统中的通信天然是异步的。而在异步环境下是没有共识算法能够完全保证一致性极端场景下会出现不一致通常是不会出现 In a fully asynchronous message-passing distributed system, in which at least one process may have a crash failure, it has been proven in the famous 1985 FLP impossibility result by Fischer, Lynch and Paterson that a deterministic algorithm for achieving consensus is impossible. 另外网络是否可以未经许可直接加入新节点也是共识算法考虑的一方面未经许可即可加入的网络环境会存在女巫攻击(Sybil attack) 分布式系统中根据共识形成的形式可分为 Voting-based Consensus Algorithms: Practical Byzantine Fault Tolerance、HotStuff、Paxos、 Raft、 ZAB …Proof-based Consensus Algorithms: Proof-of-Work、Proof-of-Stake … Zookeeper模型与架构 A Distributed Coordination Service for Distributed Applications: ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Zookeeper(后简称zk)定位于分布式环境下的元数据管理而不是数据库zk中数据存在内存中所以不适合存储大量数据。 zk以形如linux文件系统的树形层级结构管理数据如下图所示每一个节点称为一个znode除了存放用户数据一般为1KB以内还包含变更版本、变更时间、ACL信息等统计数据(Stat Structure) czxid: The zxid of the change that caused this znode to be created.mzxid: The zxid of the change that last modified this znode.pzxid: The zxid of the change that last modified children of this znode.ctime: The time in milliseconds from epoch when this znode was created.mtime: The time in milliseconds from epoch when this znode was last modified.version: The number of changes to the data of this znode.cversion: The number of changes to the children of this znode.aversion: The number of changes to the ACL of this znode.ephemeralOwner: The session id of the owner of this znode if the znode is an ephemeral node. If it is not an ephemeral node, it will be zero.dataLength: The length of the data field of this znode.numChildren: The number of children of this znode. 同时znode节点可设置为以下特性 ephemeral: 和session生命周期相同sequential: 顺序节点比如创建顺序节点/a/b则会生成/a/b0000000001 再次创建/a/b则会生成/a/b0000000002container: 容器节点用于存放其他节点的节点子节点无则它也无了监听container节点需要考虑节点不存在的情况 Zookeeper集群中节点分为三个角色 Leader它负责发起并维护与各 Follower 及 Observer 间的心跳。所有的写操作必须要通过 Leader 完成再由 Leader 将写操作广播给其它服务器。一个 Zookeeper 集群同一时间只会有一个实际工作的 Leader。Follower它会响应 Leader 的心跳。Follower 可直接处理并返回客户端的读请求同时会将写请求转发给 Leader 处理并且负责在 Leader 处理写请求时对请求进行投票。一个 Zookeeper 集群可能同时存在多个 Follower。Observer角色与 Follower 类似但是无投票权。为了保证事务的顺序一致性ZooKeeper 采用了递增的zxid来标识事务zxid(64bit)由epoch(32bit)counter(32bit)组成如果counter溢出会强制重新选主开启新纪元如果epoch满了呢读操作可以在任意一台zk集群节点中进行包括watch操作也是但写操作需要集中转发给Leader节点进行串行化执行保证一致性。 Leader 服务会为每一个 Follower 服务器分配一个单独的队列然后将事务 Proposal 依次放入队列中并根据 FIFO(先进先出) 的策略进行消息发送。Follower 服务在接收到 Proposal 后会将其以事务日志的形式写入本地磁盘中并在写入成功后反馈给 Leader 一个 Ack 响应。当 Leader 接收到超过半数 Follower 的 Ack 响应后就会广播一个 Commit 消息给所有的 Follower 以通知其进行事务提交之后 Leader 自身也会完成对事务的提交。而每一个 Follower 则在接收到 Commit 消息后完成事务的提交。这个流程和二阶段提交很像只是ZAB没有二阶段的回滚操作。 ZAB(Zookeeper Atomic Broadcast) Fast Leader Election(FLE) myid: 即sid有cfg配置文件配置每个节点之间不重复logicalclock: 即electionEpoch选举逻辑时钟 public class FastLeaderElection implements Election {// ...public Vote lookForLeader() throws InterruptedException {self.start_fle Time.currentElapsedTime();try {/** The votes from the current leader election are stored in recvset. In other words, a vote v is in recvset* if v.electionEpoch logicalclock. The current participant uses recvset to deduce on whether a majority* of participants has voted for it.*/// 投票箱, sid : VoteMapLong, Vote recvset new HashMap();/** The votes from previous leader elections, as well as the votes from the current leader election are* stored in outofelection. Note that notifications in a LOOKING state are not stored in outofelection.* Only FOLLOWING or LEADING notifications are stored in outofelection. The current participant could use* outofelection to learn which participant is the leader if it arrives late (i.e., higher logicalclock than* the electionEpoch of the received notifications) in a leader election.*/// 存放上一轮投票和这一轮投票 (FOLLOWING/LEADING)状态的peer的投票// 如果有(FOLLOWING/LEADING)的投票来迟了即已经选出了leader但是当前节点接收ack的notification迟了// 可根据outofelection来判断leader是否被quorum ack了是则跟随该leaderMapLong, Vote outofelection new HashMap();int notTimeout minNotificationInterval;synchronized (this) {// 更新当前electionEpochlogicalclock.incrementAndGet();// 更新投票为自己updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());}// 投当前的leader一票即投自己一票sendNotifications();SyncedLearnerTracker voteSet null;/** Loop in which we exchange notifications until we find a leader*/// 循环直到有leader出现while ((self.getPeerState() ServerState.LOOKING) (!stop)) {/** Remove next notification from queue, times out after 2 times* the termination time*/// 拉取其他peer的投票Notification n recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);/** Sends more notifications if havent received enough.* Otherwise processes new notification.*/if (n null) { // 没收到peer的投票信息if (manager.haveDelivered()) { // 上面的notification都发完了sendNotifications(); // 再发一次} else {manager.connectAll(); // 建立连接}/** 指数退避*/notTimeout Math.min(notTimeout 1, maxNotificationInterval);/** When a leader failure happens on a master, the backup will be supposed to receive the honour from* Oracle and become a leader, but the honour is likely to be delay. We do a re-check once timeout happens** The leader election algorithm does not provide the ability of electing a leader from a single instance* which is in a configuration of 2 instances.* */if (self.getQuorumVerifier() instanceof QuorumOracleMaj self.getQuorumVerifier().revalidateVoteset(voteSet, notTimeout ! minNotificationInterval)) {setPeerState(proposedLeader, voteSet);Vote endVote new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);leaveInstance(endVote);return endVote;}} else if (validVoter(n.sid) validVoter(n.leader)) { // 收到其他peer的投票/** Only proceed if the vote comes from a replica in the current or next* voting view for a replica in the current or next voting view.*/// 判断发送投票的peer当前状态switch (n.state) {case LOOKING: // 选主中if (getInitLastLoggedZxid() -1) {break;}if (n.zxid -1) {break;}if (n.electionEpoch logicalclock.get()) { // peer的electionEpoch大于当前节点的electionEpochlogicalclock.set(n.electionEpoch); // 直接快进到大的epoch即peer的electionEpochrecvset.clear(); // 并清空投票箱// 投票pk// peer投票和当前节点进行投票pk: peerEpoch - zxid - myid(sid)if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {// peer赢了把当前节点的leader zxid peerEpoch设置为peer的updateProposal(n.leader, n.zxid, n.peerEpoch);} else {// 当前节点赢了恢复自己的配置updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());}// 将上面更新后的自己的投票信息广播出去sendNotifications();} else if (n.electionEpoch logicalclock.get()) {// 如果peer的electionEpoch比当前节点的electionEpoch小则直接忽略break;} else// electionEpoch相等进行投票pk: peerEpoch - zxid - myid(sid)if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {// pk是peer赢了跟随peer投票并广播出去updateProposal(n.leader, n.zxid, n.peerEpoch);sendNotifications();}// 将peer的投票放入投票箱recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));voteSet getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch));if (voteSet.hasAllQuorums()) {// Verify if there is any change in the proposed leaderwhile ((n recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) ! null) {if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {recvqueue.put(n);break;}}/** This predicate is true once we dont read any new* relevant message from the reception queue*/if (n null) {setPeerState(proposedLeader, voteSet);Vote endVote new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);leaveInstance(endVote);return endVote;}}break;case OBSERVING: // peer是观察者不参与投票直接返回LOG.debug(Notification from observer: {}, n.sid);break;/** In ZOOKEEPER-3922, we separate the behaviors of FOLLOWING and LEADING.* To avoid the duplication of codes, we create a method called followingBehavior which was used to* shared by FOLLOWING and LEADING. This method returns a Vote. When the returned Vote is null, it follows* the original idea to break switch statement; otherwise, a valid returned Vote indicates, a leader* is generated.** The reason why we need to separate these behaviors is to make the algorithm runnable for 2-node* setting. An extra condition for generating leader is needed. Due to the majority rule, only when* there is a majority in the voteset, a leader will be generated. However, in a configuration of 2 nodes,* the number to achieve the majority remains 2, which means a recovered node cannot generate a leader which is* the existed leader. Therefore, we need the Oracle to kick in this situation. In a two-node configuration, the Oracle* only grants the permission to maintain the progress to one node. The oracle either grants the permission to the* remained node and makes it a new leader when there is a faulty machine, which is the case to maintain the progress.* Otherwise, the oracle does not grant the permission to the remained node, which further causes a service down.** In the former case, when a failed server recovers and participate in the leader election, it would not locate a* new leader because there does not exist a majority in the voteset. It fails on the containAllQuorum() infinitely due to* two facts. First one is the fact that it does do not have a majority in the voteset. The other fact is the fact that* the oracle would not give the permission since the oracle already gave the permission to the existed leader, the healthy machine.* Logically, when the oracle replies with negative, it implies the existed leader which is LEADING notification comes from is a valid leader.* To threat this negative replies as a permission to generate the leader is the purpose to separate these two behaviors.*** */case FOLLOWING: // peer正在following leaderVote resultFN receivedFollowingNotification(recvset, outofelection, voteSet, n);if (resultFN null) {break;} else {// 成功选主返回return resultFN;}case LEADING: // peer 是 leader/** In leadingBehavior(), it performs followingBehvior() first. When followingBehavior() returns* a null pointer, ask Oracle whether to follow this leader.* */Vote resultLN receivedLeadingNotification(recvset, outofelection, voteSet, n);if (resultLN null) {break;} else {return resultLN;}default:break;}} else {// 推举的leader或投票的peer不合法直接忽略// ...}}return null;} finally {// ...}}private Vote receivedFollowingNotification(MapLong, Vote recvset, MapLong, Vote outofelection,SyncedLearnerTracker voteSet, Notification n) {/** Consider all notifications from the same epoch* together.*/if (n.electionEpoch logicalclock.get()) { // 同一轮投票// 若对方选票中的electionEpoch等于当前的logicalclock// 说明选举结果已经出来了将它们放入recvset。recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));voteSet getVoteTracker(recvset, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));// 判断quorum是否满足选主条件if (voteSet.hasAllQuorums() // 判断推举的leader已经被quorum ack了避免leader挂了导致集群一直在选举中checkLeader(recvset, n.leader, n.electionEpoch)) {// leader是自己将自己设置为LEADING否则是FOLLOWING(或OBSERVING)setPeerState(n.leader, voteSet);Vote endVote new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);// 清空消费投票的queueleaveInstance(endVote);return endVote;}}// 到这里是// peer的electionEpoch和logicalclock不一致// 因为peer是FOLLOWING所以在它的electionEpoch里已经选主成功了/** 在跟随peer选出的leader前校验这个leader合法不合法* Before joining an established ensemble, verify that* a majority are following the same leader.** Note that the outofelection map also stores votes from the current leader election.* See ZOOKEEPER-1732 for more information.*/outofelection.put(n.sid, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));voteSet getVoteTracker(outofelection, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));if (voteSet.hasAllQuorums() checkLeader(outofelection, n.leader, n.electionEpoch)) {synchronized (this) {logicalclock.set(n.electionEpoch);setPeerState(n.leader, voteSet);}Vote endVote new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);leaveInstance(endVote);return endVote;}return null;}private Vote receivedLeadingNotification(MapLong, Vote recvset, MapLong, Vote outofelection, SyncedLearnerTracker voteSet, Notification n) {/** 在两个节点的集群中leaderfollower如果follower挂了recovery之后因为投票无法过半follower会首先投自己一票会找不到leader* In a two-node configuration, a recovery nodes cannot locate a leader because of the lack of the majority in the voteset.* Therefore, it is the time for Oracle to take place as a tight breaker.* */Vote result receivedFollowingNotification(recvset, outofelection, voteSet, n);if (result null) {/** Ask Oracle to see if it is okay to follow this leader.* We dont need the CheckLeader() because itself cannot be a leader candidate* */// needOracle当集群无follower 集群voter2 时if (self.getQuorumVerifier().getNeedOracle() // 且cfg配置中keyoraclePath的文件默认没有askOracle默认false中的值 ! 1 时会走到if里// 这里可参考官网 https://zookeeper.apache.org/doc/current/zookeeperOracleQuorums.html !self.getQuorumVerifier().askOracle()) {LOG.info(Oracle indicates to follow);setPeerState(n.leader, voteSet);Vote endVote new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);leaveInstance(endVote);return endVote;} else {LOG.info(Oracle indicates not to follow);return null;}} else {return result;}}// ... }提案处理所有的提案均通过leader来提follower接受的提案会转发到leader。 zk采用责任链模式对请求进行处理不同的角色leader/follower/observer对应不同的责任链以下是leader的各个Processor的作用 LeaderRequestProcessor: Responsible for performing local session upgrade. Only request submitted directly to the leader should go through this processor.PrepRequestProcessor: It sets up any transactions associated with requests that change the state of the systemProposalRequestProcessor: 调用Leader#propose将proposal加入发送给follower的queue由LeaderHandler异步发送给follower和处理follower的ackSyncRequestProcessor: 将request写磁盘AckRequestProcessor: ack leader自己的requestCommitProcessor: 提交提案。CommitProcessor本身是一个线程上游调用先把request加入队列然后异步消费处理ToBeAppliedRequestProcessor: simply maintains the toBeApplied listFinalRequestProcessor: This Request processor actually applies any transaction associated with a request and services any queries 接收follower的ack并提交走下面的调用 org.apache.zookeeper.server.quorum.Leader.LearnerCnxAcceptor#run org.apache.zookeeper.server.quorum.Leader.LearnerCnxAcceptor.LearnerCnxAcceptorHandler#run org.apache.zookeeper.server.quorum.Leader.LearnerCnxAcceptor.LearnerCnxAcceptorHandler#acceptConnections org.apache.zookeeper.server.quorum.LearnerHandler#run org.apache.jute.BinaryInputArchive#readRecord org.apache.zookeeper.server.quorum.LearnerMaster#processAck 这里如果满足quorum则调用CommitProcessor org.apache.zookeeper.server.quorum.Leader#tryToCommit org.apache.zookeeper.server.quorum.Leader#commit (leader 发送commit消息给follower此时leader还不一定提交了因为异步处理的) org.apache.zookeeper.server.quorum.Leader#inform (leader 发送inform消息给observer此时leader还不一定提交了因为异步处理的)判断是否满足quorum的方法为SyncedLearnerTracker#hasAllQuorums public class SyncedLearnerTracker { // Proposal的父类即每个提案一个Trackerpublic static class QuorumVerifierAcksetPair {private final QuorumVerifier qv; // 每一个zxid就是一个QuorumVerifierprivate final HashSetLong ackset; // ack的sid set...}protected ArrayListQuorumVerifierAcksetPair qvAcksetPairs new ArrayList();...public boolean hasAllQuorums() {for (QuorumVerifierAcksetPair qvAckset : qvAcksetPairs) {if (!qvAckset.getQuorumVerifier().containsQuorum(qvAckset.getAckset())) {return false;}}return true;}... }最终调用QuorumMaj#containsQuorum public class QuorumMaj implements QuorumVerifier {...protected int half votingMembers.size() / 2;/*** Verifies if a set is a majority. Assumes that ackSet contains acks only* from votingMembers*/public boolean containsQuorum(SetLong ackSet) {return (ackSet.size() half);}...参考 Consensus Algorithms in Distributed SystemsFLP Impossibility ResultzookeeperInternals详解分布式协调服务 ZooKeeper再也不怕面试问这个了Lecture 8: ZookeeperZooKeeper: Wait-free coordination for Internet-scale systemsdiff_acceptepoch_currentepochZookeeper(FastLeaderElection选主流程详解)zookeeper-framwork-design-message-processor-leader

查看全文

http://www.dnsts.com.cn/news/120419.html