編程開發中如何實現布隆過濾器

發布時間：2021-10-21 11:30:08 來源：億速云閱讀：121 作者：小新欄目：編程語言

這篇文章給大家分享的是有關編程開發中如何實現布隆過濾器的內容。小編覺得挺實用的，因此分享給大家做個參考，一起跟隨小編過來看看吧。

給40億個不重復的無符號整數，沒排過序。給一個無符號整數，如何快速判斷一個數是否在這40億個數中。

拿到這個題目，我們首先想到的是遍歷這40億的數字，然后一個一個找。顯然是行不通的。因為這40億個數放到內存中，大約需要16G內存。

如果我們把它轉換成位圖處理，那么就好處理多了。我們可以把一個×××再細分一下，一個int類型就可以編程32個位，每一位用0，1表示當前這個位置上是否存有值，同樣是利用哈希存儲的方法。只是這樣存儲的話就可以減少很多的空間了，例如上題使用的內存就可以從16G降到500M的內存。空間的使用率減少了不止一點。

大家可以根據我這個方法實現上面的代碼，今天我主要介紹的是布隆過濾器，因為布隆過濾器也要用到位圖（bitmap），位圖實現思想：

1.把一個int類型變成32個bits。

2.把它們全部初始化為0。

3.如果當前位上有值，把0置成1。

下面我給出位圖的實現代碼：

BitMap.h中

#include <vector>

class BitMap
{
public:
	BitMap(size_t size = 0)
		:_size(0)
	{
		//用resize開辟空間，_a中的值會被初始化為0

		//加1為了讓值全部能放到數組中，假如有36個數，36/32=1余4，而
		//多開的那個空間就保證了這4個數能放下

		//_a.resize(size/32+1);和下面的代碼一個性質，只不過用移位運算符比除法的效率高
		_a.resize((size >> 5) + 1);
	}
	//插入
	void Set(size_t x)
	{
		size_t index = x >> 5;
		size_t num = x % 32;
		//當前位置如果等于0，沒有值，可以插入
		if (!(_a[index] & (1 << num)))
		{
			_a[index] |= (1 << num);//當前位置置1
			++_size;
		}
	}
	//刪除
	void Reset(size_t x)
	{
		size_t index = x >> 5;
		size_t num = x % 32;
		//當前位置為1，有值，可以刪除
		if (_a[index] & (1 << num))
		{
			_a[index] &= ~(1 << num);//當前位置置0
			--_size;
		}
	}
	//判斷是否有值
	bool BitMapTest(size_t x)
	{
		size_t index = x >> 5;
		size_t num = x % 32;
		if (_a[index] & (1 << num))
		{
			return true;
		}
		return false;
	}
	void Resize(size_t size)
	{
		_a.resize(size);
	}
protected:
	vector<size_t> _a;
	size_t _size;
};

如果想判斷一個元素是不是在一個集合里，一般想到的是將所有元素保存起來，然后通過比較確定。鏈表，樹等等數據結構都是這種思路. 但是隨著集合中元素的增加，我們需要的存儲空間越來越大，檢索速度也越來越慢。不過世界上還有一種叫作散列表（又叫哈希表，Hash table）的數據結構。它可以通過一個Hash函數將一個元素映射成一個位陣列（Bit Array）中的一個點。這樣一來，我們只要看看這個點是不是 1 就知道可以集合中有沒有它了。這就是布隆過濾器的基本思想。

Hash面臨的問題就是沖突。假設 Hash 函數是良好的，如果我們的位陣列長度為 m 個點，那么如果我們想將沖突率降低到例如 1%, 這個散列表就只能容納 m/100 個元素。顯然這就不叫空間有效了（Space-efficient）。解決方法也簡單，就是使用多個 Hash，如果它們有一個說元素不在集合中，那肯定就不在。如果它們都說在，雖然也有一定可能性它們在說謊，不過直覺上判斷這種事情的概率是比較低的。

優點

相比于其它的數據結構，布隆過濾器在空間和時間方面都有巨大的優勢。布隆過濾器存儲空間和插入/查詢時間都是常數。另外, Hash 函數相互之間沒有關系，方便由硬件并行實現。布隆過濾器不需要存儲元素本身，在某些對保密要求非常嚴格的場合有優勢。

布隆過濾器可以表示全集，其它任何數據結構都不能；

k 和 m 相同，使用同一組 Hash 函數的兩個布隆過濾器的交并差運算可以使用位操作進行。

缺點

但是布隆過濾器的缺點和優點一樣明顯。誤算率（False Positive）是其中之一。隨著存入的元素數量增加，誤算率隨之增加。但是如果元素數量太少，則使用散列表足矣。

另外，一般情況下不能從布隆過濾器中刪除元素. 我們很容易想到把位列陣變成整數數組，每插入一個元素相應的計數器加1, 這樣刪除元素時將計數器減掉就可以了。然而要保證安全的刪除元素并非如此簡單。首先我們必須保證刪除的元素的確在布隆過濾器里面. 這一點單憑這個過濾器是無法保證的。另外計數器回繞也會造成問題。

下面給出布隆過濾器的實現代碼：

仿函數實現，我用了5個仿函數，它們的實現我是看http://www.cnblogs.com/-clq/archive/2012/05/31/2528153.html這里面實現的，他的實現比較好，能更好的避免哈希沖突。

commom.h中

#pragma once

#include <string>
size_t NewSize(size_t size)
{
	// 使用素數表對齊做哈希表的容量，降低哈希沖突
	const int _PrimeSize = 28;
	static const unsigned long _PrimeList[_PrimeSize] =
	{
		53ul, 97ul, 193ul, 389ul, 769ul,
		1543ul, 3079ul, 6151ul, 12289ul, 24593ul,
		49157ul, 98317ul, 196613ul, 393241ul, 786433ul,
		1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul,
		50331653ul, 100663319ul, 201326611ul, 402653189ul, 805306457ul,
		1610612741ul, 3221225473ul, 4294967291ul
	};
	for (int i = 0; i < _PrimeSize; i++)
	{
		if (_PrimeList[i]>size)
		{
			return _PrimeList[i];//按照素數表來設置容量大小
		}
	}
	//當需要的容量超過素數表的最大容量，我們就按照最大的來擴容
	return _PrimeList[_PrimeSize - 1];
}
template <class T>
struct __HashFunc1
{
	size_t BKDRHash(const T *str)
	{
		register size_t hash = 0;
		while (size_t ch = (size_t)*str++)
		{
			hash = hash * 131 + ch;   
		}
		return hash;
	}
	size_t operator()(const T& key)
	{
		return BKDRHash(key.c_str());
	}
};
template <class T>
struct __HashFunc2
{
	size_t SDBMHash(const T *str)
	{
		register size_t hash = 0;
		while (size_t ch = (size_t)*str++)
		{
			hash = 65599 * hash + ch;
			//hash = (size_t)ch + (hash << 6) + (hash << 16) - hash;  
		}
		return hash;
	}
	size_t operator()(const T& key)
	{
		return SDBMHash(key.c_str());
	}
};
template <class T>
struct __HashFunc3
{
	size_t RSHash(const T *str)
	{
		register size_t hash = 0;
		size_t magic = 63689;
		while (size_t ch = (size_t)*str++)
		{
			hash = hash * magic + ch;
			magic *= 378551;
		}
		return hash;
	}
	size_t operator()(const T& key)
	{
		return RSHash(key.c_str());
	}
};

template <class T>
struct __HashFunc4
{
	size_t APHash(const T *str)
	{
		register size_t hash = 0;
		size_t ch;
		for (long i = 0; ch = (size_t)*str++; i++)
		{
			if ((i & 1) == 0)
			{
				hash ^= ((hash << 7) ^ ch ^ (hash >> 3));
			}
			else
			{
				hash ^= (~((hash << 11) ^ ch ^ (hash >> 5)));
			}
		}
		return hash;
	}
	size_t operator()(const T& key)
	{
		return APHash(key.c_str());
	}
};
template <class T>
struct __HashFunc5
{
	size_t JSHash(const T *str)
	{
		if (!*str)        // 這是由本人添加，以保證空字符串返回哈希值0  
			return 0;
		register size_t hash = 1315423911;
		while (size_t ch = (size_t)*str++)
		{
			hash ^= ((hash << 5) + ch + (hash >> 2));
		}
		return hash;
	}
	size_t operator()(const T& key)
	{
		return JSHash(key.c_str());
	}
};

我使用了5個Hash函數，可以降低哈希沖突。大家視情況而定，自己設置哈希函數的個數。

BoolmFilter.h中

#pragma once

#include<string>
#include "BitMap.h"
#include "common.h"
template<class T = string,
	class HashFunc1=__HashFunc1<T>,
	class HashFunc2 = __HashFunc2<T>,
	class HashFunc3 = __HashFunc3<T>,
	class HashFunc4 = __HashFunc4<T>,
	class HashFunc5 = __HashFunc5<T>
>
class BoolmFilter
{
public:
	BoolmFilter(size_t capatity = 0)
	{
		_capatity = NewSize(capatity);
		_bm.Resize(_capatity);
	}
	void Set(const T& key)
	{
		size_t index1 = HashFunc1()(key);
		size_t index2 = HashFunc2()(key);
		size_t index3 = HashFunc3()(key);
		size_t index4 = HashFunc4()(key);
		size_t index5 = HashFunc5()(key);

		_bm.Set(index1%_capatity);
		_bm.Set(index2%_capatity);
		_bm.Set(index3%_capatity);
		_bm.Set(index4%_capatity);
		_bm.Set(index5%_capatity);
	}
	bool Test(const T& key)
	{
		size_t index1 = HashFunc1()(key);
		if (!_bm.BitMapTest(index1%_capatity))
		{
			return false;
		}
		size_t index2 = HashFunc2()(key);
		if (!_bm.BitMapTest(index2%_capatity))
		{
			return false;
		}
		size_t index3 = HashFunc3()(key);
		if (!_bm.BitMapTest(index3%_capatity))
		{
			return false;
		}
		size_t index4 = HashFunc4()(key);
		if (!_bm.BitMapTest(index4%_capatity))
		{
			return false;
		}
		size_t index5 = HashFunc5()(key);
		if (!_bm.BitMapTest(index5%_capatity))
		{
			return false;
		}
		return true;
	}
protected:
	BitMap _bm;
	size_t _capatity;
};

test.cpp中

#include <iostream>
using namespace std;
#include "BoolmFilter.h"

void BoolTest()
{
	BoolmFilter<>bf(100);
	bf.Set("she is girl");
	bf.Set("我是好人");
	bf.Set("chive/2012/05/31/2528153.html");
	cout << bf.Test("she is girl") << endl;
	cout << bf.Test("我是好人") << endl;
	cout << bf.Test("chive/2012/05/31/2528153.html") << endl;
}
int main()
{
	BoolTest();
	system("pause");
	return 0;
}

感謝各位的閱讀！關于“編程開發中如何實現布隆過濾器”這篇文章就分享到這里了，希望以上內容可以對大家有一定的幫助，讓大家可以學到更多知識，如果覺得文章不錯，可以把它分享出去讓更多的人看到吧！

向AI問一下細節

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

編程開發中如何實現布隆過濾器

優點

缺點

猜你喜歡

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

編程開發中如何實現布隆過濾器

優點

缺點

猜你喜歡

最新資訊

相關推薦

相關標簽