Median From Data Stream
Application of Heap Data Structure
Algorithms and Data Structures: TheAlgorist.com
System Design: DistributedComputing.dev
Low Level Design: LowLevelDesign.io
Frontend Engineering: FrontendEngineering.io
Median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value. So the median is the mean of the two middle value.
[2,3,4], the median is 3
[2,3], the median is (2 + 3) / 2 = 2.5
Design a data structure that supports the following two operations:
void addNum(int num) - Add an integer number from the data stream to the data structure.
double findMedian() - Return the median of all elements so far.
findMedian() -> 1.5
findMedian() -> 2
I expect by now you have already gone through below chapters and have gained really good understanding of heap data structures:
- Heap Fundamentals
- Heap: GetMax() / GetMin(), DeleteMax() / DeleteMin() and Insert()
- Heap Sort
- Priority Queue
- Kth Largest Element
- Merge K Sorted Lists
At any point of time, if we sort the data stream then on the left side of the MEDIAN we would have the lower half of the data stream where all the elements are less than or equal to the median, and on the right side of the MEDIAN we would have the higher half of the data stream where all the elements are greater than or equal to the median.
So if we keep two DYNAMIC data structures: one for keeping the HIGHER HALF (consisting of largest elements) of the data stream realtime, and another one for keeping the LOWER HALF (consisting of lowest elements) of the data stream realtime, then:
if the data stream has
ODDnumber of items, then whichever HALF has 1 more element than the other HALF, do below:
- if LOWER HALF has 1 more element than the higher half, then the the item with MAXIMUM value in the LOWER HALF is the median.
- if HIGHER HALF has 1 more element than the lower half, then the the item with MINIMUM value in the HIGHER HALF is the median.
if the data stream has
EVENnumber of elements, then mean of the maximum element of lower half and minimum element of higher half is the median.
Also notice that the elements in the lower half and higher half could change over time as we get more and more items in the data stream.
What data structure fits perfectly for maintaining the lower half and higher half ? Heap. Min Heap GetMin() and Max heap GetMax() are O(1) operations, and insertion of a new item and min / max deletion are O(logN) operations, where N = total number of items in heap.
We would use min heap for higher half, and max heap for lower half. Since, when data stream has odd number of elements the half that contains the median has one more element than the other half, we would impose a condition that
size of higher half (i.e, min heap) >= size of lower half (i.e, max heap)
So that whenever we see data stream has odd number (i.e, size of min heap > size of max heap) of elements, we would know the median is in the higher half. The median would be the min element of the min heap. This would be O(1) operation.
Whenever, size of min heap = size of max heap, i.e, the data stream has even elements, median would be the mean of the min of min heap and max of max heap. This O(1) operation.
Whenever we get a new item from data stream :
- if both min heap and max heap are empty we add the item to the min heap. Remember, size of min heap >= size of max heap.
if size of max heap equals size of min heap:
the new element should go to the min heap since size of max heap cannot be greater than the size of min heap. But wait, min heap is the higher half of the data stream. What if the item to be added does not logically belong to the higher half ? For example, lower half = [2, 4, 6], higher half = [70, 80, 88], item to add = 3.
To handle this situation, we first compare the item to be added with the minimum of the higher half (min heap). If the value of item to be added is greater than or equal to the value of the min element of min heap, then the item to be added does indeed belong to the higher half and is added to the min heap BUT if it is not the case, the max element of lower half needs to be removed from max heap and added to the higher half (min heap), and the new item to be added is added to lower half (max heap). This also maintains the constraint we have imposed: size of higher half (min heap) >= size of lower half (max heap).
if size of higher half (min heap) > size of lower half (max heap):
check if the new item to be added should go to higher half or lower half. If the new item is supposed to go to lower half add the item to lower half (max heap).
But if the new item is supposed to go to higher half, then remove the minimum of the higher half from min heap, and add it to the lower half. Add the new item to the higher half. After this operation size of lower half would equal the size of higher half.
The below code efficiently implements the above elegant algorithm:
This is a Premium content.
Please subscribe to Algorithms course to access the code.
Complexity Analysis:Time Complexity:
- addNum: O(logN), where N = number of elements gotten from data stream so far.
getMedian: O(1). Peek, i.e, getMin/Max is O(1) operation in heap. Min/max happens to
be the root of the complete binary tree and first element in the array representing the heap.
For more details, please read up Heap Fundamentals and Heap: GetMax() / GetMin(), DeleteMax() / DeleteMin() and Insert().
Space Complexity: O(N), to storing lower half in max heap and higher half in min heap. N = total number of elements gotten from data stream so far.
Senior SDE | Chief Architect
Microsoft | University of Florida
If you have any feedback, please use this form: https://thealgorists.com/Feedback.