Walletfox.com

Updating word count with QMutex


This article demonstrates the use of a mutex on an example that counts the total number of occurrences of a specific word (more generally of a specific regular expression) throughout multiple text files. The problem is demonstrated on 5 Shakespeare's sonnets in which we count the number of occurrences of the word "thy" (with case insensitive 't') throughout the text files.

Introduction - Synchronization

Short repetition of synchronization: Imagine a situation in which two threads try to access the same resource - e.g. an account withdrawal from the same account by two different people at the same time. The withdrawal operation has two steps - checking the balance and making the withdrawal itself. If two threads operate on the same balance, we have to make sure that checking the balance and withdrawal are not split apart within one thread of execution, otherwise, we might corrupt the variable representing the balance and cause an account overdraw. In other words, what we require is indivisibility or atomicity of the withdrawal operation (checking the balance and the withdrawal itself have to be executed as one operation). A piece of code that needs to be executed atomically is called a critical section (Sierra & Bates, Sun Certified Programmer for Java 6 Exam 310-065, McGraw-Hill 2008).

A situation in which multiple threads of execution access the same resource and as a result might produce corrupted data is known as a race condition. To protect a shared resource from a race condition, we use a synchronization tool called mutex (short for mutual exclusion). With a mutex, threads agree that only one thread at a time can access the data the mutex protects. The protection is accomplished with an object lock. Since there is only one lock per object, if one thread acquires the lock, no other thread can acquire it before the first thread releases the lock.

Even a single statement might need to be protected by a mutex (as in the example in this article). This is because the compiler might substitute this single statement with a number of machine instructions, i.e. in multithreaded environments at an instruction level, a single statement is no longer atomic (Nichols, Buttlar & Farrell, Pthreads Programming, O’Reilly 1996).

Generally, any time more than one thread is accessing changeable data, you need to synchronize to protect that data. This is to make sure that several threads are not changing the data at the same time or that one thread is not modifying the data that another thread might be reading. What you do not need to worry about are local variables as each thread gets its own copy of local variables.

This article presents two variants of the problem:

Variant 1 - QMutex is used to protect the update of a variable that stores the total number of occurrences of a particular regular expression. QRunnable and QThreadPool are used to execute the operation in multiple threads.

Variant 2 - without a mutex using QtConcurrent::mappedReduced(). This solution makes use of the fact that the reduce function is called by only one thread at the same time.

Variant 1 - with QMutex

In the first variant, we are going to use QRunnable and a QThreadPool object to run the search in multiple threads. Let's look at the variables that we need to introduce to achieve the desired functionality. Firstly, we need a variable that will hold the total number of occurrences of a particular regular expression, i.e. 'totalRegexpCount'. Because multiple threads will be accessing the 'totalRegexpCount', we have to use a mutex 'regexpCountMutex' to protect the update of the 'totaRegexpCount' variable.

int totalRegexpCount;
QMutex regexpCountMutex;

How to subclass QRunnable

A single instance of QRunnable will be responsible for a single text file. Both the fileName and the regular expression that we are trying to match are stored as private variables. The run() method is responsible for counting the occurrences of a particular regular expression in a single text file. The variable 'partialRegexpCount' represents the number of occurrences of a regular expression in a single text file and is found with the help of QString::count(QRegExp). Look at the 'if' block. Whenever we find out that a file contains at least one word that matches the regular expression ( 'thy' or 'Thy' in our case), we update the value of 'totalRegexpCount'. The update of the 'totalRegexpCount' is protected by a mutex, despite the fact that the operation is represented by a single statement. As mentioned in the introduction, this is because a single statement is no longer atomic at an instruction level. We have to lock the mutex before the update and release the lock after the update.

class TextCountRunnable : public QRunnable
{
public:
    TextCountRunnable(const QString& fileName, const QRegExp& regexpToCount):
        QRunnable(), m_fileName(fileName), m_regexpToCount(regexpToCount){
    }
    void run(){
        QFile file(m_fileName);
        file.open(QIODevice::ReadOnly);
        QTextStream in(&file);
        QString wholeText = in.readAll();
        int partialRegexpCount = wholeText.count(m_regexpToCount);
        if (partialRegexpCount) {
            regexpCountMutex.lock();
            totalRegexpCount += partialRegexpCount;
            regexpCountMutex.unlock();
        }
    }

private:
    QString m_fileName;
    QRegExp m_regexpToCount;
};

There is a convenience class called QMutexLocker that simplifies locking and unlocking mutexes. QMutexLocker should be created within a function in which a QMutex needs to be locked. The mutex will always be unlocked when the QMutexLocker object is destroyed, i.e. when the function returns.

In this particular case, to use QMutexLocker we would replace the following section:

regexpCountMutex.lock();
totalRegexpCount += partialRegexpCount;
regexpCountMutex.unlock();

with the following code:

QMutexLocker locker(&regexpCountMutex);
totalRegexpCount += partialRegexpCount;

How to run the concurrent code

Now let's look at the main.cpp. We firstly identify the files in our folder that are Shakespeare sonnets. We also create a regular expression to represent the word that we are trying to find, i.e.the word "thy" (in "\\b[Tt]hy" \b represents the word boundary, [Tt] means that the letter 't' is case insensitive). After that, we create an instance of QRunnable for each document. We wait for all the threads to exit and print the result.

int main(int argc, char *argv[])
{
    QCoreApplication app(argc, argv);
    QStringList filePaths = findFiles(QDir::currentPath(),
                                      "shakespeareSonnet");

    QRegExp regexpToCount("\\b[Tt]hy");

    foreach(QString filePath, filePaths)
        QThreadPool::globalInstance()->start(
                    new TextCountRunnable(filePath, regexpToCount));

    QThreadPool::globalInstance()->waitForDone();
    qDebug() << "totalRegexpCount " << totalRegexpCount;

    return 0;
}

Variant 2 - without mutex with QtConcurrent::mappedReduced()

The second variant bypasses the mutex by using QtConcurrent::mappedReduced(), specifically it makes use of the fact that the reduce() function is called by only one thread at the same time. For introduction into mappedReduced() look at this histogram example (in this example we are using partial reduction in the map function).

How to use the MapReduce in the main()

Let's firstly look at the main() below. We find all the text files that start with 'shakespeareSonnet' just as we did for Variant 1. After that we call mappedReduced(). Qt's mappedReduced requires three parameters - a sequence to which to apply the map function, a map function and a reduce function. The sequence is represented by all the sonnets that were found in the specified folder. The map function, in this case the map 'function object' will be automatically applied to all the files in the sequence. Notice, that we pass the regular expression as a parameter to the map function object. The result of our calculation is represented by an instance of QFuture<int>. After we started mappedReduced(), we wait until the asynchronous computation finishes and print out the result.

int main(int argc, char *argv[])
{
    QCoreApplication app(argc, argv);
    QStringList filePaths = findFiles(QDir::currentPath(),
                                      "shakespeareSonnet");

    QFuture<int> f1 = QtConcurrent::mappedReduced(
                                filePaths, MapFile(QRegExp("\\b[Tt]hy")), reduce);
    f1.waitForFinished();
    qDebug() << "totalRegexpCount " << f1.result();

    return 0;
}

How to write the map and reduce functions

Let's now look at the specifics of the MapReduce implementation. Rather than a map function we use a map function object. A function object or a functor is simply an object that can be called as if it was a function. To achieve this, it defines the operator (). We use the function object to pass a parameter, namely the regular expression that we are trying to match. Notice, that the body of the overloaded operator() is almost identical to the QRunnable::run() function from the Variant 1, with the difference that there is no update of the 'totalRegexpCount' (this will be the responsibility of the reduce function). Last, for our functor to work, we also have to define the result_type of the function call operator (which is an 'int' in this case).

struct MapFile {
    const QRegExp m_regexpToCount;
    MapFile(const QRegExp& regexpToCount): m_regexpToCount(regexpToCount) {}
    typedef int result_type;

    int operator()(const QString& fileName)
    {
        QFile file(fileName);
        file.open(QIODevice::ReadOnly);
        QTextStream in(&file);
        QString wholeText = in.readAll();
        int partialRegexpCount = wholeText.count(m_regexpToCount);
        return partialRegexpCount;
    }
};

The reduce function sums up the partial results from all the threads. Notice that there is no mutex within the reduce function - the reduce function is not called concurrently, only one thread will call the reduce function at the same time (in contrast, the map function is called concurrently).

void reduce(int &totalRegexpCount,
                              int partialRegexpCount) {
    totalRegexpCount += partialRegexpCount;
}

That's it! You can try out the example. You should get the following result:
totalRegexpCount 13

Tagged: Qt