A Contextual Model for Visual Information Processing

Khurtin, Illia; Prasad, Mukesh

doi:10.3390/computers13060155

Open AccessArticle

A Contextual Model for Visual Information Processing

by

Illia Khurtin

^*

and

Mukesh Prasad

^*

School of Computer Science, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney 2007, Australia

^*

Authors to whom correspondence should be addressed.

Computers 2024, 13(6), 155; https://doi.org/10.3390/computers13060155

Submission received: 28 March 2024 / Revised: 18 May 2024 / Accepted: 18 June 2024 / Published: 20 June 2024

(This article belongs to the Special Issue Feature Papers in Computers 2024)

Download

Browse Figures

Versions Notes

Abstract

:

Despite significant achievements in the artificial narrow intelligence sphere, the mechanisms of human-like (general) intelligence are still undeveloped. There is a theory stating that the human brain extracts the meaning of information rather than recognizes the features of a phenomenon. Extracting the meaning is finding a set of transformation rules (context) and applying them to the incoming information, producing an interpretation. Then, the interpretation is compared to something already seen and is stored in memory. Information can have different meanings in different contexts. A mathematical model of a context processor and a differential contextual space which can perform the interpretation is discussed and developed in this paper. This study examines whether the basic principles of differential contextual spaces work in practice. The model is developed with Rust programming language and trained on black and white images which are rotated and shifted both horizontally and vertically according to the saccades and torsion movements of a human eye. Then, a picture that has never been seen in the particular transformation, but has been seen in another one, is exposed to the model. The model considers the image in all known contexts and extracts the meaning. The results show that the program can successfully process black and white images which are transformed by shifts and rotations. This research prepares the grounding for further investigations of the contextual model principles with which general intelligence might operate.

Keywords:

artificial general intelligence; visual information processing; human brain; contextual model

1. Introduction

It is widely considered that the achievements in artificial intelligence (AI), and especially its applications in image processing, have essentially exceeded the most courageous expectations of previous years [1]. However, these AI models belong to the domain of artificial narrow intelligence (ANI) [2]. Research is concerned with whether these models can be developed to the level of artificial general intelligence (AGI), which is believed to be built on different principles. They are undeveloped due to the complexity and limitations of experiments for the only one known example of general intelligence—the living human brain [3,4].

The human brain can understand the meaning of information, which is thought to be an active process nowadays. It consists of three phases following each other: sensory reflection, processing the information, and the synthesis of the integral image of the object. In other words, it starts from recognizing the elements of a phenomenon from the external world by organs of sense (perception) with the following understanding and the synthesis of them into an integral objective image. This process is personal and subjective, based on the individual’s previous experience [5]. So, information without meeting with a recipient is an opportunity that has not become reality. It receives meaning only by being interpreted by an individual [6]. This has some relations to a quantum system of qubits where every qubit’s state is the superposition of two possible values. It obtains the value after its measurement. Before this act, the system does not have any values or has all of them simultaneously. The same is true of information: until the message is interpreted, it does not have any meaning or has all possible meanings at the same time. The relations between quantum theory and information processing in biological systems are discussed in [7].

Redozubov and Klepikov [8] state that the process of extracting the meaning of a message can be described as follows: the information is processed somehow, and then the result of this transformation is compared to the objects stored in memory. If there is a match, then the information has received an interpretation. Differently, the incoming information has been interpreted into something met before.

The same approach was used by the ’British Bombe’ machine for cracking the ‘Enigma’ code with crib-based decryption or known-text decryption. The idea is to apply some transformation to the given encrypted message and check if the result makes sense, or, differently, check if the result matches the words we know. For example, the incoming encrypted message is ‘TEWWER’. It makes sense to no one. However, if there is a known text showing that the given meaningless sequence of symbols is transformed into ‘WETTER’ (German for ‘weather’) then we can tell that ‘TEWWER’ is interpreted as ‘WETTER’. The next task is to guess the rules of the transformation. ‘T’ with ‘W’ are interchangeable in the given example. Thus, they are Stecker partners [9]. As soon as the rules are found for a particular part of the message, the transformation can be applied to another part of it to obtain meaning. Thus, interpretation is tightly coupled with the rules of transformation, which are defined as ‘context’ in [8]. Information does not have meaning on its own, but can receive it by being interpreted within a particular context. This also means that the same information can potentially have many meanings being interpreted in different contexts. Jokes in different cultures are based on this ambiguity [10,11].

Existing object recognition models such as Hmax [12,13], Visnet [14,15], different architectures of the artificial neural networks [16], and image pattern recognition [17] models are mostly based on hierarchical principles of organization. They recognize primitive objects in the first level, and generalize them into complex ones in higher levels. These systems have shown a high effectiveness in investigations and industry. Combining them with the contextual approach is the main question for further investigations.

The architecture of hybrid cognitive systems such as Soar [18], ACT-R [19], LIDA [20], or iCUB [21] consists in two blocks in general: cognition and action selection. The cognition block is built from the combination of specialized ANI and the processing of incoming information. Depending on its output, the action selection block chooses a particular behavior for the system. The cognitive models represent a theory of the computational structures that are necessary to support human-level agents. They combine reinforcement learning, semantic memory, episodic memory, mental imagery, and an appraisal-based model of emotion and consider the process of cognition as the process of recognizing an object with a specialized block and then taking the following decisions. Thus, they support a statement that a human-like intelligence (AGI) can be built of more specialized blocks (ANI), which is still under discussion.

Restricting the amount of objects that the system works with into one context is a popular approach [22,23,24,25]. Real-world objects are not isolated, but rather part of an entire scene and adhere to certain constraints imposed by their environment. For example, many objects rely on a supporting surface, which restricts their possible image locations. In addition, the typical layout of objects in the 3D world and their subsequent projection onto a 2D image leads to a preference for certain image neighborhoods that can be exploited using a recognition algorithm. Therefore, context plays a crucial role in scene understanding, as has been established in both psychophysical [26,27,28] and computational [29,30] studies [31]. The main difference of the contextual model described in this paper is that it focuses on the set of transformation rules applied to an object rather than on its recognition. The set has been named ‘context’ and it has been suggested to be the key difference between ANI and AGI by Redozubov and Klepikov in [8]. This investigation, being based on that paper, follows the same concept and naming conventions. Applying transformation rules to the incoming information restores the original object, followed by matching it with ideal objects stored in memory. If the match is successful, then the incoming information has been interpreted in a particular context. Thus, the contextual model suggests another approach to information processing, focusing on the situation in which a phenomenon appears rather than on its features only. This is different from the existing successful object recognition models and cognition architectures described above.

Thus, despite the fact that current systems have achieved incredible results, they are not able to solve the tasks that AGI is expected to solve. The contextual model described in this paper suggests a new view of the information flow. The main contribution of this research is that it proves the basic principles of the model of visual information processing in experiments. The paper is focused on the concept of applying transformation rules rather than performance, object recognition accuracy, or practical application. This work is expected to fill the gap of the practical grounding of the contextual approach. The model will be essentially improved in the following work by introducing the latest achievements of AI into the model.

The paper consists of the following sections. The ‘Materials and Methods’ section describes mathematical grounding, algorithms, datasets, and training approaches used for the model creation and testing. The ‘Results’ section reveals how successful the model worked in testing, with the analysis in the ‘Discussion’ section. The ‘Conclusions’ section summarizes the results and sets the direction for the further development of the model.

2. Materials and Methods

2.1. Contextual Model Overview

The formalization of the contextual information processing model has been performed by Redozubov [32]. It is stated that if an arbitrary information message consists of the discrete elements (concepts c), then the variety of possible messages can be described by N available concepts which form a dictionary C = {c₁, c₂, c₃, … c_N}. Thus, the informational message can be defined as a set of concepts with length k: I = {c₁, c₂, c₃, … c_k}, where c_i ∈ C, i ∈ 1…N.

The original message, I, can be transformed into another message, I^int, by replacing concepts c_i of the original message with some other concepts c_j from the same dictionary C: I = {c_i} → I^int = {c_j}, where c_i, c_j ∈ C; i, j ∈ 1…N. The array of messages, I_i_, with their known interpretations, I_i^int, form subject S’s memory M, in which every element m is a couple including the initial message I and its interpretation I^int: m = (I, I^int). Thus, the whole memory M = {m_i|I ∈ 1…N_M}. Being in the learning mode, subject S is provided with incoming messages, I_i_, and their correct interpretation, I^int, to save them into memory M. In this research, the valid memory entry m_i is a right-shifted picture I₁ and the original picture is I^int. Another pair is the same original picture, I^int, shifted to the left, I₂. Since the original picture, I^int, consists of the primitive objects, c_j^int, they are transformed into other objects, c_j. c_j^int and c_j are the pictures’ pixels turning into each other.

M can be divided into K groups combining different original concepts c_j with the same interpreted concepts c^int: R_j = {(c_j, c^int)}, where j ∈ K. In other words, if there are three different original concepts, c₁, c₂, and c₃, interpreted into the same c^int, then the memory elements, m_i, which include the concepts c_j in their information messages I_i, can be divided into three groups as well. The frequency of every particular c^int in M gives an appropriate estimate of the interpretation probability. A set of the same transformation rules forms a context Cont. Thus, all the revealed groups with the same transformation rules in each form a space of contexts {Cont_i} for the S.

When the learning is finished, the interpretations, I^int, without their original messages, I, form subset M^int = {I_i^int|i ∈ 1…N_Mint}. Depending on the number of the interpretations I_i → I^int in M, every element in M^int can be given its coherence ρ (1).

{ρ (I}^{i n t}) = \sum_{i} \{\begin{array}{l} 1, I_{i}^{i n t} = I^{i n t} \\ 0, I_{i}^{i n t} \neq I^{i n t} \end{array}

(1)

Since some different messages, I, can have the same interpretation, I^int, the number of the unique elements I^int is

N_{M^{i n t}} \leq N_{M}

. Applying the rules R_j by the context Cont_j, we can obtain the interpretation I_j^int for any new message I, and then calculate its consistency with the interpretation memory M^int:

ρ_{j} = ρ (I_{j}^{i n t})

.

Thus, the computation in a context can be displayed as the schema in Figure 1.

Based on the coherence value, it is possible to calculate the probability of the interpretation in a particular context Cont_j (2).

p_{j} = \{\begin{array}{l} 0, ρ_{j} = 0 \\ \frac{p_{j}}{\sum_{i} ρ_{i}}, ρ_{j} \neq 0 \end{array}

(2)

As a result, the interpretation of the information I in each of the K possible contexts and the probability of this interpretation is received:

{(I_{j}^{i n t}, p_{i}) | j = 1 \dots K}

}.

Redozubov [32] states that the information is not understood by the subject S and therefore has no meaning for it if all probabilities of its interpretations I_j^int are zero: Σ_jp_j= 0. On the contrary, the interpretations I_j^int with p_j ≠ 0 form a set of possible meanings of the incoming information I: {(I_k^int, p_k)|k = 1…M, M ≤ K, p_k > 0}. The interpretation with the highest probability is the main one for the subject S: I^int’ = I_l^int, where l = index(max(p_k)).

Original incoming information I is received by the contextual space. Each context Cont_k in the space applies its transformation rules R_k and receives interpretation

I_{k}^{i n t}

. It is compared to the memory which is identical for all modules. Comparison involves the calculation of the conformity assessment ρ_k. It can be described as the level how much

I_{k}^{i n t}

resembles items in M^int. The final meaning selection is based on the conformities ρ and interpretation probabilities p saved in the memory. When the interpretation is received, the memory can be updated, accumulating experience.

The algorithm of the differential context space model was developed. It describes how the model learns and interprets transformations from given examples. It was programmed on Rust programming language and tested as an executable module for a personal computer with 32 × 32 pixel black and white images placed on a 64 × 64 pixel field. Transformation learning was performed on 242 32 × 32 different icons. They were normalized to have only black (considered as 0) and white (considered as 1) pixels in the 64 × 64 field (Figure 2). The choice of images was based on the following criteria: they should be different enough to make it possible to learn a transformation for every pixel.

Firstly, the model was tested with only xy transformations. Then, rotations were added. As a result, in the first step, the contextual space worked with the data having every incoming bit related to a bit in the output. Due to the rotations, this rule was eased, making pixel transformations not that obvious. It has exceptions for

α \in \{0, \frac{π}{2}\}

. In this case, every pixel in the input image also has a related image in the output. This happens due to the nature of raster graphic rotations. The algorithm uses nearest pixel interpolation logic for the rotation realization.

The images were applied to 4356 different transformations in total. They include xy shifts from 0 to 16 px to imitate saccades and

0, \frac{π}{4}, \frac{π}{2}, \frac{3 π}{4}

α rotations to imitate torsions of the eye [33]. This is a subset of the rotations used in [34]:

0, \frac{π}{8}, \frac{π}{4}, \frac{3 π}{8}, \frac{π}{2}, \frac{5 π}{8}, \frac{3 π}{4}, \frac{7 π}{8}

. The number of possible transformations is explained by the fact that a 32 × 32 px image which is located in the center of the 64 × 64 px field can be shifted 16 px in the x or y direction, including zero. This is 2 × 16 + 1 = 33 possible positions for every coordinate. This gives 33 × 33 = 1089 possible xy combinations. This number is multiplied by α, giving 1089 × 8 = 8712 transformations for the full set. The full set of rotations and shifts was then used for horizontal bar interpretations tests. Every new image in the learning set was exposed to the model with no shift and rotation to store it in memory without transformations as the interpretation.

The trained model was saved into a file to restore it in the same state for different testing sessions. Then, the model was presented with an image from the set which it had never seen. The set is normalized to black and white 32 × 32 pixel images in the 64 × 64 field as well. The objects on the pictures are taken from the ’Galaxy’ computer game and have no relation to the training icons set (see Section 3.1). The set was chosen because every image in it is different compared to the rest in the set and in the transformation learning set.

Firstly, the system was shown the interpretation object. Then, it was exposed to one of the learned transformations on the other examples, but it had never seen the object in it before. The model interpreted the given image; putting in the output file the interpretation, what transformation was applied to it, and the accuracy of the result.

2.2. Structures Diagram of the Model

The main structures used in the diff_context_space program and the relations between them are shown in Figure 3.

ContextSpace consists of a set of contexts and has two methods: Learn and Interpret. In the learning mode, the space receives some known Transformation t, incoming Information i, and its interpretation in the given transformation.

In relation to the human brain and visual information, it is known that the eyes constantly perform small (microsaccades) and medium (saccades) movements [35,36]. The muscles which operate the eye can move it up, down, left, and right. Also, it is possible to rotate the eye on a small angle α (torsions). In other words, the picture on the retina is transformed by an xy shift or rotation. Since it is the brain which operates the eyes, the transformation of the picture is known. As a result, the brain receives the previous picture (I^int), the new one (I), and what happened to it (transformation). The same parameters are received by the ContextSpace structure in its Learn method.

The Transformation structure has two fields to describe horizontal and vertical shifts as integer values: Horizontal and Vertical. The structure has methods to calculate the distance to another instance (DistanceTo) and to apply itself to information (ApplyTo).

The Information structure has one field to keep the data (Data). It is a one-dimension array of unsigned integer values which can be represented in 1, 2, 4, 8, or 16 bytes. This structure has a method to calculate its coherence to other information, returning a float value from 0 to 1.

An instance of the ContextSpace finds in the array of contexts the one with the same transformation and initiates its Learn method, passing the incoming information i and its interpretation. The learning logic of the Context structure is discussed in Section 2.3.

In the interpretation mode (Interpret method), an instance of the ContextSpace structure receives incoming information I and a float value of the minimum desired accuracy, varying from 0 to 1. The contextual space requests every instance of the Context in its array and initiates the Interpret method, passing i in it. The algorithm of the interpretation is explained in detail in Section 2.4.

Having applied its own transformation rules, every Context returns an interpretation and accuracy. The latest one is used to select the context winner with the probability-dependent logic. The higher the accuracy is, the higher the chance the appropriate context has to become the winner.

2.3. Context Learning Algorithm

The main point of the learning algorithm is to pick up the transformation rules for every context and clean them from the additional data. A particular context that is responsible for a certain transformation is given incoming information i and its interpretation in the learning mode (Section 2.2). The first example in Figure 4 shows one right shift position of a vertical line in the 8 × 8 pixel size visual field.

We suspect that every pixel set to 1 in the incoming information has an appropriate pixel or their group in the interpretation. This comes from the nature of visual information. Thus, the context creates a set of rules for every set pixel in which every rule is nothing else than a hypothesis about how one pixel is transformed. The first learning example does not reveal the rule regarding what exact pixel or their group are transformed into. That is why the whole interpretation picture is saved. During the next stage, the context is given another group of pixels and their same transformation (1 px shift to the right). Having processed that in the same way, the interpretation rule can be clarified for the pixels which received an interpretation in the past. The simplest way to achieve this is to add a new experience to the existing one with the AND operator. This is illustrated in Figure 4. This algorithm is deliberately simple to receive predictable results. It is essential to prove the model’s principles in practice. The methods to reveal the transformation rules in the field shall be more sophisticated; however, it shall learn them from examples rather than be encoded.

2.4. Context Interpretation Algorithm

The interpretation algorithm receives information i, applies learned transformation rules to it, and calculates the accuracy, as shown in Figure 5. In the first step, the incoming information is disassembled to the set of bits. Then, the context applies its transformation rules for every set bit there. Having summarized the results with the OR operator, the context receives an interpretation. The interpretation is compared to the memory that is shared between all contexts with all of the already known interpretations to find the best match. The calculation of the accuracy is based on comparing the number of set bits in the incoming information and the number of the rules the context has for these particular bits. The presented logic is simple, but it is enough to prove the basic principles of the interpretation of the context. It can be essentially improved, for example, by adding modern artificial neural networks for selecting the match in the interpretation memory [37].

3. Results

3.1. XY Transformations

The model and programmed module are described in the Section 4. Firstly, it was tested with 242 16 × 16 pixel black and white images placed on a 32 × 32 pixel field. The images were applied to 289 different shifts to imitate saccades: from zero to eight pixels in horizontal and vertical directions. The total number of images consumed by the model to learn transformation rules is 289 × 242 = 69,938. The icons were normalized to have only black (considered as 0) and white (considered as 1) pixels in the 32 × 32 field. The icons and the results are grouped into Table 1.

The column ‘# of transformations’ shows the number of transformations tested. ‘Correctly recognized transformations’ show the proportion of cases in which the model recognized the transformation properly. ‘Average interpretation coherence’ shows the coherence between the selected interpreted image and the one selected from memory (from 0.0 to 1.0). The results of applying the selected context to find the interpretation are shown in Figure 6. Some of them have artifacts (extra pixels).

Then, the model was tested with the same set of images, but with 32 × 32 px sizes in a 64 × 64 field. The number of transformations applied was 1089. The results were similar: all transformations and images were recognized properly.

3.2. XY and 4α Transformations for the Galaxy Images

The model was tested with 32 × 32 pixel black and white images placed on a 64 × 64 pixel field. The images have taken 4356 different transformations. Similar to the reported transformations in Section 2.1, the transformations included xy shifts from 0 to 16 px to imitate saccades. However, 0, π/4, π/2, and 3π/4 α rotations were added to imitate torsions. The total number of images consumed by the model to learn is 4356 × 242 = 1,054,152. The interpretation results were assessed regarding the number of errors the system made in image recognition. An error means that the model was not able to recognize the initial image. Another type of assessed errors was transformation recognition. The error occurred when the transformation was not recognized properly regardless of image recognition. The results are grouped into Table 2.

The distribution of the image recognition errors, depending on the applied transformation, are shown in Figure 7. In the charts, the x and y axis refer to the x and y values of the transformation in which image interpretation error occurred. The rotation angle α is not depicted there.

Almost all images from the 16 examples have interpretation errors that count for 14% or less. Among them, two images (Figure 7d,e) had around 1.3% interpretation errors; one image (Figure 7o) had less than 3% interpretation errors; two images (Figure 7g,i) had around 3.5% interpretation errors; three images (Figure 7a,b,l) had around 6% interpretation errors; three images (Figure 7h,k,p) had around 8% interpretation errors; three images (Figure 7c,f,m) had around 12% interpretation errors; and one image (Figure 7j) had around 14% interpretation errors.

3.3. XY and 8α Transformations for Horizontal Bar

The next step of the following investigation will be the pinwheel formation of slit or bar rotation activations [38,39]; taking this into account, the model was tested only for a horizontal bar 32 × 1 px (Figure 8) interpretation instead of the Galaxy set (Table 1). This statement also allows for the exclusion of training examples from the interpretation memory and requests the model to only recognize the bar with a certain accuracy.

The number of training pictures was reduced to 100 images but with smoother lines (Figure 9). Basing on the results in the Section 3.1, the number of learning examples can be decreased without an essential influence on the extraction of transformation rules by the model. This allows for a significant increase to the training process, since the number of transformations is 8712.

The results revealed the following: in 20 cases (0.23%), the interpretation was not found for the 0.7 level of accuracy; in 425 cases (4.88%), the transformation was not selected properly.

All transformations errors had an incorrect x coordinate, but y and α were selected properly. A significant number of errors are related to a 1 px mistake in the coordinate (Figure 10).

The error distribution regarding the rotation is shown in Figure 11. The essential number of transformation misinterpretations happened for

α \in {45 °, 135 °}

(

α \in {\frac{π}{4}, \frac{3 π}{4}}

).

4. Discussion

The differential context space model has shown stable results for the transformation recognitions working with the xy transformations. All images and their transformations were recognized properly. However, the actual interpretations for some Galaxy images were not transformed to the original ones exactly and had some extra pixels. These artefacts appear because the learning examples do not allow the model to pick up the transformation rules for every pixel (Figure 6). As a result, the coherence between the interpreted images and the images existing in memory is 1.0 for seven of them, close to 0.99 for five of them, and around 0.9 for two pictures for sixteen of them. Despite this fact, the model found the original image properly and selected the context with the right transformation in all cases. Increasing the resolution of the images to a 32 × 32 px size in a 64 × 64 field leads to the same results.

Adding four rotations to the images increased the number of errors. One image (Figure 7n) was misinterpreted by the system in more than 33% of cases. The analysis of the error distribution reveals a similar picture for all images except Figure 7n. There are essential errors that rise with the increase in the xy shifts from the central position (x = 0, y = 0). This allows us to assume that the further the image is located and rotated from the central position, the harder it is to restore its initial view. Figure 7n has a significantly higher number of errors (33.26%) as well as a different pattern of their distribution. A total of 70.05% of the errors are misinterpreted with playback play.png from the training set. Since there is only one example with such a high level of errors, it can be explained by some individual features of the image.

Adding four more rotations and changing the Galaxy set to a simple horizontal bar image led to 20 cases of error when the interpretation was not found (accuracy was lower than 0.7). All of them have an extreme possible shift to the right (x = 16 px) and

α \in {\frac{π}{4}, \frac{3 π}{4}}

, but y varies. This can be explained by the fact that the rotations at

\frac{π}{4}

and

\frac{3 π}{4}

have the highest pixel loss level for raster graphics. Considering that the errors happened in the furthest position from the horizontal center (x = 16), it is possible to assume this to be the transformation when the maximum number of pixels are lost due to it. It can explain the highest level of errors in its turn.

The results show that the model works robustly with those transformations which do not lead to the loss of pixels. Differently speaking, if every pixel in the original image has a corresponding pixel to the transformed image, then the model can properly establish the transformation rules and find the right context. However, as soon as some pixels are lost, the model makes errors not only in restoring the original image, but also in finding the right context. Furthermore, there is a tendency to have more errors in the extreme allowed transformations. Despite the errors, it is possible to say that the tests have proved the basic principles of the contextual model.

The contextual model presented in this research is only a concept and cannot be applied for real-world tasks at this stage, nor can it compete with other AI systems which are specialized for visual information processing. This paper is only concentrated on the proof of the basic principles of the model to make the foundation for its further development.

The images used in this investigation were only black and white with a low resolution to make the results of the model predictable. However, it can be improved by adding edge detection algorithms so that the model can work with greyscale or color pictures as well. Also, increasing image resolution should decrease the influence of pixel loss on rotation transformations, making the edges smoother. Adding uncertainty to the interpretation of pixels with probability or using the most successful recognition models described in Section 4 for matching interpretations with memory should essentially improve the model. At the same time, it will become less predictable, making it harder to reproduce and analyze the results. This should make an essential contribution for practical applications.

5. Conclusions

This paper has shown that the differential context space can successfully learn the rules of xyα transformations of black and white visual information. It was proved for 16 × 16 px in a 32 × 32 field and for 32 × 32 px in 64 × 64 field images. The model can successfully interpret these images, even though they have never been seen with particular shifts and rotations. It has been demonstrated that the model does not make interpretive mistakes if the pixels are not lost during the transformation. The highest number of errors happened on the rotations at

\frac{π}{4}

and

\frac{3 π}{4}

with the furthest shift to the right. This fact can be explained by the nature of raster image rotations: in these positions, the maximum number of pixels are lost compared to the original. Thus, the results prove that the basic principles of contextual information processing have experimental grounding and that continued investigations are warranted. The following step is to improve the contextual model by integrating artificial neural networks in the object recognition steps.

Author Contributions

Conceptualization, I.K. and M.P.; methodology, I.K.; software, I.K.; validation, I.K. and M.P.; formal analysis, I.K.; investigation, I.K.; resources, I.K. and M.P.; data curation, I.K.; writing—original draft preparation, I.K.; writing—review and editing, I.K. and M.P.; visualization, I.K.; supervision, M.P.; project administration, I.K.; funding acquisition, M.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the findings of this study are available from the corresponding author I. Khurtin on request. The source code of the model in Rust language can be downloaded at: https://github.com/iluhakhurtin/CombinatorialSpace, commit hash a9f56cfa78, folder diff_context_space.

Acknowledgments

The authors would like to express sincere appreciation to Alexey Redozubov for his invaluable guidance and insightful discussions on interpreting the results for this paper. His support has been instrumental in sha** the direction of this research and enhancing the quality of the findings.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wlodarczak, P. Machine Learning Applications. In Machine Learning and Its Applications, 1st ed.; Wlodarczak, P., Ed.; CRC Press/Taylor & Francis Group: Boca Raton, FL, USA, 2020; pp. 163–173. [Google Scholar]
Shane, J. What is AI? In You Look Like a Thing and I Love You; OCLC: Dublin, OH, USA, 2019; 1128058352; p. 41. [Google Scholar]
Goertzel, B. Artificial General Intelligence; Gabbay, D.M., Siekmann, J., Bundy, A., Carbonell, J.G., Pinkal, M., Uszkoreit, H., Veloso, M., Wahlster, W., Wooldridge, M.J., Eds.; Cognitive Technologies; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
Gupta, A.; Seal, A.; Prasad, M.; Khanna, P. Salient Object Detection Techniques in Computer Vision. A Survey. Entropy 2020, 22, 1174. [Google Scholar] [CrossRef] [PubMed]
Menant, C. Information and Meaning. Entropy 2003, 5, 193–204. [Google Scholar] [CrossRef]
Mosunova, L. Theoretical approaches to defining the concept of the perception of the meaning of information. Sci. Tech. Inf. Process. 2017, 44, 175–183. [Google Scholar] [CrossRef]
Asano, M.; Basieva, I.; Khrennikov, A.; Ohya, M.; Tanaka, Y.; Yamato, I. Quantum Information Biology: From Information Interpretation of Quantum Mechanics to Applications in Molecular Biology and Cognitive Psychology. Found. Phys. 2015, 45, 1362–1378. [Google Scholar] [CrossRef]
Redozubov, A.; Klepikov, D. The Meaning of Things as a Concept in a Strong AI Architecture. In Artificial General Intelligence; Lecture Notes in Computer Science; Goertzel, B., Panov, A.I., Potapov, A., Yampolskiy, R., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12177, pp. 290–300. [Google Scholar]
Singh, S. Cracking the enigma. In The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography; OCLC: 150673425; Anchor Booksp: New York, NY, USA, 2000; p. 174. [Google Scholar]
Bucaria, C. Lexical and syntactic ambiguity as a source of humor: The case of newspaper headlines. Humor—Int. J. Humor Res. 2004, 17, 279–309. [Google Scholar] [CrossRef]
Attardo, S. Linguistic Theories of Humor; Walter de Gruyter: Berlin, Germany; New York, NY, USA, 2009. [Google Scholar]
Riesenhuber, M.; Poggio, T. Hierarchical models of object recognition in cortex. Nat. Neurosci. 1999, 2, 1019–1025. [Google Scholar] [CrossRef] [PubMed]
Serre, T.; Wolf, L.; Bileschi, S.; Riesenhuber, M.; Poggio, T. Robust Object Recognition with Cortex-Like Mechanisms. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 411–426. [Google Scholar] [CrossRef] [PubMed]
Wallis, G.; Rolls, E.; Foldiak, P. Learning invariant responses to the natural transformations of objects. In Proceedings of the 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan), Nagoya, Japan, 25–29 October 1993; Volume 2, pp. 1087–1090. [Google Scholar]
Robinson, L.; Rolls, E. Invariant visual object recognition: Biologically plausible approaches. Biol. Cybern. 2015, 109, 505–535. [Google Scholar] [CrossRef] [PubMed]
Egmont-Petersen, M.; de Ridder, D.; Handels, H. Image processing with neural networks—A review. Pattern Recognit. 2002, 35, 2279–2301. [Google Scholar] [CrossRef]
Rao, L.K.; Rahman, M.Z.U.; Rohini, P. Features Used for Image Retrieval Systems. In Image Pattern Recognition: Fundamentals and Applications, 1st ed.; CRC Press: Boca Raton, FL, USA, 2021; pp. 9–23. [Google Scholar]
Laird, J. The Soar Cognitive Architecture; The MIT Press: Cambridge, MA, USA, 2012; pp. 1–26. [Google Scholar]
Ritter, F.; Tehranchi, F.; Oury, J. ACT-R: A cognitive architecture for modeling cognition. WIREs Cogn. Sci. 2019, 10, e1488. [Google Scholar] [CrossRef]
Franklin, S.; Madl, T.; D’Mello, S.; Snaider, J. LIDA: A Systems-level Architecture for Cognition, Emotion, and Learning. IEEE Trans. Auton. Ment. Dev. 2014, 6, 19–41. [Google Scholar] [CrossRef]
Vernon, D.; Hofsten, C.; Fadiga, L. The iCub Cognitive Architecture. In A Roadmap for Cognitive Development in Humanoid Robots; 31 Cognitive Systems Monographs; Dillmann, R., Vernon, D., Nakamura, Y., Schaal, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; Volume 11, pp. 121–153. [Google Scholar]
Xu, Y.; Li, Y.; Shin, B. Medical image processing with contextual style transfer. Hum.-Centric Comput. Inf. Sci. 2020, 10, 46. [Google Scholar] [CrossRef]
Contextual learning is nearly all you need. Nat. Biomed. Eng. 2022, 6, 1319–1320. [CrossRef]
Rentschler, T.; Bartelheim, M.; Behrens, T.; Bonilla, M.; Teuber, S.; Scholten, T.; Schmidt, K. Contextual spatial modelling in the horizontal and vertical domains. Nat. Sci. Rep. 2022, 12, 9496. [Google Scholar] [CrossRef]
Graph deep learning detects contextual prognostic biomarkers from whole-slide images. Nat. Biomed. Eng. 2022, 6, 1326–1327. [CrossRef] [PubMed]
Biederman, I. On the semantics of a glance at a scene. In Perceptual Organization; Kubovy, M., Pomerantz, J., Eds.; Lawrence Erlbaum: London, UK, 1981; Chapter 8; pp. 213–253. [Google Scholar]
De Graef, P.; Christiaens, D.; d’Ydewalle, G. Perceptual effects of scene context on object identification. Psychol. Res. 1990, 52, 317–329. [Google Scholar] [CrossRef]
Torralba, A.; Oliva, A.; Castelhano, M.; Henderson, J. Contextual guidance of attention in natural scenes: The role of global features on object search. Psychol. Rev. 2006, 113, 766–786. [Google Scholar] [CrossRef] [PubMed]
Hoiem, D.; Efros, A.; Hebert, M. Putting objects into perspective. IEEE Conf. Comput. Vis. Pattern Recognit. 2006, 2, 2137–2144. [Google Scholar]
Torralba, A. Contextual priming for object detection. Int. J. Comput. Vis. 2003, 53, 169–191. [Google Scholar] [CrossRef]
Grauman, K.; Leibe, B. Context-based recognition. In Visual Object Recognition; Morgan & Claypool Publishers: Rapperswil, Switzerland, 2010; pp. 122–123. [Google Scholar]
Redozubov, A. Holographic Memory: A Novel Model of Information Processing by Neuronal Microcircuits. In The Physics of the Mind and Brain Disorders; Springer Series in Cognitive and Neural, Systems; Opris, I., Casanova, M.F., Eds.; Springer International Publishing: Cham, Switzerland, 2017; Volume 11, pp. 271–295. [Google Scholar]
Leigh, J.; Zee, D. A Survey of Eye Movements: Characteristics and Teleology. In The Neurology of Eye Movements, 5th ed.; University Press: Oxford, UK, 2015; pp. 10–25. [Google Scholar]
Bosking, W.; Zhang, Y.; Schofield, B.; Fitzpatrick, D. Orientation Selectivity and the Arrangement of Horizontal Connections in Tree Shrew Striate Cortex. J. Neurosci. 1997, 17, 2112–2127. [Google Scholar] [CrossRef]
Mergenthaler, K.; Engbert, R. Microsaccades are different from saccades in scene perception. Exp. Brain Res. 2010, 203, 753–757. [Google Scholar] [CrossRef] [PubMed]
Engbert, R. Microsaccades: A microcosm for research on oculomotor control, attention, and visual perception. Prog. Brain Res. 2006, 154, 177–192. [Google Scholar] [PubMed]
Bishop, C. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
Blasdel, G.; Salama, G. Voltage-sensitive dyes reveal a modular organization in monkey striate cortex. Nature 1986, 321, 579–585. [Google Scholar] [CrossRef] [PubMed]
Bonhoeffer, T.; Grinvald, A. Iso-orientation domains in cat visual cortex are arranged in pinwheel-like patterns. Nature 1991, 353, 429–431. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Computational scheme of the context module.

Figure 2. Learning icons set.

Figure 3. Structures diagram of the model.

Figure 4. Context learning algorithm example. The context creates a set or the transformation rules for every set pixel in the incoming information and, as a result, produces 6 and 4 transformation rule hypotheses for 2 pictures. The consolidation of the transformation rules combines the previous experience, clarifying the hypotheses for every incoming set (value 1) pixel (highlighted with orange color). The reset (changed value from 1 to 0) pixels on clarification are highlighted with red color.

Figure 5. Information interpretation algorithm. The incoming information is disassembled to the set pixels. Then, the rules received in the learning mode are applied to it, producing interpretations with some accuracy. Set (value 1) pixels are highlighted with orange color.

Figure 6. Galaxy images actual interpretations.

Figure 7. Image recognition errors with xy transformation distribution (a–p).

Figure 8. 32 × 1 px horizontal slit.

Figure 9. Learning Smooth Icons Set (100).

Figure 10. X transformation errors for slit.

Figure 11. X transformation errors distribution (blue line) for slit depending on α.

Table 1. Interpretation results.

# of Transformations	Correctly Recognized Transformations	Average Interpretation Coherence	# of Transformations	Correctly Recognized Transformations	Average Interpretation Coherence
289	1.0	0.94520545	289	1.0	0.9917355
289	1.0	1.0	289	1.0	1.0
289	1.0	1.0	289	1.0	0.92086333
289	1.0	0.9917355	289	1.0	0.8955224
289	1.0	0.99224806	289	1.0	1.0
289	1.0	0.99310344	289	1.0	1.0
289	1.0	1.0	289	1.0	1.0
289	1.0	0.9382716	289	1.0	0.99224806

Table 2. Interpretation errors.

Image Recognition Errors	Transformation Recognition Errors	Image and Transformation Recognition Errors	Image Recognition Errors	Transformation Recognition Errors	Image and Transformation Recognition Errors
5.10%	8.54%	5.10%	3.42%	11.59%	3.42%
5.81%	8.75%	5.81%	14.03%	14.26%	14.03%
12.35%	14.10%	12.35%	8.03%	13.59%	8.03%
1.29%	6.34%	1.29%	5.60%	12.70%	5.60%
1.33%	7.60%	1.33%	11.55%	11.75%	11.55%
11.46%	13.57%	11.46%	33.26%	25.30%	33.26%
3.51%	5.69%	3.51%	2.80%	5.26%	2.80%
8.61%	10.42%	8.61%	7.81%	11.36%	7.81%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khurtin, I.; Prasad, M. A Contextual Model for Visual Information Processing. Computers 2024, 13, 155. https://doi.org/10.3390/computers13060155

AMA Style

Khurtin I, Prasad M. A Contextual Model for Visual Information Processing. Computers. 2024; 13(6):155. https://doi.org/10.3390/computers13060155

Chicago/Turabian Style

Khurtin, Illia, and Mukesh Prasad. 2024. "A Contextual Model for Visual Information Processing" Computers 13, no. 6: 155. https://doi.org/10.3390/computers13060155

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Contextual Model for Visual Information Processing

Abstract

1. Introduction

2. Materials and Methods

2.1. Contextual Model Overview

2.2. Structures Diagram of the Model

2.3. Context Learning Algorithm

2.4. Context Interpretation Algorithm

3. Results

3.1. XY Transformations

3.2. XY and 4α Transformations for the Galaxy Images

3.3. XY and 8α Transformations for Horizontal Bar

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI