ION Markets » Blog » Scalable data caching requires scalable querying – here’s how to achieve it

Hi-Tech digital technology frequency wave diagram concept, futuristic hud visualizing complex data, monitor screen in perspective

General

Scalable data caching requires scalable querying – here’s how to achieve it

July 14, 2021

Modern fintech applications have to process enormous volumes of data. That’s putting new constraints on their performance and scalability.

One of the most crucial tasks of a fintech application is retrieving data quickly for processing. Typically, data sources are slow. One technique to improve performance is in-memory caching of the required data. This means the application can get the data from its own primary memory (RAM), instead of having to make a round trip to a slow data source.

The data requirements of modern applications are complex. Often, applications need to query data using arbitrarily complex conditions. For example, GREATER THAN, LESS THAN, BETWEEN, IN, and so on, combined with multiple AND/OR clauses. Until now, this need has been catered for by the data sources. However, if we use RAM as the source of large volumes of data, we need to query RAM in a similar way. Moreover, as the data cache expands, the querying capabilities on the cache also need to scale.

Cache withdrawal tax

In a naïve implementation, all the cached data objects are iterated and examined against the query filter conditions. The overall time this takes is directly proportional to the number of data objects cached. If the cache holds millions of objects, this approach can be quite slow.

The following chart shows how the time taken increases with the number of objects in an application using the naïve approach:

An old war horse to the rescue

To speed up our cache query, we need to reduce the number of candidate data objects to be scanned. We can do this using a popular technique from the world of databases: indexing.

Indexing is a simple idea. It’s like maintaining the index of a book so that we can go directly to the page we’re interested in. In software terms, we can keep data objects grouped by certain data attributes. If the query contains a filter clause on that attribute, we can directly get the group belonging to that attribute. Then we can filter out the objects returned according to rest of the query conditions. Fewer objects scanned means a faster query.

Six of one, and a half dozen of the other

In a complex application, a cache is often shared by different subsystems. In these scenarios, the query submitted to the cache may not be guaranteed to contain a specific filter clause. To speed up multiple possible queries, we can maintain multiple indexes on the cached set of data objects.

Further, if the structure of the query is not fully controlled (for example, if the query is generated by the application user), it can contain complex conditions connected using multiple ‘AND’ and ‘OR’ clauses. To find the index to be used from such queries, we need to find the largest set of filter clauses that are logically in an ‘AND’ condition with the remaining query filters. We can then safely use one of these filtering attributes to read the corresponding index and apply the rest of the filters on the returned objects. Out of the possible attributes, the one whose index returns the fewest objects can be chosen.

It’s important to note that maintaining multiple indexes increases the overall memory requirement. Also, whenever the cache is updated, all these indexes must be kept in sync. For overall performance and scalability, we have to choose our data structure carefully to ensure it is efficient in these respects.

We can implement an index in different ways, depending on the characteristics of the query filter clauses.

A hash-hash affair

If a filter clause on an attribute is always in the equality form ( = ), we can keep the data objects grouped by that attribute in a hash map. A hash map is highly efficient in retrieving objects that correspond to a given key value.

The overall time this approach takes is independent of the total number of cached objects. Therefore, this approach ensures high performance.

A map hung on a tree

If an attribute can have different conditions in a filter clause, such as GREATER THAN, LESS THAN, and so on, we can keep the grouped data objects in a list, or a binary search tree sorted by the attribute. Any comparison operator on the attribute can be efficiently evaluated on such a data structure to quickly locate the relevant data objects.

With data structures like these, the time taken to locate the grouped objects is usually proportional to the logarithm of the number of possible attribute values. It’s quite fast because the logarithm of even a large number is extremely small.

The following chart shows the time taken by a query filtering an increasing number of data objects. As data volume increases, query time without tree-based indexes increases much faster than query time with them. In-fact, with tree-based indexes the time taken remains relatively constant:

Group some and then group some more

If the cache query generally contains a specific combination of filtering attributes, then data objects can be grouped using the combination as well. In such an index, known as a composite index, objects are grouped using the first attribute. Then the grouped object subsets are further grouped using the second attribute, and so on. This reduces the number of objects to be scanned even further. Any of the data structures we’ve already described can be used to implement a composite index.

ION ARC

ION’s ARC technology powers the data aggregation and reporting capabilities of our products. Thanks to its scalable distributed architecture, ARC can extract, transform, and load huge volumes of data in-memory across several machine nodes. To answer complex data queries quickly, it uses various techniques, including indexing, to filter, aggregate & report the distributed data. We also provide a flexible user interface that allows you to easily design your own interactive reporting dashboards and view this data in various forms.

Learn more about ARC and the applications it powers.

Amit Mittal Amit likes to talk about employing clean crisp interfaces in any software design including front-end. He has special interest in scalable distributed architectures.

Kush Chandna Kush contributes his ideas in developing scalable software products with high-performance requirements.

Cookie	Type	Duration	Description
__cfduid	1	29 days 23 hours 59 minutes	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
cookielawinfo-checkbox-analytics	1	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
cookielawinfo-checkbox-functional	1	1 year	The cookie is set by the GDPR Cookie Consent plugin to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	0	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-non-necessary	0	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-other	1	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Other".
JSESSIONID	1	13 days 23 hours 59 minutes	Used by sites written in JSP. General purpose platform session cookies that are used to maintain users'' state across page requests.
viewed_cookie_policy	0	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
vuid	1	1 years 11 months 28 days 23 hours 59 minutes	This domain of this cookie is owned by Vimeo. This cookie is used by vimeo to collect tracking information. It sets a unique ID to embed videos to the website.
XSRF-TOKEN	1	1 days 23 hours 59 minutes	The cookie is set by Wix website building platform on Wix website. The cookie is used for security purposes.

Cookie	Type	Duration	Description
_ga	1	1 years 11 months 28 days 23 hours 59 minutes	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site''s analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga_64Q73PPC3E	1	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_45487328_32	0		This is a pattern type cookie set by Google Analytics, where the pattern element on the name contains the unique identity number of the account or website it relates to. It appears to be a variation of the _gat cookie which is used to limit the amount of data recorded by Google on high traffic volume websites.
_gat_UA-85023278-1	0		This is a pattern type cookie set by Google Analytics, where the pattern element on the name contains the unique identity number of the account or website it relates to. It appears to be a variation of the _gat cookie which is used to limit the amount of data recorded by Google on high traffic volume websites.
_gid	1	23 hours 59 minutes	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_omappvp	1	10 years 11 months 10 days 23 hours 59 minutes	The cookie is set to identify new vs returning users. The cookie is used in conjunction with _omappvs cookie to determine whether a user is new or returning.
_omappvs	1	9 minutes	The cookie is used to in conjunction with the _omappvp cookies. If the cookies are set, the user is a returning user. If neither of the cookies are set, the user is a new user.
bscookie	1	1 years 11 months 29 days 11 hours 37 minutes	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
ELQSTATUS	1	1 years 1 months	This cookies collect information in an anonymous form, including the number of visitors to the site, where visitors have come to the site from, and the pages they visited. Once consent is provided, through a form submission by the visitor, we can associate a visitor's ID to individual characteristics and past behavior.
li_sugr	1	2 months 28 days 23 hours 59 minutes	This cookie is used to make a probabilistic match of a user's identity outside the Designated Countries.
lissc	1	11 months 29 days 23 hours 59 minutes	This cookie is provided by LinkedIn. This cookie is used for tracking embedded service.
sbjs_current	1	5 months 27 days	This cookie is to identify the source of a visit and store user action information about it in a cookies. This is a analytic and behavioural cookie used for improving the visitor experience on the website.
sbjs_current_add	1	5 months 27 days	This cookie is to identify the source of a visit and store user action information about it in a cookies. This is a analytic and behavioural cookie used for improving the visitor experience on the website.
sbjs_first	1	5 months 27 days	This cookie is to identify the source of a visit and store user action information about it in a cookies. This is a analytic and behavioural cookie used for improving the visitor experience on the website.
sbjs_first_add	1	5 months 27 days	This cookie is to identify the source of a visit and store user action information about it in a cookies. This is a analytic and behavioural cookie used for improving the visitor experience on the website.
sbjs_migrations	1	5 months 27 days	This cookie is to identify the source of a visit and store user action information about it in a cookies. This is a analytic and behavioural cookie used for improving the visitor experience on the website.
sbjs_session	1	30 minutes	This cookie is to identify the source of a visit and store user action information about it in a cookies. This is a analytic and behavioural cookie used for improving the visitor experience on the website.
sbjs_udata	1	5 months 27 days	This cookie is to identify the source of a visit and store user action information about it in a cookies. This is a analytic and behavioural cookie used for improving the visitor experience on the website.
UserMatchHistory	1	29 days 23 hours 59 minutes	Linkedin - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.

Cookie	Type	Duration	Description
__ncuid	1	1 year	No description available.
_clck	0	1 year	No description
_clsk	0	1 day	No description
_gat_ncAudienceInsightsGa	0	1 minute	No description
AnalyticsSyncHistory	0	1 month	No description
CLID	1	1 year	No description
country	1	1 month	No description available.
i18next	1	11 months 29 days 23 hours 59 minutes	No description
KV_CLIENT_SESSION_ID	1	11 months 29 days 23 hours 59 minutes	No description
li_gc	0	5 months 27 days	No description
pap_session	1	1 days 23 hours 59 minutes	No description
pap_wcaid_288	1	5 days 23 hours 59 minutes	No description
SM	1	session	No description available.
TS01bd9a65	1		No description

Cookie	Type	Duration	Description
__cf_bm	1	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	1	1 years 11 months 29 days 11 hours 37 minutes	This cookie is set by linkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
ELOQUA	1	1 years 1 months	The domain of this cookie is owned byOracle Eloqua. This cookie is used for email services. It also helps for marketing automation solution for B2B marketers to track customers through all phases of buying cycle.
lang	1	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1	23 hours 59 minutes	This cookie is set by LinkedIn and used for routing.
optimizelyEndUserId	1	5 months 27 days	Optimizely uses this cookie to store a visitor''s unique identifier which is a combination of a timestamp and a random number. Different variations of web parts are shown to users that optimizes the website''s user experience.