Dynamic Content-Oriented Indexing and Replication for High-Performance Storage and Analysis of Big Data in the IPFS Network
DOI:
https://doi.org/10.14529/jsfi250206Keywords:
dynamic indexing, IPFS, big data, replication, content-oriented indexing, high-performance storageAbstract
This paper presents an architecture for dynamic, content-oriented indexing and adaptive replication that enables high-performance storage and analysis of big data on IPFS. We first outline key gaps of vanilla IPFS for analytics – no global content search, non-guaranteed persistence without coordinated pinning, static replication, and highly variable retrieval latency – and address them with two components: (1) a two-tier distributed index (per-attribute/keyword inverted lists as IPFS objects plus a lightweight catalog that maps search keys to index CIDs via DHT/IPNS or CRDT-based dissemination); and (2) an adaptive replication service that aggregates access telemetry and adjusts replica counts and placement using hysteresis thresholds and topology-aware selection. The contribution is a theoretical proposal and architectural blueprint; no prototype or experimental results are reported here. We discuss integration with analytical engines through two paths: a pragmatic FUSE mount that exposes IPFS content as a local filesystem to Spark/Flink, and prospective native connectors that parallelize block reads over the IPFS API. For tabular datasets, dataset metadata (schema, partitioning, file CIDs) is maintained in IPFS to support versioning and reproducibility. A plan for comparative evaluation versus HDFS, Ceph, and S3 (e.g., TPC-DS and subsets of Common Crawl) is outlined. Expected benefits are faster content discovery, higher throughput under skew and multi-tenant load, and improved resilience, with modest index/coordination overheads. The approach combines the openness of a decentralized P2P substrate with the manageability required by enterprise-scale analytics.
References
Benet, J.: IPFS - content addressed, versioned, P2P file system. https://doi.org/10.48550/arXiv.1407.3561
Benet, J., Greco, N., et al.: Filecoin: A decentralized storage network. Protocol Labs report, 2017.
Cao, L., Li, Y.: IPFS keyword search based on double-layer index. In: Proceedings of the International Conference on Electronic Information Engineering and Computer Communication (EIECC 2021), vol. 12172, pp. 1217209. SPIE (2022). https://doi.org/10.1117/12.2639406
Cao, X., Wang, C., Wang, B., He, Z.: A method to calculate the number of dynamic HDFS copies based on file access popularity. Mathematical Biosciences and Engineering 19(12), 12212–12231 (2022). https://doi.org/10.3934/mbe.2022583
Common Crawl Foundation. Common Crawl - Open Web Data (HTML, WARC files). https://commoncrawl.org/ (2025), accessed: 2025-05-25
Maymounkov, P., Mazi´eres, D.: Kademlia: A peer-to-peer information system based on the XOR metric. In: Peer-to-Peer Systems. IPTPS 2002. Lecture Notes in Computer Science, vol. 2429, pp. 53–65. Springer, Berlin, Heidelberg (2002). https://doi.org/10.1007/3-540-45748-8_5
Estrada-Gali˜nanes, V., ElRouby, A., Theytaz, L.: Towards efficient data management for IPFS-based applications. https://doi.org/10.48550/arXiv.2404.16210
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST 2012, Lake Tahoe, Nevada, USA, May 3-7, 2010, pp. 1–10. IEEE (2010). https://doi.org/10.1109/MSST.2010.5496972
Weil, S.A., Brandt, S.A., Miller, E.L., Maltzahn, C.: Grid resource management - CRUSH: controlled, scalable, decentralized placement of replicated data. In: Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, November 11-17, 2006, Tampa, FL, USA, pp. 122. ACM (2006). https://doi.org/10.1145/1188455.1188582
Zhu, Z., Cen, F.: Research on key technologies of information search based on IPFS. In: International Conference on Electronic Information Engineering and Computer Communication (EIECC 2021), pp. 1217201. SPIE (2022). https://doi.org/10.1117/12.2640819
Downloads
Published
How to Cite
License
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Non Commercial 3.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.