Ideal directory structure for web application - directory-structure

I'm about to create a user based website and will have to store photo, docs and other data for each user.
If I take a silly number like 1 000 000 000 users, I believe than one folder with 1 000 000 000 won't be the fastest thing in the world! So I was thinking of creating something like
1st level : [a-z]
2nd level : [a-z]
3rd level : [a-z]
Therefor bobby will be in /b/o/b/by
But this also mean that it won't be spread equaly, because there will be very few user starting with a z and many more with a m,s,l ...
so I was thinking of using a user id
such as "000000000001", "000000000001" etc...
1st level : [000-999]
2nd level : [000-999]
3rd level : [000-999]
therefore data of the user 000000000001 will be store in /data/000/000/000/001
then I will be sure to have a maximum of 1000 folder in each level.
What do you guys think about it, what I should do or not do ?
The server will be running Centos 5.4 with EXT3 on raid 1, if the I/O get's too bad
i will probably go for a raid 10.

A hash function provides a way to distribute large amounts of data across an easily searchable structure.
See this related question: Why use hashing to create pathnames for large collections of files?
And also try looking through Google results for Directory Hashing.


Is it possible to use clickhouse to implement efficient union-find algorithm?

I have typical union-find problem where I have to group records, but it includes multiple files of hundreds bilion of records.
Can I somehow use clickhouse database to solve it?
Edit - minimal reproducible example:
I have tree columns (item_id, from, to) which represent graph nodes.
I want to create groups (id, group_id, item_id) which names groups from disjoint sets.
item_id from to
0 101 102
1 102 103
2 104 105
id group_id item_id
0 0 0
1 0 1
2 1 2
There are only two groups #0 (101->102->103) and #1 (104->105).
The problem in implementation in memory is that there's too much records and I want clickhouse (or some other solution) to care about filesystem caches.
Without knowing more about your specific data and questions, it is tricky to provide a definitive answer. In general, this represents a moderate size for ClickHouse. UNION is fully supported. Your best bet is to simply try - loading data or generating data is straightforward and SQL queries can usually be translated from Postgresql/MySQL easily.

Android/ Java : IS there fast way to filter large data saved in a list ? and how to get high quality picture with small storage space in server?

I have two questions
the first one is:
I have large data come from the server I saved it in a list , the customer can filter this data by 7 filters and two by text watcher this thing caused filtering operation to slow it takes 4 seconds in each time
I tried to put the filter keywords like(length or width ...) in one if and (&&) between them
but it didn't give me a result, also I tried to replace the textwatcher by spinner but it's not
I'm using one (for loop)
So the question: how can I use multi filter for list contain up to 2000 row with mini or zero slow?
the second is:
I saved from 2 to 8 pictures in the server in string form
the question is when I get these pictures from the server how can I show them in high quality?
when I show them I can see the pixels and this is not good for the customer
I don't want these pictures to take large space in the server and at the same time I want it in good quality when I restore them to display
I'm using Android/ Java
Thank you
The answer on my first quistion is if you want using filter (like when you are using online clothes shop and you want to filter it by less price ) you should use the hash map, not ordinary list it will be faster
The answer on my second question is: if you want to save store images in a database you should save it as a link, not a string or any other datatype

Application for filtering database for the short period of time

I need to create an application that would allow me to get phone numbers of users with specific conditions as fast as possible. For example we've got 4 columns in sql table(region, income, age [and 4th with the phone number itself]). I want to get phone numbers from the table with specific region and income. Just make a sql query won't help because it takes significant amount of time. Database updates 1 time per day and I have some time to prepare data as I wish.
The question is: How would you make the process of getting phone numbers with specific conditions as fast as possible. O(1) in the best scenario. Consider storing values from sql table in RAM for the fastest access.
I came up with the following idea:
For each phone number create smth like a bitset. 0 if the particular condition is false and 1 if the condition is true. But I'm not sure I can implement it for columns with not boolean values.
Create a vector with phone numbers.
Create a vector with phone numbers' bitsets.
To get phone numbers - iterate for the 2nd vector and compare bitsets with required one.
It's not O(1) at all. And I still don't know what to do about not boolean columns. I thought maybe it's possible to do something good with std::unordered_map (all phone numbers are unique) or improve my idea with vector and masks.
P.s. SQL table consumes 4GB of memory and I can store up to 8GB in RAM. The're 500 columns.
I want to get phone numbers from the table with specific region and income.
You would create indexes in the database on (region, income). Let the database do the work.
If you really want it to be fast I think you should consider ElasticSearch. Think of every phone in the DB as a doc with properties (your columns).
You will need to reindex the table once a day (or in realtime) but when it's time to search you just use the filter of ElasticSearch to find the results.
Another option is to have an index for every column. In this case the engine will do an Index Merge to increase performance. I would also consider using MEMORY Tables. In case you write to this table - consider having a read replica just for reads.
To optimize your table - save your queries somewhere and add index(for multiple columns) just for the top X popular searches depends on your memory limitations.
You can use use NVME as your DB disk (if you can't load it to memory)

I need help in designing my C++ Console application

I have a task to complete.
There are two types of csv files 4000+ both related to each other.
2 types are:
1. Country2.csv
2. Security_Name.csv
Contents of Country2.csv:
Company Name;Security Name;;;;Final NOS;Final FFR
Contents of Security_Name.csv:
Date;Close Price;Volume
There are multiple countries and for each country multiple security files
Now I need to READ them do some CALCULATION and then WRITE the output in another files
Read both the file Country 2.csv and Security.csv and extract all the data from them.
For example :
Read France 2.csv, extract Security_Name, Final NOS, Final FFR
Then Read Security.csv(which matches the Security_Name) and extract Date, Close Price, Volume
Calculations are basically finding Median of the values extracted which is quite simple.
For Example:
Monthly Median Traded Values
Daily Traded Value of a Security ... and so on
Based on the month I need to sort the output in two different file with following formats:
If Month % 3 = 0
Save It as MONTH_NAME.csv in following format:
Security name; 12-month indicator; 3-month indicator; FOT
Save It as MONTH_NAME.csv in following format:
Security Name; Monthly Median Traded Value Ratio; Number of days Volume > 0
My question is how do I design my application in such a way that it is maintainable and the flow of data throughout the execution is seamless?
So first thing. Based on the kind of data you are looking to generate, I would probably be looking at moving this data to a SQL db if at all possible. This is "one SQL query" kind of stuff. And far more maintainable than C++ that generates CSV files from CSV files.
Barring that, I would probably look at using datamash and/or perl. On a Windows platform, you could do this through Cygwin or WSL. Probably less maintainable, but so much easier it's not too much of an issue.
That said, if you're looking for something moderately maintainable, C++ could work. The first thing I would do is design my input classes. Data-centric, but it can work. It sounds like you could have a Country class, a Security class, and a SecurityClose class...or something along those lines. You can think about whether a Security class should contain a collection of SecurityClosees (data), or whether the data should just be "loose" and reference the Security it belongs to. Same with the Country->Security relationship.
Once you've decided how all that's going to look, you want something (likely a function) that can tokenize a CSV line. So "1,2,3" gets turned into a vector<string> with the contents "1" "2" "3". Then, each of your input classes should have a constructor or initializer that takes a vector<string> and populates itself. You might need to pass higher level data along too. Like the filename if you want the security data to know which security it belongs to..
That's basically most of the battle there. Once you've pulled your data into sensibly organized classes, the rest should come more easily. And if you run into bumps, hopefully you can ask specific design or implementation questions from there.

Inner workings of an elastic search?

I want to learn how elasticsearch works. I got concerns about scalability of my design. I have got 50 million documents. Every document has got around 50 string properties,45 integer properties and 5 datetime properties.
So my concerns are When I query ES with a predicate containing 8 fields with 3 sortings based on date and integer values. How does ES perform? What happens in the background so I ensure the performance when system reaches 500 million?
The link blackpop provided in the comment is a good start to understand whats going on. But you don't need to understand everything to make things work. The good thing on elasticsearch is - it's elastic. Meaning, it scales very well, so if you need more performance you just add more RAM/CPU/Server and maybe config a cluster (well, at least then you should learn something about shards and nodes).
Btw, your scenario seems not to be very hard task for lucene (on which ES is based), if you need performing queries under a second or so. We use similiar settings with > 200 M docs on one lone middle range server (around 2500 euro). I would encourage you to make real live tests on your desktop/laptop indexing 50 M dox. We did this, too.