Jump to content
Gustavo 'Gus' Carreno

Offical launch of the 1 Billion Row Challenge in Object Pascal

Recommended Posts

2 minutes ago, Alexander Sviridenkov said:

HTML Library / SQL framework: 0.33s. Zero lines of code)

 

image.thumb.png.bf0b721d3f06f944a51818f601d4a109.png

 

 

Hey Alexander,

 

I quite like your sense of humour 😁 !!

 

Doesn't quite satisfy the rules, and a line of SQL is still a line, but yeah, good one !!

 

Cheers,

Gus

Share this post


Link to post
1 hour ago, Gustavo 'Gus' Carreno said:

Doesn't quite satisfy the rules, and a line of SQL is still a line, but yeah, good one !!

 

Why not? The whole app on the screenshot is written in object pascal, AFAIK it will even compile to your ubuntu.

  • Confused 2

Share this post


Link to post
5 minutes ago, Attila Kovacs said:

Why not? The whole app on the screenshot is written in object pascal, AFAIK it will even compile to your ubuntu.

Welp, has to be a command line program I can time with hyperfine, as stated on the rules.

Needs to output to STDOUT, as stated on the rules.

Needs to be pure Object Pascal with no external libs or package dependency, as stated in the rules.

 

Must I go on 😉 ?

 

Cheers,

Gus

  • Like 1

Share this post


Link to post

Best of luck with that. The issue with these challenges isn't the problem they aim to solve, but rather, who on earth has the time for them.

  • Like 1

Share this post


Link to post
Posted (edited)
3 minutes ago, Attila Kovacs said:

Best of luck with that. The issue with these challenges isn't the problem they aim to solve, but rather, who on earth has the time for them.

Hey Attila,

 

Thank you very much !!!

 

I don't see that as an issue. I see that as a fact of life, like everyone has a personal life.
Then you chose to make the time for it or not, and that depends on your schedule and your will to participate.

I'm not putting a gun on anyone's head, just proposing a fun, and quite optional, exercise in programming.

 

Cheers,

Gus

Edited by Gustavo 'Gus' Carreno
  • Like 2

Share this post


Link to post
Posted (edited)

I wonder how much of the time depends on where the file is read from: RAM Disk, SSD, HDD and for the latter whether it's already in the cache or not.

 

2 (American billion) = 2.000.000.000 lines of about 15 characters makes it about 30.000.000.000 bytes, that's 30 Gig of data to read, split into lines, then split into name and value and then aggregate by name.

 

32 bit Delphi won't be able to handle that with Stringlist because it won't fit into memroy, I wonder whether there are any bugs in the RTL that would prevent that with a 64 bit Delphi program. But anyway: Using a StringList is probably not the most efficient way of reading the data. Plain old ReadLn would likely do the trick faster. Some kind of buffering might speed it up and maybe parsing based on a PChar pointer rather than strings.

 

Then selecting a suitable datastructure, probably some hash based dictionary.

 

The rest is not much of a challenge.

Edited by dummzeuch

Share this post


Link to post
Posted (edited)
Quote

I wonder how much of the time depends on where the file is read from: RAM Disk, SSD, HDD and for the latter whether it's already in the cache or not.

 

I'm performing tests on both an SSD and an HDD and the results reflect that: https://github.com/gcarreno/1brc-ObjectPascal#results

I'm using hyperfine to run the program 10 times. This will give the system ( Ubuntu 23.10 64b ) the opportunity to cache what it needs to cache.
The specs of my machine are listed on the GitHub repository.

 

Quote

2 (American billion) = 2.000.000.000 lines of about 15 characters makes it about 30.000.000.000 bytes, that's 30 Gig of data to read, split into lines, then split into name and value and then aggregate by name.

 

The input file has 1 (American Billion) 1.000.000.000 lines and has the size of ~16GiB.

 

Quote

32 bit Delphi won't be able to handle that with Stringlist because it won't fit into memroy, I wonder whether there are any bugs in the RTL that would prevent that with a 64 bit Delphi program. But anyway: Using a StringList is probably not the most efficient way of reading the data. Plain old ReadLn would likely do the trick faster. Some kind of buffering might speed it up and maybe parsing based on a PChar pointer rather than strings.

 

Yeah, sorry, don't have the necessary knowledge to even comment on that 😅

 

Quote

Then selecting a suitable datastructure, probably some hash based dictionary.

 

Yeah, agreed!

 

Quote

The rest is not much of a challenge.

 

Depends on the opinion you have about using threads and their implicit complexity. But yeah, the real challenge is to make it as blazing fast as you can!!
And the time to match, or beat, is one second. This from the results of the original challenge made in Java.

 

Cheers,

Gus

Edited by Gustavo 'Gus' Carreno
Mention of speed from original challenge
  • Like 1

Share this post


Link to post

Hey dummzeuch,

 

BTW... Writing 5 paragraphs with a conjecture of how to do it and then dismissing the entire thing as being "not much of a challenge" is a bit of a crappy dismissal, no?

Instead of just resting on your thought experiment, why don't you put your money where your mouth is and prove what you claim?

 

Shooting from the hip is rather easy, but making an entry and proving your chops is something entirely different, right?

 

I'm a bit miffed and my choice of words may seem harsh, but the lack of usefulness or any point in the answers I got just gave me a very bad brogrammer machismo vibe that I've only seen in Stack Overflow.

 

If this is the type of welcome you peeps extend to a newcomer... I dunno... It's pretty toxic...

Even in the case that this could not be the type of thing that the regulars here have an interest in, at least a sense of community is the least to expect, no?

 

I deeply regret the thought I had of attempting to post here! I just hope that name calling and dumb shaming is not the next thing I'm to be dealt...

 

Cheers,

Gus

  • Like 1

Share this post


Link to post
Posted (edited)
8 minutes ago, Gustavo 'Gus' Carreno said:

If this is the type of welcome you peeps extend to a newcomer... I dunno... It's pretty toxic...

It's Sunday man. You haven't even met the hardcore yet. Be patient. Perhaps you could create a table to track how long it takes until you become angry again 😛

Edited by Attila Kovacs
  • Haha 1

Share this post


Link to post

Hey Attila,

 

Quote

It's Sunday man. You haven't even met the hardcore yet. Be patient. Perhaps you could create a table to track how long it takes until you become angry again 😛

 

I am know to go off in tangents and angry rants, sometimes, for not much.

 

I'm also involved in a bunch of other communities like Telegram(English and Portuguese, Lazarus and Delphi), Discord( 3 Servers of Lazarus and Delphi) and the Lazarus Forums.

In none of those have I ever had such a response.
In all of the above I try really hard to welcome the padawans that wander in with the most devilishly incomplete and weird questions, trying to hang on to the patience of a saint.
Heck, I even got made MVP by Ian because of that alone!!

 

When I come to a new place and I'm greeted this way, welp, my short fuse did get lit, consumed and passed the spark to the gun powder!!

I probably need to apologise for the words I've used. But I'm not apologising for the message conveyed!

 

Cheers,

Gus

  • Like 1

Share this post


Link to post
2 minutes ago, Brian Evans said:

Would have been received better if you hadn't left out the background of the challenge - that it was a Java challenge originally and has subsequently been picked up by other languages. 

 

GitHub - gunnarmorling/1brc: 1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

Hey Brian Evans,

 

I was only trying to be brief, since the README file on the GitHub repository has all the needed information, plus the necessary attributions.

 

I left the correct trail to be followed for the ones that would have the interest of getting to the bottom of it all.

I don't think that me leaving a breadcrumb trail is to be used as an excuse to just dismiss the hole thing entirely.

 

But again, maybe due to my short fuse, while it did take more than a couple of hours to actually come back and write a less honourable post, I will apologise for the type of wording I used. But not for the content itself!!

 

Cheers,

Gus

  • Like 1

Share this post


Link to post
25 minutes ago, Gustavo 'Gus' Carreno said:

Writing 5 paragraphs with a conjecture of how to do it ... Instead of just resting on your thought experiment, why don't you put your money where your mouth is and prove what you claim?

I can't be bothered, sorry. I was only "thinking aloud". Maybe I shouldn't have written it as a comment, though.

Share this post


Link to post
Just now, dummzeuch said:

I can't be bothered, sorry. I was only "thinking aloud". Maybe I shouldn't have written it as a comment, though.

Hey Dummzeuch,

 

I really enjoyed the way you laid it out, that I really enjoyed, truly.

 

Just that last paragraph did put the flame on my short fuse.

 

If you just said something along: Nice thing to have a go, if ever I had the time for it.

I would just be beaming with content and would never shoot off my big mouth.
I would be quite grateful for your input and gone to do something else. The end...

 

Cheers,

Gus

Share this post


Link to post
10 hours ago, Attila Kovacs said:

Why not? The whole app on the screenshot is written in object pascal, AFAIK it will even compile to your ubuntu. 

Because he forget to mention how long it take to populate table weather_station with one billion records. I'm guessing many minutes, unless he has a very performant hardware. Running a query is only half part of the story. The post is funny though.

  • Like 1

Share this post


Link to post
9 hours ago, Gustavo 'Gus' Carreno said:

Hey Brian Evans,

 

I was only trying to be brief, since the README file on the GitHub repository has all the needed information, plus the necessary attributions.

 

I left the correct trail to be followed for the ones that would have the interest of getting to the bottom of it all.

I don't think that me leaving a breadcrumb trail is to be used as an excuse to just dismiss the hole thing entirely.

 

But again, maybe due to my short fuse, while it did take more than a couple of hours to actually come back and write a less honourable post, I will apologise for the type of wording I used. But not for the content itself!!

 

Cheers,

Gus

It is missing WHY this specific task was chosen and WHY somebody might want to tackle it.  Without either WHY the task itself seems silly and not worth much time. Read the blog post and readme from the point of view of somebody who had never heard of the "1 Billion Row Challenge".  Only by following and reading some of the LINKS in the readme would they find or deduce answers for the two WHYs. 

 

This observation is not really meant as criticism but feedback for why the response here has been so lackluster: At first look it seems like a very silly contest so got silly and "who cares" answers. 

Share this post


Link to post

Hey Brian,

 

Okydokes, I get it!!

 

I completely forgot to account that I'm a 53 year old person that lives in an era where the average attention span is... oopsss, it's gone!!

 

And for that I deeply apologise !! I shoulda known better, cuz I do have 2 kids that show those symptoms and I completely forgot about that fact.

 

Sorry!!

 

Cheers,

Gus

Share this post


Link to post

Hey Gus,

It seems like a fun challenge that people who have time and interest will look at and participate. We've got a couple of months, so plenty of time to get involved if you're so inclined. 

 

One question: the .CSV in the repository has only 44,691 entries. So, the idea is that we need to run the generator program first to generate the 1B file, right?  I suppose this could also, then, be used to generate smaller files for development testing.

 

Thanks for the links and for bringing it to the Delphi community!

  • Like 2

Share this post


Link to post
Posted (edited)

Hey Cornelius,

 

Quote

It seems like a fun challenge that people who have time and interest will look at and participate.

Absolutely!!

 

Quote

We've got a couple of months, so plenty of time to get involved if you're so inclined.

Correctomundo!!

 

Quote

One question: the .CSV in the repository has only 44,691 entries. So, the idea is that we need to run the generator program first to generate the 1B file, right?
I suppose this could also, then, be used to generate smaller files for development testing.

You are correct. As any well behaved Linux command, well at least I made an effort on the Lazarus side, if you run it with the `-h` or `--help` param it will print it's usage.
The Delphi one also has the same behaviour. And we also made an effort to make the Delphi and Lazarus side of things match in terms of generation.
The main objective of having a generator is the fact that anyone can practice with the exact same content.
The other objective is simply the fact that the file that contains the full 1 billion rows is ~16GiB. No way we were going to store that on a free GitHub repository.

$ ./bin/generator -h
Generates the measurement file with the specified number of lines

USAGE
  generator <flags>

FLAGS
  -h|--help                      Writes this help message and exits
  -v|--version                   Writes the version and exits
  -i|--input-file <filename>     The file containing the Weather Stations
  -o|--output-file <filename>    The file that will contain the generated lines
  -n|--line-count <number>       The amount of lines to be generated ( Can use 1_000_000_000 )

 

The input and output files are needed. As is the number of lines to generate.
The input file being the one you mentioned having the ~44K entries.
The output file is of your choice.

The number of lines can be in the normal base 10 format, or use underscores for the thousands separator, as shown on the usage printed above.

 

Most have been running on test files of about 100 million rows, but this is just an example.

 

Quote

Thanks for the links and for bringing it to the Delphi community!

You're more than welcome !!

 

Hope you can make the time to participate and have a ton of fun while doing it!!

Cheers,

Gus

 

Edited by Gustavo 'Gus' Carreno
Better context
  • Like 1

Share this post


Link to post
On 3/10/2024 at 9:06 AM, Alexander Sviridenkov said:

HTML Library / SQL framework: 0.33s. Zero lines of code)

So assuming that your code scales linearly it will only take 92 days for 1 billion rows

  • Like 1
  • Haha 1

Share this post


Link to post
18 minutes ago, Stefan Glienke said:

So assuming that your code scales linearly it will only take 92 days for 1 billion rows 

USA Billion 😉 

Share this post


Link to post
43 minutes ago, Stefan Glienke said:

So assuming that your code scales linearly it will only take 92 days for 1 billion rows

Scaling is not linear there. Real 10^9 file (16.5 Gb) is processed in 15 minutes (936 sec) on Ryzen 5 4600H notebook (single thread).

  • Like 1

Share this post


Link to post
18 hours ago, kolbasz said:

Because he forget to mention how long it take to populate table weather_station with one billion records. I'm guessing many minutes, unless he has a very performant hardware. Running a query is only half part of the story. The post is funny though.

There are no tables. Query is executed directly on CSV file.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×